зеркало из https://github.com/microsoft/git.git
70 Коммитов
Автор | SHA1 | Сообщение | Дата |
---|---|---|---|
Derrick Stolee | 61f7a383d3 |
maintenance: use 'incremental' strategy by default
The 'git maintenance (register|start)' subcommands add the current repository to the global Git config so maintenance will operate on that repository. It does not specify what maintenance should occur or how often. To make it simple for users to start background maintenance with a recommended schedlue, update the 'maintenance.strategy' config option in both the 'register' and 'start' subcommands. This allows users to customize beyond the defaults using individual 'maintenance.<task>.schedule' options, but also the user can opt-out of this strategy using 'maintenance.strategy=none'. Helped-by: Martin Ågren <martin.agren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | a4cb1a2339 |
maintenance: create maintenance.strategy config
To provide an on-ramp for users to use background maintenance without several 'git config' commands, create a 'maintenance.strategy' config option. Currently, the only important value is 'incremental' which assigns the following schedule: * gc: never * prefetch: hourly * commit-graph: hourly * loose-objects: daily * incremental-repack: daily These tasks are chosen to minimize disruptions to foreground Git commands and use few compute resources. The 'maintenance.strategy' is intended as a baseline that can be customzied further by manually assigning 'maintenance.<task>.enabled' and 'maintenance.<task>.schedule' config options, which will override any recommendation from 'maintenance.strategy'. This operates similarly to config options like 'feature.experimental' which operate as "meta" config options that change default config values. This presents a way forward for updating the 'incremental' strategy in the future or adding new strategies. For example, a potential strategy could be to include a 'full' strategy that runs the 'gc' task weekly and no other tasks by default. Helped-by: Martin Ågren <martin.agren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | d334107c5d |
maintenance: core.commitGraph=false prevents writes
Recently, a user had an issue due to combining fetch.writeCommitGraph=true with core.commitGraph=false. The root bug has been resolved by preventing commit-graph writes when core.commitGraph is disabled. This happens inside the 'git commit-graph write' command, but we can be more aware of this situation and prevent that process from ever starting in the 'commit-graph' maintenance task. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 8f801804be |
maintenance: test commit-graph auto condition
The auto condition for the commit-graph maintenance task walks refs
looking for commits that are not in the commit-graph file. This was
added in
|
|
Derrick Stolee | 2fec604f8d |
maintenance: add start/stop subcommands
Add new subcommands to 'git maintenance' that start or stop background maintenance using 'cron', when available. This integration is as simple as I could make it, barring some implementation complications. The schedule is laid out as follows: 0 1-23 * * * $cmd maintenance run --schedule=hourly 0 0 * * 1-6 $cmd maintenance run --schedule=daily 0 0 * * 0 $cmd maintenance run --schedule=weekly where $cmd is a properly-qualified 'git for-each-repo' execution: $cmd=$path/git --exec-path=$path for-each-repo --config=maintenance.repo where $path points to the location of the Git executable running 'git maintenance start'. This is critical for systems with multiple versions of Git. Specifically, macOS has a system version at '/usr/bin/git' while the version that users can install resides at '/usr/local/bin/git' (symlinked to '/usr/local/libexec/git-core/git'). This will also use your locally-built version if you build and run this in your development environment without installing first. This conditional schedule avoids having cron launch multiple 'git for-each-repo' commands in parallel. Such parallel commands would likely lead to the 'hourly' and 'daily' tasks competing over the object database lock. This could lead to to some tasks never being run! Since the --schedule=<frequency> argument will run all tasks with _at least_ the given frequency, the daily runs will also run the hourly tasks. Similarly, the weekly runs will also run the daily and hourly tasks. The GIT_TEST_CRONTAB environment variable is not intended for users to edit, but instead as a way to mock the 'crontab [-l]' command. This variable is set in test-lib.sh to avoid a future test from accidentally running anything with the cron integration from modifying the user's schedule. We use GIT_TEST_CRONTAB='test-tool crontab <file>' in our tests to check how the schedule is modified in 'git maintenance (start|stop)' commands. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 0c18b70081 |
maintenance: add [un]register subcommands
In preparation for launching background maintenance from the 'git maintenance' builtin, create register/unregister subcommands. These commands update the new 'maintenance.repos' config option in the global config so the background maintenance job knows which repositories to maintain. These commands allow users to add a repository to the background maintenance list without disrupting the actual maintenance mechanism. For example, a user can run 'git maintenance register' when no background maintenance is running and it will not start the background maintenance. A later update to start running background maintenance will then pick up this repository automatically. The opposite example is that a user can run 'git maintenance unregister' to remove the current repository from background maintenance without halting maintenance for other repositories. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | b08ff1fee0 |
maintenance: add --schedule option and config
Maintenance currently triggers when certain data-size thresholds are met, such as number of pack-files or loose objects. Users may want to run certain maintenance tasks based on frequency instead. For example, a user may want to perform a 'prefetch' task every hour, or 'gc' task every day. To help these users, update the 'git maintenance run' command to include a '--schedule=<frequency>' option. The allowed frequencies are 'hourly', 'daily', and 'weekly'. These values are also allowed in a new config value 'maintenance.<task>.schedule'. The 'git maintenance run --schedule=<frequency>' checks the '*.schedule' config value for each enabled task to see if the configured frequency is at least as frequent as the frequency from the '--schedule' argument. We use the following order, for full clarity: 'hourly' > 'daily' > 'weekly' Use new 'enum schedule_priority' to track these values numerically. The following cron table would run the scheduled tasks with the correct frequencies: 0 1-23 * * * git -C <repo> maintenance run --schedule=hourly 0 0 * * 1-6 git -C <repo> maintenance run --schedule=daily 0 0 * * 0 git -C <repo> maintenance run --schedule=weekly This cron schedule will run --schedule=hourly every hour except at midnight. This avoids a concurrent run with the --schedule=daily that runs at midnight every day except the first day of the week. This avoids a concurrent run with the --schedule=weekly that runs at midnight on the first day of the week. Since --schedule=daily also runs the 'hourly' tasks and --schedule=weekly runs the 'hourly' and 'daily' tasks, we will still see all tasks run with the proper frequencies. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 1942d48380 |
maintenance: optionally skip --auto process
Some commands run 'git maintenance run --auto --[no-]quiet' after doing their normal work, as a way to keep repositories clean as they are used. Currently, users who do not want this maintenance to occur would set the 'gc.auto' config option to 0 to avoid the 'gc' task from running. However, this does not stop the extra process invocation. On Windows, this extra process invocation can be more expensive than necessary. Allow users to drop this extra process by setting 'maintenance.auto' to 'false'. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | e841a79a13 |
maintenance: add incremental-repack auto condition
The incremental-repack task updates the multi-pack-index by deleting pack- files that have been replaced with new packs, then repacking a batch of small pack-files into a larger pack-file. This incremental repack is faster than rewriting all object data, but is slower than some other maintenance activities. The 'maintenance.incremental-repack.auto' config option specifies how many pack-files should exist outside of the multi-pack-index before running the step. These pack-files could be created by 'git fetch' commands or by the loose-objects task. The default value is 10. Setting the option to zero disables the task with the '--auto' option, and a negative value makes the task run every time. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | a13e3d0ec8 |
maintenance: auto-size incremental-repack batch
When repacking during the 'incremental-repack' task, we use the --batch-size option in 'git multi-pack-index repack'. The initial setting used --batch-size=0 to repack everything into a single pack-file. This is not sustainable for a large repository. The amount of work required is also likely to use too many system resources for a background job. Update the 'incremental-repack' task by dynamically computing a --batch-size option based on the current pack-file structure. The dynamic default size is computed with this idea in mind for a client repository that was cloned from a very large remote: there is likely one "big" pack-file that was created at clone time. Thus, do not try repacking it as it is likely packed efficiently by the server. Instead, we select the second-largest pack-file, and create a batch size that is one larger than that pack-file. If there are three or more pack-files, then this guarantees that at least two will be combined into a new pack-file. Of course, this means that the second-largest pack-file size is likely to grow over time and may eventually surpass the initially-cloned pack-file. Recall that the pack-file batch is selected in a greedy manner: the packs are considered from oldest to newest and are selected if they have size smaller than the batch size until the total selected size is larger than the batch size. Thus, that oldest "clone" pack will be first to repack after the new data creates a pack larger than that. We also want to place some limits on how large these pack-files become, in order to bound the amount of time spent repacking. A maximum batch-size of two gigabytes means that large repositories will never be packed into a single pack-file using this job, but also that repack is rather expensive. This is a trade-off that is valuable to have if the maintenance is being run automatically or in the background. Users who truly want to optimize for space and performance (and are willing to pay the upfront cost of a full repack) can use the 'gc' task to do so. Create a test for this two gigabyte limit by creating an EXPENSIVE test that generates two pack-files of roughly 2.5 gigabytes in size, then performs an incremental repack. Check that the --batch-size argument in the subcommand uses the hard-coded maximum. Helped-by: Chris Torek <chris.torek@gmail.com> Reported-by: Son Luong Ngoc <sluongng@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 52fe41ff1c |
maintenance: add incremental-repack task
The previous change cleaned up loose objects using the 'loose-objects' that can be run safely in the background. Add a similar job that performs similar cleanups for pack-files. One issue with running 'git repack' is that it is designed to repack all pack-files into a single pack-file. While this is the most space-efficient way to store object data, it is not time or memory efficient. This becomes extremely important if the repo is so large that a user struggles to store two copies of the pack on their disk. Instead, perform an "incremental" repack by collecting a few small pack-files into a new pack-file. The multi-pack-index facilitates this process ever since 'git multi-pack-index expire' was added in |
|
Derrick Stolee | 3e220e6069 |
maintenance: create auto condition for loose-objects
The loose-objects task deletes loose objects that already exist in a pack-file, then place the remaining loose objects into a new pack-file. If this step runs all the time, then we risk creating pack-files with very few objects with every 'git commit' process. To prevent overwhelming the packs directory with small pack-files, place a minimum number of objects to justify the task. The 'maintenance.loose-objects.auto' config option specifies a minimum number of loose objects to justify the task to run under the '--auto' option. This defaults to 100 loose objects. Setting the value to zero will prevent the step from running under '--auto' while a negative value will force it to run every time. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 252cfb7cb8 |
maintenance: add loose-objects task
One goal of background maintenance jobs is to allow a user to disable auto-gc (gc.auto=0) but keep their repository in a clean state. Without any cleanup, loose objects will clutter the object database and slow operations. In addition, the loose objects will take up extra space because they are not stored with deltas against similar objects. Create a 'loose-objects' task for the 'git maintenance run' command. This helps clean up loose objects without disrupting concurrent Git commands using the following sequence of events: 1. Run 'git prune-packed' to delete any loose objects that exist in a pack-file. Concurrent commands will prefer the packed version of the object to the loose version. (Of course, there are exceptions for commands that specifically care about the location of an object. These are rare for a user to run on purpose, and we hope a user that has selected background maintenance will not be trying to do foreground maintenance.) 2. Run 'git pack-objects' on a batch of loose objects. These objects are grouped by scanning the loose object directories in lexicographic order until listing all loose objects -or- reaching 50,000 objects. This is more than enough if the loose objects are created only by a user doing normal development. We noticed users with _millions_ of loose objects because VFS for Git downloads blobs on-demand when a file read operation requires populating a virtual file. This step is based on a similar step in Scalar [1] and VFS for Git. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 28cb5e66dd |
maintenance: add prefetch task
When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.
Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.
The task is called "prefetch" because it is work done in advance
of a foreground fetch to make that 'git fetch' command much faster.
However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foreground 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.
When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:
1. --no-tags prevents getting a new tag when a user wants to see
the new tags appear in their foreground fetches.
2. --refmap= removes the configured refspec which usually updates
refs/remotes/<remote>/* with the refs advertised by the remote.
While this looks confusing, this was documented and tested by
|
|
Derrick Stolee | 916d0626c2 |
maintenance: use pointers to check --auto
The 'git maintenance run' command has an '--auto' option. This is used by other Git commands such as 'git commit' or 'git fetch' to check if maintenance should be run after adding data to the repository. Previously, this --auto option was only used to add the argument to the 'git gc' command as part of the 'gc' task. We will be expanding the other tasks to perform a check to see if they should do work as part of the --auto flag, when they are enabled by config. First, update the 'gc' task to perform the auto check inside the maintenance process. This prevents running an extra 'git gc --auto' command when not needed. It also shows a model for other tasks. Second, use the 'auto_condition' function pointer as a signal for whether we enable the maintenance task under '--auto'. For instance, we do not want to enable the 'fetch' task in '--auto' mode, so that function pointer will remain NULL. Now that we are not automatically calling 'git gc', a test in t5514-fetch-multiple.sh must be changed to watch for 'git maintenance' instead. We continue to pass the '--auto' option to the 'git gc' command when necessary, because of the gc.autoDetach config option changes behavior. Likely, we will want to absorb the daemonizing behavior implied by gc.autoDetach as a maintenance.autoDetach config option. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 65d655b52d |
maintenance: create maintenance.<task>.enabled config
Currently, a normal run of "git maintenance run" will only run the 'gc' task, as it is the only one enabled. This is mostly for backwards- compatible reasons since "git maintenance run --auto" commands replaced previous "git gc --auto" commands after some Git processes. Users could manually run specific maintenance tasks by calling "git maintenance run --task=<task>" directly. Allow users to customize which steps are run automatically using config. The 'maintenance.<task>.enabled' option then can turn on these other tasks (or turn off the 'gc' task). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 090511bc0b |
maintenance: add --task option
A user may want to only run certain maintenance tasks in a certain order. Add the --task=<task> option, which allows a user to specify an ordered list of tasks to run. These cannot be run multiple times, however. Here is where our array of maintenance_task pointers becomes critical. We can sort the array of pointers based on the task order, but we do not want to move the struct data itself in order to preserve the hashmap references. We use the hashmap to match the --task=<task> arguments into the task struct data. Keep in mind that the 'enabled' member of the maintenance_task struct is a placeholder for a future 'maintenance.<task>.enabled' config option. Thus, we use the 'enabled' member to specify which tasks are run when the user does not specify any --task=<task> arguments. The 'enabled' member should be ignored if --task=<task> appears. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 663b2b1b90 |
maintenance: add commit-graph task
The first new task in the 'git maintenance' builtin is the 'commit-graph' task. This updates the commit-graph file incrementally with the command git commit-graph write --reachable --split By writing an incremental commit-graph file using the "--split" option we minimize the disruption from this operation. The default behavior is to merge layers until the new "top" layer is less than half the size of the layer below. This provides quick writes most of the time, with the longer writes following a power law distribution. Most importantly, concurrent Git processes only look at the commit-graph-chain file for a very short amount of time, so they will verly likely not be holding a handle to the file when we try to replace it. (This only matters on Windows.) If a concurrent process reads the old commit-graph-chain file, but our job expires some of the .graph files before they can be read, then those processes will see a warning message (but not fail). This could be avoided by a future update to use the --expire-time argument when writing the commit-graph. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 3ddaad0e06 |
maintenance: add --quiet option
Maintenance activities are commonly used as steps in larger scripts. Providing a '--quiet' option allows those scripts to be less noisy when run on a terminal window. Turn this mode on by default when stderr is not a terminal. Pipe the option to the 'git gc' child process. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |
|
Derrick Stolee | 2057d75038 |
maintenance: create basic maintenance runner
The 'gc' builtin is our current entrypoint for automatically maintaining a repository. This one tool does many operations, such as repacking the repository, packing refs, and rewriting the commit-graph file. The name implies it performs "garbage collection" which means several different things, and some users may not want to use this operation that rewrites the entire object database. Create a new 'maintenance' builtin that will become a more general- purpose command. To start, it will only support the 'run' subcommand, but will later expand to add subcommands for scheduling maintenance in the background. For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin. In fact, the only option is the '--auto' toggle, which is handed directly to the 'gc' builtin. The current change is isolated to this simple operation to prevent more interesting logic from being lost in all of the boilerplate of adding a new builtin. Use existing builtin/gc.c file because we want to share code between the two builtins. It is possible that we will have 'maintenance' replace the 'gc' builtin entirely at some point, leaving 'git gc' as an alias for some specific arguments to 'git maintenance run'. Create a new test_subcommand helper that allows us to test if a certain subcommand was run. It requires storing the GIT_TRACE2_EVENT logs in a file. A negation mode is available that will be used in later tests. Helped-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> |