More header clean-up.
* en/header-split-cache-h-part-2: (22 commits)
reftable: ensure git-compat-util.h is the first (indirect) include
diff.h: reduce unnecessary includes
object-store.h: reduce unnecessary includes
commit.h: reduce unnecessary includes
fsmonitor: reduce includes of cache.h
cache.h: remove unnecessary headers
treewide: remove cache.h inclusion due to previous changes
cache,tree: move basic name compare functions from read-cache to tree
cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c
hash-ll.h: split out of hash.h to remove dependency on repository.h
tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h
dir.h: move DTYPE defines from cache.h
versioncmp.h: move declarations for versioncmp.c functions from cache.h
ws.h: move declarations for ws.c functions from cache.h
match-trees.h: move declarations for match-trees.c functions from cache.h
pkt-line.h: move declarations for pkt-line.c functions from cache.h
base85.h: move declarations for base85.c functions from cache.h
copy.h: move declarations for copy.c functions from cache.h
server-info.h: move declarations for server-info.c functions from cache.h
packfile.h: move pack_window and pack_entry from cache.h
...
The code to parse capability list for v0 on-wire protocol fell into
an infinite loop when a capability appears multiple times, which
has been corrected.
* jk/protocol-cap-parse-fix:
v0 protocol: use size_t for capability length/offset
t5512: test "ls-remote --heads --symref" filtering with v0 and v2
t5512: allow any protocol version for filtered symref test
t5512: add v2 support for "ls-remote --symref" test
v0 protocol: fix sha1/sha256 confusion for capabilities^{}
t5512: stop referring to "v1" protocol
v0 protocol: fix infinite loop when parsing multi-valued capabilities
When parsing server capabilities, we use "int" to store lengths and
offsets. At first glance this seems like a spot where our parser may be
confused by integer overflow if somebody sent us a malicious response.
In practice these strings are all bounded by the 64k limit of a
pkt-line, so using "int" is OK. However, it makes the code simpler to
audit if they just use size_t everywhere. Note that because we take
these parameters as pointers, this also forces many callers to update
their declared types.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Several files were including cache.h solely to get other headers, such
as trace.h and trace2.h. Since the last few commits have modified
files to make these dependencies more explicit, the inclusion of cache.h
is no longer needed in several cases. Remove it.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Dozens of files made use of trace and trace2 functions, without
explicitly including trace.h or trace2.h. This made it more difficult
to find which files could remove a dependence on cache.h. Make C files
explicitly include trace.h or trace2.h if they are using them.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Split key function and data structure definitions out of cache.h to
new header files and adjust the users.
* en/header-split-cleanup:
csum-file.h: remove unnecessary inclusion of cache.h
write-or-die.h: move declarations for write-or-die.c functions from cache.h
treewide: remove cache.h inclusion due to setup.h changes
setup.h: move declarations for setup.c functions from cache.h
treewide: remove cache.h inclusion due to environment.h changes
environment.h: move declarations for environment.c functions from cache.h
treewide: remove unnecessary includes of cache.h
wrapper.h: move declarations for wrapper.c functions from cache.h
path.h: move function declarations for path.c functions from cache.h
cache.h: remove expand_user_path()
abspath.h: move absolute path functions from cache.h
environment: move comment_line_char from cache.h
treewide: remove unnecessary cache.h inclusion from several sources
treewide: remove unnecessary inclusion of gettext.h
treewide: be explicit about dependence on gettext.h
treewide: remove unnecessary cache.h inclusion from a few headers
Code clean-up around the use of the_repository.
* ab/remove-implicit-use-of-the-repository:
libs: use "struct repository *" argument, not "the_repository"
post-cocci: adjust comments for recent repo_* migration
cocci: apply the "revision.h" part of "the_repository.pending"
cocci: apply the "rerere.h" part of "the_repository.pending"
cocci: apply the "refs.h" part of "the_repository.pending"
cocci: apply the "promisor-remote.h" part of "the_repository.pending"
cocci: apply the "packfile.h" part of "the_repository.pending"
cocci: apply the "pretty.h" part of "the_repository.pending"
cocci: apply the "object-store.h" part of "the_repository.pending"
cocci: apply the "diff.h" part of "the_repository.pending"
cocci: apply the "commit.h" part of "the_repository.pending"
cocci: apply the "commit-reach.h" part of "the_repository.pending"
cocci: apply the "cache.h" part of "the_repository.pending"
cocci: add missing "the_repository" macros to "pending"
cocci: sort "the_repository" rules by header
cocci: fix incorrect & verbose "the_repository" rules
cocci: remove dead rule from "the_repository.pending.cocci"
* ab/remove-implicit-use-of-the-repository:
libs: use "struct repository *" argument, not "the_repository"
post-cocci: adjust comments for recent repo_* migration
cocci: apply the "revision.h" part of "the_repository.pending"
cocci: apply the "rerere.h" part of "the_repository.pending"
cocci: apply the "refs.h" part of "the_repository.pending"
cocci: apply the "promisor-remote.h" part of "the_repository.pending"
cocci: apply the "packfile.h" part of "the_repository.pending"
cocci: apply the "pretty.h" part of "the_repository.pending"
cocci: apply the "object-store.h" part of "the_repository.pending"
cocci: apply the "diff.h" part of "the_repository.pending"
cocci: apply the "commit.h" part of "the_repository.pending"
cocci: apply the "commit-reach.h" part of "the_repository.pending"
cocci: apply the "cache.h" part of "the_repository.pending"
cocci: add missing "the_repository" macros to "pending"
cocci: sort "the_repository" rules by header
cocci: fix incorrect & verbose "the_repository" rules
cocci: remove dead rule from "the_repository.pending.cocci"
Apply the part of "the_repository.pending.cocci" pertaining to
"object-store.h".
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
With the change in the last commit to move several functions to
write-or-die.h, csum-file.h no longer needs to include cache.h.
However, removing that include forces several other C files, which
directly or indirectly dependend upon csum-file.h's inclusion of
cache.h, to now be more explicit about their dependencies.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Dozens of files made use of gettext functions, without explicitly
including gettext.h. This made it more difficult to find which files
could remove a dependence on cache.h. Make C files explicitly include
gettext.h if they are using it.
However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an
include of gettext.h, it was left out to avoid conflicting with an
in-flight topic.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
More work towards -Wunused.
* jk/unused-post-2.39-part2: (21 commits)
help: mark unused parameter in git_unknown_cmd_config()
run_processes_parallel: mark unused callback parameters
userformat_want_item(): mark unused parameter
for_each_commit_graft(): mark unused callback parameter
rewrite_parents(): mark unused callback parameter
fetch-pack: mark unused parameter in callback function
notes: mark unused callback parameters
prio-queue: mark unused parameters in comparison functions
for_each_object: mark unused callback parameters
list-objects: mark unused callback parameters
mark unused parameters in signal handlers
run-command: mark error routine parameters as unused
mark "pointless" data pointers in callbacks
ref-filter: mark unused callback parameters
http-backend: mark unused parameters in virtual functions
http-backend: mark argc/argv unused
object-name: mark unused parameters in disambiguate callbacks
serve: mark unused parameters in virtual functions
serve: use repository pointer to get config
ls-refs: drop config caching
...
The for_each_cached_alternate() interface requires a callback that takes
a negotiator parameter, but not all implementations need it. Mark the
unused one as such to appease -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This allows us to replace includes of cache.h with includes of the much
smaller alloc.h in many places. It does mean that we also need to add
includes of alloc.h in a number of C files.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The server_supports_v2() helper lets a caller find out if the server
supports a feature, and will optionally die if it's not supported. This
makes the return value confusing, as it's only meaningful when the
function is not asked to die.
Coverity flagged a new call like:
/* check that we support "foo" */
server_supports_v2("foo", 1);
complaining that we usually checked the return value, but this time we
didn't. But this call is correct, and other ones that did:
if (server_supports_v2("foo", 1))
do_something_with_foo();
are "wrong", in the sense that we know the conditional will always be
true (but there's no bug; the code is simply misleading).
Let's split the "die" behavior into its own function which returns void,
and modify each caller to use the correct one.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Undoes 'jk/unused-annotation' topic and redoes it to work around
Coccinelle rules misfiring false positives in unrelated codepaths.
* ab/unused-annotation:
git-compat-util.h: use "deprecated" for UNUSED variables
git-compat-util.h: use "UNUSED", not "UNUSED(var)"
Annotate function parameters that are not used (but cannot be
removed for structural reasons), to prepare us to later compile
with -Wunused warning turned on.
* jk/unused-annotation:
is_path_owned_by_current_uid(): mark "report" parameter as unused
run-command: mark unused async callback parameters
mark unused read_tree_recursive() callback parameters
hashmap: mark unused callback parameters
config: mark unused callback parameters
streaming: mark unused virtual method parameters
transport: mark bundle transport_options as unused
refs: mark unused virtual method parameters
refs: mark unused reflog callback parameters
refs: mark unused each_ref_fn parameters
git-compat-util: add UNUSED macro
As reported in [1] the "UNUSED(var)" macro introduced in
2174b8c75de (Merge branch 'jk/unused-annotation' into next,
2022-08-24) breaks coccinelle's parsing of our sources in files where
it occurs.
Let's instead partially go with the approach suggested in [2] of
making this not take an argument. As noted in [1] "coccinelle" will
ignore such tokens in argument lists that it doesn't know about, and
it's less of a surprise to syntax highlighters.
This undoes the "help us notice when a parameter marked as unused is
actually use" part of 9b24034754 (git-compat-util: add UNUSED macro,
2022-08-19), a subsequent commit will further tweak the macro to
implement a replacement for that functionality.
1. https://lore.kernel.org/git/220825.86ilmg4mil.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/220819.868rnk54ju.gmgdl@evledraar.gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The start_async(), etc, functions need a "proc" callback that conforms
to a particular interface. Not every callback needs every parameter
(e.g., the caller might not even ask to open an input descriptor, in
which case there is no point in the callback looking at it). Let's mark
these for -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Functions used with for_each_ref(), etc, need to conform to the
each_ref_fn interface. But most of them don't need every parameter;
let's annotate the unused ones to quiet -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Currently, negotiation for V0/V1/V2 fetch have trace2 regions covering
the entire negotiation process. However, we'd like additional data, such
as timing for each round of negotiation or the number of "haves" in each
round. Additionally, "independent negotiation" (AKA push negotiation)
has no tracing at all. Having this data would allow us to compare the
performance of the various negotation implementations, and to debug
unexpectedly slow fetch & push sessions.
Add per-round trace2 regions for all negotiation implementations (V0+V1,
V2, and independent negotiation), as well as an overall region for
independent negotiation. Add trace2 data logging for the number of haves
and "in vain" objects for each round, and for the total number of rounds
once negotiation completes. Finally, add a few checks into various
tests to verify that the number of rounds is logged as expected.
Signed-off-by: Josh Steadmon <steadmon@google.com>
Acked-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git fetch" client logs the partial clone filter used in the trace2
output.
* jt/fetch-pack-trace2-filter-spec:
fetch-pack: write effective filter to trace2
Make our mergesort implementation type-safe.
* rs/mergesort:
mergesort: remove llist_mergesort()
packfile: use DEFINE_LIST_SORT
fetch-pack: use DEFINE_LIST_SORT
commit: use DEFINE_LIST_SORT
blame: use DEFINE_LIST_SORT
test-mergesort: use DEFINE_LIST_SORT
test-mergesort: use DEFINE_LIST_SORT_DEBUG
mergesort: add macros for typed sort of linked lists
mergesort: tighten merge loop
mergesort: unify ranks loops
Administrators of a managed Git environment (like the one at $DAYJOB)
might want to quantify the performance change of fetches with and
without filters from the client's point of view, and also detect if a
server does not support it. Therefore, log the filter information being
sent to the server whenever a fetch (or clone) occurs. Note that this is
not necessarily the same as what's specified on the CLI, because during
a fetch, the configured filter is used whenever a filter is not
specified on the CLI.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Build a static typed ref sorting function using DEFINE_LIST_SORT along
with a typed comparison function near its only two callers instead of
having an exported version that calls llist_mergesort(). This gets rid
of the next pointer accessor functions and their calling overhead at the
cost of a slightly increased object text size.
Before:
__TEXT __DATA __OBJC others dec hex
23231 389 0 113689 137309 2185d fetch-pack.o
29158 80 0 146864 176102 2afe6 remote.o
With this patch:
__TEXT __DATA __OBJC others dec hex
23591 389 0 117759 141739 229ab fetch-pack.o
29070 80 0 145718 174868 2ab14 remote.o
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Preliminary code refactoring around transport and bundle code.
* ds/bundle-uri:
bundle.h: make "fd" version of read_bundle_header() public
remote: allow relative_url() to return an absolute url
remote: move relative_url()
http: make http_get_file() external
fetch-pack: move --keep=* option filling to a function
fetch-pack: add a deref_without_lazy_fetch_extended()
dir API: add a generalized path_match_flags() function
connect.c: refactor sending of agent & object-format
"git fetch" unnecessarily failed when an unexpected optional
section appeared in the output, which has been corrected.
* jt/fetch-peek-optional-section:
fetch-pack: make unexpected peek result non-fatal
Move the populating of the --keep=* option argument to "index-pack" to
a static function, a subsequent commit will make use of it in another
function.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a version of the deref_without_lazy_fetch function which can be
called with custom oi_flags and to grab information about the
"object_type". This will be used for the bundle-uri client in a
subsequent commit.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When a Git server responds to a fetch request, it may send optional
sections before the packfile section. To handle this, the Git client
calls packet_reader_peek() (see process_section_header()) in order to
see what's next without consuming the line.
However, as implemented, Git errors out whenever what's peeked is not an
ordinary line. This is not only unexpected (here, we only need to know
whether the upcoming line is the section header we want) but causes
errors to include the name of a section header that is irrelevant to the
cause of the error. For example, at $DAYJOB, we have seen "fatal: error
reading section header 'shallow-info'" error messages when none of the
repositories involved are shallow.
Therefore, fix this so that the peek returns 1 if the upcoming line is
the wanted section header and nothing else. Because of this change,
reader->line may now be NULL later in the function, so update the error
message printing code accordingly.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allow a "refetch" where the contents of the local object store are
ignored and a full fetch is performed, not attempting to find or
negotiate common commits with the remote.
A key use case is to apply a new partial clone blob/tree filter and
refetch all the associated matching content, which would otherwise not
be transferred when the commit objects are already present locally.
Signed-off-by: Robert Coup <robert@coup.net.nz>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A couple of optimization to "git fetch".
* ps/fetch-optim-with-commit-graph:
fetch: skip computing output width when not printing anything
fetch-pack: use commit-graph when computing cutoff
The protocol keyword 'ready' isn't meant for translation. Pass it as
parameter instead of spell it in die() message (and potentially confuse
translators).
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
During packfile negotiation we iterate over all refs announced by the
remote side to check whether their IDs refer to commits already known to
us. If a commit is known to us already, then its date is a potential
cutoff point for commits we have in common with the remote side.
There is potentially a lot of commits announced by the remote depending
on how many refs there are in the remote repository, and for every one
of them we need to search for it in our object database and, if found,
parse the corresponding object to find out whether it is a candidate for
the cutoff date. This can be sped up by trying to look up commits via
the commit-graph first, which is a lot more efficient.
Benchmarks in a repository with about 2,1 million refs and an up-to-date
commit-graph show an almost 20% speedup when mirror-fetching:
Benchmark 1: git fetch +refs/*:refs/* (v2.35.0)
Time (mean ± σ): 115.587 s ± 2.009 s [User: 109.874 s, System: 11.305 s]
Range (min … max): 113.584 s … 118.820 s 5 runs
Benchmark 2: git fetch +refs/*:refs/* (HEAD)
Time (mean ± σ): 96.859 s ± 0.624 s [User: 91.948 s, System: 10.980 s]
Range (min … max): 96.180 s … 97.875 s 5 runs
Summary
'git fetch +refs/*:refs/* (HEAD)' ran
1.19 ± 0.02 times faster than 'git fetch +refs/*:refs/* (v2.35.0)'
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
They are all replaced by "the option '%s' requires '%s'", which is a
new string but replaces 17 previous unique strings.
Signed-off-by: Jean-Noël Avila <jn.avila@free.fr>
Reviewed-by: Johannes Sixt <j6t@kdbg.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git fetch", when received a bad packfile, can fail with SIGPIPE.
This wasn't wrong per-se, but we now detect the situation and fail
in a more predictable way.
* jk/fetch-pack-avoid-sigpipe-to-index-pack:
fetch-pack: ignore SIGPIPE when writing to index-pack
When fetching, we send the incoming pack to index-pack (or
unpack-objects) via the sideband demuxer. If index-pack hits an error
(e.g., because an object fails fsck), then it will die immediately. This
may cause us to get SIGPIPE on the fetch, as we're still trying to write
pack contents from the sideband demuxer (which is typically a thread,
and thus takes down the whole fetch process).
You can see this in action with:
./t5702-protocol-v2.sh --stress --run=59
which ends with (wrapped for readability):
test_must_fail: died by signal 13: git -c protocol.version=2 \
-c transfer.fsckobjects=1 -c fetch.uriprotocols=http,https \
clone http://127.0.0.1:5708/smart/http_parent http_child
not ok 59 - packfile-uri with transfer.fsckobjects fails on bad object
This is mostly cosmetic. The actual error of interest (in this case, the
object that failed the fsck check) comes from index-pack straight to
stderr, so the user still sees it. They _might_ even see fetch-pack
complaining about index-pack failing, because the main thread is racing
with the sideband-demuxer. But they'll definitely see the signal death
in the exit code, which is what the test is complaining about.
We can make this more predictable by just ignoring SIGPIPE. The sideband
demuxer uses write_or_die(), so it will notice and stop (gracefully,
because we hook die_routine() to exit just the thread). And during this
section we're not writing anywhere else where we'd be concerned about
SIGPIPE preventing us from wasting effort writing to nowhere.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In some setups, packfile uris act as bearer token. It is not
recommended to expose them plainly in logs, although in special
circunstances (e.g. debug) it makes sense to write them.
Redact the packfile URL paths by default, unless the GIT_TRACE_REDACT
variable is set to false. This mimics the redacting of the Authorization
header in HTTP.
Signed-off-by: Ivan Frade <ifrade@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In order to negotiate a packfile, we need to dereference refs to see
which commits we have in common with the remote. To do so, we first look
up the object's type -- if it's a tag, we peel until we hit a non-tag
object. If we hit a commit eventually, then we return that commit.
In case the object ID points to a commit directly, we can avoid the
initial lookup of the object type by opportunistically looking up the
commit via the commit-graph, if available, which gives us a slight speed
bump of about 2% in a huge repository with about 2.3M refs:
Benchmark #1: HEAD~: git-fetch
Time (mean ± σ): 31.634 s ± 0.258 s [User: 28.400 s, System: 5.090 s]
Range (min … max): 31.280 s … 31.896 s 5 runs
Benchmark #2: HEAD: git-fetch
Time (mean ± σ): 31.129 s ± 0.543 s [User: 27.976 s, System: 5.056 s]
Range (min … max): 30.172 s … 31.479 s 5 runs
Summary
'HEAD: git-fetch' ran
1.02 ± 0.02 times faster than 'HEAD~: git-fetch'
In case this fails, we fall back to the old code which peels the
objects to a commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The object ID iterator used by the connectivity checks returns the next
object ID via an out-parameter and then uses a return code to indicate
whether an item was found. This is a bit roundabout: instead of a
separate error code, we can just return the next object ID directly and
use `NULL` pointers as indicator that the iterator got no items left.
Furthermore, this avoids a copy of the object ID.
Refactor the iterator and all its implementations to return object IDs
directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs:
Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch
Time (mean ± σ): 30.110 s ± 0.148 s [User: 27.161 s, System: 5.075 s]
Range (min … max): 29.934 s … 30.406 s 10 runs
Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch
Time (mean ± σ): 29.899 s ± 0.109 s [User: 26.916 s, System: 5.104 s]
Range (min … max): 29.696 s … 29.996 s 10 runs
Summary
'328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran
1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch'
While this 1% speedup could be labelled as statistically insignificant,
the speedup is consistent on my machine. Furthermore, this is an end to
end test, so it is expected that the improvement in the connectivity
check itself is more significant.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Loading of ref tips to prepare for common ancestry negotiation in
"git fetch-pack" has been optimized by taking advantage of the
commit graph when available.
* ps/fetch-pack-load-refs-optim:
fetch-pack: speed up loading of refs via commit graph
When doing reference negotiation, git-fetch-pack(1) is loading all refs
from disk in order to determine which commits it has in common with the
remote repository. This can be quite expensive in repositories with many
references though: in a real-world repository with around 2.2 million
refs, fetching a single commit by its ID takes around 44 seconds.
Dominating the loading time is decompression and parsing of the objects
which are referenced by commits. Given the fact that we only care about
commits (or tags which can be peeled to one) in this context, there is
thus an easy performance win by switching the parsing logic to make use
of the commit graph in case we have one available. Like this, we avoid
hitting the object database to parse these commits but instead only load
them from the commit-graph. This results in a significant performance
boost when executing git-fetch in said repository with 2.2 million refs:
Benchmark #1: HEAD~: git fetch $remote $commit
Time (mean ± σ): 44.168 s ± 0.341 s [User: 42.985 s, System: 1.106 s]
Range (min … max): 43.565 s … 44.577 s 10 runs
Benchmark #2: HEAD: git fetch $remote $commit
Time (mean ± σ): 19.498 s ± 0.724 s [User: 18.751 s, System: 0.690 s]
Range (min … max): 18.629 s … 20.454 s 10 runs
Summary
'HEAD: git fetch $remote $commit' ran
2.27 ± 0.09 times faster than 'HEAD~: git fetch $remote $commit'
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When fetching with the v0 protocol over ssh (or a local upload-pack with
pipes), the server closes the connection as soon as it is finished
sending the pack. So even though the client may still be operating on
the data via index-pack (e.g., resolving deltas, checking connectivity,
etc), the server has released all resources.
With the v2 protocol, however, the server considers the ssh session only
as a transport, with individual requests coming over it. After sending
the pack, it goes back to its main loop, waiting for another request to
come from the client. As a result, the ssh session hangs around until
the client process ends, which may be much later (because resolving
deltas, etc, may consume a lot of CPU).
This is bad for two reasons:
- it's consuming resources on the server to leave open a connection
that won't see any more use
- if something bad happens to the ssh connection in the meantime (say,
it gets killed by the network because it's idle, as happened in a
real-world report), then ssh will exit non-zero, and we'll propagate
the error up the stack.
The server is correct here not to hang up after serving the pack. The v2
protocol's design is meant to allow multiple requests like this, and
hanging up would be the wrong thing for a hypothetical client which was
planning to make more requests (though in practice, the git.git client
never would, and I doubt any other implementations would either).
The right thing is instead for the client to signal to the server that
it's not interested in making more requests. We can do that by closing
the pipe descriptor we use to write to ssh. This will propagate to the
server upload-pack as an EOF when it tries to read the next request (and
then it will close its half, and the whole connection will go away).
It's important to do this "half duplex" shutdown, because we have to do
it _before_ we actually receive the pack. This is an artifact of the way
fetch-pack and index-pack (or unpack-objects) interact. We hand the
connection off to index-pack (really, a sideband demuxer which feeds
it), and then wait until it returns. And it doesn't do that until it has
resolved all of the deltas in the pack, even though it was done reading
from the server long before.
So just closing the connection fully after index-pack returns would be
too late; we'd have held it open much longer than was necessary. And
teaching index-pack to close the connection is awkward. It's not even
seeing the whole conversation (the sideband demuxer is, but it doesn't
actually know what's in the packets, or when the end comes).
Note that this close() is happening deep within the transport code. It's
possible that a caller would want to perform other operations over the
same ssh transport after receiving the pack. But as of the current code,
none of the callers do, and there haven't been discussions of any plans
to change this. If we need to support that later, we can probably do so
by passing down a flag for "you're the last request on the transport;
it's OK to close" instead of the code just assuming that's true.
The description above all discusses v2 ssh, so it's worth thinking about
how this interacts with other protocols:
- in v0 protocols, we could do the same half-duplex shutdown (it just
goes into the v0 do_fetch_pack() instead). This does work, but since
it doesn't have the same persistence problem in the first place,
there's little reason to change it at this point.
- local fetches against git-upload-pack on the same machine will
behave the same as ssh (they are talking over two pipes, and see EOF
on their input pipe)
- fetches against git-daemon will run this same code, and close one of
the descriptors. In practice, this won't do anything, since there
our two descriptors are dups of each other, and not part of a
half-duplex pair. The right thing would probably be to call
shutdown(SHUT_WR) on it. I didn't bother with that here. It doesn't
face the same error-code problem (since it's just a TCP connection),
so it's really only an optimization problem. And git:// is not that
widely used these days, and has less impact on server resources than
an ssh termination.
- v2 http doesn't suffer from this problem in the first place, as our
pipes terminate at a local git-remote-https, which is passing data
along as individual requests via curl. Probably curl is keeping the
TCP/TLS connection open for more requests, and we might be able to
tell it manually "hey, we are done making requests now". But I think
that's much less important. It again doesn't suffer from the
error-code problem, and HTTP keepalive is pretty well understood
(importantly, the timeouts can be set low, because clients like curl
know how to reconnect for subsequent requests if necessary). So it's
probably not worth figuring out how to tell curl that we're done
(though if we do, this patch is probably the first step anyway;
fetch-pack closes the pipe back to remote-https, which would be the
signal that it should tell curl we're done).
The code is pretty straightforward. We close the pipe at the right
moment, and set it to -1 to mark it as invalid. I modified the later
cleanup code to avoid calling close(-1). That's not strictly necessary,
since close(-1) is a noop, but hopefully makes things a bit more obvious
to a reader.
I suspect that trying to call more transport functions after the close()
(e.g., calling transport_fetch_refs() again) would fail, as it's not
smart enough to realize we need to re-open the ssh connection. But
that's already true when v0 is in use. And no current callers want to do
that (and again, the solution is probably a flag in the transport code
to keep things open, which can be added later).
There's no test here, as the situation it covers is inherently racy (the
question is when upload-pack exits, compared to when index-pack finishes
resolving deltas and exits). The rather gross shell snippet below does
recreate the problematic situation; when run on a sufficiently-large
repository (git.git works fine), it kills an "idle" upload-pack while
the client is resolving deltas, leading to a failed clone.
(
git clone --no-local --progress . foo.git 2>&1
echo >&2 "clone exit code=$?"
) |
tr '\r' '\n' |
while read line
do
case "$done,$line" in
,Resolving*)
echo "hit resolving deltas; killing upload-pack"
killall -9 git-upload-pack
done=t
;;
esac
done
Reported-by: Greg Pflaum <greg.pflaum@pnp-hcl.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Currently, the packfile negotiation step within a Git fetch cannot be
done independent of sending the packfile, even though there is at least
one application wherein this is useful. Therefore, make it possible for
this negotiation step to be done independently. A subsequent commit will
use this for one such application - push negotiation.
This feature is for protocol v2 only. (An implementation for protocol v0
would require a separate implementation in the fetch, transport, and
transport helper code.)
In the protocol, the main hindrance towards independent negotiation is
that the server can unilaterally decide to send the packfile. This is
solved by a "wait-for-done" argument: the server will then wait for the
client to say "done". In practice, the client will never say it; instead
it will cease requests once it is satisfied.
In the client, the main change lies in the transport and transport
helper code. fetch_refs_via_pack() performs everything needed - protocol
version and capability checks, and the negotiation itself.
There are 2 code paths that do not go through fetch_refs_via_pack() that
needed to be individually excluded: the bundle transport (excluded
through requiring smart_options, which the bundle transport doesn't
support) and transport helpers that do not support takeover. If or when
we support independent negotiation for protocol v0, we will need to
modify these 2 code paths to support it. But for now, report failure if
independent negotiation is requested in these cases.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A subsequent commit will need this functionality independent of the rest
of send_fetch_request(), so put this into its own function.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>