When cloning an empty repository, protocol versions 0 and 1 currently
offer nothing but the header and flush packets for the /info/refs
endpoint. This means that no capabilities are provided, so the client
side doesn't know what capabilities are present.
However, this does pose a problem when working with SHA-256
repositories, since we use the capabilities to know the remote side's
object format (hash algorithm). As of 8b214c2e9d ("clone: propagate
object-format when cloning from void", 2023-04-05), this has been fixed
for protocol v2, since there we always read the hash algorithm from the
remote.
Fortunately, the push version of the protocol already indicates a clue
for how to solve this. When the /info/refs endpoint is accessed for a
push and the remote is empty, we include a dummy "capabilities^{}" ref
pointing to the all-zeros object ID. The protocol documentation already
indicates this should _always_ be sent, even for fetches and clones, so
let's just do that, which means we'll properly announce the hash
algorithm as part of the capabilities. This just works with the
existing code because we share the same ref code for fetches and clones,
and libgit2, JGit, and dulwich do as well.
There is one minor issue to fix, though. If we called send_ref with
namespaces, we would return NULL with the capabilities entry, which
would cause a crash. Instead, let's refactor out a function to print
just the ref itself without stripping the namespace and use it for our
special capabilities entry.
Add several sets of tests for HTTP as well as for local clones. The
behavior can be slightly different for HTTP versus a local or SSH clone
because of the stateless-rpc functionality, so it's worth testing both.
Signed-off-by: brian m. carlson <bk2204@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code to parse capability list for v0 on-wire protocol fell into
an infinite loop when a capability appears multiple times, which
has been corrected.
* jk/protocol-cap-parse-fix:
v0 protocol: use size_t for capability length/offset
t5512: test "ls-remote --heads --symref" filtering with v0 and v2
t5512: allow any protocol version for filtered symref test
t5512: add v2 support for "ls-remote --symref" test
v0 protocol: fix sha1/sha256 confusion for capabilities^{}
t5512: stop referring to "v1" protocol
v0 protocol: fix infinite loop when parsing multi-valued capabilities
When parsing server capabilities, we use "int" to store lengths and
offsets. At first glance this seems like a spot where our parser may be
confused by integer overflow if somebody sent us a malicious response.
In practice these strings are all bounded by the 64k limit of a
pkt-line, so using "int" is OK. However, it makes the code simpler to
audit if they just use size_t everywhere. Note that because we take
these parameters as pointers, this also forces many callers to update
their declared types.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
cache.h's nature of a dumping ground of includes prevented it from
being included in some compat/ files, forcing us into a workaround
of having a double forward declaration of the read_in_full() function
(see commit 14086b0a13 ("compat/pread.c: Add a forward declaration to
fix a warning", 2007-11-17)). Now that we have moved functions like
read_in_full() from cache.h to wrapper.h, and wrapper.h isn't littered
with unrelated and scary #defines, get rid of the extra forward
declaration and just have compat/pread.c include wrapper.h.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Several files were including cache.h solely to get other headers, such
as trace.h and trace2.h. Since the last few commits have modified
files to make these dependencies more explicit, the inclusion of cache.h
is no longer needed in several cases. Remove it.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Dozens of files made use of trace and trace2 functions, without
explicitly including trace.h or trace2.h. This made it more difficult
to find which files could remove a dependence on cache.h. Make C files
explicitly include trace.h or trace2.h if they are using them.
Signed-off-by: Elijah Newren <newren@gmail.com>
Acked-by: Calvin Wan <calvinwan@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Split key function and data structure definitions out of cache.h to
new header files and adjust the users.
* en/header-split-cleanup:
csum-file.h: remove unnecessary inclusion of cache.h
write-or-die.h: move declarations for write-or-die.c functions from cache.h
treewide: remove cache.h inclusion due to setup.h changes
setup.h: move declarations for setup.c functions from cache.h
treewide: remove cache.h inclusion due to environment.h changes
environment.h: move declarations for environment.c functions from cache.h
treewide: remove unnecessary includes of cache.h
wrapper.h: move declarations for wrapper.c functions from cache.h
path.h: move function declarations for path.c functions from cache.h
cache.h: remove expand_user_path()
abspath.h: move absolute path functions from cache.h
environment: move comment_line_char from cache.h
treewide: remove unnecessary cache.h inclusion from several sources
treewide: remove unnecessary inclusion of gettext.h
treewide: be explicit about dependence on gettext.h
treewide: remove unnecessary cache.h inclusion from a few headers
Code clean-up around the use of the_repository.
* ab/remove-implicit-use-of-the-repository:
libs: use "struct repository *" argument, not "the_repository"
post-cocci: adjust comments for recent repo_* migration
cocci: apply the "revision.h" part of "the_repository.pending"
cocci: apply the "rerere.h" part of "the_repository.pending"
cocci: apply the "refs.h" part of "the_repository.pending"
cocci: apply the "promisor-remote.h" part of "the_repository.pending"
cocci: apply the "packfile.h" part of "the_repository.pending"
cocci: apply the "pretty.h" part of "the_repository.pending"
cocci: apply the "object-store.h" part of "the_repository.pending"
cocci: apply the "diff.h" part of "the_repository.pending"
cocci: apply the "commit.h" part of "the_repository.pending"
cocci: apply the "commit-reach.h" part of "the_repository.pending"
cocci: apply the "cache.h" part of "the_repository.pending"
cocci: add missing "the_repository" macros to "pending"
cocci: sort "the_repository" rules by header
cocci: fix incorrect & verbose "the_repository" rules
cocci: remove dead rule from "the_repository.pending.cocci"
* ab/remove-implicit-use-of-the-repository:
libs: use "struct repository *" argument, not "the_repository"
post-cocci: adjust comments for recent repo_* migration
cocci: apply the "revision.h" part of "the_repository.pending"
cocci: apply the "rerere.h" part of "the_repository.pending"
cocci: apply the "refs.h" part of "the_repository.pending"
cocci: apply the "promisor-remote.h" part of "the_repository.pending"
cocci: apply the "packfile.h" part of "the_repository.pending"
cocci: apply the "pretty.h" part of "the_repository.pending"
cocci: apply the "object-store.h" part of "the_repository.pending"
cocci: apply the "diff.h" part of "the_repository.pending"
cocci: apply the "commit.h" part of "the_repository.pending"
cocci: apply the "commit-reach.h" part of "the_repository.pending"
cocci: apply the "cache.h" part of "the_repository.pending"
cocci: add missing "the_repository" macros to "pending"
cocci: sort "the_repository" rules by header
cocci: fix incorrect & verbose "the_repository" rules
cocci: remove dead rule from "the_repository.pending.cocci"
Apply the part of "the_repository.pending.cocci" pertaining to
"object-store.h".
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Dozens of files made use of gettext functions, without explicitly
including gettext.h. This made it more difficult to find which files
could remove a dependence on cache.h. Make C files explicitly include
gettext.h if they are using it.
However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an
include of gettext.h, it was left out to avoid conflicting with an
in-flight topic.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
More work towards -Wunused.
* jk/unused-post-2.39-part2: (21 commits)
help: mark unused parameter in git_unknown_cmd_config()
run_processes_parallel: mark unused callback parameters
userformat_want_item(): mark unused parameter
for_each_commit_graft(): mark unused callback parameter
rewrite_parents(): mark unused callback parameter
fetch-pack: mark unused parameter in callback function
notes: mark unused callback parameters
prio-queue: mark unused parameters in comparison functions
for_each_object: mark unused callback parameters
list-objects: mark unused callback parameters
mark unused parameters in signal handlers
run-command: mark error routine parameters as unused
mark "pointless" data pointers in callbacks
ref-filter: mark unused callback parameters
http-backend: mark unused parameters in virtual functions
http-backend: mark argc/argv unused
object-name: mark unused parameters in disambiguate callbacks
serve: mark unused parameters in virtual functions
serve: use repository pointer to get config
ls-refs: drop config caching
...
Each v2 "serve" action has a virtual function for advertising and
implementing the command. A few of these are so trivial that they don't
need to look at their parameters, especially the "repository" parameter.
We can mark them so that -Wunused-parameter doesn't complain.
Note that upload_pack_v2() probably _should_ be using its repository
pointer. But teaching the functions it calls to do so is non-trivial.
Even using it for something as simple as reading config is tricky, both
because it shares code with the v1 upload pack, and because the
git_protected_config() mechanism it uses does not have a repo-specific
interface. So we'll just annotate it for now, and cleaning it up can be
part of the larger work to drop references to the_repository.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A few of the v2 "serve" callbacks ignore their repository parameter and
read config using the_repository (either directly or implicitly by
calling wrapper functions). This isn't a bug since the server code only
handles a single main repository anyway (and indeed, if you look at the
callers, these repository parameters will always be the_repository). But
in the long run we want to get rid of the_repository, so let's take a
tiny step in that direction.
As a bonus, this silences some -Wunused-parameter warnings.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We're about to add a new argument to git-rev-list(1) that allows it to
add all references that are visible when taking `transfer.hideRefs` et
al into account. This will require us to potentially parse multiple sets
of hidden refs, which is not easily possible right now as there is only
a single, global instance of the list of parsed hidden refs.
Refactor `parse_hide_refs_config()` and `ref_is_hidden()` so that both
take the list of hidden references as input and adjust callers to keep a
local list, instead. This allows us to easily use multiple hidden-ref
lists. Furthermore, it allows us to properly free this list before we
exit.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
A couple of bugfixes with code clean-up.
* jk/list-objects-filter-cleanup:
list-objects-filter: convert filter_spec to a strbuf
list-objects-filter: add and use initializers
list-objects-filter: handle null default filter spec
list-objects-filter: don't memset after releasing filter struct
Undoes 'jk/unused-annotation' topic and redoes it to work around
Coccinelle rules misfiring false positives in unrelated codepaths.
* ab/unused-annotation:
git-compat-util.h: use "deprecated" for UNUSED variables
git-compat-util.h: use "UNUSED", not "UNUSED(var)"
Annotate function parameters that are not used (but cannot be
removed for structural reasons), to prepare us to later compile
with -Wunused warning turned on.
* jk/unused-annotation:
is_path_owned_by_current_uid(): mark "report" parameter as unused
run-command: mark unused async callback parameters
mark unused read_tree_recursive() callback parameters
hashmap: mark unused callback parameters
config: mark unused callback parameters
streaming: mark unused virtual method parameters
transport: mark bundle transport_options as unused
refs: mark unused virtual method parameters
refs: mark unused reflog callback parameters
refs: mark unused each_ref_fn parameters
git-compat-util: add UNUSED macro
In 7e2619d8ff (list_objects_filter_options: plug leak of filter_spec
strings, 2022-09-08), we noted that the filter_spec string_list was
inconsistent in how it handled memory ownership of strings stored in the
list. The fix there was a bit of a band-aid to set the "strdup_strings"
variable right before adding anything.
That works OK, and it lets the users of the API continue to
zero-initialize the struct. But it makes the code a bit hard to follow
and accident-prone, as any other spots appending the filter_spec need to
think about whether to set the strdup_strings value, too (there's one
such spot in partial_clone_get_default_filter_spec(), which is probably
a possible memory leak).
So let's do that full cleanup now. We'll introduce a
LIST_OBJECTS_FILTER_INIT macro and matching function, and use them as
appropriate (though it is for the "_options" struct, this matches the
corresponding list_objects_filter_release() function).
This is harder than it seems! Many other structs, like
git_transport_data, embed the filter struct. So they need to initialize
it themselves even if the rest of the enclosing struct is OK with
zero-initialization. I found all of the relevant spots by grepping
manually for declarations of list_objects_filter_options. And then doing
so recursively for structs which embed it, and ones which embed those,
and so on.
I'm pretty sure I got everything, but there's no change that would alert
the compiler if any topics in flight added new declarations. To catch
this case, we now double-check in the parsing function that things were
initialized as expected and BUG() if appropriate.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If the caller told us that they don't care about us checking the object
hash, then we're free to implement any optimizations that get us the
parsed value more quickly. An obvious one is to check the commit graph
before loading an object from disk. And in fact, both of the callers who
pass in this flag are already doing so before they call parse_object()!
So we can simplify those callers, as well as any possible future ones,
by moving the logic into parse_object().
There are two subtle things to note in the diff, but neither has any
impact in practice:
- it seems least-surprising here to do the graph lookup on the
git-replace'd oid, rather than the original. This is in theory a
change of behavior from the earlier code, as neither caller did a
replace lookup itself. But in practice it doesn't matter, as we
disable the commit graph entirely if there are any replace refs.
- the caller in get_reference() passes the skip_hash flag only if
revs->verify_objects isn't set, whereas it would look in the commit
graph unconditionally. In practice this should not matter as we
should disable the commit graph entirely when using verify_objects
(and that was done recently in another patch).
So this should be a pure cleanup with no behavior change.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Imagine we have a history with commit C pointing to a large blob B.
If a client asks us for C, we can generally serve both objects to them
without accessing the uncompressed contents of B. In upload-pack, we
figure out which commits we have and what the client has, and feed those
tips to pack-objects. In pack-objects, we traverse the commits and trees
(or use bitmaps!) to find the set of objects needed, but we never open
up B. When we serve it to the client, we can often pass the compressed
bytes directly from the on-disk packfile over the wire.
But if a client asks us directly for B, perhaps because they are doing
an on-demand fetch to fill in the missing blob of a partial clone, we
end up much slower. Upload-pack calls parse_object() on the oid we
receive, which opens up the object and re-checks its hash (even though
if it were a commit, we might skip this parse entirely in favor of the
commit graph!). And then we feed the oid directly to pack-objects, which
again calls parse_object() and opens the object. And then finally, when
we write out the result, we may send bytes straight from disk, but only
after having unnecessarily uncompressed and computed the sha1 of the
object twice!
This patch teaches both code paths to use the new SKIP_HASH_CHECK flag
for parse_object(). You can see the speed-up in p5600, which does a
blob:none clone followed by a checkout. The savings for git.git are
modest:
Test HEAD^ HEAD
----------------------------------------------------------------------
5600.3: checkout of result 2.23(4.19+0.24) 1.72(3.79+0.18) -22.9%
But the savings scale with the number of bytes. So on a repository like
linux.git with more files, we see more improvement (in both absolute and
relative numbers):
Test HEAD^ HEAD
----------------------------------------------------------------------------
5600.3: checkout of result 51.62(77.26+2.76) 34.86(61.41+2.63) -32.5%
And here's an even more extreme case. This is the android gradle-plugin
repository, whose tip checkout has ~3.7GB of files:
Test HEAD^ HEAD
--------------------------------------------------------------------------
5600.3: checkout of result 79.51(90.84+5.55) 40.28(51.88+5.67) -49.3%
Keep in mind that these timings are of the whole checkout operation. So
they count the client indexing the pack and actually writing out the
files. If we want to see just the server's view, we can hack up the
GIT_TRACE_PACKET output from those operations and replay it via
upload-pack. For the gradle example, that gives me:
Benchmark 1: GIT_PROTOCOL=version=2 git.old upload-pack ../gradle-plugin <input
Time (mean ± σ): 50.884 s ± 0.239 s [User: 51.450 s, System: 1.726 s]
Range (min … max): 50.608 s … 51.025 s 3 runs
Benchmark 2: GIT_PROTOCOL=version=2 git.new upload-pack ../gradle-plugin <input
Time (mean ± σ): 9.728 s ± 0.112 s [User: 10.466 s, System: 1.535 s]
Range (min … max): 9.618 s … 9.842 s 3 runs
Summary
'GIT_PROTOCOL=version=2 git.new upload-pack ../gradle-plugin <input' ran
5.23 ± 0.07 times faster than 'GIT_PROTOCOL=version=2 git.old upload-pack ../gradle-plugin <input'
So a server would see an 80% reduction in CPU serving the initial
checkout of a partial clone for this repository. Or possibly even more
depending on the packing; most of the time spent in the faster one were
objects we had to open during the write phase.
In both cases skipping the extra hashing on the server should be pretty
safe. The client doesn't trust the server anyway, so it will re-hash all
of the objects via index-pack. There is one thing to note, though: the
change in get_reference() affects not just pack-objects, but rev-list,
git-log, etc. We could use a flag to limit to index-pack here, but we
may already skip hash checks in this instance. For commits, we'd skip
anything we load via the commit-graph. And while before this commit we
would check a blob fed directly to rev-list on the command-line, we'd
skip checking that same blob if we found it by traversing a tree.
The exception for both is if --verify-objects is used. In that case,
we'll skip this optimization, and the new test makes sure we do this
correctly.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As reported in [1] the "UNUSED(var)" macro introduced in
2174b8c75de (Merge branch 'jk/unused-annotation' into next,
2022-08-24) breaks coccinelle's parsing of our sources in files where
it occurs.
Let's instead partially go with the approach suggested in [2] of
making this not take an argument. As noted in [1] "coccinelle" will
ignore such tokens in argument lists that it doesn't know about, and
it's less of a surprise to syntax highlighters.
This undoes the "help us notice when a parameter marked as unused is
actually use" part of 9b24034754 (git-compat-util: add UNUSED macro,
2022-08-19), a subsequent commit will further tweak the macro to
implement a replacement for that functionality.
1. https://lore.kernel.org/git/220825.86ilmg4mil.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/220819.868rnk54ju.gmgdl@evledraar.gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Functions used with for_each_ref(), etc, need to conform to the
each_ref_fn interface. But most of them don't need every parameter;
let's annotate the unused ones to quiet -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Fix a memory leak that's been reported by some versions of "gcc" since
"output_state" became malloc'd in 55a9651d26 (upload-pack.c: increase
output buffer size, 2021-12-14).
In e75d2f7f73 (revisions API: have release_revisions() release
"filter", 2022-04-13) it was correctly marked as leak-free, the only
path through this function that doesn't reach the free(output_state)
is if we "goto fail", and that will invoke "die()".
Such leaks are not included with SANITIZE=leak (but e.g. valgrind will
still report them), but under some gcc optimization (I have not been
able to reproduce it with "clang") we'll report a leak here
anyway. E.g. gcc v12 with "-O2" and above will trigger it, but not
clang v13 with any "-On".
The GitHub CI would also run into this leak if the "linux-leaks" job
was made to run with "GIT_TEST_SANITIZE_LEAK_LOG=true".
See [1] for a past case where gcc had similar trouble analyzing leaks
involving a die() invocation in the function.
1. https://lore.kernel.org/git/patch-v3-5.6-9a44204c4c9-20211022T175227Z-avarab@gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
`uploadpack.packObjectsHook` is the only 'protected configuration only'
variable today, but we've noted that `safe.directory` and the upcoming
`safe.bareRepository` should also be 'protected configuration only'. So,
for consistency, we'd like to have a single implementation for protected
configuration.
The primary constraints are:
1. Reading from protected configuration should be fast. Nearly all "git"
commands inside a bare repository will read both `safe.directory` and
`safe.bareRepository`, so we cannot afford to be slow.
2. Protected configuration must be readable when the gitdir is not
known. `safe.directory` and `safe.bareRepository` both affect
repository discovery and the gitdir is not known at that point [1].
The chosen implementation in this commit is to read protected
configuration and cache the values in a global configset. This is
similar to the caching behavior we get with the_repository->config.
Introduce git_protected_config(), which reads protected configuration
and caches them in the global configset protected_config. Then, refactor
`uploadpack.packObjectsHook` to use git_protected_config().
The protected configuration functions are named similarly to their
non-protected counterparts, e.g. git_protected_config_check_init() vs
git_config_check_init().
In light of constraint 1, this implementation can still be improved.
git_protected_config() iterates through every variable in
protected_config, which is wasteful, but it makes the conversion simple
because it matches existing patterns. We will likely implement constant
time lookup functions for protected configuration in a future series
(such functions already exist for non-protected configuration, i.e.
repo_config_get_*()).
An alternative that avoids introducing another configset is to continue
to read all config using git_config(), but only accept values that have
the correct config scope [2]. This technically fulfills constraint 2,
because git_config() simply ignores the local and worktree config when
the gitdir is not known. However, this would read incomplete config into
the_repository->config, which would need to be reset when the gitdir is
known and git_config() needs to read the local and worktree config.
Resetting the_repository->config might be reasonable while we only have
these 'protected configuration only' variables, but it's not clear
whether this extends well to future variables.
[1] In this case, we do have a candidate gitdir though, so with a little
refactoring, it might be possible to provide a gitdir.
[2] This is how `uploadpack.packObjectsHook` was implemented prior to
this commit.
Signed-off-by: Glen Choo <chooglen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
During packfile negotiation the client will send "want" and "want-ref"
lines to the server to tell it which objects it is interested in. The
server-side parses each of those and looks them up to see whether it
actually has requested objects. This lookup is performed by calling
`parse_object()` directly, which thus hits the object database. In the
general case though most of the objects the client requests will be
commits. We can thus try to look up the object via the commit-graph
opportunistically, which is much faster than doing the same via the
object database.
Refactor parsing of both "want" and "want-ref" lines to do so.
The following benchmark is executed in a repository with a huge number
of references. It uses cached request from git-fetch(1) as input to
git-upload-pack(1) that contains about 876,000 "want" lines:
Benchmark 1: HEAD~
Time (mean ± σ): 7.113 s ± 0.028 s [User: 6.900 s, System: 0.662 s]
Range (min … max): 7.072 s … 7.168 s 10 runs
Benchmark 2: HEAD
Time (mean ± σ): 6.622 s ± 0.061 s [User: 6.452 s, System: 0.650 s]
Range (min … max): 6.535 s … 6.727 s 10 runs
Summary
'HEAD' ran
1.07 ± 0.01 times faster than 'HEAD~'
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When serving a fetch, git upload-pack copies data from a git
pack-objects stdout pipe to its stdout. This commit increases the size
of the buffer used for that copying from 8192 to 65515, the maximum
sideband-64k packet size.
Previously, this buffer was allocated on the stack. Because the new
buffer size is nearly 64KB, we switch this to a heap allocation.
On GitLab.com we use GitLab's pack-objects cache which does writes of
65515 bytes. Because of the default 8KB buffer size, propagating these
cache writes requires 8 pipe reads and 8 pipe writes from
git-upload-pack, and 8 pipe reads from Gitaly (our Git RPC service).
If we increase the size of the buffer to the maximum Git packet size,
we need only 1 pipe read and 1 pipe write in git-upload-pack, and 1
pipe read in Gitaly to transfer the same amount of data. In benchmarks
with a pure fetch and 100% cache hit rate workload we are seeing CPU
utilization reductions of over 30%.
Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Change a pattern of hardcoding an "argv" array size, populating it and
assigning to the "argv" member of "struct child_process" to instead
use "strvec_pushl()" to add data to the "args" member.
This implements the same behavior as before in fewer lines of code,
and moves us further towards being able to remove the "argv" member in
a subsequent commit.
Since we've entirely removed the "argv" variable(s) we can be sure
that no potential logic errors of the type discussed in a preceding
commit are being introduced here, i.e. ones where the local "argv" was
being modified after the assignment to "struct child_process"'s
"argv".
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up around "git serve".
* ab/serve-cleanup:
upload-pack: document and rename --advertise-refs
serve.[ch]: remove "serve_options", split up --advertise-refs code
{upload,receive}-pack tests: add --advertise-refs tests
serve.c: move version line to advertise_capabilities()
serve: move transfer.advertiseSID check into session_id_advertise()
serve.[ch]: don't pass "struct strvec *keys" to commands
serve: use designated initializers
transport: use designated initializers
transport: rename "fetch" in transport_vtable to "fetch_refs"
serve: mark has_capability() as static
Reduce number of write(2) system calls while sending the
ref advertisement.
* jv/pkt-line-batch:
upload-pack: use stdio in send_ref callbacks
pkt-line: add stdio packet write functions
In both protocol v0 and v2, upload-pack writes one pktline packet per
advertised ref to stdout. That means one or two write(2) syscalls per
ref. This is problematic if these writes become network sends with
high overhead.
This commit changes both send_ref callbacks to use buffered IO using
stdio.
To give an example of the impact: I set up a single-threaded loop that
calls ls-remote (with HTTP and protocol v2) on a local GitLab
instance, on a repository with 11K refs. When I switch from Git
v2.32.0 to this patch, I see a 40% reduction in CPU time for Git, and
65% for Gitaly (GitLab's Git RPC service).
So using buffered IO not only saves syscalls in upload-pack, it also
saves time in things that consume upload-pack's output.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When 'upload-pack' runs within the context of a git namespace, treat any
'want-ref' lines the client sends as relative to that namespace.
Also check if the wanted ref is hidden via 'hideRefs'. If it is hidden,
respond with an error as if the ref didn't exist.
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Kim Altintop <kim@eagain.st>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "advertise capabilities" mode of serve.c added in
ed10cb952d (serve: introduce git-serve, 2018-03-15) is only used by
the http-backend.c to call {upload,receive}-pack with the
--advertise-refs parameter. See 42526b478e (Add stateless RPC options
to upload-pack, receive-pack, 2009-10-30).
Let's just make cmd_upload_pack() take the two (v2) or three (v2)
parameters the the v2/v1 servicing functions need directly, and pass
those in via the function signature. The logic of whether daemon mode
is implied by the timeout belongs in the v1 function (only used
there).
Once we split up the "advertise v2 refs" from "serve v2 request" it
becomes clear that v2 never cared about those in combination. The only
time it mattered was for v1 to emit its ref advertisement, in that
case we wanted to emit the smart-http-only "no-done" capability.
Since we only do that in the --advertise-refs codepath let's just have
it set "do_done" itself in v1's upload_pack() just before send_ref(),
at that point --advertise-refs and --stateless-rpc in combination are
redundant (the only user is get_info_refs() in http-backend.c), so we
can just pass in --advertise-refs only.
Since we need to touch all the serve() and advertise_capabilities()
codepaths let's rename them to less clever and obvious names, it's
been suggested numerous times, the latest of which is [1]'s suggestion
for protocol_v2_serve_loop(). Let's go with that.
1. https://lore.kernel.org/git/CAFQ2z_NyGb8rju5CKzmo6KhZXD0Dp21u-BbyCb2aNxLEoSPRJw@mail.gmail.com/
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The serve.c API added in ed10cb952d (serve: introduce git-serve,
2018-03-15) was passing in the raw capabilities "keys", but nothing
downstream of it ever used them.
Let's remove that code because it's not needed. If we do end up
needing to pass information about the advertisement in the future
it'll make more sense to have serve.c parse the capabilities keys and
pass the result of its parsing, rather than expecting expecting its
API users to parse the same keys again.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git push" learns to discover common ancestor with the receiving
end over protocol v2.
* jt/push-negotiation:
send-pack: support push negotiation
fetch: teach independent negotiation (no packfile)
fetch-pack: refactor command and capability write
fetch-pack: refactor add_haves()
fetch-pack: refactor process_acks()
Currently, the packfile negotiation step within a Git fetch cannot be
done independent of sending the packfile, even though there is at least
one application wherein this is useful. Therefore, make it possible for
this negotiation step to be done independently. A subsequent commit will
use this for one such application - push negotiation.
This feature is for protocol v2 only. (An implementation for protocol v0
would require a separate implementation in the fetch, transport, and
transport helper code.)
In the protocol, the main hindrance towards independent negotiation is
that the server can unilaterally decide to send the packfile. This is
solved by a "wait-for-done" argument: the server will then wait for the
client to say "done". In practice, the client will never say it; instead
it will cease requests once it is satisfied.
In the client, the main change lies in the transport and transport
helper code. fetch_refs_via_pack() performs everything needed - protocol
version and capability checks, and the negotiation itself.
There are 2 code paths that do not go through fetch_refs_via_pack() that
needed to be individually excluded: the bundle transport (excluded
through requiring smart_options, which the bundle transport doesn't
support) and transport helpers that do not support takeover. If or when
we support independent negotiation for protocol v0, we will need to
modify these 2 code paths to support it. But for now, report failure if
independent negotiation is requested in these cases.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
All of the other lookup_foo() functions take a repository argument, but
lookup_unknown_object() was never converted, and it uses the_repository
internally. Let's fix that.
We could leave a wrapper that uses the_repository, but there aren't that
many calls, so we'll just convert them all. I looked briefly at each
site to see if we had a repository struct (besides the_repository) we
could pass, but none of them do (so this conversion to pass
the_repository is a pure noop in each case, though it does take us one
step closer to eventually getting rid of the_repository).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit-graph learned to use corrected commit dates instead of
the generation number to help topological revision traversal.
* ak/corrected-commit-date:
doc: add corrected commit date info
commit-reach: use corrected commit dates in paint_down_to_common()
commit-graph: use generation v2 only if entire chain does
commit-graph: implement generation data chunk
commit-graph: implement corrected commit date
commit-graph: return 64-bit generation number
commit-graph: add a slab to store topological levels
t6600-test-reach: generalize *_three_modes
commit-graph: consolidate fill_commit_graph_info
revision: parse parent in indegree_walk_step()
commit-graph: fix regression when computing Bloom filters
Fix in passing custom args from "git clone" to "upload-pack" on the
other side.
* jv/upload-pack-filter-spec-quotefix:
t5544: clarify 'hook works with partial clone' test
upload-pack.c: fix filter spec quoting bug
Fix a bug in upload-pack.c that occurs when you combine partial
clone and uploadpack.packObjectsHook. You can reproduce it as
follows:
git clone -u 'git -c uploadpack.allowfilter '\
'-c uploadpack.packobjectshook=env '\
'upload-pack' --filter=blob:none --no-local \
src.git dst.git
Be careful with the line endings because this has a long quoted
string as the -u argument.
The error I get when I run this is:
Cloning into '/tmp/broken'...
remote: fatal: invalid filter-spec ''blob:none''
error: git upload-pack: git-pack-objects died with error.
fatal: git upload-pack: aborting due to possible repository corruption on the remote side.
remote: aborting due to possible repository corruption on the remote side.
fatal: early EOF
fatal: index-pack failed
The problem is caused by unneeded quoting.
This bug was already present in 10ac85c785 (upload-pack: add object
filtering for partial clone, 2017-12-08) when the server side filter
support was introduced. In fact, in 10ac85c785 this was broken
regardless of uploadpack.packObjectsHook. Then in 0b6069fe0a
(fetch-pack: test support excluding large blobs, 2017-12-08) the
quoting was removed but only behind a conditional that depends on
whether uploadpack.packObjectsHook is set.
Because uploadpack.packObjectsHook is apparently rarely used, nobody
noticed the problematic quoting could still happen.
Remove the conditional quoting and add a test for partial clone in
t5544-pack-objects-hook.
Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The peel_ref() interface is confusing and error-prone:
- it's typically used by ref iteration callbacks that have both a
refname and oid. But since they pass only the refname, we may load
the ref value from the filesystem again. This is inefficient, but
also means we are open to a race if somebody simultaneously updates
the ref. E.g., this:
int some_ref_cb(const char *refname, const struct object_id *oid, ...)
{
if (!peel_ref(refname, &peeled))
printf("%s peels to %s",
oid_to_hex(oid), oid_to_hex(&peeled);
}
could print nonsense. It is correct to say "refname peels to..."
(you may see the "before" value or the "after" value, either of
which is consistent), but mentioning both oids may be mixing
before/after values.
Worse, whether this is possible depends on whether the optimization
to read from the current iterator value kicks in. So it is actually
not possible with:
for_each_ref(some_ref_cb);
but it _is_ possible with:
head_ref(some_ref_cb);
which does not use the iterator mechanism (though in practice, HEAD
should never peel to anything, so this may not be triggerable).
- it must take a fully-qualified refname for the read_ref_full() code
path to work. Yet we routinely pass it partial refnames from
callbacks to for_each_tag_ref(), etc. This happens to work when
iterating because there we do not call read_ref_full() at all, and
only use the passed refname to check if it is the same as the
iterator. But the requirements for the function parameters are quite
unclear.
Instead of taking a refname, let's instead take an oid. That fixes both
problems. It's a little funny for a "ref" function not to involve refs
at all. The key thing is that it's optimizing under the hood based on
having access to the ref iterator. So let's change the name to make it
clear why you'd want this function versus just peel_object().
There are two other directions I considered but rejected:
- we could pass the peel information into the each_ref_fn callback.
However, we don't know if the caller actually wants it or not. For
packed-refs, providing it is essentially free. But for loose refs,
we actually have to peel the object, which would be wasteful in most
cases. We could likewise pass in a flag to the callback indicating
whether the peeled information is known, but that complicates those
callbacks, as they then have to decide whether to manually peel
themselves. Plus it requires changing the interface of every
callback, whether they care about peeling or not, and there are many
of them.
- we could make a function to return the peeled value of the current
iterated ref (computing it if necessary), and BUG() otherwise. I.e.:
int peel_current_iterated_ref(struct object_id *out);
Each of the current callers is an each_ref_fn callback, so they'd
mostly be happy. But:
- we use those callbacks with functions like head_ref(), which do
not use the iteration code. So we'd need to handle the fallback
case there, anyway.
- it's possible that a caller would want to call into generic code
that sometimes is used during iteration and sometimes not. This
encapsulates the logic to do the fast thing when possible, and
fallback when necessary.
The implementation is mostly obvious, but I want to call out a few
things in the patch:
- the test-tool coverage for peel_ref() is now meaningless, as it all
collapses to a single peel_object() call (arguably they were pretty
uninteresting before; the tricky part of that function is the
fast-path we see during iteration, but these calls didn't trigger
that). I've just dropped it entirely, though note that some other
tests relied on the tags we created; I've moved that creation to the
tests where it matters.
- we no longer need to take a ref_store parameter, since we'd never
look up a ref now. We do still rely on a global "current iterator"
variable which _could_ be kept per-ref-store. But in practice this
is only useful if there are multiple recursive iterations, at which
point the more appropriate solution is probably a stack of
iterators. No caller used the actual ref-store parameter anyway
(they all call the wrapper that passes the_repository).
- the original only kicked in the optimization when the "refname"
pointer matched (i.e., not string comparison). We do likewise with
the "oid" parameter here, but fall back to doing an actual oideq()
call. This in theory lets us kick in the optimization more often,
though in practice no current caller cares. It should never be
wrong, though (peeling is a property of an object, so two refs
pointing to the same object would peel identically).
- the original took care not to touch the peeled out-parameter unless
we found something to put in it. But no caller cares about this, and
anyway, it is enforced by peel_object() itself (and even in the
optimized iterator case, that's where we eventually end up). We can
shorten the code and avoid an extra copy by just passing the
out-parameter through the stack.
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.
We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.
With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Tighten error checking in the codepath that responds to "git fetch".
* jk/check-config-parsing-error-in-upload-pack:
upload-pack: propagate return value from object filter config callback
The transport layer was taught to optionally exchange the session
ID assigned by the trace2 subsystem during fetch/push transactions.
* js/trace2-session-id:
receive-pack: log received client session ID
send-pack: advertise session ID in capabilities
upload-pack, serve: log received client session ID
fetch-pack: advertise session ID in capabilities
transport: log received server session ID
serve: advertise session ID in v2 capabilities
receive-pack: advertise session ID in v0 capabilities
upload-pack: advertise session ID in v0 capabilities
trace2: add a public function for getting the SID
docs: new transfer.advertiseSID option
docs: new capability to advertise session IDs