pack-objects: break delta cycles before delta-search phase
We do not allow cycles in the delta graph of a pack (i.e., A
is a delta of B which is a delta of A) for the obvious
reason that you cannot actually access any of the objects in
such a case.
There's a last-ditch attempt to notice cycles during the
write phase, during which we issue a warning to the user and
write one of the objects out in full. However, this is
"last-ditch" for two reasons:
1. By this time, it's too late to find another delta for
the object, so the resulting pack is larger than it
otherwise could be.
2. The warning is there because this is something that
_shouldn't_ ever happen. If it does, then either:
a. a pack we are reusing deltas from had its own
cycle
b. we are reusing deltas from multiple packs, and
we found a cycle among them (i.e., A is a delta of
B in one pack, but B is a delta of A in another,
and we choose to use both deltas).
c. there is a bug in the delta-search code
So this code serves as a final check that none of these
things has happened, warns the user, and prevents us
from writing a bogus pack.
Right now, (2b) should never happen because of the static
ordering of packs in want_object_in_pack(). If two objects
have a delta relationship, then they must be in the same
pack, and therefore we will find them from that same pack.
However, a future patch would like to change that static
ordering, which will make (2b) a common occurrence. In
preparation, we should be able to handle those kinds of
cycles better. This patch does by introducing a
cycle-breaking step during the get_object_details() phase,
when we are deciding which deltas can be reused. That gives
us the chance to feed the objects into the delta search as
if the cycle did not exist.
We'll leave the detection and warning in the write_object()
phase in place, as it still serves as a check for case (2c).
This does mean we will stop warning for (2a). That case is
caused by bogus input packs, and we ideally would warn the
user about it. However, since those cycles show up after
picking reusable deltas, they look the same as (2b) to us;
our new code will break the cycles early and the last-ditch
check will never see them.
We could do analysis on any cycles that we find to
distinguish the two cases (i.e., it is a bogus pack if and
only if every delta in the cycle is in the same pack), but
we don't need to. If there is a cycle inside a pack, we'll
run into problems not only reusing the delta, but accessing
the object data at all. So when we try to dig up the actual
size of the object, we'll hit that same cycle and kick in
our usual complain-and-try-another-source code.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-11 12:26:36 +03:00
|
|
|
#!/bin/sh
|
|
|
|
|
|
|
|
test_description='test handling of inter-pack delta cycles during repack
|
|
|
|
|
|
|
|
The goal here is to create a situation where we have two blobs, A and B, with A
|
|
|
|
as a delta against B in one pack, and vice versa in the other. Then if we can
|
|
|
|
persuade a full repack to find A from one pack and B from the other, that will
|
|
|
|
give us a cycle when we attempt to reuse those deltas.
|
|
|
|
|
|
|
|
The trick is in the "persuade" step, as it depends on the internals of how
|
|
|
|
pack-objects picks which pack to reuse the deltas from. But we can assume
|
|
|
|
that it does so in one of two general strategies:
|
|
|
|
|
|
|
|
1. Using a static ordering of packs. In this case, no inter-pack cycles can
|
|
|
|
happen. Any objects with a delta relationship must be present in the same
|
|
|
|
pack (i.e., no "--thin" packs on disk), so we will find all related objects
|
|
|
|
from that pack. So assuming there are no cycles within a single pack (and
|
|
|
|
we avoid generating them via pack-objects or importing them via
|
|
|
|
index-pack), then our result will have no cycles.
|
|
|
|
|
|
|
|
So this case should pass the tests no matter how we arrange things.
|
|
|
|
|
|
|
|
2. Picking the next pack to examine based on locality (i.e., where we found
|
|
|
|
something else recently).
|
|
|
|
|
|
|
|
In this case, we want to make sure that we find the delta versions of A and
|
|
|
|
B and not their base versions. We can do this by putting two blobs in each
|
|
|
|
pack. The first is a "dummy" blob that can only be found in the pack in
|
|
|
|
question. And then the second is the actual delta we want to find.
|
|
|
|
|
|
|
|
The two blobs must be present in the same tree, not present in other trees,
|
|
|
|
and the dummy pathname must sort before the delta path.
|
|
|
|
|
|
|
|
The setup below focuses on case 2. We have two commits HEAD and HEAD^, each
|
|
|
|
which has two files: "dummy" and "file". Then we can make two packs which
|
|
|
|
contain:
|
|
|
|
|
|
|
|
[pack one]
|
|
|
|
HEAD:dummy
|
|
|
|
HEAD:file (as delta against HEAD^:file)
|
|
|
|
HEAD^:file (as base)
|
|
|
|
|
|
|
|
[pack two]
|
|
|
|
HEAD^:dummy
|
|
|
|
HEAD^:file (as delta against HEAD:file)
|
|
|
|
HEAD:file (as base)
|
|
|
|
|
|
|
|
Then no matter which order we start looking at the packs in, we know that we
|
|
|
|
will always find a delta for "file", because its lookup will always come
|
|
|
|
immediately after the lookup for "dummy".
|
|
|
|
'
|
|
|
|
|
2022-07-01 13:42:59 +03:00
|
|
|
TEST_PASSES_SANITIZE_LEAK=true
|
|
|
|
. ./test-lib.sh
|
pack-objects: break delta cycles before delta-search phase
We do not allow cycles in the delta graph of a pack (i.e., A
is a delta of B which is a delta of A) for the obvious
reason that you cannot actually access any of the objects in
such a case.
There's a last-ditch attempt to notice cycles during the
write phase, during which we issue a warning to the user and
write one of the objects out in full. However, this is
"last-ditch" for two reasons:
1. By this time, it's too late to find another delta for
the object, so the resulting pack is larger than it
otherwise could be.
2. The warning is there because this is something that
_shouldn't_ ever happen. If it does, then either:
a. a pack we are reusing deltas from had its own
cycle
b. we are reusing deltas from multiple packs, and
we found a cycle among them (i.e., A is a delta of
B in one pack, but B is a delta of A in another,
and we choose to use both deltas).
c. there is a bug in the delta-search code
So this code serves as a final check that none of these
things has happened, warns the user, and prevents us
from writing a bogus pack.
Right now, (2b) should never happen because of the static
ordering of packs in want_object_in_pack(). If two objects
have a delta relationship, then they must be in the same
pack, and therefore we will find them from that same pack.
However, a future patch would like to change that static
ordering, which will make (2b) a common occurrence. In
preparation, we should be able to handle those kinds of
cycles better. This patch does by introducing a
cycle-breaking step during the get_object_details() phase,
when we are deciding which deltas can be reused. That gives
us the chance to feed the objects into the delta search as
if the cycle did not exist.
We'll leave the detection and warning in the write_object()
phase in place, as it still serves as a check for case (2c).
This does mean we will stop warning for (2a). That case is
caused by bogus input packs, and we ideally would warn the
user about it. However, since those cycles show up after
picking reusable deltas, they look the same as (2b) to us;
our new code will break the cycles early and the last-ditch
check will never see them.
We could do analysis on any cycles that we find to
distinguish the two cases (i.e., it is a bogus pack if and
only if every delta in the cycle is in the same pack), but
we don't need to. If there is a cycle inside a pack, we'll
run into problems not only reusing the delta, but accessing
the object data at all. So when we try to dig up the actual
size of the object, we'll hit that same cycle and kick in
our usual complain-and-try-another-source code.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-11 12:26:36 +03:00
|
|
|
|
2019-12-15 18:12:24 +03:00
|
|
|
# Create a pack containing the tree $1 and blob $1:file, with
|
pack-objects: break delta cycles before delta-search phase
We do not allow cycles in the delta graph of a pack (i.e., A
is a delta of B which is a delta of A) for the obvious
reason that you cannot actually access any of the objects in
such a case.
There's a last-ditch attempt to notice cycles during the
write phase, during which we issue a warning to the user and
write one of the objects out in full. However, this is
"last-ditch" for two reasons:
1. By this time, it's too late to find another delta for
the object, so the resulting pack is larger than it
otherwise could be.
2. The warning is there because this is something that
_shouldn't_ ever happen. If it does, then either:
a. a pack we are reusing deltas from had its own
cycle
b. we are reusing deltas from multiple packs, and
we found a cycle among them (i.e., A is a delta of
B in one pack, but B is a delta of A in another,
and we choose to use both deltas).
c. there is a bug in the delta-search code
So this code serves as a final check that none of these
things has happened, warns the user, and prevents us
from writing a bogus pack.
Right now, (2b) should never happen because of the static
ordering of packs in want_object_in_pack(). If two objects
have a delta relationship, then they must be in the same
pack, and therefore we will find them from that same pack.
However, a future patch would like to change that static
ordering, which will make (2b) a common occurrence. In
preparation, we should be able to handle those kinds of
cycles better. This patch does by introducing a
cycle-breaking step during the get_object_details() phase,
when we are deciding which deltas can be reused. That gives
us the chance to feed the objects into the delta search as
if the cycle did not exist.
We'll leave the detection and warning in the write_object()
phase in place, as it still serves as a check for case (2c).
This does mean we will stop warning for (2a). That case is
caused by bogus input packs, and we ideally would warn the
user about it. However, since those cycles show up after
picking reusable deltas, they look the same as (2b) to us;
our new code will break the cycles early and the last-ditch
check will never see them.
We could do analysis on any cycles that we find to
distinguish the two cases (i.e., it is a bogus pack if and
only if every delta in the cycle is in the same pack), but
we don't need to. If there is a cycle inside a pack, we'll
run into problems not only reusing the delta, but accessing
the object data at all. So when we try to dig up the actual
size of the object, we'll hit that same cycle and kick in
our usual complain-and-try-another-source code.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-11 12:26:36 +03:00
|
|
|
# the latter stored as a delta against $2:file.
|
|
|
|
#
|
|
|
|
# We convince pack-objects to make the delta in the direction of our choosing
|
|
|
|
# by marking $2 as a preferred-base edge. That results in $1:file as a thin
|
|
|
|
# delta, and index-pack completes it by adding $2:file as a base.
|
|
|
|
#
|
|
|
|
# Note that the two variants of "file" must be similar enough to convince git
|
|
|
|
# to create the delta.
|
|
|
|
make_pack () {
|
|
|
|
{
|
|
|
|
printf '%s\n' "-$(git rev-parse $2)"
|
|
|
|
printf '%s dummy\n' "$(git rev-parse $1:dummy)"
|
|
|
|
printf '%s file\n' "$(git rev-parse $1:file)"
|
|
|
|
} |
|
|
|
|
git pack-objects --stdout |
|
|
|
|
git index-pack --stdin --fix-thin
|
|
|
|
}
|
|
|
|
|
|
|
|
test_expect_success 'setup' '
|
2018-03-24 10:44:42 +03:00
|
|
|
test-tool genrandom base 4096 >base &&
|
pack-objects: break delta cycles before delta-search phase
We do not allow cycles in the delta graph of a pack (i.e., A
is a delta of B which is a delta of A) for the obvious
reason that you cannot actually access any of the objects in
such a case.
There's a last-ditch attempt to notice cycles during the
write phase, during which we issue a warning to the user and
write one of the objects out in full. However, this is
"last-ditch" for two reasons:
1. By this time, it's too late to find another delta for
the object, so the resulting pack is larger than it
otherwise could be.
2. The warning is there because this is something that
_shouldn't_ ever happen. If it does, then either:
a. a pack we are reusing deltas from had its own
cycle
b. we are reusing deltas from multiple packs, and
we found a cycle among them (i.e., A is a delta of
B in one pack, but B is a delta of A in another,
and we choose to use both deltas).
c. there is a bug in the delta-search code
So this code serves as a final check that none of these
things has happened, warns the user, and prevents us
from writing a bogus pack.
Right now, (2b) should never happen because of the static
ordering of packs in want_object_in_pack(). If two objects
have a delta relationship, then they must be in the same
pack, and therefore we will find them from that same pack.
However, a future patch would like to change that static
ordering, which will make (2b) a common occurrence. In
preparation, we should be able to handle those kinds of
cycles better. This patch does by introducing a
cycle-breaking step during the get_object_details() phase,
when we are deciding which deltas can be reused. That gives
us the chance to feed the objects into the delta search as
if the cycle did not exist.
We'll leave the detection and warning in the write_object()
phase in place, as it still serves as a check for case (2c).
This does mean we will stop warning for (2a). That case is
caused by bogus input packs, and we ideally would warn the
user about it. However, since those cycles show up after
picking reusable deltas, they look the same as (2b) to us;
our new code will break the cycles early and the last-ditch
check will never see them.
We could do analysis on any cycles that we find to
distinguish the two cases (i.e., it is a bogus pack if and
only if every delta in the cycle is in the same pack), but
we don't need to. If there is a cycle inside a pack, we'll
run into problems not only reusing the delta, but accessing
the object data at all. So when we try to dig up the actual
size of the object, we'll hit that same cycle and kick in
our usual complain-and-try-another-source code.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-11 12:26:36 +03:00
|
|
|
for i in one two
|
|
|
|
do
|
|
|
|
# we want shared content here to encourage deltas...
|
|
|
|
cp base file &&
|
|
|
|
echo $i >>file &&
|
|
|
|
|
|
|
|
# ...whereas dummy should be short, because we do not want
|
|
|
|
# deltas that would create duplicates when we --fix-thin
|
|
|
|
echo $i >dummy &&
|
|
|
|
|
|
|
|
git add file dummy &&
|
|
|
|
test_tick &&
|
|
|
|
git commit -m $i ||
|
|
|
|
return 1
|
|
|
|
done &&
|
|
|
|
|
|
|
|
make_pack HEAD^ HEAD &&
|
|
|
|
make_pack HEAD HEAD^
|
|
|
|
'
|
|
|
|
|
|
|
|
test_expect_success 'repack' '
|
|
|
|
# We first want to check that we do not have any internal errors,
|
|
|
|
# and also that we do not hit the last-ditch cycle-breaking code
|
|
|
|
# in write_object(), which will issue a warning to stderr.
|
|
|
|
git repack -ad 2>stderr &&
|
tests: use 'test_must_be_empty' instead of 'test_cmp <empty> <out>'
Using 'test_must_be_empty' is shorter and more idiomatic than
>empty &&
test_cmp empty out
as it saves the creation of an empty file. Furthermore, sometimes the
expected empty file doesn't have such a descriptive name like 'empty',
and its creation is far away from the place where it's finally used
for comparison (e.g. in 't7600-merge.sh', where two expected empty
files are created in the 'setup' test, but are used only about 500
lines later).
These cases were found by instrumenting 'test_cmp' to error out the
test script when it's used to compare empty files, and then converted
manually.
Note that even after this patch there still remain a lot of cases
where we use 'test_cmp' to check empty files:
- Sometimes the expected output is not hard-coded in the test, but
'test_cmp' is used to ensure that two similar git commands produce
the same output, and that output happens to be empty, e.g. the
test 'submodule update --merge - ignores --merge for new
submodules' in 't7406-submodule-update.sh'.
- Repetitive common tasks, including preparing the expected results
and running 'test_cmp', are often extracted into a helper
function, and some of this helper's callsites expect no output.
- For the same reason as above, the whole 'test_expect_success'
block is within a helper function, e.g. in 't3070-wildmatch.sh'.
- Or 'test_cmp' is invoked in a loop, e.g. the test 'cvs update
(-p)' in 't9400-git-cvsserver-server.sh'.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20 00:57:25 +03:00
|
|
|
test_must_be_empty stderr &&
|
pack-objects: break delta cycles before delta-search phase
We do not allow cycles in the delta graph of a pack (i.e., A
is a delta of B which is a delta of A) for the obvious
reason that you cannot actually access any of the objects in
such a case.
There's a last-ditch attempt to notice cycles during the
write phase, during which we issue a warning to the user and
write one of the objects out in full. However, this is
"last-ditch" for two reasons:
1. By this time, it's too late to find another delta for
the object, so the resulting pack is larger than it
otherwise could be.
2. The warning is there because this is something that
_shouldn't_ ever happen. If it does, then either:
a. a pack we are reusing deltas from had its own
cycle
b. we are reusing deltas from multiple packs, and
we found a cycle among them (i.e., A is a delta of
B in one pack, but B is a delta of A in another,
and we choose to use both deltas).
c. there is a bug in the delta-search code
So this code serves as a final check that none of these
things has happened, warns the user, and prevents us
from writing a bogus pack.
Right now, (2b) should never happen because of the static
ordering of packs in want_object_in_pack(). If two objects
have a delta relationship, then they must be in the same
pack, and therefore we will find them from that same pack.
However, a future patch would like to change that static
ordering, which will make (2b) a common occurrence. In
preparation, we should be able to handle those kinds of
cycles better. This patch does by introducing a
cycle-breaking step during the get_object_details() phase,
when we are deciding which deltas can be reused. That gives
us the chance to feed the objects into the delta search as
if the cycle did not exist.
We'll leave the detection and warning in the write_object()
phase in place, as it still serves as a check for case (2c).
This does mean we will stop warning for (2a). That case is
caused by bogus input packs, and we ideally would warn the
user about it. However, since those cycles show up after
picking reusable deltas, they look the same as (2b) to us;
our new code will break the cycles early and the last-ditch
check will never see them.
We could do analysis on any cycles that we find to
distinguish the two cases (i.e., it is a bogus pack if and
only if every delta in the cycle is in the same pack), but
we don't need to. If there is a cycle inside a pack, we'll
run into problems not only reusing the delta, but accessing
the object data at all. So when we try to dig up the actual
size of the object, we'll hit that same cycle and kick in
our usual complain-and-try-another-source code.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-11 12:26:36 +03:00
|
|
|
|
|
|
|
# And then double-check that the resulting pack is usable (i.e.,
|
|
|
|
# we did not fail to notice any cycles). We know we are accessing
|
|
|
|
# the objects via the new pack here, because "repack -d" will have
|
|
|
|
# removed the others.
|
|
|
|
git cat-file blob HEAD:file >/dev/null &&
|
|
|
|
git cat-file blob HEAD^:file >/dev/null
|
|
|
|
'
|
|
|
|
|
|
|
|
test_done
|