diffcore-rename had two different checks of the form
if ((a < limit || b < limit) &&
a * b <= limit * limit)
This can be simplified to
if (st_mult(a, b) <= st_mult(limit, limit))
which makes it clearer how we are checking for overflow, and makes it
much easier to parse given the drop from 8 to 4 variable appearances.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
too_many_rename_candidates() got the number of rename destinations via
an argument to the function, but the number of rename sources via a
global variable. That felt rather inconsistent. Pass in the number of
rename sources as an argument as well.
While we are at it... We had a local variable, num_src, that served two
purposes. Initially it was set to the global value, but later was used
for counting a subset of the number of sources. Since we now have a
function argument for the former usage, introduce a clearer variable
name for the latter usage.
This patch has no behavioral changes; it's just renaming and passing an
argument instead of grabbing it from the global namespace. (You may
find it easier to view the patch using git diff's --color-words option.)
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our main data structures are rename_src and rename_dst. For counters of
these data structures, num_sources and num_destinations seem natural;
definitely more so than using num_create for the latter.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
hashmap_free(), hashmap_free_entries(), and hashmap_free_() have existed
for a while, but aren't necessarily the clearest names, especially with
hashmap_partial_clear() being added to the mix and lazy-initialization
now being supported. Peff suggested we adopt the following names[1]:
- hashmap_clear() - remove all entries and de-allocate any
hashmap-specific data, but be ready for reuse
- hashmap_clear_and_free() - ditto, but free the entries themselves
- hashmap_partial_clear() - remove all entries but don't deallocate
table
- hashmap_partial_clear_and_free() - ditto, but free the entries
This patch provides the new names and converts all existing callers over
to the new naming scheme.
[1] https://lore.kernel.org/git/20201030125059.GA3277724@coredump.intra.peff.net/
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The behavior of diff_populate_filespec() currently can be customized
through a bitflag, but a subsequent patch requires it to support a
non-boolean option. Replace the bitflag with an options struct.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allow hash_object_file() to work on arbitrary repos by introducing a
git_hash_algo parameter. Change callers which have a struct repository
pointer in their scope to pass on the git_hash_algo from the said repo.
For all other callers, pass on the_hash_algo, which was already being
used internally at hash_object_file(). This functionality will be used
in the following patch to make check_object_signature() be able to work
on arbitrary repos (which, in turn, will be used to fix an
inconsistency at object.c:parse_object()).
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
While we cannot rely on a `__typeof__' operator being portable
to use with `offsetof'; we can calculate the pointer offset
using an existing pointer and the address of a member using
pointer arithmetic for compilers without `__typeof__'.
This allows us to simplify usage of hashmap iterator macros
by not having to specify a type when a pointer of that type
is already given.
In the future, list iterator macros (e.g. list_for_each_entry)
may also be implemented using OFFSETOF_VAR to save hackers the
trouble of using container_of/list_entry macros and without
relying on non-portable `__typeof__'.
v3: use `__typeof__' to avoid clang warnings
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
`hashmap_free_entries' behaves like `container_of' and passes
the offset of the hashmap_entry struct to the internal
`hashmap_free_' function, allowing the function to free any
struct pointer regardless of where the hashmap_entry field
is located.
`hashmap_free' no longer takes any arguments aside from
the hashmap itself.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Using `container_of' can be verbose and choosing names for
intermediate "struct hashmap_entry" pointers is a hard problem.
So introduce "*_entry" APIs inspired by similar linked-list
APIs in the Linux kernel.
Unfortunately, `__typeof__' is not portable C, so we need an
extra parameter to specify the type.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is a step towards removing the requirement for
hashmap_entry being the first field of a struct.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is less error-prone than "void *" as the compiler now
detects invalid types being passed.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is less error-prone than "const void *" as the compiler
now detects invalid types being passed.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
C compilers do type checking to make life easier for us. So
rely on that and update all hashmap_entry_init callers to take
"struct hashmap_entry *" to avoid future bugs while improving
safety and readability.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
During Git's rename detection, the file names are sorted. At the moment,
this job is performed by `qsort()`. As that function is not guaranteed
to implement a stable sort algorithm, this can lead to inconsistent
and/or surprising behavior: a rename might be detected differently
depending on the platform where Git was run.
The `qsort()` in MS Visual C's runtime does _not_ implement a stable
sort algorithm, and it even leads to an inconsistency leading to a test
failure in t3030.35 "merge-recursive remembers the names of all base
trees": a different code path than on Linux is taken in the rename
detection of an ambiguous rename between either `e` to `a` or
`a~Temporary merge branch 2_0` to `a` during a recursive merge,
unexpectedly resulting in a clean merge.
Let's use the stable sort provided by `git_stable_qsort()` to avoid this
inconsistency.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up to remove hardcoded SHA-1 hash from many places.
* jk/oidhash:
hashmap: convert sha1hash() to oidhash()
hash.h: move object_id definition from cache.h
khash: rename oid helper functions
khash: drop sha1-specific map types
pack-bitmap: convert khash_sha1 maps into kh_oid_map
delta-islands: convert island_marks khash to use oids
khash: rename kh_oid_t to kh_oid_set
khash: drop broken oid_map typedef
object: convert create_object() to use object_id
object: convert internal hash_obj() to object_id
object: convert lookup_object() to use object_id
object: convert lookup_unknown_object() to use object_id
pack-objects: convert locate_object_entry_hash() to object_id
pack-objects: convert packlist_find() to use object_id
pack-bitmap-write: convert some helpers to use object_id
upload-pack: rename a "sha1" variable to "oid"
describe: fix accidental oid/hash type-punning
There are no callers left of sha1hash() that do not simply pass the
"hash" member of a "struct object_id". Let's get rid of the outdated
sha1-specific function and provide one that operates on the whole struct
(even though the technique, taking the first few bytes of the hash, will
remain the same).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Calculating the sum of two array indexes to find the midpoint between
them can overflow, i.e. code like this is unsafe for big arrays:
mid = (first + last) >> 1;
Make sure the intermediate value stays within the boundaries instead,
like this:
mid = first + ((last - first) >> 1);
The loop condition of the binary search makes sure that 'last' is
always greater than 'first', so this is safe as long as 'first' is
not negative. And that can be verified easily using the pre-context
of each change, except for name-hash.c, so add an assertion to that
effect there.
The unsafe calculations were found with:
git grep '(.*+.*) *>> *1'
This is a continuation of 19716b21a4 (cleanup: fix possible overflow
errors in binary search, 2017-10-08).
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Various codepaths in the core-ish part learn to work on an
arbitrary in-core index structure, not necessarily the default
instance "the_index".
* nd/the-index: (23 commits)
revision.c: reduce implicit dependency the_repository
revision.c: remove implicit dependency on the_index
ws.c: remove implicit dependency on the_index
tree-diff.c: remove implicit dependency on the_index
submodule.c: remove implicit dependency on the_index
line-range.c: remove implicit dependency on the_index
userdiff.c: remove implicit dependency on the_index
rerere.c: remove implicit dependency on the_index
sha1-file.c: remove implicit dependency on the_index
patch-ids.c: remove implicit dependency on the_index
merge.c: remove implicit dependency on the_index
merge-blobs.c: remove implicit dependency on the_index
ll-merge.c: remove implicit dependency on the_index
diff-lib.c: remove implicit dependency on the_index
read-cache.c: remove implicit dependency on the_index
diff.c: remove implicit dependency on the_index
grep.c: remove implicit dependency on the_index
diff.c: remove the_index dependency in textconv() functions
blame.c: rename "repo" argument to "r"
combine-diff.c: remove implicit dependency on the_index
...
diff and textconv code has so widespread use that it's hard to simply
update their api and all call sites at once because it would result in
a big patch. For now reduce the_index references to two places:
diff_setup() and fill_textconv().
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is the flip side of the previous two patches: checking
for a non-zero oidcmp() can be more strictly expressed as
inequality. Like those patches, we write "!= 0" in the
coccinelle transformation, which covers by isomorphism the
more common:
if (oidcmp(E1, E2))
As with the previous two patches, this patch can be achieved
almost entirely by running "make coccicheck"; the only
differences are manual line-wrap fixes to match the original
code.
There is one thing to note for anybody replicating this,
though: coccinelle 1.0.4 seems to miss the case in
builtin/tag.c, even though it's basically the same as all
the others. Running with 1.0.7 does catch this, so
presumably it's just a coccinelle bug that was fixed in the
interim.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This should make these functions easier to find and cache.h less
overwhelming to read.
In particular, this moves:
- read_object_file
- oid_object_info
- write_object_file
As a result, most of the codebase needs to #include object-store.h.
In this patch the #include is only added to files that would fail to
compile otherwise. It would be better to #include wherever
identifiers from the header are used. That can happen later
when we have better tooling for it.
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert the declaration and definition of hash_sha1_file to use
struct object_id and adjust all function calls.
Rename this function to hash_object_file.
Signed-off-by: Patryk Obara <patryk.obara@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Use the helper macro MOVE_ARRAY to move arrays. This is shorter and
safer, as it automatically infers the size of elements.
Patch generated by Coccinelle and contrib/coccinelle/array.cocci in
Travis CI's static analysis build job.
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the documentation of diff-tree, it is stated that the -l option
"prevents rename/copy detection from running if the number of
rename/copy targets exceeds the specified number". The documentation
does not mention any special handling for the number 0, but the
implementation before commit 9f7e4bfa3b ("diff: remove silent clamp of
renameLimit", 2017-11-13) treated 0 as a special value indicating that
the rename limit is to be a very large number instead.
The commit 9f7e4bfa3b changed that behavior, treating 0 as 0. Revert
this behavior to what it was previously. This allows existing scripts
and tools that use "-l0" to continue working. The alternative (to have
"-l0" suppress rename detection) is probably much less useful, since
users can just refrain from specifying -M and/or -C to have the same
effect.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In commit 0024a5492 (Fix the rename detection limit checking; 2007-09-14),
the renameLimit was clamped to 32767. This appears to have been to simply
avoid integer overflow in the following computation:
num_create * num_src <= rename_limit * rename_limit
although it also could be viewed as a hardcoded bound on the amount of CPU
time we're willing to allow users to tell git to spend on handling
renames. An upper bound may make sense, but unfortunately this upper
bound was neither communicated to the users, nor documented anywhere.
Although large limits can make things slow, we have users who would be
ecstatic to have a small five file change be correctly cherry picked even
if they have to manually specify a large limit and wait ten minutes for
the renames to be detected.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The possibility of setting merge.renameLimit beyond 2^16 raises the
possibility that the values passed to progress can exceed 2^32.
Use uint64_t, because it "ought to be enough for anybody". :-)
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove the `DIFF_OPT_TST` macro and instead access the flags directly.
This conversion is done using the following semantic patch:
@@
expression E;
identifier fld;
@@
- DIFF_OPT_TST(&E, fld)
+ E.flags.fld
@@
type T;
T *ptr;
identifier fld;
@@
- DIFF_OPT_TST(ptr, fld)
+ ptr->flags.fld
Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We used to expose the full power of the delayed progress API to the
callers, so that they can specify, not just the message to show and
expected total amount of work that is used to compute the percentage
of work performed so far, the percent-threshold parameter P and the
delay-seconds parameter N. The progress meter starts to show at N
seconds into the operation only if we have not yet completed P per-cent
of the total work.
Most callers used either (0%, 2s) or (50%, 1s) as (P, N), but there
are oddballs that chose more random-looking values like 95%.
For a smoother workload, (50%, 1s) would allow us to start showing
the progress meter earlier than (0%, 2s), while keeping the chance
of not showing progress meter for long running operation the same as
the latter. For a task that would take 2s or more to complete, it
is likely that less than half of it would complete within the first
second, if the workload is smooth. But for a spiky workload whose
earlier part is easier, such a setting is likely to fail to show the
progress meter entirely and (0%, 2s) is more appropriate.
But that is merely a theory. Realistically, it is of dubious value
to ask each codepath to carefully consider smoothness of their
workload and specify their own setting by passing two extra
parameters. Let's simplify the API by dropping both parameters and
have everybody use (0%, 2s).
Oh, by the way, the percent-threshold parameter and the structure
member were consistently misspelled, which also is now fixed ;-)
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When using the hashmap a common need is to have access to caller provided
data in the compare function. A couple of times we abuse the keydata field
to pass in the data needed. This happens for example in patch-ids.c.
This patch changes the function signature of the compare function
to have one more void pointer available. The pointer given for each
invocation of the compare function must be defined in the init function
of the hashmap and is just passed through.
Documentation of this new feature is deferred to a later patch.
This is a rather mechanical conversion, just adding the new pass-through
parameter. However while at it improve the naming of the fields of all
compare functions used by hashmaps by ensuring unused parameters are
prefixed with 'unused_' and naming the parameters what they are (instead
of 'unused' make it 'unused_keydata').
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A common pattern to free a piece of memory and assign NULL to the
pointer that used to point at it has been replaced with a new
FREE_AND_NULL() macro.
* ab/free-and-null:
*.[ch] refactoring: make use of the FREE_AND_NULL() macro
coccinelle: make use of the "expression" FREE_AND_NULL() rule
coccinelle: add a rule to make "expression" code use FREE_AND_NULL()
coccinelle: make use of the "type" FREE_AND_NULL() rule
coccinelle: add a rule to make "type" code use FREE_AND_NULL()
git-compat-util: add a FREE_AND_NULL() wrapper around free(ptr); ptr = NULL
Apply the result of the just-added coccinelle rule. This manually
excludes a few occurrences, mostly things that resulted in many
FREE_AND_NULL() on one line, that'll be manually fixed in a subsequent
change.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The delta_limit parameter to diffcore_count_changes() has been unused
since commit ba23bbc8e ("diffcore-delta: make change counter to byte
oriented again.", 2006-03-04).
Remove the parameter and adjust all callers.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Apply the semantic patch contrib/coccinelle/qsort.cocci to the code
base, replacing calls of qsort(3) with QSORT. The resulting code is
shorter and supports empty arrays with NULL pointers.
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The result of st_mult() is the same no matter the order of its
arguments. It invokes the macro unsigned_mult_overflows(), which
divides the second parameter by the first one. Pass constants
first to allow that division to be done already at compile time.
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that this struct's sha1 member is called "oid", update the comment
and the sha1_valid member to be called "oid_valid" instead. The
following Coccinelle semantic patch was used to implement this, followed
by the transformations in object_id.cocci:
@@
struct diff_filespec o;
@@
- o.sha1_valid
+ o.oid_valid
@@
struct diff_filespec *p;
@@
- p->sha1_valid
+ p->oid_valid
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert struct diff_filespec's sha1 member to use a struct object_id
called "oid" instead. The following Coccinelle semantic patch was used
to implement this, followed by the transformations in object_id.cocci:
@@
struct diff_filespec o;
@@
- o.sha1
+ o.oid.hash
@@
struct diff_filespec *p;
@@
- p->sha1
+ p->oid.hash
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git diff -M" used to work better when two originally identical
files A and B got renamed to X/A and X/B by pairing A to X/A and B
to X/B, but this was broken in the 2.0 timeframe.
* sg/diff-multiple-identical-renames:
diffcore: fix iteration order of identical files during rename detection
"git diff -M" used to work better when two originally identical
files A and B got renamed to X/A and X/B by pairing A to X/A and B
to X/B, but this was broken in the 2.0 timeframe.
* sg/diff-multiple-identical-renames:
diffcore: fix iteration order of identical files during rename detection
If the two paths 'dir/A/file' and 'dir/B/file' have identical content
and the parent directory is renamed, e.g. 'git mv dir other-dir', then
diffcore reports the following exact renames:
renamed: dir/B/file -> other-dir/A/file
renamed: dir/A/file -> other-dir/B/file
While technically not wrong, this is confusing not only for the user,
but also for git commands that make decisions based on rename
information, e.g. 'git log --follow other-dir/A/file' follows
'dir/B/file' past the rename.
This behavior is a side effect of commit v2.0.0-rc4~8^2~14
(diffcore-rename.c: simplify finding exact renames, 2013-11-14): the
hashmap storing sources returns entries from the same bucket, i.e.
sources matching the current destination, in LIFO order. Thus the
iteration first examines 'other-dir/A/file' and 'dir/B/file' and, upon
finding identical content and basename, reports an exact rename.
Other hashmap users are apparently happy with the current iteration
order over the entries of a bucket. Changing the iteration order
would risk upsetting other hashmap users and would increase the memory
footprint of each bucket by a pointer to the tail element.
Fill the hashmap with source entries in reverse order to restore the
original exact rename detection behavior.
Reported-by: Bill Okara <billokara@gmail.com>
Signed-off-by: SZEDER Gábor <szeder@ira.uka.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If our size computation overflows size_t, we may allocate a
much smaller buffer than we expected and overflow it. It's
probably impossible to trigger an overflow in most of these
sites in practice, but it is easy enough convert their
additions and multiplications into overflow-checking
variants. This may be fixing real bugs, and it makes
auditing the code easier.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A corrupt input to "git diff -M" can cause us to segfault.
* jk/diffcore-rename-duplicate:
diffcore-rename: avoid processing duplicate destinations
diffcore-rename: split locate_rename_dst into two functions
The rename code cannot handle an input where we have
duplicate destinations (i.e., more than one diff_filepair in
the queue with the same string in its pair->two->path). We
end up allocating only one slot in the rename_dst mapping.
If we fill in the diff_filepair for that slot, when we
re-queue the results, we may queue that filepair multiple
times. When the diff is finally flushed, the filepair is
processed and free()d multiple times, leading to heap
corruption.
This situation should only happen when a tree diff sees
duplicates in one of the trees (see the added test for a
detailed example). Rather than handle it, the sanest thing
is just to turn off rename detection altogether for the
diff.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This function manages the mapping of destination pathnames
to filepairs, and it handles both insertion and lookup. This
makes the return value a bit confusing, as we return a newly
created entry (even though no caller cares), and have no
room to indicate to the caller that an entry already
existed.
Instead, let's break this up into two distinct functions,
both backed by a common binary search. The binary search
will use our normal "return the index if we found something,
or negative index minus one to show where it would have
gone" semantics.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Hashmap entries are typically looked up by just a key. The hashmap_get()
API expects an initialized entry structure instead, to support compound
keys. This flexibility is currently only needed by find_dir_entry() in
name-hash.c (and compat/win32/fscache.c in the msysgit fork). All other
(currently five) call sites of hashmap_get() have to set up a near emtpy
entry structure, resulting in duplicate code like this:
struct hashmap_entry keyentry;
hashmap_entry_init(&keyentry, hash(key));
return hashmap_get(map, &keyentry, key);
Add a hashmap_get_from_hash() API that allows hashmap lookups by just
specifying the key and its hash code, i.e.:
return hashmap_get_from_hash(map, hash(key), key);
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Copying the first bytes of a SHA1 is duplicated in six places,
however, the implications (the actual value would depend on the
endianness of the platform) is documented only once.
Add a properly documented API for this.
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Replace open-coded reallocation with ALLOC_GROW() macro.
* dd/use-alloc-grow:
sha1_file.c: use ALLOC_GROW() in pretend_sha1_file()
read-cache.c: use ALLOC_GROW() in add_index_entry()
builtin/mktree.c: use ALLOC_GROW() in append_to_tree()
attr.c: use ALLOC_GROW() in handle_attr_line()
dir.c: use ALLOC_GROW() in create_simplify()
reflog-walk.c: use ALLOC_GROW()
replace_object.c: use ALLOC_GROW() in register_replace_object()
patch-ids.c: use ALLOC_GROW() in add_commit()
diffcore-rename.c: use ALLOC_GROW()
diff.c: use ALLOC_GROW()
commit.c: use ALLOC_GROW() in register_commit_graft()
cache-tree.c: use ALLOC_GROW() in find_subtree()
bundle.c: use ALLOC_GROW() in add_to_ref_list()
builtin/pack-objects.c: use ALLOC_GROW() in check_pbase_path()
Use ALLOC_GROW() instead of open-coding it in locate_rename_dst()
and register_rename_src().
Signed-off-by: Dmitry S. Dolzhenko <dmitrys.dolzhenko@yandex.ru>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The find_exact_renames function currently only uses the hash table for
grouping, i.e.:
1. add sources
2. add destinations
3. iterate all buckets, per bucket:
4. split sources from destinations
5. iterate destinations, per destination:
6. iterate sources to find best match
This can be simplified by utilizing the lookup functionality of the hash
table, i.e.:
1. add sources
2. iterate destinations, per destination:
3. lookup sources matching the current destination
4. iterate sources to find best match
This saves several iterations and file_similarity allocations for the
destinations.
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
No actual code changes, just move hash_filespec up and outdent part of
find_identical_files.
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This avoids unnecessary re-allocations and reinsertions. On webkit.git
(i.e. about 182k inserts to the name hash table), this reduces about
100ms out of 3s user time.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We do not want a link to 0{40} object stored anywhere in our objects.
* jk/maint-null-in-trees:
fsck: detect null sha1 in tree entries
do not write null sha1s to on-disk index
diff: do not use null sha1 as a sentinel value
The diff code represents paths using the diff_filespec
struct. This struct has a sha1 to represent the sha1 of the
content at that path, as well as a sha1_valid member which
indicates whether its sha1 field is actually useful. If
sha1_valid is not true, then the filespec represents a
working tree file (e.g., for the no-index case, or for when
the index is not up-to-date).
The diff_filespec is only used internally, though. At the
interfaces to the diff subsystem, callers feed the sha1
directly, and we create a diff_filespec from it. It's at
that point that we look at the sha1 and decide whether it is
valid or not; callers may pass the null sha1 as a sentinel
value to indicate that it is not.
We should not typically see the null sha1 coming from any
other source (e.g., in the index itself, or from a tree).
However, a corrupt tree might have a null sha1, which would
cause "diff --patch" to accidentally diff the working tree
version of a file instead of treating it as a blob.
This patch extends the edges of the diff interface to accept
a "sha1_valid" flag whenever we accept a sha1, and to use
that flag when creating a filespec. In some cases, this
means passing the flag through several layers, making the
code change larger than would be desirable.
One alternative would be to simply die() upon seeing
corrupted trees with null sha1s. However, this fix more
directly addresses the problem (while bogus sha1s in a tree
are probably a bad thing, it is really the sentinel
confusion sending us down the wrong code path that is what
makes it devastating). And it means that git is more capable
of examining and debugging these corrupted trees. For
example, you can still "diff --raw" such a tree to find out
when the bogus entry was introduced; you just cannot do a
"--patch" diff (just as you could not with any other
corrupted tree, as we do not have any content to diff).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our rename detection is a heuristic, matching pairs of
removed and added files with similar or identical content.
It's unlikely to be wrong when there is actual content to
compare, and we already take care not to do inexact rename
detection when there is not enough content to produce good
results.
However, we always do exact rename detection, even when the
blob is tiny or empty. It's easy to get false positives with
an empty blob, simply because it is an obvious content to
use as a boilerplate (e.g., when telling git that an empty
directory is worth tracking via an empty .gitignore).
This patch lets callers specify whether or not they are
interested in using empty files as rename sources and
destinations. The default is "yes", keeping the original
behavior. It works by detecting the empty-blob sha1 for
rename sources and destinations.
One more flexible alternative would be to allow the caller
to specify a minimum size for a blob to be "interesting" for
rename detection. But that would catch small boilerplate
files, not large ones (e.g., if you had the GPL COPYING file
in many directories).
A better alternative would be to allow a "-rename"
gitattribute to allow boilerplate files to be marked as
such. I'll leave the complexity of that solution until such
time as somebody actually wants it. The complaints we've
seen so far revolve around empty files, so let's start with
the simple thing.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since 9d8a5a5 (diffcore-rename: refactor "too many candidates" logic,
2011-01-06), diffcore_rename() initializes num_src but does not use it
anymore. "-Wunused-but-set-variable" in gcc-4.6 complains about this.
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since e9c8409 (diff-index --cached --raw: show tree entry on the LHS for
unmerged entries., 2007-01-05), an unmerged entry should be detected by
using DIFF_PAIR_UNMERGED(p), not by noticing both one and two sides of
the filepair records mode=0 entries. However, it forgot to update some
parts of the rename detection logic.
This only makes difference in the "diff --cached" codepath where an
unmerged filepair carries information on the entries that came from the
tree. It probably hasn't been noticed for a long time because nobody
would run "diff -M" during a conflict resolution, but "git status" uses
rename detection when it internally runs "diff-index" and "diff-files"
and gives nonsense results.
In an unmerged pair, "one" side can have a valid filespec to record the
tree entry (e.g. what's in HEAD) when running "diff --cached". This can
be used as a rename source to other paths in the index that are not
unmerged. The path that is unmerged by definition does not have the
final content yet (i.e. "two" side cannot have a valid filespec), so it
can never be a rename destination.
Use the DIFF_PAIR_UNMERGED() to detect unmerged filepair correctly, and
allow the valid "one" side of an unmerged filepair to be considered a
potential rename source, but never to be considered a rename destination.
Commit message and first two test cases by Junio, the rest by Martin.
Signed-off-by: Martin von Zweigbergk <martin.von.zweigbergk@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When there are too many paths in the project, the number of rename source
candidates "git diff -C -C" finds will exceed the rename detection limit,
and no inexact rename detection is performed. We however could fall back
to "git diff -C" if the number of modified paths is sufficiently small.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This will allow us to later skip unmodified entries added due to "-C -C".
We might also want to do something similar to rename_dst side, but that
would only be for the sake of symmetry and not necessary for this series.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Move the logic to a separate function, to be enhanced by later patches in
the series.
While at it, swap the condition used in the if statement from "if it is
too big then do this" to "if it would fit then do this".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
Rebased to 'master' as the logic to use the result of this logic was
updated recently, together with the addition of eye-candy.
We might spend many seconds doing inexact rename detection
with no output. It's nice to let the user know that
something is actually happening.
This patch adds the infrastructure, but no callers actually
turn on progress reporting.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The warning is generated deep in the diffcore code, which
means that it will come first, followed possibly by a spew
of conflicts, making it hard to see.
Instead, let's have diffcore pass back the information about
how big the rename limit would needed to have been, and then
the caller can provide a more appropriate message (and at a
more appropriate time).
No refactoring of other non-merge callers is necessary,
because nobody else was even using the warn_on_rename_limit
feature.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The logic to quickly dismiss potential rename pairs was broken. It
would too eagerly dismiss possible renames when all of the difference
was due to pure new data (or deleted data).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We would allow rename detection to do copy detection even when asked
purely for renames. That confuses users, but more importantly it can
terminally confuse the recursive merge rename logic.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
For the find_exact_renames() function, this allows us to pass the
diff_options structure pointer to the low-level routines. We will use
that to distinguish between the "rename" and "copy" cases.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Refactor the diff_queue_struct code, this macro help
to reset the structure.
Signed-off-by: Bo Yang <struggleyb.nku@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
After running one round of estimate_similarity(), filespecs on either
side will have populated their cnt_data fields, and we do not need
the blob text anymore. We used to retain the blob data to optimize
for smaller projects (not freeing the blob data here would mean that
the final output phase would not have to re-read it), but we are
efficient enough without such optimization for smaller projects anyway,
and freeing memory early will help larger projects.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In diffcore_rename, we assume that the blob contents in the filespec
aren't required anymore after estimate_similarity has been called and thus
we free it. But estimate_similarity might return early when the file sizes
differ too much. In that case, cnt_data is never set and the next call to
estimate_similarity will populate the filespec again, eventually rereading
the same blob over and over again.
To fix that, we first get the blob sizes and only when the blob contents
are actually required, and when cnt_data will be set, the full filespec is
populated, once.
Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we refuse to do rename detection due to having too many files
created or deleted, let the user know the numbers. That way there is a
reasonable starting point for setting the diff.renamelimit option.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In many cases, the warning ends up as clutter, because the
diff is being done "behind the scenes" from the user (e.g.,
when generating a commit diffstat), and whether we show
renames or not is not particularly interesting to the user.
However, in the case of a merge (which is what motivated the
warning in the first place), it is a useful hint as to why a
merge with renames might have failed.
This patch makes the warning optional based on the code
calling into diffcore. We default to not showing the
warning, but turn it on for merges.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
On Thu, 29 Nov 2007, Jeff King wrote:
>
> I think it will get worse, because you are simultaneously calculating
> all of the similarity scores bit by bit rather than doing a loop. Though
> perhaps you mean at the end you will end up with a list of src/dst pairs
> sorted by score, and you can loop over that.
Well, after thinking about this a bit, I think there's a solution that may
work well with the current thing too: instead of looping just *once* over
the list of rename pairs, loop twice - and simply refuse to do copies on
the first loop.
This trivial patch does that, and turns Kumar's test-case into a perfect
rename list.
It's not pretty, it's not smart, but it seems to work. There's something
to be said for keeping it simple and stupid.
And it should not be nearly as expensive as it may _look_. Yes, the loop
is "(i = 0; i < num_create * num_src; i++)", but the important part is
that the whole array is sorted by rename score, and we have a
if (mx[i].score < minimum_score)
break;
in it, so uthe loop actually would tend to terminate rather quickly.
Anyway, Kumar, the thing to take away from this is:
- git really doesn't even *care* about the whole "rename detection"
internally, and any commits you have done with renames are totally
independent of the heuristics we then use to *show* the renames.
- the rename detection really is for just two reasons: (a) keep humans
happy, and keep the diffs small and (b) help automatic merging across
renames. So getting renames right is certainly good, but it's more of a
"politeness" issue than a "correctness" issue, although the merge
portion of it does matter a lot sometimes.
- the important thing here is that you can commit your changes and not
worry about them being somehow "corrupted" by lack of rename detection,
even if you commit them with a version of git that doesn't do rename
detection the way you expected it. The rename detection is an
"after-the-fact" thing, not something that actually gets saved in the
repository, which is why we can change the heuristics _after_ seeing
examples, and the examples magically correct themselves!
- try out the two patches I've posted, and see if they work for you. They
pass the test-suite, and the output for your example commit looks sane,
but hey, if you have other test-cases, try them out.
Here's Kumar's pretty diffstat with both my patches:
Makefile | 6 +++---
board/{cds => freescale}/common/cadmus.c | 0
board/{cds => freescale}/common/cadmus.h | 0
board/{cds => freescale}/common/eeprom.c | 0
board/{cds => freescale}/common/eeprom.h | 0
board/{cds => freescale}/common/ft_board.c | 0
board/{cds => freescale}/common/via.c | 0
board/{cds => freescale}/common/via.h | 0
board/{cds => freescale}/mpc8541cds/Makefile | 0
board/{cds => freescale}/mpc8541cds/config.mk | 0
board/{cds => freescale}/mpc8541cds/init.S | 0
board/{cds => freescale}/mpc8541cds/mpc8541cds.c | 0
board/{cds => freescale}/mpc8541cds/u-boot.lds | 4 ++--
board/{cds => freescale}/mpc8548cds/Makefile | 0
board/{cds => freescale}/mpc8548cds/config.mk | 0
board/{cds => freescale}/mpc8548cds/init.S | 0
board/{cds => freescale}/mpc8548cds/mpc8548cds.c | 0
board/{cds => freescale}/mpc8548cds/u-boot.lds | 4 ++--
board/{cds => freescale}/mpc8555cds/Makefile | 0
board/{cds => freescale}/mpc8555cds/config.mk | 0
board/{cds => freescale}/mpc8555cds/init.S | 0
board/{cds => freescale}/mpc8555cds/mpc8555cds.c | 0
board/{cds => freescale}/mpc8555cds/u-boot.lds | 4 ++--
23 files changed, 9 insertions(+), 9 deletions(-)
and here it is before:
Makefile | 6 +-
board/cds/mpc8548cds/Makefile | 60 -----
board/cds/mpc8555cds/Makefile | 60 -----
board/cds/mpc8555cds/init.S | 255 --------------------
board/cds/mpc8555cds/u-boot.lds | 150 ------------
board/{cds => freescale}/common/cadmus.c | 0
board/{cds => freescale}/common/cadmus.h | 0
board/{cds => freescale}/common/eeprom.c | 0
board/{cds => freescale}/common/eeprom.h | 0
board/{cds => freescale}/common/ft_board.c | 0
board/{cds => freescale}/common/via.c | 0
board/{cds => freescale}/common/via.h | 0
board/{cds => freescale}/mpc8541cds/Makefile | 0
board/{cds => freescale}/mpc8541cds/config.mk | 0
board/{cds => freescale}/mpc8541cds/init.S | 0
board/{cds => freescale}/mpc8541cds/mpc8541cds.c | 0
board/{cds => freescale}/mpc8541cds/u-boot.lds | 4 +-
.../mpc8541cds => freescale/mpc8548cds}/Makefile | 0
board/{cds => freescale}/mpc8548cds/config.mk | 0
board/{cds => freescale}/mpc8548cds/init.S | 0
board/{cds => freescale}/mpc8548cds/mpc8548cds.c | 0
board/{cds => freescale}/mpc8548cds/u-boot.lds | 4 +-
.../mpc8541cds => freescale/mpc8555cds}/Makefile | 0
board/{cds => freescale}/mpc8555cds/config.mk | 0
.../mpc8541cds => freescale/mpc8555cds}/init.S | 0
board/{cds => freescale}/mpc8555cds/mpc8555cds.c | 0
.../mpc8541cds => freescale/mpc8555cds}/u-boot.lds | 4 +-
27 files changed, 9 insertions(+), 534 deletions(-)
so it certainly makes the diffs prettier.
Linus
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Kumar Gala had a case in the u-boot archive with multiple renames of files
with identical contents, and git would turn those into multiple "copy"
operations of one of the sources, and just deleting the other sources.
This patch makes the git exact rename detection prefer to spread out the
renames over the multiple sources, rather than do multiple copies of one
source.
NOTE! The changes are a bit larger than required, because I also renamed
the variables named "one" and "two" to "target" and "source" respectively.
That makes the logic easier to follow, especially as the "one" was
illogically the target and not the soruce, for purely historical reasons
(this piece of code used to traverse over sources and targets in the wrong
order, and when we fixed that, we didn't fix the names back then. So I
fixed them now).
The important part of this change is just the trivial score calculations
for when files have identical contents:
/* Give higher scores to sources that haven't been used already */
score = !source->rename_used;
score += basename_same(source, target);
and when we have multiple choices we'll now pick the choice that gets the
best rename score, rather than only looking at whether the basename
matched.
It's worth noting a few gotchas:
- this scoring is currently only done for the "exact match" case.
In particular, in Kumar's example, even after this patch, the inexact
match case is still done as a copy+delete rather than as two renames:
delete mode 100644 board/cds/mpc8555cds/u-boot.lds
copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%)
rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%)
because apparently the "cds/mpc8541cds/u-boot.lds" copy looked
a bit more similar to both end results. That said, I *suspect* we just
have the exact same issue there - the similarity analysis just gave
identical (or at least very _close_ to identical) similarity points,
and we do not have any logic to prefer multiple renames over a
copy/delete there.
That is a separate patch.
- When you have identical contents and identical basenames, the actual
entry that is chosen is still picked fairly "at random" for the first
one (but the subsequent ones will prefer entries that haven't already
been used).
It's not actually really random, in that it actually depends on the
relative alphabetical order of the files (which in turn will have
impacted the order that the entries got hashed!), so it gives
consistent results that can be explained. But I wanted to point it out
as an issue for when anybody actually does cross-renames.
In Kumar's case the choice is the right one (and for a single normal
directory rename it should always be, since the relative alphabetical
sorting of the files will be identical), and we now get:
rename board/{cds => freescale}/mpc8541cds/init.S (100%)
rename board/{cds => freescale}/mpc8548cds/init.S (100%)
which is the "expected" answer. However, it might still be better to
change the pedantic "exact same basename" on/off choice into a more
graduated "how similar are the pathnames" scoring situation, in order
to be more likely to get the exact rename choice that people *expect*
to see, rather than other alternatives that may *technically* be
equally good, but are surprising to a human.
It's also unclear whether we should consider "basenames are equal" or
"have already used this as a source" to be more important. This gives them
equal weight, but I suspect we might want to just multiple the "basenames
are equal" weight by two, or something, to prefer equal basenames even if
that causes a copy/delete pair. I dunno.
Anyway, what I'm just saying in a really long-winded manner is that I
think this is right as-is, but it's not the complete solution, and it may
want some further tweaking in the future.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we do the fuzzy rename detection, we don't care about the
destinations that we already handled with the exact rename detector.
And, in fact, the code already knew that - but the rename limiter, which
used to run *before* exact renames were detected, did not.
This fixes it so that the rename detection limiter now bases its
decisions on the *remaining* rename counts, rather than the original
ones.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
For historical reasons, the exact rename detection had populated the
filespecs for the entries it compared, and the rest of the similarity
analysis depended on that. I hadn't even bothered to debug why that was
the case when I re-did the rename detection, I just made the new one
have the same broken behaviour, with a note about this special case.
This fixes that fixme. The reason the exact rename detector needed to
fill in the file sizes of the files it checked was that the _inexact_
rename detector was broken, and started comparing file sizes before it
filled them in.
Fixing that allows the exact phase to do the sane thing of never even
caring (since all *it* cares about is really just the SHA1 itself, not
the size nor the contents).
It turns out that this also indirectly fixes a bug: trying to populate
all the filespecs will run out of virtual memory if there is tons and
tons of possible rename options. The fuzzy similarity analysis does the
right thing in this regard, and free's the blob info after it has
generated the hash tables, so the special case code caused more trouble
than just some extra illogical code.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.
In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.
With the new code, that resulted in
[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
400006
real 0m2.071s
user 0m1.520s
sys 0m0.576s
ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:
diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
similarity index 100%
rename from really-big-dir/file-1-1-1-1-1
rename to moved-big-dir/file-1-1-1-1-1
and the extra six lines is from a one-liner commit message and all the
commit information and spacing).
Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.
With the old code, this wouldn't have been practically possible. Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive. In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:
[torvalds@woody big-rename]$ time git show -C | wc -l
1400006
real 0m12.337s
user 0m12.285s
sys 0m0.192s
ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This implements a smarter rename detector for exact renames, which
rather than doing a pairwise comparison (time O(m*n)) will just hash the
files into a hash-table (size O(n+m)), and only do pairwise comparisons
to renames that have the same hash (time O(n+m) except for unrealistic
hash collissions, which we just cull aggressively).
Admittedly the exact rename case is not nearly as interesting as the
generic case, but it's an important case none-the-less. A similar general
approach should work for the generic case too, but even then you do need
to handle the exact renames/copies separately (to avoid the inevitable
added cost factor that comes from the _size_ of the file), so this is
worth doing.
In the expectation that we will indeed do the same hashing trick for the
general rename case, this code uses a generic hash-table implementation
that can be used for other things too. In fact, we might be able to
consolidate some of our existing hash tables with the new generic code
in hash.[ch].
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The core rename detection had some rather stupid code to check if a
pathname was used by a later modification or rename, which basically
walked the whole pathname space for all renames for each rename, in
order to tell whether it was a pure rename (no remaining users) or
should be considered a copy (other users of the source file remaining).
That's really silly, since we can just keep a count of users around, and
replace all those complex and expensive loops with just testing that
simple counter (but this all depends on the previous commit that shared
the diff_filespec data structure by using a separate reference count).
Note that the reference count is not the same as the rename count: they
behave otherwise rather similarly, but the reference count is tied to
the allocation (and decremented at de-allocation, so that when it turns
zero we can get rid of the memory), while the rename count is tied to
the renames and is decremented when we find a rename (so that when it
turns zero we know that it was a rename, not a copy).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rather than copy the filespecs when introducing new versions of them
(for rename or copy detection), use a refcount and increment the count
when reusing the diff_filespec.
This avoids unnecessary allocations, but the real reason behind this is
a future enhancement: we will want to track shared data across the
copy/rename detection. In order to efficiently notice when a filespec
is used by a rename, the rename machinery wants to keep track of a
rename usage count which is shared across all different users of the
filespec.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This makes the exact content match a separate function of its own.
Partly to cut down a bit on the size of the diffcore_rename() function
(which is too complex as it is), and partly because there are smarter
ways to do this than an O(m*n) loop over it all, and that function
should be rewritten to take that into account.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We find rename candidates by computing a fingerprint hash of
each file, and then comparing those fingerprints. There are
inherently O(n^2) comparisons, so it pays in CPU time to
hoist the (rather expensive) computation of the fingerprint
out of that loop (or to cache it once we have computed it once).
Previously, we didn't keep the filespec information around
because then we had the potential to consume a great deal of
memory. However, instead of keeping all of the filespec
data, we can instead just keep the fingerprint.
This patch implements and uses diff_free_filespec_data_large
to accomplish that goal. We also have to change
estimate_similarity not to needlessly repopulate the
filespec data when we already have the hash.
Practical tests showed 4.5x speedup for a 10% memory usage
increase.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This adds more proper rename detection limits. Instead of just checking
the limit against the number of potential rename destinations, we verify
that the rename matrix (which is what really matters) doesn't grow
ridiculously large, and we also make sure that we don't overflow when
doing the matrix size calculation.
This also changes the default limits from unlimited, to a rename matrix
that is limited to 100 entries on a side. You can raise it with the config
entry, or by using the "-l<n>" command line flag, but at least the default
is now a sane number that avoids spending lots of time (and memory) in
situations that likely don't merit it.
The choice of default value is of course very debatable. Limiting the
rename matrix to a 100x100 size will mean that even if you have just one
obvious rename, but you also create (or delete) 10,000 files, the rename
matrix will be so big that we disable the heuristics. Sounds reasonable to
me, but let's see if people hit this (and, perhaps more importantly,
actually *care*) in real life.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/diffcore:
diffcore-delta.c: Ignore CR in CRLF for text files
diffcore-delta.c: update the comment on the algorithm.
diffcore_filespec: add is_binary
diffcore_count_changes: pass diffcore_filespec
We may want to use richer information on the data we are dealing
with in this function, so instead of passing a buffer address
and length, just pass the diffcore_filespec structure. Existing
callers always call this function with parameters taken from a
filespec anyway, so there is no functionality changes.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This implements a suggestion from Johannes. It uses a separate field in
struct diff_score to keep the result of the file name comparison in the
rename detection logic. This reverts the value of the similarity index
to be a function of file contents, only, and basename comparison is only
used to decide between files with equal amounts of content changes.
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>