microsoft/git - git

Граф коммитов

Автор	SHA1	Сообщение	Дата
Junio C Hamano	26effb8487	Merge branch 'sg/diff-multiple-identical-renames' "git diff -M" used to work better when two originally identical files A and B got renamed to X/A and X/B by pairing A to X/A and B to X/B, but this was broken in the 2.0 timeframe. * sg/diff-multiple-identical-renames: diffcore: fix iteration order of identical files during rename detection	2016-04-13 14:12:32 -07:00
SZEDER Gábor	ca4e3ca029	diffcore: fix iteration order of identical files during rename detection If the two paths 'dir/A/file' and 'dir/B/file' have identical content and the parent directory is renamed, e.g. 'git mv dir other-dir', then diffcore reports the following exact renames: renamed: dir/B/file -> other-dir/A/file renamed: dir/A/file -> other-dir/B/file While technically not wrong, this is confusing not only for the user, but also for git commands that make decisions based on rename information, e.g. 'git log --follow other-dir/A/file' follows 'dir/B/file' past the rename. This behavior is a side effect of commit v2.0.0-rc4~8^2~14 (diffcore-rename.c: simplify finding exact renames, 2013-11-14): the hashmap storing sources returns entries from the same bucket, i.e. sources matching the current destination, in LIFO order. Thus the iteration first examines 'other-dir/A/file' and 'dir/B/file' and, upon finding identical content and basename, reports an exact rename. Other hashmap users are apparently happy with the current iteration order over the entries of a bucket. Changing the iteration order would risk upsetting other hashmap users and would increase the memory footprint of each bucket by a pointer to the tail element. Fill the hashmap with source entries in reverse order to restore the original exact rename detection behavior. Reported-by: Bill Okara <billokara@gmail.com> Signed-off-by: SZEDER Gábor <szeder@ira.uka.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-03-30 13:46:04 -07:00
Jeff King	50a6c8efa2	use st_add and st_mult for allocation size computation If our size computation overflows size_t, we may allocate a much smaller buffer than we expected and overflow it. It's probably impossible to trigger an overflow in most of these sites in practice, but it is easy enough convert their additions and multiplications into overflow-checking variants. This may be fixing real bugs, and it makes auditing the code easier. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-22 14:51:09 -08:00
Junio C Hamano	2d659f7d6e	Merge branch 'jk/diffcore-rename-duplicate' A corrupt input to "git diff -M" can cause us to segfault. * jk/diffcore-rename-duplicate: diffcore-rename: avoid processing duplicate destinations diffcore-rename: split locate_rename_dst into two functions	2015-03-10 13:52:39 -07:00
Jeff King	4d6be03b95	diffcore-rename: avoid processing duplicate destinations The rename code cannot handle an input where we have duplicate destinations (i.e., more than one diff_filepair in the queue with the same string in its pair->two->path). We end up allocating only one slot in the rename_dst mapping. If we fill in the diff_filepair for that slot, when we re-queue the results, we may queue that filepair multiple times. When the diff is finally flushed, the filepair is processed and free()d multiple times, leading to heap corruption. This situation should only happen when a tree diff sees duplicates in one of the trees (see the added test for a detailed example). Rather than handle it, the sanest thing is just to turn off rename detection altogether for the diff. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-02-27 13:41:29 -08:00
Jeff King	f98c2f7e53	diffcore-rename: split locate_rename_dst into two functions This function manages the mapping of destination pathnames to filepairs, and it handles both insertion and lookup. This makes the return value a bit confusing, as we return a newly created entry (even though no caller cares), and have no room to indicate to the caller that an entry already existed. Instead, let's break this up into two distinct functions, both backed by a common binary search. The binary search will use our normal "return the index if we found something, or negative index minus one to show where it would have gone" semantics. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-02-27 13:41:21 -08:00
Nguyễn Thái Ngọc Duy	8e5dd3d654	diff.c: allow to pass more flags to diff_populate_filespec Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-08-18 10:16:35 -07:00
Karsten Blees	ab73a9d119	hashmap: add simplified hashmap_get_from_hash() API Hashmap entries are typically looked up by just a key. The hashmap_get() API expects an initialized entry structure instead, to support compound keys. This flexibility is currently only needed by find_dir_entry() in name-hash.c (and compat/win32/fscache.c in the msysgit fork). All other (currently five) call sites of hashmap_get() have to set up a near emtpy entry structure, resulting in duplicate code like this: struct hashmap_entry keyentry; hashmap_entry_init(&keyentry, hash(key)); return hashmap_get(map, &keyentry, key); Add a hashmap_get_from_hash() API that allows hashmap lookups by just specifying the key and its hash code, i.e.: return hashmap_get_from_hash(map, hash(key), key); Signed-off-by: Karsten Blees <blees@dcon.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-07-07 13:56:35 -07:00
Karsten Blees	039dc71a7c	hashmap: factor out getting a hash code from a SHA1 Copying the first bytes of a SHA1 is duplicated in six places, however, the implications (the actual value would depend on the endianness of the platform) is documented only once. Add a properly documented API for this. Signed-off-by: Karsten Blees <blees@dcon.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-07-07 13:56:24 -07:00
Junio C Hamano	fe9122a352	Merge branch 'dd/use-alloc-grow' Replace open-coded reallocation with ALLOC_GROW() macro. * dd/use-alloc-grow: sha1_file.c: use ALLOC_GROW() in pretend_sha1_file() read-cache.c: use ALLOC_GROW() in add_index_entry() builtin/mktree.c: use ALLOC_GROW() in append_to_tree() attr.c: use ALLOC_GROW() in handle_attr_line() dir.c: use ALLOC_GROW() in create_simplify() reflog-walk.c: use ALLOC_GROW() replace_object.c: use ALLOC_GROW() in register_replace_object() patch-ids.c: use ALLOC_GROW() in add_commit() diffcore-rename.c: use ALLOC_GROW() diff.c: use ALLOC_GROW() commit.c: use ALLOC_GROW() in register_commit_graft() cache-tree.c: use ALLOC_GROW() in find_subtree() bundle.c: use ALLOC_GROW() in add_to_ref_list() builtin/pack-objects.c: use ALLOC_GROW() in check_pbase_path()	2014-03-18 13:50:21 -07:00
Junio C Hamano	0963008cbf	Merge branch 'nd/i18n-progress' Mark the progress indicators from various time-consuming commands for i18n/l10n. * nd/i18n-progress: i18n: mark all progress lines for translation	2014-03-14 14:26:31 -07:00
Dmitry S. Dolzhenko	337ce247e3	diffcore-rename.c: use ALLOC_GROW() Use ALLOC_GROW() instead of open-coding it in locate_rename_dst() and register_rename_src(). Signed-off-by: Dmitry S. Dolzhenko <dmitrys.dolzhenko@yandex.ru> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-03-03 14:49:02 -08:00
Nguyễn Thái Ngọc Duy	754dbc43f0	i18n: mark all progress lines for translation Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-02-24 09:08:37 -08:00
Karsten Blees	f79d9c5814	diffcore-rename.c: use new hash map implementation Signed-off-by: Karsten Blees <blees@dcon.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-11-18 13:04:23 -08:00
Karsten Blees	7c85f8acb2	diffcore-rename.c: simplify finding exact renames The find_exact_renames function currently only uses the hash table for grouping, i.e.: 1. add sources 2. add destinations 3. iterate all buckets, per bucket: 4. split sources from destinations 5. iterate destinations, per destination: 6. iterate sources to find best match This can be simplified by utilizing the lookup functionality of the hash table, i.e.: 1. add sources 2. iterate destinations, per destination: 3. lookup sources matching the current destination 4. iterate sources to find best match This saves several iterations and file_similarity allocations for the destinations. Signed-off-by: Karsten Blees <blees@dcon.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-11-18 13:04:23 -08:00
Karsten Blees	48f6407ffe	diffcore-rename.c: move code around to prepare for the next patch No actual code changes, just move hash_filespec up and outdent part of find_identical_files. Signed-off-by: Karsten Blees <blees@dcon.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-11-18 13:04:22 -08:00
Nguyễn Thái Ngọc Duy	c73592812d	Preallocate hash tables when the number of inserts are known in advance This avoids unnecessary re-allocations and reinsertions. On webkit.git (i.e. about 182k inserts to the name hash table), this reduces about 100ms out of 3s user time. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-03-16 22:57:29 -07:00
Junio C Hamano	3b753148b6	Merge branch 'jk/maint-null-in-trees' We do not want a link to 0{40} object stored anywhere in our objects. * jk/maint-null-in-trees: fsck: detect null sha1 in tree entries do not write null sha1s to on-disk index diff: do not use null sha1 as a sentinel value	2012-08-27 11:54:28 -07:00
Jeff King	e54501004a	diff: do not use null sha1 as a sentinel value The diff code represents paths using the diff_filespec struct. This struct has a sha1 to represent the sha1 of the content at that path, as well as a sha1_valid member which indicates whether its sha1 field is actually useful. If sha1_valid is not true, then the filespec represents a working tree file (e.g., for the no-index case, or for when the index is not up-to-date). The diff_filespec is only used internally, though. At the interfaces to the diff subsystem, callers feed the sha1 directly, and we create a diff_filespec from it. It's at that point that we look at the sha1 and decide whether it is valid or not; callers may pass the null sha1 as a sentinel value to indicate that it is not. We should not typically see the null sha1 coming from any other source (e.g., in the index itself, or from a tree). However, a corrupt tree might have a null sha1, which would cause "diff --patch" to accidentally diff the working tree version of a file instead of treating it as a blob. This patch extends the edges of the diff interface to accept a "sha1_valid" flag whenever we accept a sha1, and to use that flag when creating a filespec. In some cases, this means passing the flag through several layers, making the code change larger than would be desirable. One alternative would be to simply die() upon seeing corrupted trees with null sha1s. However, this fix more directly addresses the problem (while bogus sha1s in a tree are probably a bad thing, it is really the sentinel confusion sending us down the wrong code path that is what makes it devastating). And it means that git is more capable of examining and debugging these corrupted trees. For example, you can still "diff --raw" such a tree to find out when the bogus entry was introduced; you just cannot do a "--patch" diff (just as you could not with any other corrupted tree, as we do not have any content to diff). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-07-29 15:04:32 -07:00
Jeff King	90d43b0768	teach diffcore-rename to optionally ignore empty content Our rename detection is a heuristic, matching pairs of removed and added files with similar or identical content. It's unlikely to be wrong when there is actual content to compare, and we already take care not to do inexact rename detection when there is not enough content to produce good results. However, we always do exact rename detection, even when the blob is tiny or empty. It's easy to get false positives with an empty blob, simply because it is an obvious content to use as a boilerplate (e.g., when telling git that an empty directory is worth tracking via an empty .gitignore). This patch lets callers specify whether or not they are interested in using empty files as rename sources and destinations. The default is "yes", keeping the original behavior. It works by detecting the empty-blob sha1 for rename sources and destinations. One more flexible alternative would be to allow the caller to specify a minimum size for a blob to be "interesting" for rename detection. But that would catch small boilerplate files, not large ones (e.g., if you had the GPL COPYING file in many directories). A better alternative would be to allow a "-rename" gitattribute to allow boilerplate files to be marked as such. I'll leave the complexity of that solution until such time as somebody actually wants it. The complaints we've seen so far revolve around empty files, so let's start with the simple thing. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-03-23 13:52:49 -07:00
Junio C Hamano	2db8926236	Merge branch 'mz/maint-rename-unmerged' * mz/maint-rename-unmerged: diffcore-rename: don't consider unmerged path as source	2011-05-02 15:58:27 -07:00
Jim Meyering	dabdbee10b	diffcore-rename.c: avoid set-but-not-used warning Since `9d8a5a5` (diffcore-rename: refactor "too many candidates" logic, 2011-01-06), diffcore_rename() initializes num_src but does not use it anymore. "-Wunused-but-set-variable" in gcc-4.6 complains about this. Signed-off-by: Jim Meyering <meyering@redhat.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-04-29 08:20:10 -07:00
Martin von Zweigbergk	d7c9bf2235	diffcore-rename: don't consider unmerged path as source Since `e9c8409` (diff-index --cached --raw: show tree entry on the LHS for unmerged entries., 2007-01-05), an unmerged entry should be detected by using DIFF_PAIR_UNMERGED(p), not by noticing both one and two sides of the filepair records mode=0 entries. However, it forgot to update some parts of the rename detection logic. This only makes difference in the "diff --cached" codepath where an unmerged filepair carries information on the entries that came from the tree. It probably hasn't been noticed for a long time because nobody would run "diff -M" during a conflict resolution, but "git status" uses rename detection when it internally runs "diff-index" and "diff-files" and gives nonsense results. In an unmerged pair, "one" side can have a valid filespec to record the tree entry (e.g. what's in HEAD) when running "diff --cached". This can be used as a rename source to other paths in the index that are not unmerged. The path that is unmerged by definition does not have the final content yet (i.e. "two" side cannot have a valid filespec), so it can never be a rename destination. Use the DIFF_PAIR_UNMERGED() to detect unmerged filepair correctly, and allow the valid "one" side of an unmerged filepair to be considered a potential rename source, but never to be considered a rename destination. Commit message and first two test cases by Junio, the rest by Martin. Signed-off-by: Martin von Zweigbergk <martin.von.zweigbergk@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-03-23 22:44:22 -07:00
Junio C Hamano	f31027c99c	diffcore-rename: fall back to -C when -C -C busts the rename limit When there are too many paths in the project, the number of rename source candidates "git diff -C -C" finds will exceed the rename detection limit, and no inexact rename detection is performed. We however could fall back to "git diff -C" if the number of modified paths is sufficiently small. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-03-22 14:29:07 -07:00
Junio C Hamano	e88d6bc6f9	diffcore-rename: record filepair for rename src This will allow us to later skip unmodified entries added due to "-C -C". We might also want to do something similar to rename_dst side, but that would only be for the sake of symmetry and not necessary for this series. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-03-22 14:29:07 -07:00
Junio C Hamano	9d8a5a50b7	diffcore-rename: refactor "too many candidates" logic Move the logic to a separate function, to be enhanced by later patches in the series. While at it, swap the condition used in the if statement from "if it is too big then do this" to "if it would fit then do this". Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Rebased to 'master' as the logic to use the result of this logic was updated recently, together with the addition of eye-candy.	2011-03-22 14:29:07 -07:00
Junio C Hamano	0ce6a51b43	Merge branch 'jk/merge-rename-ux' * jk/merge-rename-ux: pull: propagate --progress to merge merge: enable progress reporting for rename detection add inexact rename detection progress infrastructure commit: stop setting rename limit bump rename limit defaults (again) merge: improve inexact rename limit warning	2011-03-19 23:23:56 -07:00
Jeff King	3ac942d42e	add inexact rename detection progress infrastructure We might spend many seconds doing inexact rename detection with no output. It's nice to let the user know that something is actually happening. This patch adds the infrastructure, but no callers actually turn on progress reporting. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-02-21 10:24:07 -08:00
Jeff King	bf0ab10fa8	merge: improve inexact rename limit warning The warning is generated deep in the diffcore code, which means that it will come first, followed possibly by a spew of conflicts, making it hard to see. Instead, let's have diffcore pass back the information about how big the rename limit would needed to have been, and then the caller can provide a more appropriate message (and at a more appropriate time). No refactoring of other non-merge callers is necessary, because nobody else was even using the warn_on_rename_limit feature. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-02-21 10:21:26 -08:00
Linus Torvalds	3a4d67692b	diffcore-rename: improve estimate_similarity() heuristics The logic to quickly dismiss potential rename pairs was broken. It would too eagerly dismiss possible renames when all of the difference was due to pure new data (or deleted data). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-02-18 22:26:45 -08:00
Linus Torvalds	0940e5f211	diffcore-rename: properly honor the difference between -M and -C We would allow rename detection to do copy detection even when asked purely for renames. That confuses users, but more importantly it can terminally confuse the recursive merge rename logic. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-02-18 22:26:28 -08:00
Linus Torvalds	11f944dd6b	for_each_hash: allow passing a 'void *data' pointer to callback For the find_exact_renames() function, this allows us to pass the diff_options structure pointer to the low-level routines. We will use that to distinguish between the "rename" and "copy" cases. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-02-18 22:25:51 -08:00
Bo Yang	9ca5df9061	Add a macro DIFF_QUEUE_CLEAR. Refactor the diff_queue_struct code, this macro help to reset the structure. Signed-off-by: Bo Yang <struggleyb.nku@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2010-05-07 09:34:27 -07:00
Junio C Hamano	809809bb75	diffcore-rename: reduce memory footprint by freeing blob data early After running one round of estimate_similarity(), filespecs on either side will have populated their cnt_data fields, and we do not need the blob text anymore. We used to retain the blob data to optimize for smaller projects (not freeing the blob data here would mean that the final output phase would not have to re-read it), but we are efficient enough without such optimization for smaller projects anyway, and freeing memory early will help larger projects. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2009-11-20 22:13:47 -08:00
Mike Ralphson	3ea3c215c0	Fix typos / spelling in comments Signed-off-by: Mike Ralphson <mike@abacus.co.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2009-04-22 19:02:12 -07:00
Björn Steinbrink	885c716f0f	Rename detection: Avoid repeated filespec population In diffcore_rename, we assume that the blob contents in the filespec aren't required anymore after estimate_similarity has been called and thus we free it. But estimate_similarity might return early when the file sizes differ too much. In that case, cnt_data is never set and the next call to estimate_similarity will populate the filespec again, eventually rereading the same blob over and over again. To fix that, we first get the blob sizes and only when the blob contents are actually required, and when cnt_data will be set, the full filespec is populated, once. Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2009-01-21 00:14:12 -08:00
Linus Torvalds	6e381d3aff	Add file delete/create info when we overflow rename_limit When we refuse to do rename detection due to having too many files created or deleted, let the user know the numbers. That way there is a reasonable starting point for setting the diff.renamelimit option. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2008-10-28 08:58:42 -07:00
Jeff King	b8960bbe7b	diff: make "too many files" rename warning optional In many cases, the warning ends up as clutter, because the diff is being done "behind the scenes" from the user (e.g., when generating a commit diffstat), and whether we show renames or not is not particularly interesting to the user. However, in the case of a merge (which is what motivated the warning in the first place), it is a useful hint as to why a merge with renames might have failed. This patch makes the warning optional based on the code calling into diffcore. We default to not showing the warning, but turn it on for merges. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2008-05-03 13:40:43 -07:00
Junio C Hamano	2a5fe25458	Merge branch 'jc/rename' * 'jc/rename' (early part): Optimize rename detection for a huge diff	2008-04-09 01:09:12 -07:00
Jeff King	ee542ee3fc	rename: warn user when we have turned off rename detection Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2008-03-01 01:30:15 -08:00
Junio C Hamano	6d24ad971c	Optimize rename detection for a huge diff When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2008-02-13 15:44:20 -08:00
Linus Torvalds	9ae8fcb36a	Fix a pathological case in git detecting proper renames On Thu, 29 Nov 2007, Jeff King wrote: > > I think it will get worse, because you are simultaneously calculating > all of the similarity scores bit by bit rather than doing a loop. Though > perhaps you mean at the end you will end up with a list of src/dst pairs > sorted by score, and you can loop over that. Well, after thinking about this a bit, I think there's a solution that may work well with the current thing too: instead of looping just once over the list of rename pairs, loop twice - and simply refuse to do copies on the first loop. This trivial patch does that, and turns Kumar's test-case into a perfect rename list. It's not pretty, it's not smart, but it seems to work. There's something to be said for keeping it simple and stupid. And it should not be nearly as expensive as it may _look_. Yes, the loop is "(i = 0; i < num_create * num_src; i++)", but the important part is that the whole array is sorted by rename score, and we have a if (mx[i].score < minimum_score) break; in it, so uthe loop actually would tend to terminate rather quickly. Anyway, Kumar, the thing to take away from this is: - git really doesn't even care about the whole "rename detection" internally, and any commits you have done with renames are totally independent of the heuristics we then use to show the renames. - the rename detection really is for just two reasons: (a) keep humans happy, and keep the diffs small and (b) help automatic merging across renames. So getting renames right is certainly good, but it's more of a "politeness" issue than a "correctness" issue, although the merge portion of it does matter a lot sometimes. - the important thing here is that you can commit your changes and not worry about them being somehow "corrupted" by lack of rename detection, even if you commit them with a version of git that doesn't do rename detection the way you expected it. The rename detection is an "after-the-fact" thing, not something that actually gets saved in the repository, which is why we can change the heuristics _after_ seeing examples, and the examples magically correct themselves! - try out the two patches I've posted, and see if they work for you. They pass the test-suite, and the output for your example commit looks sane, but hey, if you have other test-cases, try them out. Here's Kumar's pretty diffstat with both my patches: Makefile \| 6 +++--- board/{cds => freescale}/common/cadmus.c \| 0 board/{cds => freescale}/common/cadmus.h \| 0 board/{cds => freescale}/common/eeprom.c \| 0 board/{cds => freescale}/common/eeprom.h \| 0 board/{cds => freescale}/common/ft_board.c \| 0 board/{cds => freescale}/common/via.c \| 0 board/{cds => freescale}/common/via.h \| 0 board/{cds => freescale}/mpc8541cds/Makefile \| 0 board/{cds => freescale}/mpc8541cds/config.mk \| 0 board/{cds => freescale}/mpc8541cds/init.S \| 0 board/{cds => freescale}/mpc8541cds/mpc8541cds.c \| 0 board/{cds => freescale}/mpc8541cds/u-boot.lds \| 4 ++-- board/{cds => freescale}/mpc8548cds/Makefile \| 0 board/{cds => freescale}/mpc8548cds/config.mk \| 0 board/{cds => freescale}/mpc8548cds/init.S \| 0 board/{cds => freescale}/mpc8548cds/mpc8548cds.c \| 0 board/{cds => freescale}/mpc8548cds/u-boot.lds \| 4 ++-- board/{cds => freescale}/mpc8555cds/Makefile \| 0 board/{cds => freescale}/mpc8555cds/config.mk \| 0 board/{cds => freescale}/mpc8555cds/init.S \| 0 board/{cds => freescale}/mpc8555cds/mpc8555cds.c \| 0 board/{cds => freescale}/mpc8555cds/u-boot.lds \| 4 ++-- 23 files changed, 9 insertions(+), 9 deletions(-) and here it is before: Makefile \| 6 +- board/cds/mpc8548cds/Makefile \| 60 ----- board/cds/mpc8555cds/Makefile \| 60 ----- board/cds/mpc8555cds/init.S \| 255 -------------------- board/cds/mpc8555cds/u-boot.lds \| 150 ------------ board/{cds => freescale}/common/cadmus.c \| 0 board/{cds => freescale}/common/cadmus.h \| 0 board/{cds => freescale}/common/eeprom.c \| 0 board/{cds => freescale}/common/eeprom.h \| 0 board/{cds => freescale}/common/ft_board.c \| 0 board/{cds => freescale}/common/via.c \| 0 board/{cds => freescale}/common/via.h \| 0 board/{cds => freescale}/mpc8541cds/Makefile \| 0 board/{cds => freescale}/mpc8541cds/config.mk \| 0 board/{cds => freescale}/mpc8541cds/init.S \| 0 board/{cds => freescale}/mpc8541cds/mpc8541cds.c \| 0 board/{cds => freescale}/mpc8541cds/u-boot.lds \| 4 +- .../mpc8541cds => freescale/mpc8548cds}/Makefile \| 0 board/{cds => freescale}/mpc8548cds/config.mk \| 0 board/{cds => freescale}/mpc8548cds/init.S \| 0 board/{cds => freescale}/mpc8548cds/mpc8548cds.c \| 0 board/{cds => freescale}/mpc8548cds/u-boot.lds \| 4 +- .../mpc8541cds => freescale/mpc8555cds}/Makefile \| 0 board/{cds => freescale}/mpc8555cds/config.mk \| 0 .../mpc8541cds => freescale/mpc8555cds}/init.S \| 0 board/{cds => freescale}/mpc8555cds/mpc8555cds.c \| 0 .../mpc8541cds => freescale/mpc8555cds}/u-boot.lds \| 4 +- 27 files changed, 9 insertions(+), 534 deletions(-) so it certainly makes the diffs prettier. Linus Signed-off-by: Junio C Hamano <gitster@pobox.com>	2007-11-30 15:49:17 -08:00
Linus Torvalds	32d75d29f9	Fix a pathological case in git detecting proper renames Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already / score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people expect to see, rather than other alternatives that may technically be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2007-11-30 15:49:17 -08:00
Linus Torvalds	42899ac898	Do the fuzzy rename detection limits with the exact renames removed When we do the fuzzy rename detection, we don't care about the destinations that we already handled with the exact rename detector. And, in fact, the code already knew that - but the rename limiter, which used to run before exact renames were detected, did not. This fixes it so that the rename detection limiter now bases its decisions on the remaining rename counts, rather than the original ones. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2007-10-26 23:18:06 -07:00
Linus Torvalds	81ac051d6a	Fix ugly magic special case in exact rename detection For historical reasons, the exact rename detection had populated the filespecs for the entries it compared, and the rest of the similarity analysis depended on that. I hadn't even bothered to debug why that was the case when I re-did the rename detection, I just made the new one have the same broken behaviour, with a note about this special case. This fixes that fixme. The reason the exact rename detector needed to fill in the file sizes of the files it checked was that the _inexact_ rename detector was broken, and started comparing file sizes before it filled them in. Fixing that allows the exact phase to do the sane thing of never even caring (since all it cares about is really just the SHA1 itself, not the size nor the contents). It turns out that this also indirectly fixes a bug: trying to populate all the filespecs will run out of virtual memory if there is tons and tons of possible rename options. The fuzzy similarity analysis does the right thing in this regard, and free's the blob info after it has generated the hash tables, so the special case code caused more trouble than just some extra illogical code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2007-10-26 23:18:06 -07:00
Linus Torvalds	17559a643e	Do exact rename detection regardless of rename limits Now that the exact rename detection is linear-time (with a very small constant factor to boot), there is no longer any reason to limit it by the number of files involved. In some trivial testing, I created a repository with a directory that had a hundred thousand files in it (all with different contents), and then moved that directory to show the effects of renaming 100,000 files. With the new code, that resulted in [torvalds@woody big-rename]$ time ~/git/git show -C \| wc -l 400006 real 0m2.071s user 0m1.520s sys 0m0.576s ie the code can correctly detect the hundred thousand renames in about 2 seconds (the number "400006" comes from four lines for each rename: diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1 similarity index 100% rename from really-big-dir/file-1-1-1-1-1 rename to moved-big-dir/file-1-1-1-1-1 and the extra six lines is from a one-liner commit message and all the commit information and spacing). Most of those two seconds weren't even really the rename detection, it's really all the other stuff needed to get there. With the old code, this wouldn't have been practically possible. Doing a pairwise check of the ten billion possible pairs would have been prohibitively expensive. In fact, even with the rename limiter in place, the old code would waste a lot of time just on the diff_filespec checks, and despite not even trying to find renames, it used to look like: [torvalds@woody big-rename]$ time git show -C \| wc -l 1400006 real 0m12.337s user 0m12.285s sys 0m0.192s ie we used to take 12 seconds for this load and not even do any rename detection! (The number 1400006 comes from fourteen lines per file moved: seven lines each for the delete and the create of a one-liner file, and the same extra six lines of commit information). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2007-10-26 23:18:06 -07:00
Linus Torvalds	9027f53cb5	Do linear-time/space rename logic for exact renames This implements a smarter rename detector for exact renames, which rather than doing a pairwise comparison (time O(m*n)) will just hash the files into a hash-table (size O(n+m)), and only do pairwise comparisons to renames that have the same hash (time O(n+m) except for unrealistic hash collissions, which we just cull aggressively). Admittedly the exact rename case is not nearly as interesting as the generic case, but it's an important case none-the-less. A similar general approach should work for the generic case too, but even then you do need to handle the exact renames/copies separately (to avoid the inevitable added cost factor that comes from the _size_ of the file), so this is worth doing. In the expectation that we will indeed do the same hashing trick for the general rename case, this code uses a generic hash-table implementation that can be used for other things too. In fact, we might be able to consolidate some of our existing hash tables with the new generic code in hash.[ch]. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2007-10-26 23:18:06 -07:00
Linus Torvalds	644797119d	copy vs rename detection: avoid unnecessary O(n*m) loops The core rename detection had some rather stupid code to check if a pathname was used by a later modification or rename, which basically walked the whole pathname space for all renames for each rename, in order to tell whether it was a pure rename (no remaining users) or should be considered a copy (other users of the source file remaining). That's really silly, since we can just keep a count of users around, and replace all those complex and expensive loops with just testing that simple counter (but this all depends on the previous commit that shared the diff_filespec data structure by using a separate reference count). Note that the reference count is not the same as the rename count: they behave otherwise rather similarly, but the reference count is tied to the allocation (and decremented at de-allocation, so that when it turns zero we can get rid of the memory), while the rename count is tied to the renames and is decremented when we find a rename (so that when it turns zero we know that it was a rename, not a copy). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2007-10-26 23:18:06 -07:00
Linus Torvalds	9fb88419ba	Ref-count the filespecs used by diffcore Rather than copy the filespecs when introducing new versions of them (for rename or copy detection), use a refcount and increment the count when reusing the diff_filespec. This avoids unnecessary allocations, but the real reason behind this is a future enhancement: we will want to track shared data across the copy/rename detection. In order to efficiently notice when a filespec is used by a rename, the rename machinery wants to keep track of a rename usage count which is shared across all different users of the filespec. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2007-10-26 23:18:05 -07:00
Linus Torvalds	cb1491b6bf	Split out "exact content match" phase of rename detection This makes the exact content match a separate function of its own. Partly to cut down a bit on the size of the diffcore_rename() function (which is too complex as it is), and partly because there are smarter ways to do this than an O(m*n) loop over it all, and that function should be rewritten to take that into account. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2007-10-26 23:18:05 -07:00

1 2 3

103 Коммитов