microsoft/git - git

Граф коммитов

Автор	SHA1	Сообщение	Дата
Junio C Hamano	3ad8b5bf26	Merge branch 'mh/ref-remove-empty-directory' Deletion of a branch "foo/bar" could remove .git/refs/heads/foo once there no longer is any other branch whose name begins with "foo/", but we didn't do so so far. Now we do. * mh/ref-remove-empty-directory: (23 commits) files_transaction_commit(): clean up empty directories try_remove_empty_parents(): teach to remove parents of reflogs, too try_remove_empty_parents(): don't trash argument contents try_remove_empty_parents(): rename parameter "name" -> "refname" delete_ref_loose(): inline function delete_ref_loose(): derive loose reference path from lock log_ref_write_1(): inline function log_ref_setup(): manage the name of the reflog file internally log_ref_write_1(): don't depend on logfile argument log_ref_setup(): pass the open file descriptor back to the caller log_ref_setup(): improve robustness against races log_ref_setup(): separate code for create vs non-create log_ref_write(): inline function rename_tmp_log(): improve error reporting rename_tmp_log(): use raceproof_create_file() lock_ref_sha1_basic(): use raceproof_create_file() lock_ref_sha1_basic(): inline constant raceproof_create_file(): new function safe_create_leading_directories(): set errno on SCLD_EXISTS safe_create_leading_directories_const(): preserve errno ...	2017-02-27 13:57:12 -08:00
René Scharfe	886ddf4777	sha1_file: release fallback base's memory in unpack_entry() If a pack entry that's used as a delta base is corrupt, unpack_entry() marks it as unusable and then searches the object again in the hope that it can be found in another pack or in a loose file. The memory for this external base object is never released. Free it after use. Signed-off-by: Rene Scharfe <l.s.r@web.de> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-02-27 11:00:30 -08:00
brian m. carlson	76c1d9a096	Convert object iteration callbacks to struct object_id Convert each_loose_object_fn and each_packed_object_fn to take a pointer to struct object_id. Update the various callbacks. Convert several 40-based constants to use GIT_SHA1_HEXSZ. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-02-22 10:12:15 -08:00
brian m. carlson	068f85e313	sha1_file: introduce an nth_packed_object_oid function There are places in the code where we would like to provide a struct object_id *, yet read the hash directly from the pack. Provide an nth_packed_object_oid function that is similar to the nth_packed_object_sha1 function. In order to avoid a potentially invalid cast, nth_packed_object_oid provides a variable into which to store the value, which it returns on success; on error, it returns NULL, as nth_packed_object_sha1 does. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-02-22 10:12:15 -08:00
Junio C Hamano	c54ba283fa	Merge branch 'jk/clear-delta-base-cache-fix' A crashing bug introduced in v2.11 timeframe has been found (it is triggerable only in fast-import) and fixed. * jk/clear-delta-base-cache-fix: clear_delta_base_cache(): don't modify hashmap while iterating	2017-01-31 13:14:59 -08:00
Junio C Hamano	42ace93e41	Merge branch 'jk/loose-object-fsck' "git fsck" inspects loose objects more carefully now. * jk/loose-object-fsck: fsck: detect trailing garbage in all object types fsck: parse loose object paths directly sha1_file: add read_loose_object() function t1450: test fsck of packed objects sha1_file: fix error message for alternate objects t1450: refactor loose-object removal	2017-01-31 13:14:57 -08:00
Jeff King	abd5a00268	clear_delta_base_cache(): don't modify hashmap while iterating On Thu, Jan 19, 2017 at 03:03:46PM +0100, Ulrich Spörlein wrote: > > I suspect the patch below may fix things for you. It works around it by > > walking over the lru list (either is fine, as they both contain all > > entries, and since we're clearing everything, we don't care about the > > order). > > Confirmed. With the patch applied, I can import the whole 55G in one go > without any crashes or aborts. Thanks much! Thanks. Here it is rolled up with a commit message. -- >8 -- Subject: clear_delta_base_cache(): don't modify hashmap while iterating Removing entries while iterating causes fast-import to access an already-freed `struct packed_git`, leading to various confusing errors. What happens is that clear_delta_base_cache() drops the whole contents of the cache by iterating over the hashmap, calling release_delta_base_cache() on each entry. That function removes the item from the hashmap. The hashmap code may then shrink the table, but the hashmap_iter struct retains an offset from the old table. As a result, the next call to hashmap_iter_next() may claim that the iteration is done, even though some items haven't been visited. The only caller of clear_delta_base_cache() is fast-import, which wants to clear the cache because it is discarding the packed_git struct for its temporary pack. So by failing to remove all of the entries, we still have references to the freed packed_git. To make things even more confusing, this doesn't seem to trigger with the test suite, because it depends on complexities like the size of the hash table, which entries got cleared, whether we try to access them before they're evicted from the cache, etc. So I've been able to identify the problem with large imports like freebsd's svn import, or a fast-export of linux.git. But nothing that would be reasonable to run as part of the normal test suite. We can fix this easily by iterating over the lru linked list instead of the hashmap. They both contain the same entries, and we can use the "safe" variant of the list iterator, which exists for exactly this case. Let's also add a warning to the hashmap API documentation to reduce the chances of getting bit by this again. Reported-by: Ulrich Spörlein <uqs@freebsd.org> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-19 11:17:20 -08:00
Junio C Hamano	55d128ae06	Merge branch 'bw/grep-recurse-submodules' "git grep" has been taught to optionally recurse into submodules. * bw/grep-recurse-submodules: grep: search history of moved submodules grep: enable recurse-submodules to work on <tree> objects grep: optionally recurse into submodules grep: add submodules as a grep source type submodules: load gitmodules file from commit sha1 submodules: add helper to determine if a submodule is initialized submodules: add helper to determine if a submodule is populated real_path: canonicalize directory separators in root parts real_path: have callers use real_pathdup and strbuf_realpath real_path: create real_pathdup real_path: convert real_path_internal to strbuf_realpath real_path: resolve symlinks by hand	2017-01-18 15:12:11 -08:00
Junio C Hamano	bcaf277b4a	Merge branch 'jk/quote-env-path-list-component' into maint A recent update to receive-pack to make it easier to drop garbage objects made it clear that GIT_ALTERNATE_OBJECT_DIRECTORIES cannot have a pathname with a colon in it (no surprise!), and this in turn made it impossible to push into a repository at such a path. This has been fixed by introducing a quoting mechanism used when appending such a path to the colon-separated list. * jk/quote-env-path-list-component: t5615-alternate-env: double-quotes in file names do not work on Windows t5547-push-quarantine: run the path separator test on Windows, too tmp-objdir: quote paths we add to alternates alternates: accept double-quoted paths	2017-01-17 15:11:06 -08:00
Jeff King	cce044df7f	fsck: detect trailing garbage in all object types When a loose tree or commit is read by fsck (or any git program), unpack_sha1_rest() checks whether there is extra cruft at the end of the object file, after the zlib data. Blobs that are streamed, however, do not have this check. For normal git operations, it's not a big deal. We know the sha1 and size checked out, so we have the object bytes we wanted. The trailing garbage doesn't affect what we're trying to do. But since the point of fsck is to find corruption or other problems, it should be more thorough. This patch teaches its loose-sha1 reader to detect extra bytes after the zlib stream and complain. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-15 15:59:03 -08:00
Jeff King	f6371f9210	sha1_file: add read_loose_object() function It's surprisingly hard to ask the sha1_file code to open a _specific_ incarnation of a loose object. Most of the functions take a sha1, and loop over the various object types (packed versus loose) and locations (local versus alternates) at a low level. However, some tools like fsck need to look at a specific file. This patch gives them a function they can use to open the loose object at a given path. The implementation unfortunately ends up repeating bits of related functions, but there's not a good way around it without some major refactoring of the whole sha1_file stack. We need to mmap the specific file, then partially read the zlib stream to know whether we're streaming or not, and then finally either stream it or copy the data to a buffer. We can do that by assembling some of the more arcane internal sha1_file functions, but we end up having to essentially reimplement unpack_sha1_file(), along with the streaming bits of check_sha1_signature(). Still, most of the ugliness is contained in the new function, and the interface is clean enough that it may be reusable (though it seems unlikely anything but git-fsck would care about opening a specific file). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-15 15:59:03 -08:00
Jeff King	771e7d578e	sha1_file: fix error message for alternate objects When we fail to open a corrupt loose object, we report an error and mention the filename via sha1_file_name(). However, that function will always give us a path in the local repository, whereas the corrupt object may have come from an alternate. The result is a very misleading error message. Teach the open_sha1_file() and stat_sha1_file() helpers to pass back the path they found, so that we can report it correctly. Note that the pointers we return go to static storage (e.g., from sha1_file_name()), which is slightly dangerous. However, these helpers are static local helpers, and the names are used for immediately generating error messages. The simplicity is an acceptable tradeoff for the danger. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-15 15:59:03 -08:00
Junio C Hamano	02d0457eb4	Merge branch 'jc/git-open-cloexec' The codeflow of setting NOATIME and CLOEXEC on file descriptors Git opens has been simplified. We may want to drop the tip one, but we'll see. * jc/git-open-cloexec: sha1_file: stop opening files with O_NOATIME git_open_cloexec(): use fcntl(2) w/ FD_CLOEXEC fallback git_open(): untangle possible NOATIME and CLOEXEC interactions	2017-01-10 15:24:26 -08:00
Michael Haggerty	177978f56a	raceproof_create_file(): new function Add a function that tries to create a file and any containing directories in a way that is robust against races with other processes that might be cleaning up empty directories at the same time. The actual file creation is done by a callback function, which, if it fails, should set errno to EISDIR or ENOENT according to the convention of open(). raceproof_create_file() detects such failures, and respectively either tries to delete empty directories that might be in the way of the file or tries to create the containing directories. Then it retries the callback function. This function is not yet used. Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-07 19:30:09 -08:00
Michael Haggerty	204a047f23	safe_create_leading_directories(): set errno on SCLD_EXISTS The exit path for SCLD_EXISTS wasn't setting errno, which some callers use to generate error messages for the user. Fix the problem and document that the function sets errno correctly to help avoid similar regressions in the future. Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-07 19:30:08 -08:00
Michael Haggerty	029443070a	safe_create_leading_directories_const(): preserve errno Some implementations of free() change errno (even thought they shouldn't): https://sourceware.org/bugzilla/show_bug.cgi?id=17924 So preserve the errno from safe_create_leading_directories() across the call to free(). Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2017-01-07 19:30:08 -08:00
Junio C Hamano	fe05033407	Merge branch 'jk/quote-env-path-list-component' A recent update to receive-pack to make it easier to drop garbage objects made it clear that GIT_ALTERNATE_OBJECT_DIRECTORIES cannot have a pathname with a colon in it (no surprise!), and this in turn made it impossible to push into a repository at such a path. This has been fixed by introducing a quoting mechanism used when appending such a path to the colon-separated list. * jk/quote-env-path-list-component: t5615-alternate-env: double-quotes in file names do not work on Windows t5547-push-quarantine: run the path separator test on Windows, too tmp-objdir: quote paths we add to alternates alternates: accept double-quoted paths	2016-12-21 14:55:02 -08:00
Brandon Williams	4ac9006f83	real_path: have callers use real_pathdup and strbuf_realpath Migrate callers of real_path() who duplicate the retern value to use real_pathdup or strbuf_realpath. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-12-12 15:22:32 -08:00
Jeff King	cf3c635210	alternates: accept double-quoted paths We read lists of alternates from objects/info/alternates files (delimited by newline), as well as from the GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable (delimited by colon or semi-colon, depending on the platform). There's no mechanism for quoting the delimiters, so it's impossible to specify an alternate path that contains a colon in the environment, or one that contains a newline in a file. We've lived with that restriction for ages because both alternates and filenames with colons are relatively rare, and it's only a problem when the two meet. But since `722ff7f87` (receive-pack: quarantine objects until pre-receive accepts, 2016-10-03), which builds on the alternates system, every push causes the receiver to set GIT_ALTERNATE_OBJECT_DIRECTORIES internally. It would be convenient to have some way to quote the delimiter so that we can represent arbitrary paths. The simplest thing would be an escape character before a quoted delimiter (e.g., "\:" as a literal colon). But that creates a backwards compatibility problem: any path which uses that escape character is now broken, and we've just shifted the problem. We could choose an unlikely escape character (e.g., something from the non-printable ASCII range), but that's awkward to use. Instead, let's treat names as unquoted unless they begin with a double-quote, in which case they are interpreted via our usual C-stylke quoting rules. This also breaks backwards-compatibility, but in a smaller way: it only matters if your file has a double-quote as the very _first_ character in the path (whereas an escape character is a problem anywhere in the path). It's also consistent with many other parts of git, which accept either a bare pathname or a double-quoted one, and the sender can choose to quote or not as required. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-12-12 15:10:43 -08:00
Junio C Hamano	0538b84027	Merge branch 'jk/alt-odb-cleanup' Fix a corner-case regression in a topic that graduated during the v2.11 cycle. * jk/alt-odb-cleanup: alternates: re-allow relative paths from environment	2016-11-10 13:17:30 -08:00
Jeff King	37a95862c6	alternates: re-allow relative paths from environment Commit `670c359da` (link_alt_odb_entry: handle normalize_path errors, 2016-10-03) regressed the handling of relative paths in the GIT_ALTERNATE_OBJECT_DIRECTORIES variable. It's not entirely clear this was ever meant to work, but it _has_ worked for several years, so this commit restores the original behavior. When we get a path in GIT_ALTERNATE_OBJECT_DIRECTORIES, we add it the path to the list of alternate object directories as if it were found in objects/info/alternates, but with one difference: we do not provide the link_alt_odb_entry() function with a base for relative paths. That function doesn't turn it into an absolute path, and we end up feeding the relative path to the strbuf_normalize_path() function. Most relative paths break out of the top-level directory (e.g., "../foo.git/objects"), and thus normalizing fails. Prior to `670c359da`, we simply ignored the error, and due to the way normalize_path_copy() was implemented it happened to return the original path in this case. We then accessed the alternate objects using this relative path. By storing the relative path in the alt_odb list, the path is relative to wherever we happen to be at the time we do an object lookup. That means we look from $GIT_DIR in a bare repository, and from the top of the worktree in a non-bare repository. If this were being designed from scratch, it would make sense to pick a stable location (probably $GIT_DIR, or even the object directory) and use that as the relative base, turning the result into an absolute path. However, given the history, at this point the minimal fix is to match the pre-670c359da behavior. We can do this simply by ignoring the error when we have no relative base and using the original value (which we now reliably have, thanks to strbuf_normalize_path()). That still leaves us with a relative path that foils our duplicate detection, and may act strangely if we ever chdir() later in the process. We could solve that by storing an absolute path based on getcwd(). That may be a good future direction; for now we'll do just the minimum to fix the regression. The new t5615 script demonstrates the fix in its final three tests. Since we didn't have any tests of the alternates environment variable at all, it also adds some tests of absolute paths. Reported-by: Bryan Turner <bturner@atlassian.com> Signed-off-by: Jeff King <peff@peff.net>	2016-11-08 15:28:22 -05:00
Junio C Hamano	b4d065df03	sha1_file: stop opening files with O_NOATIME When we open object files, we try to do so with O_NOATIME. This dates back to `144bde78e9` (Use O_NOATIME when opening the sha1 files., 2005-04-23), which is an optimization to avoid creating a bunch of dirty inodes when we're accessing many objects. But a few things have changed since then: 1. In June 2005, git learned about packfiles, which means we would do a lot fewer atime updates (rather than one per object access, we'd generally get one per packfile). 2. In late 2006, Linux learned about "relatime", which is generally the default on modern installs. So performance around atimes updates is a non-issue there these days. All the world isn't Linux, but as it turns out, Linux is the only platform to implement O_NOATIME in the first place. So it's very unlikely that this code is helping anybody these days. Helped-by: Jeff King <peff@peff.net> [jc: took idea and log message from peff] Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-11-02 19:34:41 -07:00
Junio C Hamano	1e3001a8e2	git_open_cloexec(): use fcntl(2) w/ FD_CLOEXEC fallback A platform might not support open(2) with O_CLOEXEC but may support telling the same with fcntl(2) to flip FD_CLOEXEC bit on on an open file descriptor. It is a fallback that is inherently racy and this may not be worth doing, though. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-11-02 19:34:16 -07:00
Junio C Hamano	906d6906fb	Merge branch 'ls/git-open-cloexec' Git generally does not explicitly close file descriptors that were open in the parent process when spawning a child process, but most of the time the child does not want to access them. As Windows does not allow removing or renaming a file that has a file descriptor open, a slow-to-exit child can even break the parent process by holding onto them. Use O_CLOEXEC flag to open files in various codepaths. * ls/git-open-cloexec: read-cache: make sure file handles are not inherited by child processes sha1_file: open window into packfiles with O_CLOEXEC sha1_file: rename git_open_noatime() to git_open()	2016-10-31 13:15:21 -07:00
Junio C Hamano	39000e8499	Merge branch 'jk/fetch-quick-tag-following' into maint When fetching from a remote that has many tags that are irrelevant to branches we are following, we used to waste way too many cycles when checking if the object pointed at by a tag (that we are not going to fetch!) exists in our repository too carefully. * jk/fetch-quick-tag-following: fetch: use "quick" has_sha1_file for tag following	2016-10-28 09:01:17 -07:00
Junio C Hamano	1b8ac5ead5	git_open(): untangle possible NOATIME and CLOEXEC interactions The way we structured the fallback/retry mechanism for opening with O_NOATIME and O_CLOEXEC meant that if we failed due to lack of support to open the file with O_NOATIME option (i.e. EINVAL), we would still try to drop O_CLOEXEC first and retry, and then drop O_NOATIME. A platform on which O_NOATIME is defined in the header without support from the kernel wouldn't have a chance to open with O_CLOEXEC option due to this code structure. Arguably, O_CLOEXEC is more important than O_NOATIME, as the latter is mostly about performance, while the former can affect correctness. Instead use O_CLOEXEC to open the file, and then use fcntl(2) to set O_NOATIME on the resulting file descriptor. open(2) itself does not cause atime to be updated according to Linus [1]. The helper to do the former can be usable in the codepath in ce_compare_data() that was recently added to open a file descriptor with O_CLOEXEC; use it while we are at it. 1 <CA+55aFw83E+zOd+z5h-CA-3NhrLjVr-anL6pubrSWttYx3zu8g@mail.gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-28 06:23:07 -07:00
Junio C Hamano	d7ae013a31	Merge branch 'jk/abbrev-auto' Updates the way approximate count of total objects is computed while attempting to come up with a unique abbreviated object name, which in turn needs to estimate how many hexdigits are necessary to ensure uniqueness. * jk/abbrev-auto: find_unique_abbrev: move logic out of get_short_sha1()	2016-10-27 14:58:47 -07:00
Junio C Hamano	9fcd14491d	Merge branch 'jk/fetch-quick-tag-following' When fetching from a remote that has many tags that are irrelevant to branches we are following, we used to waste way too many cycles when checking if the object pointed at by a tag (that we are not going to fetch!) exists in our repository too carefully. * jk/fetch-quick-tag-following: fetch: use "quick" has_sha1_file for tag following	2016-10-26 13:14:47 -07:00
Lars Schneider	cd66ada065	sha1_file: open window into packfiles with O_CLOEXEC All processes that the Git main process spawns inherit the open file descriptors of the main process. These leaked file descriptors can cause problems. Use the O_CLOEXEC flag similar to `05d1ed61` to fix the leaked file descriptors. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-25 11:09:54 -07:00
Lars Schneider	a5436b5794	sha1_file: rename git_open_noatime() to git_open() This function is meant to be used when reading from files in the object store, and the original objective was to avoid smudging atime of loose object files too often, hence its name. Because we'll be extending its role in the next commit to also arrange the file descriptors they return auto-closed in the child processes, rename it to lose "noatime" part that is too specific. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-25 10:59:13 -07:00
Junio C Hamano	dec040192f	Merge branch 'jk/alt-odb-cleanup' Codepaths involved in interacting alternate object store have been cleaned up. * jk/alt-odb-cleanup: alternates: use fspathcmp to detect duplicates sha1_file: always allow relative paths to alternates count-objects: report alternates via verbose mode fill_sha1_file: write into a strbuf alternates: store scratch buffer as strbuf fill_sha1_file: write "boring" characters alternates: use a separate scratch space alternates: encapsulate alt->base munging alternates: provide helper for allocating alternate alternates: provide helper for adding to alternates list link_alt_odb_entry: refactor string handling link_alt_odb_entry: handle normalize_path errors t5613: clarify "too deep" recursion tests t5613: do not chdir in main process t5613: whitespace/style cleanups t5613: use test_must_fail t5613: drop test_valid_repo function t5613: drop reachable_via function	2016-10-17 13:25:20 -07:00
Jeff King	5827a03545	fetch: use "quick" has_sha1_file for tag following When we auto-follow tags in a fetch, we look at all of the tags advertised by the remote and fetch ones where we don't already have the tag, but we do have the object it peels to. This involves a lot of calls to has_sha1_file(), some of which we can reasonably expect to fail. Since `45e8a74` (has_sha1_file: re-check pack directory before giving up, 2013-08-30), this may cause many calls to reprepare_packed_git(), which is potentially expensive. This has gone unnoticed for several years because it requires a fairly unique setup to matter: 1. You need to have a lot of packs on the client side to make reprepare_packed_git() expensive (the most expensive part is finding duplicates in an unsorted list, which is currently quadratic). 2. You need a large number of tag refs on the server side that are candidates for auto-following (i.e., that the client doesn't have). Each one triggers a re-read of the pack directory. 3. Under normal circumstances, the client would auto-follow those tags and after one large fetch, (2) would no longer be true. But if those tags point to history which is disconnected from what the client otherwise fetches, then it will never auto-follow, and those candidates will impact it on every fetch. So when all three are true, each fetch pays an extra O(nr_tags * nr_packs^2) cost, mostly in string comparisons on the pack names. This was exacerbated by `47bf4b0` (prepare_packed_git_one: refactor duplicate-pack check, 2014-06-30) which uses a slightly more expensive string check, under the assumption that the duplicate check doesn't happen very often (and it shouldn't; the real problem here is how often we are calling reprepare_packed_git()). This patch teaches fetch to use HAS_SHA1_QUICK to sacrifice accuracy for speed, in cases where we might be racy with a simultaneous repack. This is similar to the fix in `0eeb077` (index-pack: avoid excessive re-reading of pack directory, 2015-06-09). As with that case, it's OK for has_sha1_file() occasionally say "no I don't have it" when we do, because the worst case is not a corruption, but simply that we may fail to auto-follow a tag that points to it. Here are results from the included perf script, which sets up a situation similar to the one described above: Test HEAD^ HEAD ---------------------------------------------------------- 5550.4: fetch 11.21(10.42+0.78) 0.08(0.04+0.02) -99.3% Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-14 11:31:32 -07:00
Junio C Hamano	f7f0a87e0a	Merge branch 'jc/verify-loose-object-header' into maint Codepaths that read from an on-disk loose object were too loose in validating what they are reading is a proper object file and sometimes read past the data they read from the disk, which has been corrected. H/t to Gustavo Grieco for reporting. * jc/verify-loose-object-header: unpack_sha1_header(): detect malformed object header streaming: make sure to notice corrupt object	2016-10-11 14:21:03 -07:00
Junio C Hamano	e6e24c94df	Merge branch 'jk/pack-objects-optim-mru' "git pack-objects" in a repository with many packfiles used to spend a lot of time looking for/at objects in them; the accesses to the packfiles are now optimized by checking the most-recently-used packfile first. * jk/pack-objects-optim-mru: pack-objects: use mru list when iterating over packs pack-objects: break delta cycles before delta-search phase sha1_file: make packed_object_info public provide an initializer for "struct object_info"	2016-10-10 14:03:47 -07:00
Jeff King	ea0fc3b417	alternates: use fspathcmp to detect duplicates On a case-insensitive filesystem, we should realize that "a/objects" and "A/objects" are the same path. We already use fspathcmp() to check against the main object directory, but until recently we couldn't use it for comparing against other alternates (because their paths were not NUL-terminated strings). But now we can, so let's do so. Note that we also need to adjust count-objects to load the config, so that it can see the setting of core.ignorecase (this is required by the test, but is also a general bugfix for users of count-objects). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:37 -07:00
Jeff King	087b6d5840	sha1_file: always allow relative paths to alternates We recursively expand alternates repositories, so that if A borrows from B which borrows from C, A can see all objects. For the root object database, we allow relative paths, so A can point to B as "../B/objects". However, we currently do not allow relative paths when recursing, so B must use an absolute path to reach C. That is an ancient protection from `c2f493a` (Transitively read alternatives, 2006-05-07) that tries to avoid adding the same alternate through two different paths. Since `5bdf0a8` (sha1_file: normalize alt_odb path before comparing and storing, 2011-09-07), we use a normalized absolute path for each alt_odb entry. This means that in most cases the protection is no longer necessary; we will detect the duplicate no matter how we got there (but see below). And it's a good idea to get rid of it, as it creates an unnecessary complication when setting up recursive alternates (B has to know that A is going to borrow from it and make sure to use an absolute path). Note that our normalization doesn't actually look at the filesystem, so it can still be fooled by crossing symbolic links. But that's also true of absolute paths, so it's not a good reason to disallow only relative paths (it's potentially a reason to switch to real_path(), but that's a separate and non-trivial change). We adjust the test script here to demonstrate that this now works, and add new tests to show that the normalization does indeed suppress duplicates. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:37 -07:00
Jeff King	f7b7774f34	fill_sha1_file: write into a strbuf It's currently the responsibility of the caller to give fill_sha1_file() enough bytes to write into, leading them to manually compute the required lengths. Instead, let's just write into a strbuf so that it's impossible to get this wrong. The alt_odb caller already has a strbuf, so this makes things strictly simpler. The other caller, sha1_file_name(), uses a static PATH_MAX buffer and dies when it would overflow. We can convert this to a static strbuf, which means our allocation cost is amortized (and as a bonus, we no longer have to worry about PATH_MAX being too short for normal use). This does introduce some small overhead in fill_sha1_file(), as each strbuf_addchar() will check whether it needs to grow. However, between the optimization in `fec501d` (strbuf_addch: avoid calling strbuf_grow, 2015-04-16) and the fact that this is not generally called in a tight loop (after all, the next step is typically to access the file!) this probably doesn't matter. And even if it did, the right place to micro-optimize is inside fill_sha1_file(), by calling a single strbuf_grow() there. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:37 -07:00
Jeff King	a5b34d2152	alternates: provide helper for adding to alternates list The submodule code wants to temporarily add an alternate object store to our in-memory alt_odb list, but does it manually. Let's provide a helper so it can reuse the code in link_alt_odb_entry(). While we're adding our new add_to_alternates_memory(), let's document add_to_alternates_file(), as the two are related. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	38dbe5f078	alternates: store scratch buffer as strbuf We pre-size the scratch buffer to hold a loose object filename of the form "xx/yyyy...", which leads to allocation code that is hard to verify. We have to use some magic numbers during the initial allocation, and then writers must blindly assume that the buffer is big enough. Using a strbuf makes it more clear that we cannot overflow. Unfortunately, we do still need some magic numbers to grow our strbuf before calling fill_sha1_path(), but the strbuf growth is much closer to the point of use. This makes it easier to see that it's correct, and opens the possibility of pushing it even further down if fill_sha1_path() learns to work on strbufs. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	4ea82473aa	link_alt_odb_entry: refactor string handling The string handling in link_alt_odb_entry() is mostly an artifact of the original version, which took the path as a ptr/len combo, and did not have a NUL-terminated string until we created one in the alternate_object_database struct. But since `5bdf0a8` (sha1_file: normalize alt_odb path before comparing and storing, 2011-09-07), the first thing we do is put the path into a strbuf, which gives us some easy opportunities for cleanup. In particular: - we call strlen(pathbuf.buf), which is silly; we can look at pathbuf.len. - even though we have a strbuf, we don't maintain its "len" field when chomping extra slashes from the end, and instead keep a separate "pfxlen" variable. We can fix this and then drop "pfxlen" entirely. - we don't check whether the path is usable until after we allocate the new struct, making extra cleanup work for ourselves. Since we have a NUL-terminated string, we can bump the "is it usable" checks higher in the function. While we're at it, we can move that logic to its own helper, which makes the flow of link_alt_odb_entry() easier to follow. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	afbba2f09a	fill_sha1_file: write "boring" characters This function forms a sha1 as "xx/yyyy...", but skips over the slot for the slash rather than writing it, leaving it to the caller to do so. It also does not bother to put in a trailing NUL, even though every caller would want it (we're forming a path which by definition is not a directory, so the only thing to do with it is feed it to a system call). Let's make the lives of our callers easier by just writing out the internal "/" and the NUL. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	670c359da3	link_alt_odb_entry: handle normalize_path errors When we add a new alternate to the list, we try to normalize out any redundant "..", etc. However, we do not look at the return value of normalize_path_copy(), and will happily continue with a path that could not be normalized. Worse, the normalizing process is done in-place, so we are left with whatever half-finished working state the normalizing function was in. Fortunately, this cannot cause us to read past the end of our buffer, as that working state will always leave the NUL from the original path in place. And we do tend to notice problems when we check is_directory() on the path. But you can see the nonsense that we feed to is_directory with an entry like: this/../../is/../../way/../../too/../../deep/../../to/../../resolve in your objects/info/alternates, which yields: error: object directory /to/e/deep/too/way//ects/this/../../is/../../way/../../too/../../deep/../../to/../../resolve does not exist; check .git/objects/info/alternates. We can easily fix this just by checking the return value. But that makes it hard to generate a good error message, since we're normalizing in-place and our input value has been overwritten by cruft. Instead, let's provide a strbuf helper that does an in-place normalize, but restores the original contents on error. This uses a second buffer under the hood, which is slightly less efficient, but this is not a performance-critical code path. The strbuf helper can also properly set the "len" parameter of the strbuf before returning. Just doing: normalize_path_copy(buf.buf, buf.buf); will shorten the string, but leave buf.len at the original length. That may be confusing to later code which uses the strbuf. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	597f9134de	alternates: use a separate scratch space The alternate_object_database struct uses a single buffer both for storing the path to the alternate, and as a scratch buffer for forming object names. This is efficient (since otherwise we'd end up storing the path twice), but it makes life hard for callers who just want to know the path to the alternate. They have to remember to stop reading after "alt->name - alt->base" bytes, and to subtract one for the trailing '/'. It would be much simpler if they could simply access a NUL-terminated path string. We could encapsulate this in a function which puts a NUL in the scratch buffer and returns the string, but that opens up questions about the lifetime of the result. The first time another caller uses the alternate, the scratch buffer may get other data tacked onto it. Let's instead just store the root path separately from the scratch buffer. There aren't enough alternates being stored for the duplicated data to matter for performance, and this keeps things simple and safe for the callers. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	29ec6af2b8	alternates: encapsulate alt->base munging The alternate_object_database struct holds a path to the alternate objects, but we also use that buffer as scratch space for forming loose object filenames. Let's pull that logic into a helper function so that we can more easily modify it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	7f0fa2c02a	alternates: provide helper for allocating alternate Allocating a struct alternate_object_database is tricky, as we must over-allocate the buffer to provide scratch space, and then put in particular '/' and NUL markers. Let's encapsulate this in a function so that the complexity doesn't leak into callers (and so that we can modify it later). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-10 13:52:36 -07:00
Jeff King	8e3f52d778	find_unique_abbrev: move logic out of get_short_sha1() The get_short_sha1() is only about reading short sha1s; we do call it in a loop to check "is this long enough" for each object, but otherwise it should not need to know about things like our default_abbrev setting. So instead of asking it to set default_automatic_abbrev as a side-effect, let's just have find_unique_abbrev() pick the right place to start its loop. This requires a separate approximate_object_count() function, but that naturally belongs with the rest of sha1_file.c. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-10-03 21:03:14 -07:00
Junio C Hamano	71a57ab32d	Merge branch 'jc/verify-loose-object-header' Codepaths that read from an on-disk loose object were too loose in validating what they are reading is a proper object file and sometimes read past the data they read from the disk, which has been corrected. H/t to Gustavo Grieco for reporting. * jc/verify-loose-object-header: unpack_sha1_header(): detect malformed object header streaming: make sure to notice corrupt object	2016-10-03 13:30:36 -07:00
Junio C Hamano	d21f842690	unpack_sha1_header(): detect malformed object header When opening a loose object file, we often do this sequence: - prepare a short buffer for the object header (on stack) - call unpack_sha1_header() and have early part of the object data inflated, enough to fill the buffer - parse that data in the short buffer, assuming that the first part of the object is <typename> SP <length> NUL Because the parsing function parse_sha1_header_extended() is not given the number of bytes inflated into the header buffer, it you craft a file whose early part inflates a garbage sequence without SP or NUL, and replace a loose object with it, it will end up reading past the end of the inflated data. To correct this, do the following four things: - rename unpack_sha1_header() to unpack_sha1_short_header() and have unpack_sha1_header_to_strbuf() keep calling that as its helper function. This will detect and report zlib errors, but is not aware of the format of a loose object (as before). - introduce unpack_sha1_header() that calls the same helper function, and when zlib reports it inflated OK into the buffer, check if the inflated data has NUL. This would ensure that parsing function will terminate within the buffer that holds the inflated header. - update unpack_sha1_header_to_strbuf() to check if the resulting buffer has NUL for the same effect. - update parse_sha1_header_extended() to make sure that its loop to find the SP that terminates the <typename> stops at NUL. Essentially, this makes unpack_*() functions that are asked to unpack a loose object header to be a bit more strict and detect an input that cannot possibly be a valid object header, even before the parsing function kicks in. Reported-by: Gustavo Grieco <gustavo.grieco@imag.fr> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-26 10:48:22 -07:00
Junio C Hamano	ee19836995	Merge branch 'rs/pack-sort-with-llist-mergesort' Code cleanup. * rs/pack-sort-with-llist-mergesort: sha1_file: use llist_mergesort() for sorting packs	2016-09-21 15:15:25 -07:00
Junio C Hamano	13307145a9	Merge branch 'jk/delta-base-cache' Recently we updated the code to manage the in-core cache that holds objects that have recently been used to reconstitute other objects that are stored as deltas against them, but the update used an incorrect API function to manage the list of these objects. This has been fixed. * jk/delta-base-cache: add_delta_base_cache: use list_for_each_safe	2016-09-21 15:15:20 -07:00
René Scharfe	c4c6effa9b	sha1_file: use llist_mergesort() for sorting packs Sort the linked list of packs directly using llist_mergesort() instead of building an array, calling qsort(3) and fixing up the list pointers. This is shorter and less complicated. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-13 16:26:46 -07:00
Junio C Hamano	305d7f1339	Merge branch 'jk/diff-submodule-diff-inline' The "git diff --submodule={short,log}" mechanism has been enhanced to allow "--submodule=diff" to show the patch between the submodule commits bound to the superproject. * jk/diff-submodule-diff-inline: diff: teach diff to display submodule difference with an inline diff submodule: refactor show_submodule_summary with helper function submodule: convert show_submodule_summary to use struct object_id * allow do_submodule_path to work even if submodule isn't checked out diff: prepare for additional submodule formats graph: add support for --line-prefix on all graph-aware output diff.c: remove output_prefix_length field cache: add empty_tree_oid object and helper function	2016-09-12 15:34:31 -07:00
Jeff King	4b92bae7d3	add_delta_base_cache: use list_for_each_safe We may remove elements from the list while we are iterating, which requires using a second temporary pointer. Otherwise stepping to the next element of the list might involve looking at freed memory (which generally works in practice, as we _just_ freed it, but of course is wrong to rely on; valgrind notices it). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-12 12:09:41 -07:00
Junio C Hamano	02c6c14d6c	Merge branch 'sb/submodule-clone-rr' "git clone --resurse-submodules --reference $path $URL" is a way to reduce network transfer cost by borrowing objects in an existing $path repository when cloning the superproject from $URL; it learned to also peek into $path for presense of corresponding repositories of submodules and borrow objects from there when able. * sb/submodule-clone-rr: clone: recursive and reference option triggers submodule alternates clone: implement optional references clone: clarify option_reference as required clone: factor out checking for an alternate path submodule--helper update-clone: allow multiple references submodule--helper module-clone: allow multiple references t7408: merge short tests, factor out testing method t7408: modernize style	2016-09-08 21:49:50 -07:00
Jacob Keller	8576fde6cb	cache: add empty_tree_oid object and helper function Similar to is_null_oid(), and is_empty_blob_sha1() add an empty_tree_oid along with helper function is_empty_tree_oid(). For completeness, also add an "is_empty_tree_sha1()", "is_empty_blob_sha1()", "is_empty_tree_oid()" and "is_empty_blob_oid()" helpers. To ensure we only get one singleton, implement EMPTY_BLOB_SHA1_BIN as simply getting the hash of empty_blob_oid structure. Signed-off-by: Jacob Keller <jacob.keller@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-31 18:07:08 -07:00
Jeff King	8261e1f139	delta_base_cache: use hashmap.h The fundamental data structure of the delta base cache is a hash table mapping pairs of "(packfile, offset)" into structs containing the actual object data. The hash table implementation dates back to `e5e0161` (Implement a simple delta_base cache, 2007-03-17), and uses a fixed-size table. The current size is a hard-coded 256 entries. Because we need to be able to remove objects from the hash table, entry lookup does not do any kind of probing to handle collisions. Colliding items simply replace whatever is in their slot. As a result, we have fewer usable slots than even the 256 we allocate. At half full, each new item has a 50% chance of displacing another one. Or another way to think about it: every item has a 1/256 chance of being ejected due to hash collision, without regard to our LRU strategy. So it would be interesting to see the effect of increasing the cache size on the runtime for some common operations. As with the previous patch, we'll measure "git log --raw" for tree-only operations, and "git log -Sfoo --raw" for operations that touch trees and blobs. All times are wall-clock best-of-3, done against fully packed repos with --depth=50, and the default core.deltaBaseCacheLimit of 96MB. Here are timings for various values of MAX_DELTA_CACHE against git.git (the asterisk marks the minimum time for each operation): MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m02.227s 0m12.821s 512 0m02.143s 0m10.602s 1024 0m02.127s 0m08.642s 2048 0m02.148s 0m07.123s 4096 0m02.194s 0m06.448s* 8192 0m02.239s 0m06.504s 16384 0m02.144s* 0m06.502s 32768 0m02.202s 0m06.622s 65536 0m02.230s 0m06.677s The log-raw case isn't changed much at all here (probably because our trees just aren't that big in the first place, or possibly because we have so _few_ trees in git.git that the 256-entry cache is enough). But once we start putting blobs in the cache, too, we see a big improvement (almost 50%). The curve levels off around 4096, which means that we can hold about that many entries before hitting the 96MB memory limit (or possibly that the workload is small enough that there is simply no more work to be optimized out by caching more). (As a side note, I initially timed my existing git.git pack, which was a base of --aggressive combined with some pulls on top. So it had quite a few deeper delta chains. The 256-cache case was more like 15s, and it still dropped to ~6.5s in the same way). Here are the timings for linux.git: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m41.661s 5m12.410s 512 0m39.547s 5m07.920s 1024 0m37.054s 4m54.666s 2048 0m35.871s 4m41.194s* 4096 0m34.646s 4m51.648s 8192 0m33.881s 4m55.342s 16384 0m35.190s 5m00.122s 32768 0m35.060s 4m58.851s 65536 0m33.311s* 4m51.420s As we grow we see a nice 20% speedup in the tree traversal, and more modest 10% in the log-S. This is probably an indication that we are bound less by the number of entries, and more by the memory limit (more on that below). What is interesting is that the numbers bounce around a bit; increasing the number of entries isn't always a strict improvement. Partially this is due to noise in the measurement. But it may also be an indication that our LRU ejection scheme is not optimal. The smaller cache sizes introduce some randomness into the ejection (due to collisions), which may sometimes work in our favor (and sometimes not!). So what is the optimal setting of MAX_DELTA_CACHE? The "bouncing" in the linux.git log-S numbers notwithstanding, it mostly seems like bigger is better. And even if we were to try to find a "sweet spot", these are just two repositories, that are not necessarily representative. The shape of history, the size of trees and blobs, the memory limit configuration, etc, all will affect the outcome. Rather than trying to find the "right" number, another strategy is to just switch to a hash table that can actually store collisions: namely our hashmap.h implementation. Here are numbers for that compared to the "best" we saw from adjusting MAX_DELTA_CACHE: \| log-raw \| log-S \| best hashmap \| best hashmap \| --------- --------- \| --------- --------- git \| 0m02.144s 0m02.144s \| 0m06.448s 0m06.688s linux \| 0m33.311s 0m33.092s \| 4m41.194s 4m57.172s We can see the results are similar in most cases, which is what we'd expect. We're not ejecting due to collisions at all, so this is purely representing the LRU. So really, we'd expect this to model most closely the larger values of the static MAX_DELTA_CACHE limit. And that does seem to be what's happening, including the "bounce" in the linux log-S case. So while the value for that case _isn't_ as good as the optimal one measured above (which was 2048 entries), given the bouncing I'm hesitant to suggest that 2048 is any kind of optimum (not even for linux.git, let alone as a general rule). The generic hashmap has the appeal that it drops the number of tweakable numbers by one, which means we can focus on tuning other elements, like the LRU strategy or the core.deltaBaseCacheLimit setting. And indeed, if we bump the cache limit to 1G (which is probably silly for general use, but maybe something people with big workstations would want to do), the linux.git log-S time drops to 3m32s. That's something you really _can't_ do easily with the static hash table, because the number of entries needs to grow in proportion to the memory limit (so 2048 is almost certainly not going to be the right value there). This patch takes that direction, and drops the static hash table entirely in favor of using the hashmap.h API. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 15:18:50 -07:00
Jeff King	6d9617f4f7	delta_base_cache: drop special treatment of blobs When the delta base cache runs out of allowed memory, it has to drop entries. It does so by walking an LRU list, dropping objects until we are under the memory limit. But we actually walk the list twice: once to drop blobs, and then again to drop other objects (which are generally trees). This comes from `18bdec1` (Limit the size of the new delta_base_cache, 2007-03-19). This performs poorly as the number of entries grows, because any time dropping blobs does not satisfy the limit, we have to walk the _entire_ list, trees included, looking for blobs to drop, before starting to drop any trees. It's not generally a problem now, as the cache is limited to only 256 entries. But as we could benefit from increasing that in a future patch, it's worth looking at how it performs as the cache size grows. And the answer is "not well". The table below shows times for various operations with different values of MAX_DELTA_CACHE (which is not a run-time knob; I recompiled with -DMAX_DELTA_CACHE=$n for each). I chose "git log --raw" ("log-raw" in the table) because it will access all of the trees, but no blobs at all (so in a sense it is a worst case for this problem, because we will always walk over the entire list of trees once before realizing there are no blobs to drop). This is also representative of other tree-only operations like "rev-list --objects" and "git log -- <path>". I also timed "git log -Sfoo --raw" ("log-S" in the table). It similarly accesses all of the trees, but also the blobs for each commit. It's representative of "git log -p", though it emphasizes the cost of blob access more, as "-S" is cheaper than computing an actual blob diff. All timings are best-of-3 wall-clock times (though they all were CPU bound, so the user CPU times are similar). The repositories were fully packed with --depth=50, and the default core.deltaBaseCacheLimit of 96M was in effect. The current value of MAX_DELTA_CACHE is 256, so I started there and worked up by factors of 2. First, here are values for git.git (the asterisk signals the fastest run for each operation): MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m02.212s 0m12.634s 512 0m02.136s* 0m10.614s 1024 0m02.156s 0m08.614s 2048 0m02.208s 0m07.062s 4096 0m02.190s 0m06.484s* 8192 0m02.176s 0m07.635s 16384 0m02.913s 0m19.845s 32768 0m03.617s 1m05.507s 65536 0m04.031s 1m18.488s You can see that for the tree-only log-raw case, we don't actually benefit that much as the cache grows (all the differences up through 8192 are basically just noise; this is probably because we don't actually have that many distinct trees in git.git). But for log-S, we get a definite speed improvement as the cache grows, but the improvements are lost as cache size grows and the linear LRU management starts to dominate. Here's the same thing run against linux.git: MAX_DELTA_CACHE log-raw log-S --------------- --------- ---------- 256 0m40.987s 5m13.216s 512 0m37.949s 5m03.243s 1024 0m35.977s 4m50.580s 2048 0m33.855s 4m39.818s 4096 0m32.913s 4m47.299s* 8192 0m32.176s* 5m14.650s 16384 0m32.185s 6m31.625s 32768 0m38.056s 9m31.136s 65536 1m30.518s 17m38.549s The pattern is similar, though the effect in log-raw is more pronounced here. The times dip down in the middle, and then go back up as we keep growing. So we know there's a problem. What's the solution? The obvious one is to improve the data structure to avoid walking over tree entries during the looking-for-blobs traversal. We can do this by keeping _two_ LRU lists: one for blobs, and one for other objects. We drop items from the blob LRU first, and then from the tree LRU (if necessary). Here's git.git using that strategy: MAX_DELTA_CACHE log-raw log-S --------------- --------- ---------- 256 0m02.264s 0m12.830s 512 0m02.201s 0m10.771s 1024 0m02.181s 0m08.593s 2048 0m02.205s 0m07.116s 4096 0m02.158s 0m06.537s* 8192 0m02.213s 0m07.246s 16384 0m02.155s* 0m10.975s 32768 0m02.159s 0m16.047s 65536 0m02.181s 0m16.992s The upswing on log-raw is gone completely. But log-S still has it (albeit much better than without this strategy). Let's see what linux.git shows: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m42.519s 5m14.654s 512 0m39.106s 5m04.708s 1024 0m36.802s 4m51.454s 2048 0m34.685s 4m39.378s* 4096 0m33.663s 4m44.047s 8192 0m33.157s 4m50.644s 16384 0m33.090s* 4m49.648s 32768 0m33.458s 4m53.371s 65536 0m33.563s 5m04.580s The results are similar. The tree-only case again performs well (not surprising; we're literally just dropping the one useless walk, and not otherwise changing the cache eviction strategy at all). But the log-S case again does a bit worse as the cache grows (though possibly that's within the noise, which is much larger for this case). Perhaps this is an indication that the "remove blobs first" strategy is not actually optimal. The intent of it is to avoid blowing out the tree cache when we see large blobs, but it also means we'll throw away useful, recent blobs in favor of older trees. Let's run the same numbers without caring about object type at all (i.e., one LRU list, and always evicting whatever is at the head, regardless of type). Here's git.git: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m02.227s 0m12.821s 512 0m02.143s 0m10.602s 1024 0m02.127s 0m08.642s 2048 0m02.148s 0m07.123s 4096 0m02.194s 0m06.448s* 8192 0m02.239s 0m06.504s 16384 0m02.144s* 0m06.502s 32768 0m02.202s 0m06.622s 65536 0m02.230s 0m06.677s Much smoother; there's no dramatic upswing as we increase the cache size (some remains, though it's small enough that it's mostly run-to-run noise. E.g., in the log-raw case, note how 8192 is 50-100ms higher than its neighbors). Note also that we stop getting any real benefit for log-S after about 4096 entries; that number will depend on the size of the repository, the size of the blob entries, and the memory limit of the cache. Let's see what linux.git shows for the same strategy: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m41.661s 5m12.410s 512 0m39.547s 5m07.920s 1024 0m37.054s 4m54.666s 2048 0m35.871s 4m41.194s* 4096 0m34.646s 4m51.648s 8192 0m33.881s 4m55.342s 16384 0m35.190s 5m00.122s 32768 0m35.060s 4m58.851s 65536 0m33.311s* 4m51.420s It's similarly good. As with the "separate blob LRU" strategy, there's a lot of noise on the log-S run here. But it's certainly not any worse, is possibly a bit better, and the improvement over "separate blob LRU" on the git.git case is dramatic. So it seems like a clear winner, and that's what this patch implements. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:57:44 -07:00
Jeff King	12d95ef6fc	delta_base_cache: use list.h for LRU We keep an LRU list of entries for when we need to drop something from an over-full cache. The list is implemented as a circular doubly-linked list, which is exactly what list.h provides. We can save a few lines by using the list.h macros and functions. More importantly, this makes the code easier to follow, as the reader sees explicit concepts like "list_add_tail()" instead of pointer manipulation. As a bonus, the list_entry() macro lets us place the lru pointers anywhere inside the delta_base_cache_entry struct (as opposed to just casting the pointer, which requires it at the front of the struct). This will be useful in later patches when we need to place other items at the front of the struct (e.g., our hashmap implementation requires this). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:52:00 -07:00
Jeff King	f92dd60f95	release_delta_base_cache: reuse existing detach function This function drops an entry entirely from the cache, meaning that aside from the freeing of the buffer, it is exactly equivalent to detach_delta_base_cache_entry(). Let's build on top of the detach function, which shortens the code and will make it simpler when we change out the underlying storage in future patches. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:47:33 -07:00
Jeff King	4a5397ca79	clear_delta_base_cache_entry: use a more descriptive name The delta base cache entries are stored in a fixed-length hash table. So the way to remove an entry is to "clear" the slot in the table, and that is what this function does. However, the name is a leaky abstraction. If we were to change the hash table implementation, it would no longer be about "clearing". We should name it after _what_ it does, not _how_ it does it. I.e., something like "remove" instead of "clear". But that does not tell the whole story, either. The subtle thing about this function is that it removes the entry, but does not free the entry data. So a more descriptive name is "detach"; we give ownership of the data buffer to the caller, and remove any other resources. This patch uses the name detach_delta_base_cache_entry(). We could further model this after functions like strbuf_detach(), which pass back all of the detached information. However, since there are so many bits of information in the struct (the data, the size, the type), and so few callers (only one), it's not worth that awkwardness. The name change and a comment can make the intent clear. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:45:29 -07:00
Jeff King	85fe35ab9e	cache_or_unpack_entry: drop keep_cache parameter There is only one caller of cache_or_unpack_entry() and it always passes 1 for the keep_cache parameter. We can simplify it by dropping the "!keep_cache" case. Another call, which did pass 0, was dropped in `abe601b` (sha1_file: remove recursion in unpack_entry, 2013-03-27), as unpack_entry() now does more complicated things than a simple unpack when there is a cache miss. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-23 14:44:26 -07:00
Stefan Beller	9eeea7d2bc	clone: factor out checking for an alternate path In a later patch we want to determine if a path is suitable as an alternate from other commands than builtin/clone. Move the checking functionality of `add_one_reference` to `compute_alternate_path` that is defined in cache.h. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-15 15:28:01 -07:00
Junio C Hamano	e6b8f80653	Merge branch 'vs/typofix' * vs/typofix: Spelling fixes	2016-08-12 09:47:37 -07:00
Ville Skyttä	2e3a16b279	Spelling fixes <BAD> <CORRECTED> accidently accidentally commited committed dependancy dependency emtpy empty existance existence explicitely explicitly git-upload-achive git-upload-archive hierachy hierarchy indegee indegree intial initial mulitple multiple non-existant non-existent precendence. precedence. priviledged privileged programatically programmatically psuedo-binary pseudo-binary soemwhere somewhere successfull successful transfering transferring uncommited uncommitted unkown unknown usefull useful writting writing Signed-off-by: Ville Skyttä <ville.skytta@iki.fi> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-11 14:35:42 -07:00
Jeff King	ca79c98572	sha1_file: make packed_object_info public Some code may have a pack/offset pair for an object, but would like to look up more information. Using sha1_object_info() is too heavy-weight; it starts from the sha1 and has to find the pack again (so not only does it waste time, it might not even find the same instance). In some cases, this problem is solved by helpers like get_size_from_delta(), which is used by pack-objects to take a shortcut for objects whose packed representation has already been found. But there's no similar function for getting the object type, for instance. Rather than introduce one, let's just make the whole packed_object_info() available. It is smart enough to spend effort only on the items the caller wants. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-11 10:43:24 -07:00
Jeff King	27b5c1a065	provide an initializer for "struct object_info" An all-zero initializer is fine for this struct, but because the first element is a pointer, call sites need to know to use "NULL" instead of "0". Otherwise some static checkers like "sparse" will complain; see `d099b71` (Fix some sparse warnings, 2013-07-18) for example. So let's provide an initializer to make this easier to get right. But let's also comment that memset() to zero is explicitly OK[1]. One of the callers embeds object_info in another struct which is initialized via memset (expand_data in builtin/cat-file.c). Since our subset of C doesn't allow assignment from a compound literal, handling this in any other way is awkward, so we'd like to keep the ability to initialize by memset(). By documenting this property, it should make anybody who wants to change the initializer think twice before doing so. There's one other caller of interest. In parse_sha1_header(), we did not initialize the struct fully in the first place. This turned out not to be a bug because the sub-function it calls does not look at any other fields except the ones we did initialize. But that assumption might not hold in the future, so it's a dangerous construct. This patch switches it to initializing the whole struct, which protects us against unexpected reads of the other fields. [1] Obviously using memset() to initialize a pointer violates the C standard, but we long ago decided that it was an acceptable tradeoff in the real world. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-08-11 10:42:23 -07:00
Junio C Hamano	1a5f1a3f25	Merge branch 'js/am-3-merge-recursive-direct' "git am -3" calls "git merge-recursive" when it needs to fall back to a three-way merge; this call has been turned into an internal subroutine call instead of spawning a separate subprocess. * js/am-3-merge-recursive-direct: merge-recursive: flush output buffer even when erroring out merge_trees(): ensure that the callers release output buffer merge-recursive: offer an option to retain the output in 'obuf' merge-recursive: write the commit title in one go merge-recursive: flush output buffer before printing error messages am -3: use merge_recursive() directly again merge-recursive: switch to returning errors instead of dying merge-recursive: handle return values indicating errors merge-recursive: allow write_tree_from_memory() to error out merge-recursive: avoid returning a wholesale struct merge_recursive: abort properly upon errors prepare the builtins for a libified merge_recursive() merge-recursive: clarify code in was_tracked() die(_("BUG")): avoid translating bug messages die("bug"): report bugs consistently t5520: verify that `pull --rebase` shows the helpful advice when failing	2016-08-10 12:33:20 -07:00
Junio C Hamano	78849622ec	Merge branch 'jk/pack-objects-optim' "git pack-objects" has a few options that tell it not to pack objects found in certain packfiles, which require it to scan .idx files of all available packs. The codepaths involved in these operations have been optimized for a common case of not having any non-local pack and/or any .kept pack. * jk/pack-objects-optim: pack-objects: compute local/ignore_pack_keep early pack-objects: break out of want_object loop early find_pack_entry: replace last_found_pack with MRU cache add generic most-recently-used list sha1_file: drop free_pack_by_name t/perf: add tests for many-pack scenarios	2016-08-08 14:48:39 -07:00
Junio C Hamano	aa9136a87e	Merge branch 'nd/pack-ofs-4gb-limit' into maint "git pack-objects" and "git index-pack" mostly operate with off_t when talking about the offset of objects in a packfile, but there were a handful of places that used "unsigned long" to hold that value, leading to an unintended truncation. * nd/pack-ofs-4gb-limit: fsck: use streaming interface for large blobs in pack pack-objects: do not truncate result in-pack object size on 32-bit systems index-pack: correct "offset" type in unpack_entry_data() index-pack: report correct bad object offsets even if they are large index-pack: correct "len" type in unpack_data() sha1_file.c: use type off_t* for object_info->disk_sizep pack-objects: pass length to check_pack_crc() without truncation	2016-08-08 14:21:36 -07:00
Jeff King	a73cdd21c4	find_pack_entry: replace last_found_pack with MRU cache Each pack has an index for looking up entries in O(log n) time, but if we have multiple packs, we have to scan through them linearly. This can produce a measurable overhead for some operations. We dealt with this long ago in `f7c22cc` (always start looking up objects in the last used pack first, 2007-05-30), which keeps what is essentially a 1-element most-recently-used cache. In theory, we should be able to do better by keeping a similar but longer cache, that is the same length as the pack-list itself. Since we now have a convenient generic MRU structure, we can plug it in and measure. Here are the numbers for running p5303 against linux.git: Test HEAD^ HEAD ------------------------------------------------------------------------ 5303.3: rev-list (1) 31.56(31.28+0.27) 31.30(31.08+0.20) -0.8% 5303.4: repack (1) 40.62(39.35+2.36) 40.60(39.27+2.44) -0.0% 5303.6: rev-list (50) 31.31(31.06+0.23) 31.23(31.00+0.22) -0.3% 5303.7: repack (50) 58.65(69.12+1.94) 58.27(68.64+2.05) -0.6% 5303.9: rev-list (1000) 38.74(38.40+0.33) 31.87(31.62+0.24) -17.7% 5303.10: repack (1000) 367.20(441.80+4.62) 342.00(414.04+3.72) -6.9% The main numbers of interest here are the rev-list ones (since that is exercising the normal object lookup code path). The single-pack case shouldn't improve at all; the 260ms speedup there is just part of the run-to-run noise (but it's important to note that we didn't make anything worse with the overhead of maintaining our cache). In the 50-pack case, we see similar results. There may be a slight improvement, but it's mostly within the noise. The 1000-pack case does show a big improvement, though. That carries over to the repack case, as well. Even though we haven't touched its pack-search loop yet, it does still do a lot of normal object lookups (e.g., for the internal revision walk), and so improves. As a point of reference, I also ran the 1000-pack test against a version of HEAD^ with the last_found_pack optimization disabled. It takes ~60s, so that gives an indication of how much even the single-element cache is helping. For comparison, here's a smaller repository, git.git: Test HEAD^ HEAD --------------------------------------------------------------------- 5303.3: rev-list (1) 1.56(1.54+0.01) 1.54(1.51+0.02) -1.3% 5303.4: repack (1) 1.84(1.80+0.10) 1.82(1.80+0.09) -1.1% 5303.6: rev-list (50) 1.58(1.55+0.02) 1.59(1.57+0.01) +0.6% 5303.7: repack (50) 2.50(3.18+0.04) 2.50(3.14+0.04) +0.0% 5303.9: rev-list (1000) 2.76(2.71+0.04) 2.24(2.21+0.02) -18.8% 5303.10: repack (1000) 13.21(19.56+0.25) 11.66(18.01+0.21) -11.7% You can see that the percentage improvement is similar. That's because the lookup we are optimizing is roughly O(nr_objects * nr_packs). Since the number of packs is constant in both tests, we'd expect the improvement to be linear in the number of objects. But the whole process is also linear in the number of objects, so the improvement is a constant factor. The exact improvement does also depend on the contents of the packs. In p5303, the extra packs all have 5 first-parent commits in them, which is a reasonable simulation of a pushed-to repository. But it also means that only 250 first-parent commits are in those packs (compared to almost 50,000 total in linux.git), and the rest are in the huge "base" pack. So once we start looking at history in taht big pack, that's where we'll find most everything, and even the 1-element cache gets close to 100% cache hits. You could almost certainly show better numbers with a more pathological case (e.g., distributing the objects more evenly across the packs). But that's simply not that realistic a scenario, so it makes more sense to focus on these numbers. The implementation itself is a straightforward application of the MRU code. We provide an MRU-ordered list of packs that shadows the packed_git list. This is easy to do because we only create and revise the pack list in one place. The "reprepare" code path actually drops the whole MRU and replaces it for simplicity. It would be more efficient to just add new entries, but there's not much point in optimizing here; repreparing happens rarely, and only after doing a lot of other expensive work. The key things to keep optimized are traversal (which is just a normal linked list, albeit with one extra level of indirection over the regular packed_git list), and marking (which is a constant number of pointer assignments, though slightly more than the old last_found_pack was; it doesn't seem to create a measurable slowdown, though). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-29 11:05:07 -07:00
Jeff King	3157c880f6	sha1_file: drop free_pack_by_name The point of this function is to drop an entry from the "packed_git" cache that points to a file we might be overwriting, because our contents may not be the same (and hence the only caller was pack-objects as it moved a temporary packfile into place). In older versions of git, this could happen because the names of packfiles were derived from the set of objects they contained, not the actual bits on disk. But since `1190a1a` (pack-objects: name pack files after trailer hash, 2013-12-05), the name reflects the actual bits on disk, and any two packfiles with the same name can be used interchangeably. Dropping this function not only saves a few lines of code, it makes the lifetime of "struct packed_git" much easier to reason about: namely, we now do not ever free these structs. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-29 11:05:06 -07:00
Junio C Hamano	ad2d777604	Merge branch 'nd/pack-ofs-4gb-limit' "git pack-objects" and "git index-pack" mostly operate with off_t when talking about the offset of objects in a packfile, but there were a handful of places that used "unsigned long" to hold that value, leading to an unintended truncation. * nd/pack-ofs-4gb-limit: fsck: use streaming interface for large blobs in pack pack-objects: do not truncate result in-pack object size on 32-bit systems index-pack: correct "offset" type in unpack_entry_data() index-pack: report correct bad object offsets even if they are large index-pack: correct "len" type in unpack_data() sha1_file.c: use type off_t* for object_info->disk_sizep pack-objects: pass length to check_pack_crc() without truncation	2016-07-28 10:34:42 -07:00
Johannes Schindelin	ef1177d18e	die("bug"): report bugs consistently The vast majority of error messages in Git's source code which report a bug use the convention to prefix the message with "BUG:". As part of cleaning up merge-recursive to stop die()ing except in case of detected bugs, let's just make the remainder of the bug reports consistent with the de facto rule. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-26 11:13:44 -07:00
Nguyễn Thái Ngọc Duy	211c61c6cf	pack-objects: pass length to check_pack_crc() without truncation On 32 bit systems with large file support, unsigned long is 32-bit while the two offsets in the subtraction expression (pack-objects has the exact same expression as in sha1_file.c but not shown in diff) are in 64-bit. If an in-pack object is larger than 2^32 len/datalen is truncated and we get a misleading "error: bad packed object CRC for ..." as a result. Use off_t for len and datalen. check_pack_crc() already accepts this argument as off_t and can deal with 4+ GB. Noticed-by: Christoph Michelbach <michelbach94@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-07-12 10:14:29 -07:00
Junio C Hamano	352d72a30e	Merge branch 'nd/worktree-various-heads' The experimental "multiple worktree" feature gains more safety to forbid operations on a branch that is checked out or being actively worked on elsewhere, by noticing that e.g. it is being rebased. * nd/worktree-various-heads: branch: do not rename a branch under bisect or rebase worktree.c: check whether branch is bisected in another worktree wt-status.c: split bisect detection out of wt_status_get_state() worktree.c: check whether branch is rebased in another worktree worktree.c: avoid referencing to worktrees[i] multiple times wt-status.c: make wt_status_check_rebase() work on any worktree wt-status.c: split rebase detection out of wt_status_get_state() path.c: refactor and add worktree_git_path() worktree.c: mark current worktree worktree.c: make find_shared_symref() return struct worktree * worktree.c: store "id" instead of "git_dir" path.c: add git_common_path() and strbuf_git_common_path() dir.c: rename str(n)cmp_icase to fspath(n)cmp	2016-05-23 14:54:29 -07:00
Nguyễn Thái Ngọc Duy	7616c6ca9d	sha1_file.c: use {error,die,warning}_errno() Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-05-09 12:29:08 -07:00
Nguyễn Thái Ngọc Duy	ba0897e6ae	dir.c: rename str(n)cmp_icase to fspath(n)cmp These functions compare two paths that are taken from file system. Depending on the running file system, paths may need to be compared case-sensitively or not, and maybe even something else in future. The current names do not convey that well. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-04-22 14:09:37 -07:00
Junio C Hamano	090de6b289	Merge branch 'jk/pack-idx-corruption-safety' The code to read the pack data using the offsets stored in the pack idx file has been made more carefully check the validity of the data in the idx. * jk/pack-idx-corruption-safety: sha1_file.c: mark strings for translation use_pack: handle signed off_t overflow nth_packed_object_offset: bounds-check extended offset t5313: test bounds-checks of corrupted/malicious pack/idx files	2016-03-04 13:45:47 -08:00
Nguyễn Thái Ngọc Duy	7465feba51	sha1_file.c: mark strings for translation Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-27 09:54:57 -08:00
Junio C Hamano	11529ecec9	Merge branch 'jk/tighten-alloc' Update various codepaths to avoid manually-counted malloc(). * jk/tighten-alloc: (22 commits) ewah: convert to REALLOC_ARRAY, etc convert ewah/bitmap code to use xmalloc diff_populate_gitlink: use a strbuf transport_anonymize_url: use xstrfmt git-compat-util: drop mempcpy compat code sequencer: simplify memory allocation of get_message test-path-utils: fix normalize_path_copy output buffer size fetch-pack: simplify add_sought_entry fast-import: simplify allocation in start_packfile write_untracked_extension: use FLEX_ALLOC helper prepare_{git,shell}_cmd: use argv_array use st_add and st_mult for allocation size computation convert trivial cases to FLEX_ARRAY macros use xmallocz to avoid size arithmetic convert trivial cases to ALLOC_ARRAY convert manual allocations to argv_array argv-array: add detach function add helpers for allocating flex-array structs harden REALLOC_ARRAY and xcalloc against size_t overflow tree-diff: catch integer overflow in combine_diff_path allocation ...	2016-02-26 13:37:16 -08:00
Jeff King	13e0b0d3dc	use_pack: handle signed off_t overflow A v2 pack index file can specify an offset within a packfile of up to 2^64-1 bytes. On a system with a signed 64-bit off_t, we can represent only up to 2^63-1. This means that a corrupted .idx file can end up with a negative offset in the pack code. Our bounds-checking use_pack function looks for too-large offsets, but not for ones that have wrapped around to negative. Let's do so, which fixes an out-of-bounds access demonstrated in t5313. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-25 11:32:46 -08:00
Jeff King	47fe3f6ef0	nth_packed_object_offset: bounds-check extended offset If a pack .idx file has a corrupted offset for an object, we may try to access an offset in the .idx or .pack file that is larger than the file's size. For the .pack case, we have use_pack() to protect us, which realizes the access is out of bounds. But if the corrupted value asks us to look in the .idx file's secondary 64-bit offset table, we blindly add it to the mmap'd index data and access arbitrary memory. We can fix this with a simple bounds-check compared to the size we found when we opened the .idx file. Note that there's similar code in index-pack that is triggered only during "index-pack --verify". To support both, we pull the bounds-check into a separate function, which dies when it sees a corrupted file. It would be nice if we could return an error, so that the pack code could try to find a good copy of the object elsewhere. Currently nth_packed_object_offset doesn't have any way to return an error, but it could probably use "0" as a sentinel value (since no object can start there). This is the minimal fix, and we can improve the resilience later on top. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-25 11:32:43 -08:00
Jeff King	50a6c8efa2	use st_add and st_mult for allocation size computation If our size computation overflows size_t, we may allocate a much smaller buffer than we expected and overflow it. It's probably impossible to trigger an overflow in most of these sites in practice, but it is easy enough convert their additions and multiplications into overflow-checking variants. This may be fixing real bugs, and it makes auditing the code easier. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-22 14:51:09 -08:00
Jeff King	b32fa95fd8	convert trivial cases to ALLOC_ARRAY Each of these cases can be converted to use ALLOC_ARRAY or REALLOC_ARRAY, which has two advantages: 1. It automatically checks the array-size multiplication for overflow. 2. It always uses sizeof(*array) for the element-size, so that it can never go out of sync with the declared type of the array. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-02-22 14:51:09 -08:00
Junio C Hamano	3f16396228	clone/sha1_file: read info/alternates with strbuf_getline() $GIT_OBJECT_DIRECTORY/info/alternates is a text file that can be edited with a DOS editor. We do not want to use the real path with CR appended at the end. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-01-15 10:34:53 -08:00
Junio C Hamano	8f309aeb82	strbuf: introduce strbuf_getline_{lf,nul}() The strbuf_getline() interface allows a byte other than LF or NUL as the line terminator, but this is only because I wrote these codepaths anticipating that there might be a value other than NUL and LF that could be useful when I introduced line_termination long time ago. No useful caller that uses other value has emerged. By now, it is clear that the interface is overly broad without a good reason. Many codepaths have hardcoded preference to read either LF terminated or NUL terminated records from their input, and then call strbuf_getline() with LF or NUL as the third parameter. This step introduces two thin wrappers around strbuf_getline(), namely, strbuf_getline_lf() and strbuf_getline_nul(), and mechanically rewrites these call sites to call either one of them. The changes contained in this patch are: * introduction of these two functions in strbuf.[ch] * mechanical conversion of all callers to strbuf_getline() with either '\n' or '\0' as the third parameter to instead call the respective thin wrapper. After this step, output from "git grep 'strbuf_getline('" would become a lot smaller. An interim goal of this series is to make this an empty set, so that we can have strbuf_getline_crlf() take over the shorter name strbuf_getline(). Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-01-15 10:12:51 -08:00
Junio C Hamano	fbe959dde7	Merge branch 'bc/format-patch-null-from-line' "format-patch" has learned a new option to zero-out the commit object name on the mbox "From " line. * bc/format-patch-null-from-line: format-patch: check that header line has expected format format-patch: add an option to suppress commit hash sha1_file.c: introduce a null_oid constant	2015-12-21 10:59:08 -08:00
Junio C Hamano	897b18508b	Merge branch 'jk/prune-mtime' The helper used to iterate over loose object directories to prune stale objects did not closedir() immediately when it is done with a directory--a callback such as the one used for "git prune" may want to do rmdir(), but it would fail on open directory on platforms such as WinXP. * jk/prune-mtime: prune: close directory earlier during loose-object directory traversal	2015-12-15 08:02:16 -08:00
brian m. carlson	3e56e7245c	sha1_file.c: introduce a null_oid constant null_oid is the struct object_id equivalent to null_sha1. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-12-14 13:35:54 -08:00
brian m. carlson	b419aa25d5	sha1_file: introduce has_object_file helper. Add has_object_file, which is a wrapper around has_sha1_file, but for struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>	2015-11-20 08:02:05 -05:00
Jeff King	45014beac0	Merge branch 'dk/gc-idx-wo-pack' Having a leftover .idx file without corresponding .pack file in the repository hurts performance; "git gc" learned to prune them. * dk/gc-idx-wo-pack: gc: remove garbage .idx files from pack dir t5304: test cleaning pack garbage prepare_packed_git(): refactor garbage reporting in pack directory	2015-11-20 06:55:34 -05:00
Junio C Hamano	808d119263	Merge branch 'js/misc-fixes' Various compilation fixes and squelching of warnings. * js/misc-fixes: Correct fscanf formatting string for I64u values Silence GCC's "cast of pointer to integer of a different size" warning Squelch warning about an integer overflow	2015-10-30 13:07:00 -07:00
Johannes Schindelin	56a1a3ab44	Silence GCC's "cast of pointer to integer of a different size" warning When calculating hashes from pointers, it actually makes sense to cut off the most significant bits. In that case, said warning does not make a whole lot of sense. So let's just work around it by casting the pointer first to intptr_t and then casting up/down to the final integral type. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-26 13:24:03 -07:00
Junio C Hamano	78891795df	Merge branch 'jk/war-on-sprintf' Many allocations that is manually counted (correctly) that are followed by strcpy/sprintf have been replaced with a less error prone constructs such as xstrfmt. Macintosh-specific breakage was noticed and corrected in this reroll. * jk/war-on-sprintf: (70 commits) name-rev: use strip_suffix to avoid magic numbers use strbuf_complete to conditionally append slash fsck: use for_each_loose_file_in_objdir Makefile: drop D_INO_IN_DIRENT build knob fsck: drop inode-sorting code convert strncpy to memcpy notes: document length of fanout path with a constant color: add color_set helper for copying raw colors prefer memcpy to strcpy help: clean up kfmclient munging receive-pack: simplify keep_arg computation avoid sprintf and strcpy with flex arrays use alloc_ref rather than hand-allocating "struct ref" color: add overflow checks for parsing colors drop strcpy in favor of raw sha1_to_hex use sha1_to_hex_r() instead of strcpy daemon: use cld->env_array when re-spawning stat_tracking_info: convert to argv_array http-push: use an argv_array for setup_revisions fetch-pack: use argv_array for index-pack / unpack-objects ...	2015-10-20 15:24:01 -07:00
Junio C Hamano	db5adf24bf	Merge branch 'js/clone-dissociate' "git clone --dissociate" runs a big "git repack" process at the end, and it helps to close file descriptors that are open on the packs and their idx files before doing so on filesystems that cannot remove a file that is still open. * js/clone-dissociate: clone --dissociate: avoid locking pack files sha1_file.c: add a function to release all packs sha1_file: consolidate code to close a pack's file descriptor t5700: demonstrate a Windows file locking issue with `git clone --dissociate`	2015-10-15 15:43:49 -07:00
Johannes Schindelin	38849a8116	sha1_file.c: add a function to release all packs On Windows, files that are in use cannot be removed or renamed. That means that we have to release pack files when we are about to, say, repack them. Let's introduce a convenient function to close all the pack files and their idx files. While at it, we consolidate the close windows/close fd/close index stanza in `free_pack_by_name()` into the `close_pack()` function that is used by the new `close_all_packs()` function to avoid repeated code. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-07 10:47:10 -07:00
Johannes Schindelin	71fe5d7fb0	sha1_file: consolidate code to close a pack's file descriptor There was a lot of repeated code to close the file descriptor of a given pack. Let's just refactor this code into a single function. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-05 14:43:58 -07:00
Jeff King	c7ab0ba340	avoid sprintf and strcpy with flex arrays When we are allocating a struct with a FLEX_ARRAY member, we generally compute the size of the array and then sprintf or strcpy into it. Normally we could improve a dynamic allocation like this by using xstrfmt, but it doesn't work here; we have to account for the size of the rest of the struct. But we can improve things a bit by storing the length that we use for the allocation, and then feeding it to xsnprintf or memcpy, which makes it more obvious that we are not writing more than the allocated number of bytes. It would be nice if we had some kind of helper for allocating generic flex arrays, but it doesn't work that well: - the call signature is a little bit unwieldy: d = flex_struct(sizeof(*d), offsetof(d, path), fmt, ...); You need offsetof here instead of just writing to the end of the base size, because we don't know how the struct is packed (partially this is because FLEX_ARRAY might not be zero, though we can account for that; but the size of the struct may actually be rounded up for alignment, and we can't know that). - some sites do clever things, like over-allocating because they know they will write larger things into the buffer later (e.g., struct packed_git here). So we're better off to just write out each allocation (or add type-specific helpers, though many of these are one-off allocations anyway). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-05 11:08:05 -07:00
Jeff King	d4b3d11a03	write_loose_object: convert to strbuf When creating a loose object tempfile, we use a fixed PATH_MAX-sized buffer, and strcpy directly into it. This isn't buggy, because we do a rough check of the size, but there's no verification that our guesstimate of the required space is enough (in fact, it's several bytes too big for the current naming scheme). Let's switch to a strbuf, which makes this much easier to verify. The allocation overhead should be negligible, since we are replacing a static buffer with a static strbuf, and we'll only need to allocate on the first call. While we're here, we can also document a subtle interaction with mkstemp that would be easy to overlook. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-10-05 11:08:05 -07:00
Jeff King	ac5190cc48	sha1_get_pack_name: use a strbuf We do some manual memory computation here, and there's no check that our 60 is not overflowed by the raw sprintf (it isn't, because the "which" parameter is never longer than "pack"). We can simplify this greatly with a strbuf. Technically the end result is not identical, as the original took care not to rewrite the object directory on each call for performance reasons. We could do that here, too (by saving the baselen and resetting to it), but it's not worth the complexity; this function is not called a lot (generally once per packfile that we open). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-25 10:18:18 -07:00
Jeff King	9ae97018fb	use strip_suffix and xstrfmt to replace suffix When we want to convert "foo.pack" to "foo.idx", we do it by duplicating the original string and then munging the bytes in place. Let's use strip_suffix and xstrfmt instead, which has several advantages: 1. It's more clear what the intent is. 2. It does not implicitly rely on the fact that strlen(".idx") <= strlen(".pack") to avoid an overflow. 3. We communicate the assumption that the input file ends with ".pack" (and get a run-time check that this is so). 4. We drop calls to strcpy, which makes auditing the code base easier. Likewise, we can do this to convert ".pack" to ".bitmap", avoiding some manual memory computation. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-25 10:18:18 -07:00
Jeff King	48bcc1c3cc	add_packed_git: convert strcpy into xsnprintf We have the path "foo.idx", and we create a buffer big enough to hold "foo.pack" and "foo.keep", and then strcpy straight into it. This isn't a bug (we have enough space), but it's very hard to tell from the strcpy that this is so. Let's instead use strip_suffix to take off the ".idx", record the size of our allocation, and use xsnprintf to make sure we don't violate our assumptions. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-25 10:18:18 -07:00
Jeff King	ef1286d3c0	use xsnprintf for generating git object headers We generally use 32-byte buffers to format git's "type size" header fields. These should not generally overflow unless you can produce some truly gigantic objects (and our types come from our internal array of constant strings). But it is a good idea to use xsnprintf to make sure this is the case. Note that we slightly modify the interface to write_sha1_file_prepare, which nows uses "hdrlen" as an "in" parameter as well as an "out" (on the way in it stores the allocated size of the header, and on the way out it returns the ultimate size of the header). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-25 10:18:18 -07:00
Junio C Hamano	f0bc854623	Sync with 2.5.2	2015-09-09 14:30:35 -07:00
Junio C Hamano	3d3caf0b78	Sync with 2.4.9	2015-09-04 10:43:23 -07:00
Junio C Hamano	ef0e938a1a	Sync with 2.3.9	2015-09-04 10:34:19 -07:00
Junio C Hamano	8267cd11d6	Sync with 2.2.3	2015-09-04 10:29:28 -07:00
Jeff King	5015f01c12	read_info_alternates: handle paths larger than PATH_MAX This function assumes that the relative_base path passed into it is no larger than PATH_MAX, and writes into a fixed-size buffer. However, this path may not have actually come from the filesystem; for example, add_submodule_odb generates a path using a strbuf and passes it in. This is hard to trigger in practice, though, because the long submodule directory would have to exist on disk before we would try to open its info/alternates file. We can easily avoid the bug, though, by simply creating the filename on the heap. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-04 09:36:51 -07:00
Junio C Hamano	cbcd3dcaa8	Merge branch 'cb/open-noatime-clear-errno' into maint When trying to see that an object does not exist, a state errno leaked from our "first try to open a packfile with O_NOATIME and then if it fails retry without it" logic on a system that refuses O_NOATIME. This confused us and caused us to die, saying that the packfile is unreadable, when we should have just reported that the object does not exist in that packfile to the caller. * cb/open-noatime-clear-errno: git_open_noatime: return with errno=0 on success	2015-09-03 19:17:49 -07:00
Junio C Hamano	9b8d731995	Merge branch 'cb/open-noatime-clear-errno' When trying to see that an object does not exist, a state errno leaked from our "first try to open a packfile with O_NOATIME and then if it fails retry without it" logic on a system that refuses O_NOATIME. This confused us and caused us to die, saying that the packfile is unreadable, when we should have just reported that the object does not exist in that packfile to the caller. * cb/open-noatime-clear-errno: git_open_noatime: return with errno=0 on success	2015-08-25 14:57:10 -07:00
Junio C Hamano	8c9155e031	Merge branch 'jk/git-path' git_path() and mkpath() are handy helper functions but it is easy to misuse, as the callers need to be careful to keep the number of active results below 4. Their uses have been reduced. * jk/git-path: memoize common git-path "constant" files get_repo_path: refactor path-allocation find_hook: keep our own static buffer refs.c: remove_empty_directories can take a strbuf refs.c: avoid git_path assignment in lock_ref_sha1_basic refs.c: avoid repeated git_path calls in rename_tmp_log refs.c: simplify strbufs in reflog setup and writing path.c: drop git_path_submodule refs.c: remove extra git_path calls from read_loose_refs remote.c: drop extraneous local variable from migrate_file prefer mkpathdup to mkpath in assignments prefer git_pathdup to git_path in some possibly-dangerous cases add_to_alternates_file: don't add duplicate entries t5700: modernize style cache.h: complete set of git_path_submodule helpers cache.h: clarify documentation for git_path, et al	2015-08-19 14:48:56 -07:00
Junio C Hamano	51a22ce147	Merge branch 'jc/finalize-temp-file' Long overdue micro clean-up. * jc/finalize-temp-file: sha1_file.c: rename move_temp_to_file() to finalize_object_file()	2015-08-19 14:48:55 -07:00
Junio C Hamano	0a489b0680	prepare_packed_git(): refactor garbage reporting in pack directory The hook to report "garbage" files in $GIT_OBJECT_DIRECTORY/pack/ could be generic but is too specific to count-object's needs. Move the part to produce human-readable messages to count-objects, and refine the interface to callback with the "bits" with values defined in the cache.h header file, so that other callers (e.g. prune) can later use the same mechanism to enumerate different kinds of garbage files and do something intelligent about them, other than reporting in textual messages. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-08-17 09:14:59 -07:00
Clemens Buchacher	dff6f280df	git_open_noatime: return with errno=0 on success In read_sha1_file_extended we die if read_object fails with a fatal error. We detect a fatal error if errno is non-zero and is not ENOENT. If the object could not be read because it does not exist, this is not considered a fatal error and we want to return NULL. Somewhere down the line, read_object calls git_open_noatime to open a pack index file, for example. We first try open with O_NOATIME. If O_NOATIME fails with EPERM, we retry without O_NOATIME. When the second open succeeds, errno is however still set to EPERM from the first attempt. When we finally determine that the object does not exist, read_object returns NULL and read_sha1_file_extended dies with a fatal error: fatal: failed to read object <sha1>: Operation not permitted Fix this by resetting errno to zero before we call open again. Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Clemens Buchacher <clemens.buchacher@intel.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-08-12 13:56:19 -07:00
Johannes Sixt	094c7e6352	prune: close directory earlier during loose-object directory traversal `27e1e22d` (prune: factor out loose-object directory traversal, 2014-10-16) introduced a new function for_each_loose_file_in_objdir() with a helper for_each_file_in_obj_subdir(). The latter calls callbacks for each file found during a directory traversal and finally also a callback for the directory itself. git-prune uses the function to clean up the object directory. In particular, in the directory callback it calls rmdir(). On Windows XP, this rmdir call fails, because the directory is still open while the callback is called. Close the directory before calling the callback. Signed-off-by: Johannes Sixt <j6t@kdbg.org> Acked-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-08-12 12:06:00 -07:00
Jeff King	77b9b1d13a	add_to_alternates_file: don't add duplicate entries The add_to_alternates_file function blindly uses hold_lock_file_for_append to copy the existing contents, and then adds the new line to it. This has two minor problems: 1. We might add duplicate entries, which are ugly and inefficient. 2. We do not check that the file ends with a newline, in which case we would bogusly append to the final line. This is quite unlikely in practice, though, as we call this function only from git-clone, so presumably we are the only writers of the file (and we always add a newline). Instead of using hold_lock_file_for_append, let's copy the file line by line, which ensures all records are properly terminated. If we see an extra line, we can simply abort the update (there is no point in even copying the rest, as we know that it would be identical to the original). As a bonus, we also get rid of some calls to the static-buffer mkpath and git_path functions. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-08-10 15:15:42 -07:00
Junio C Hamano	cb5add5868	sha1_file.c: rename move_temp_to_file() to finalize_object_file() Since `5a688fe4` ("core.sharedrepository = 0mode" should set, not loosen, 2009-03-25), we kept reminding ourselves: NEEDSWORK: this should be renamed to finalize_temp_file() as "moving" is only a part of what it does, when no patch between master to pu changes the call sites of this function. without doing anything about it. Let's do so. The purpose of this function was not to move but to finalize. The detail of the primarily implementation of finalizing was to link the temporary file to its final name and then to unlink, which wasn't even "moving". The alternative implementation did "move" by calling rename(2), which is a fun tangent. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-08-10 11:10:37 -07:00
Junio C Hamano	7c696007ca	Merge branch 'jk/fix-refresh-utime' into maint Fix a small bug in our use of umask() return value. * jk/fix-refresh-utime: check_and_freshen_file: fix reversed success-check	2015-07-27 12:21:40 -07:00
Junio C Hamano	de62fe8c42	Merge branch 'jk/index-pack-reduce-recheck' into maint Disable "have we lost a race with competing repack?" check while receiving a huge object transfer that runs index-pack. * jk/index-pack-reduce-recheck: index-pack: avoid excessive re-reading of pack directory	2015-07-27 12:21:38 -07:00
Junio C Hamano	1f9e0a5348	Merge branch 'jk/fix-refresh-utime' Fix a small bug in our use of umask() return value. * jk/fix-refresh-utime: check_and_freshen_file: fix reversed success-check	2015-07-10 14:26:15 -07:00
Junio C Hamano	c07173f215	Merge branch 'jk/maint-for-each-packed-object' The for_each_packed_object() API function did not iterate over objects in a packfile that hasn't been used yet. * jk/maint-for-each-packed-object: for_each_packed_object: automatically open pack index	2015-07-09 14:31:43 -07:00
Jeff King	3096b2ecdb	check_and_freshen_file: fix reversed success-check When we want to write out a loose object file, we have always first made sure we don't already have the object somewhere. Since `33d4221` (write_sha1_file: freshen existing objects, 2014-10-15), we also update the timestamp on the file, so that a simultaneous prune knows somebody is likely to reference it soon. If our utime() call fails, we treat this the same as not having the object in the first place; the safe thing to do is write out another copy. However, the loose-object check accidentally inverts the utime() check; it returns failure _only_ when the utime() call actually succeeded. Thus it was failing to protect us there, and in the normal case where utime() succeeds, it caused us to pointlessly write out and link the object. This passed our freshening tests, because writing out the new object is certainly _one_ way of updating its utime. So the normal case was inefficient, but not wrong. While we're here, let's also drop a comment in front of the check_and_freshen functions, making a note of their return type (since it is not our usual "0 for success, -1 for error"). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-07-08 15:58:28 -07:00
Junio C Hamano	c5baf18a40	Merge branch 'jk/diagnose-config-mmap-failure' into maint The configuration reader/writer uses mmap(2) interface to access the files; when we find a directory, it barfed with "Out of memory?". * jk/diagnose-config-mmap-failure: xmmap(): drop "Out of memory?" config.c: rewrite ENODEV into EISDIR when mmap fails config.c: avoid xmmap error messages config.c: fix mmap leak when writing config read-cache.c: drop PROT_WRITE from mmap of index	2015-06-25 11:02:11 -07:00
Junio C Hamano	712b351bd3	Merge branch 'jk/index-pack-reduce-recheck' Disable "have we lost a race with competing repack?" check while receiving a huge object transfer that runs index-pack. * jk/index-pack-reduce-recheck: index-pack: avoid excessive re-reading of pack directory	2015-06-24 12:21:54 -07:00
Jeff King	f813e9ea5f	for_each_packed_object: automatically open pack index When for_each_packed_object is called, we call prepare_packed_git() to make sure we have the actual list of packs. But the latter does not actually open the pack indices, meaning that pack->nr_objects may simply be 0 if the pack has not otherwise been used since the program started. In practice, this didn't come up for the current callers, because they iterate the packed objects only after iterating all reachable objects (so for it to matter you would have to have a pack consisting only of unreachable objects). But it is a dangerous and confusing interface that should be fixed for future callers. Note that we do not end the iteration when a pack cannot be opened, but we do return an error. That lets you complete the iteration even in actively-repacked repository where an .idx file may racily go away, but it also lets callers know that they may not have gotten the complete list (which the current reachability-check caller does care about). We have to tweak one of the prune tests due to the changed return value; an earlier test creates bogus .idx files and does not clean them up. Having to make this tweak is a good thing; it means we will not prune in a broken repository, and the test confirms that we do not negatively impact a more lenient caller, count-objects. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-06-22 14:53:58 -07:00
Junio C Hamano	070d276cc1	Merge branch 'jh/filter-empty-contents' into maint The clean/smudge interface did not work well when filtering an empty contents (failed and then passed the empty input through). It can be argued that a filter that produces anything but empty for an empty input is nonsense, but if the user wants to do strange things, then why not? * jh/filter-empty-contents: sha1_file: pass empty buffer to index empty file	2015-06-16 14:33:44 -07:00
Junio C Hamano	dee47925c1	Merge branch 'jk/diagnose-config-mmap-failure' The configuration reader/writer uses mmap(2) interface to access the files; when we find a directory, it barfed with "Out of memory?". * jk/diagnose-config-mmap-failure: xmmap(): drop "Out of memory?" config.c: rewrite ENODEV into EISDIR when mmap fails config.c: avoid xmmap error messages config.c: fix mmap leak when writing config read-cache.c: drop PROT_WRITE from mmap of index	2015-06-11 09:29:55 -07:00
Jeff King	0eeb077be7	index-pack: avoid excessive re-reading of pack directory Since `45e8a74` (has_sha1_file: re-check pack directory before giving up, 2013-08-30), we spend extra effort for has_sha1_file to give the right answer when somebody else is repacking. Usually this effort does not matter, because after finding that the object does not exist, the next step is usually to die(). However, some code paths make a large number of has_sha1_file checks which are _not_ expected to return 1. The collision test in index-pack.c is such a case. On a local system, this can cause a performance slowdown of around 5%. But on a system with high-latency system calls (like NFS), it can be much worse. This patch introduces a "quick" flag to has_sha1_file which callers can use when they would prefer high performance at the cost of false negatives during repacks. There may be other code paths that can use this, but the index-pack one is the most obviously critical, so we'll start with switching that one. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-06-09 12:26:35 -07:00
Junio C Hamano	3c91e9966a	Merge branch 'jk/sha1-file-reduce-useless-warnings' into maint * jk/sha1-file-reduce-useless-warnings: sha1_file: squelch "packfile cannot be accessed" warnings	2015-06-05 12:00:07 -07:00
Junio C Hamano	152722f155	Merge branch 'jh/filter-empty-contents' The clean/smudge interface did not work well when filtering an empty contents (failed and then passed the empty input through). It can be argued that a filter that produces anything but empty for an empty input is nonsense, but if the user wants to do strange things, then why not? * jh/filter-empty-contents: sha1_file: pass empty buffer to index empty file	2015-06-01 12:45:11 -07:00
Junio C Hamano	9ca0aaf6de	xmmap(): drop "Out of memory?" We show that message with die_errno(), but the OS is ought to know why mmap(2) failed much better than we do. There is no reason for us to say "Out of memory?" here. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-05-28 11:35:25 -07:00
Jeff King	1570856b51	config.c: avoid xmmap error messages The config-writing code uses xmmap to map the existing config file, which will die if the map fails. This has two downsides: 1. The error message is not very helpful, as it lacks any context about the file we are mapping: $ mkdir foo $ git config --file=foo some.key value fatal: Out of memory? mmap failed: No such device 2. We normally do not die in this code path; instead, we'd rather report the error and return an appropriate exit status (which is part of the public interface documented in git-config.1). This patch introduces a "gentle" form of xmmap which lets us produce our own error message. We do not want to use mmap directly, because we would like to use the other compatibility elements of xmmap (e.g., handling 0-length maps portably). The end result is: $ git.compile config --file=foo some.key value error: unable to mmap 'foo': No such device $ echo $? 3 Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-05-28 11:33:18 -07:00
Junio C Hamano	1e6c8babf8	Merge branch 'jc/hash-object' into maint "hash-object --literally" introduced in v2.2 was not prepared to take a really long object type name. * jc/hash-object: write_sha1_file(): do not use a separate sha1[] array t1007: add hash-object --literally tests hash-object --literally: fix buffer overrun with extra-long object type git-hash-object.txt: document --literally option	2015-05-26 13:49:25 -07:00
Junio C Hamano	3b7d373ae2	Merge branch 'kn/cat-file-literally' Add the "--allow-unknown-type" option to "cat-file" to allow inspecting loose objects of an experimental or a broken type. * kn/cat-file-literally: t1006: add tests for git cat-file --allow-unknown-type cat-file: teach cat-file a '--allow-unknown-type' option cat-file: make the options mutually exclusive sha1_file: support reading from a loose object of unknown type	2015-05-19 13:17:58 -07:00
Jim Hill	f6a1e1e288	sha1_file: pass empty buffer to index empty file `git add` of an empty file with a filter pops complaints from `copy_fd` about a bad file descriptor. This traces back to these lines in sha1_file.c:index_core: if (!size) { ret = index_mem(sha1, NULL, size, type, path, flags); The problem here is that content to be added to the index can be supplied from an fd, or from a memory buffer, or from a pathname. This call is supplying a NULL buffer pointer and a zero size. Downstream logic takes the complete absence of a buffer to mean the data is to be found elsewhere -- for instance, these, from convert.c: if (params->src) { write_err = (write_in_full(child_process.in, params->src, params->size) < 0); } else { write_err = copy_fd(params->fd, child_process.in); } ~If there's a buffer, write from that, otherwise the data must be coming from an open fd.~ Perfectly reasonable logic in a routine that's going to write from either a buffer or an fd. So change `index_core` to supply an empty buffer when indexing an empty file. There's a patch out there that instead changes the logic quoted above to take a `-1` fd to mean "use the buffer", but it seems to me that the distinction between a missing buffer and an empty one carries intrinsic semantics, where the logic change is adapting the code to handle incorrect arguments. Signed-off-by: Jim Hill <gjthill@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-05-18 10:15:20 -07:00
Junio C Hamano	ebb464f0cb	Merge branch 'jk/prune-mtime' into maint Access to objects in repositories that borrow from another one on a slow NFS server unnecessarily got more expensive due to recent code becoming more cautious in a naive way not to lose objects to pruning. * jk/prune-mtime: sha1_file: only freshen packs once per run sha1_file: freshen pack objects before loose reachable: only mark local objects as recent	2015-05-13 14:05:50 -07:00
Junio C Hamano	051086b947	Merge branch 'jc/hash-object' "hash-object --literally" introduced in v2.2 was not prepared to take a really long object type name. * jc/hash-object: write_sha1_file(): do not use a separate sha1[] array t1007: add hash-object --literally tests hash-object --literally: fix buffer overrun with extra-long object type git-hash-object.txt: document --literally option	2015-05-11 14:23:59 -07:00
Junio C Hamano	cedeffeee0	Merge branch 'jk/sha1-file-reduce-useless-warnings' * jk/sha1-file-reduce-useless-warnings: sha1_file: squelch "packfile cannot be accessed" warnings	2015-05-11 14:23:41 -07:00
Junio C Hamano	68a2e6a2c8	Merge branch 'nd/multiple-work-trees' A replacement for contrib/workdir/git-new-workdir that does not rely on symbolic links and make sharing of objects and refs safer by making the borrowee and borrowers aware of each other. * nd/multiple-work-trees: (41 commits) prune --worktrees: fix expire vs worktree existence condition t1501: fix test with split index t2026: fix broken &&-chain t2026 needs procondition SANITY git-checkout.txt: a note about multiple checkout support for submodules checkout: add --ignore-other-wortrees checkout: pass whole struct to parse_branchname_arg instead of individual flags git-common-dir: make "modules/" per-working-directory directory checkout: do not fail if target is an empty directory t2025: add a test to make sure grafts is working from a linked checkout checkout: don't require a work tree when checking out into a new one git_path(): keep "info/sparse-checkout" per work-tree count-objects: report unused files in $GIT_DIR/worktrees/... gc: support prune --worktrees gc: factor out gc.pruneexpire parsing code gc: style change -- no SP before closing parenthesis checkout: clean up half-prepared directories in --to mode checkout: reject if the branch is already checked out elsewhere prune: strategies for linked checkouts checkout: support checking out into a new working directory ...	2015-05-11 14:23:39 -07:00
Karthik Nayak	46f034483e	sha1_file: support reading from a loose object of unknown type Update sha1_loose_object_info() to optionally allow it to read from a loose object file of unknown/bogus type; as the function usually returns the type of the object it read in the form of enum for known types, add an optional "typename" field to receive the name of the type in textual form and a flag to indicate the reading of a loose object file of unknown/bogus type. Add parse_sha1_header_extended() which acts as a wrapper around parse_sha1_header() allowing more information to be obtained. Add unpack_sha1_header_to_strbuf() to unpack sha1 headers of unknown/corrupt objects which have a unknown sha1 header size to a strbuf structure. This was written by Junio C Hamano but tested by me. Helped-by: Junio C Hamano <gitster@pobox.com> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Helped-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Hepled-by: Jeff King <peff@peff.net> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-05-06 13:35:48 -07:00
Junio C Hamano	e3b199aef1	Merge branch 'jk/prune-mtime' Access to objects in repositories that borrow from another one on a slow NFS server unnecessarily got more expensive due to recent code becoming more cautious in a naive way not to lose objects to pruning. * jk/prune-mtime: sha1_file: only freshen packs once per run sha1_file: freshen pack objects before loose reachable: only mark local objects as recent	2015-05-05 21:00:37 -07:00
Junio C Hamano	1427a7ff70	write_sha1_file(): do not use a separate sha1[] array In the beginning, write_sha1_file() did not have a way to tell the caller the name of the object it wrote to the caller. This was changed in `d6d3f9d0` (This implements the new "recursive tree" write-tree., 2005-04-09) by adding the "returnsha1" parameter to the function so that the callers who are interested in the value can optionally pass a pointer to receive it. It turns out that all callers do want to know the name of the object it just has written. Nobody passes a NULL to this parameter, hence it is not necessary to use a separate sha1[] array to receive the result from write_sha1_file_prepare(), and copy the result to the returnsha1 supplied by the caller. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-05-05 10:17:56 -07:00
Eric Sunshine	0c3db67cc8	hash-object --literally: fix buffer overrun with extra-long object type "hash-object" learned in `5ba9a93` (hash-object: add --literally option, 2014-09-11) to allow crafting a corrupt/broken object of unknown type. When the user-provided type is particularly long, however, it can overflow the relatively small stack-based character array handed to write_sha1_file_prepare() by hash_sha1_file() and write_sha1_file(), leading to stack corruption (and crash). Introduce a custom helper to allow arbitrarily long typenames just for "hash-object --literally". [jc: Eric's original used a strbuf in the more common codepaths, and I rewrote it to avoid penalizing the non-literally code. Bugs are mine] Signed-off-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-05-05 10:14:18 -07:00
Jeff King	ee1c6c34ac	sha1_file: only freshen packs once per run Since `33d4221` (write_sha1_file: freshen existing objects, 2014-10-15), we update the mtime of existing objects that we would have written out (had they not existed). For the common case in which many objects are packed, we may update the mtime on a single packfile repeatedly. This can result in a noticeable performance problem if calling utime() is expensive (e.g., because your storage is on NFS). We can fix this by keeping a per-pack flag that lets us freshen only once per program invocation. An alternative would be to keep the packed_git.mtime flag up to date as we freshen, and freshen only once every N seconds. In practice, it's not worth the complexity. We are racing against prune expiration times here, which inherently must be set to accomodate reasonable program running times (because they really care about the time between an object being written and it becoming referenced, and the latter is typically the last step a program takes). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-04-20 13:09:40 -07:00
Jeff King	b5f52f372e	sha1_file: freshen pack objects before loose When writing out an object file, we first check whether it already exists and if so optimize out the write. Prior to `33d4221`, we did this by calling has_sha1_file(), which will check for packed objects followed by loose. Since that commit, we check loose objects first. For the common case of a repository whose objects are mostly packed, this means we will make a lot of extra access() system calls checking for loose objects. We should follow the same packed-then-loose order that all of our other lookups use. Reported-by: Stefan Saasen <ssaasen@atlassian.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-04-20 13:09:38 -07:00
Jeff King	1385bb7ba3	reachable: only mark local objects as recent When pruning and repacking a repository that has an alternate object store configured, we may traverse a large number of objects in the alternate. This serves no purpose, and may be expensive to do. A longer explanation is below. Commits `d3038d2` and `abcb865` taught prune and pack-objects (respectively) to treat "recent" objects as tips for reachability, so that we keep whole chunks of history. They built on the object traversal in `660c889` (sha1_file: add for_each iterators for loose and packed objects, 2014-10-15), which covers both local and alternate objects. In both cases, covering alternate objects is unnecessary, as both commands can only drop objects from the local repository. In the case of prune, we traverse only the local object directory. And in the case of repacking, while we may or may not include local objects in our pack, we will never reach into the alternate with "repack -d". The "-l" option is only a question of whether we are migrating objects from the alternate into our repository, or leaving them untouched. It is possible that we may drop an object that is depended upon by another object in the alternate. For example, imagine two repositories, A and B, with A pointing to B as an alternate. Now imagine a commit that is in B which references a tree that is only in A. Traversing from recent objects in B might prevent A from dropping that tree. But this case isn't worth covering. Repo B should take responsibility for its own objects. It would never have had the commit in the first place if it did not also have the tree, and assuming it is using the same "keep recent chunks of history" scheme, then it would itself keep the tree, as well. So checking the alternate objects is not worth doing, and come with a significant performance impact. In both cases, we skip any recent objects that have already been marked SEEN (i.e., that we know are already reachable for prune, or included in the pack for a repack). So there is a slight waste of time in opening the alternate packs at all, only to notice that we have already considered each object. But much worse, the alternate repository may have a large number of objects that are not reachable from the local repository at all, and we end up adding them to the traversal. We can fix this by considering only local unseen objects. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-04-20 13:09:27 -07:00
Jeff King	319b678a7b	sha1_file: squelch "packfile cannot be accessed" warnings When we find an object in a packfile index, we make sure we can still open the packfile itself (or that it is already open), as it might have been deleted by a simultaneous repack. If we can't access the packfile, we print a warning for the user and tell the caller that we don't have the object (we can then look in other packfiles, or find a loose version, before giving up). The warning we print to the user isn't really accomplishing anything, and it is potentially confusing to users. In the normal case, it is complete noise; we find the object elsewhere, and the user does not have to care that we racily saw a packfile index that became stale. It didn't affect the operation at all. A possibly more interesting case is when we later can't find the object, and report failure to the user. In this case the warning could be considered a clue toward that ultimate failure. But it's not really a useful clue in practice. We wouldn't even print it consistently (since we are racing with another process, we might not even see the .idx file, or we might win the race and open the packfile, completing the operation). This patch drops the warning entirely (not only from the fill_pack_entry site, but also from an identical use in pack-objects). If we did find the warning interesting in the error case, we could stuff it away and reveal it to the user when we later die() due to the broken object. But that complexity just isn't worth it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-03-30 21:47:39 -07:00
Junio C Hamano	a393c6bfd9	Merge branch 'rs/deflate-init-cleanup' into maint Code simplification. * rs/deflate-init-cleanup: zlib: initialize git_zstream in git_deflate_init{,_gzip,_raw}	2015-03-23 11:23:38 -07:00
Junio C Hamano	6902c4da58	Merge branch 'rs/deflate-init-cleanup' Code simplification. * rs/deflate-init-cleanup: zlib: initialize git_zstream in git_deflate_init{,_gzip,_raw}	2015-03-17 16:01:26 -07:00
René Scharfe	9a6f1287fb	zlib: initialize git_zstream in git_deflate_init{,_gzip,_raw} Clear the git_zstream variable at the start of git_deflate_init() etc. so that callers don't have to do that. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-03-05 15:46:03 -08:00

1 2 3 4 5 ...

902 Коммитов