2005-04-19 00:04:43 +04:00
|
|
|
/*
|
|
|
|
* GIT - The information manager from hell
|
|
|
|
*
|
|
|
|
* Copyright (C) Linus Torvalds, 2005
|
|
|
|
*
|
2020-12-31 14:56:21 +03:00
|
|
|
* This handles basic git object files - packing, unpacking,
|
2005-04-19 00:04:43 +04:00
|
|
|
* creation etc.
|
|
|
|
*/
|
2023-04-22 23:17:23 +03:00
|
|
|
#include "git-compat-util.h"
|
2023-03-21 09:25:58 +03:00
|
|
|
#include "abspath.h"
|
2023-02-24 03:09:24 +03:00
|
|
|
#include "alloc.h"
|
2017-06-14 21:07:36 +03:00
|
|
|
#include "config.h"
|
2023-04-11 06:00:40 +03:00
|
|
|
#include "convert.h"
|
2023-03-21 09:26:03 +03:00
|
|
|
#include "environment.h"
|
2023-03-21 09:25:54 +03:00
|
|
|
#include "gettext.h"
|
2023-02-24 03:09:27 +03:00
|
|
|
#include "hex.h"
|
2012-11-05 12:41:22 +04:00
|
|
|
#include "string-list.h"
|
2014-10-01 14:28:42 +04:00
|
|
|
#include "lockfile.h"
|
2005-06-27 14:35:33 +04:00
|
|
|
#include "delta.h"
|
2005-06-29 01:21:02 +04:00
|
|
|
#include "pack.h"
|
2006-04-02 16:44:09 +04:00
|
|
|
#include "blob.h"
|
|
|
|
#include "commit.h"
|
2011-05-08 12:47:35 +04:00
|
|
|
#include "run-command.h"
|
2006-04-02 16:44:09 +04:00
|
|
|
#include "tag.h"
|
|
|
|
#include "tree.h"
|
2011-02-05 13:52:21 +03:00
|
|
|
#include "tree-walk.h"
|
2007-04-10 08:20:29 +04:00
|
|
|
#include "refs.h"
|
2008-02-28 08:25:19 +03:00
|
|
|
#include "pack-revindex.h"
|
2020-12-31 14:56:23 +03:00
|
|
|
#include "hash-lookup.h"
|
2011-10-29 01:48:40 +04:00
|
|
|
#include "bulk-checkin.h"
|
2018-03-23 20:20:57 +03:00
|
|
|
#include "repository.h"
|
2018-04-12 03:21:06 +03:00
|
|
|
#include "replace-object.h"
|
2012-03-07 14:54:18 +04:00
|
|
|
#include "streaming.h"
|
2013-02-15 16:07:10 +04:00
|
|
|
#include "dir.h"
|
2016-08-23 00:59:42 +03:00
|
|
|
#include "list.h"
|
2016-09-13 20:54:42 +03:00
|
|
|
#include "mergesort.h"
|
alternates: accept double-quoted paths
We read lists of alternates from objects/info/alternates
files (delimited by newline), as well as from the
GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable
(delimited by colon or semi-colon, depending on the
platform).
There's no mechanism for quoting the delimiters, so it's
impossible to specify an alternate path that contains a
colon in the environment, or one that contains a newline in
a file. We've lived with that restriction for ages because
both alternates and filenames with colons are relatively
rare, and it's only a problem when the two meet. But since
722ff7f87 (receive-pack: quarantine objects until
pre-receive accepts, 2016-10-03), which builds on the
alternates system, every push causes the receiver to set
GIT_ALTERNATE_OBJECT_DIRECTORIES internally.
It would be convenient to have some way to quote the
delimiter so that we can represent arbitrary paths.
The simplest thing would be an escape character before a
quoted delimiter (e.g., "\:" as a literal colon). But that
creates a backwards compatibility problem: any path which
uses that escape character is now broken, and we've just
shifted the problem. We could choose an unlikely escape
character (e.g., something from the non-printable ASCII
range), but that's awkward to use.
Instead, let's treat names as unquoted unless they begin
with a double-quote, in which case they are interpreted via
our usual C-stylke quoting rules. This also breaks
backwards-compatibility, but in a smaller way: it only
matters if your file has a double-quote as the very _first_
character in the path (whereas an escape character is a
problem anywhere in the path). It's also consistent with
many other parts of git, which accept either a bare pathname
or a double-quoted one, and the sender can choose to quote
or not as required.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-12 22:52:22 +03:00
|
|
|
#include "quote.h"
|
2017-08-19 01:20:16 +03:00
|
|
|
#include "packfile.h"
|
2023-04-11 10:41:53 +03:00
|
|
|
#include "object-file.h"
|
2018-03-23 20:20:55 +03:00
|
|
|
#include "object-store.h"
|
2023-04-22 23:17:27 +03:00
|
|
|
#include "oidtree.h"
|
2019-06-25 16:40:31 +03:00
|
|
|
#include "promisor-remote.h"
|
2023-03-21 09:26:05 +03:00
|
|
|
#include "setup.h"
|
2021-08-17 00:09:51 +03:00
|
|
|
#include "submodule.h"
|
hash-object: use fsck for object checks
Since c879daa237 (Make hash-object more robust against malformed
objects, 2011-02-05), we've done some rudimentary checks against objects
we're about to write by running them through our usual parsers for
trees, commits, and tags.
These parsers catch some problems, but they are not nearly as careful as
the fsck functions (which make sense; the parsers are designed to be
fast and forgiving, bailing only when the input is unintelligible). We
are better off doing the more thorough fsck checks when writing objects.
Doing so at write time is much better than writing garbage only to find
out later (after building more history atop it!) that fsck complains
about it, or hosts with transfer.fsckObjects reject it.
This is obviously going to be a user-visible behavior change, and the
test changes earlier in this series show the scope of the impact. But
I'd argue that this is OK:
- the documentation for hash-object is already vague about which
checks we might do, saying that --literally will allow "any
garbage[...] which might not otherwise pass standard object parsing
or git-fsck checks". So we are already covered under the documented
behavior.
- users don't generally run hash-object anyway. There are a lot of
spots in the tests that needed to be updated because creating
garbage objects is something that Git's tests disproportionately do.
- it's hard to imagine anyone thinking the new behavior is worse. Any
object we reject would be a potential problem down the road for the
user. And if they really want to create garbage, --literally is
already the escape hatch they need.
Note that the change here is actually in index_mem(), which handles the
HASH_FORMAT_CHECK flag passed by hash-object. That flag is also used by
"git-replace --edit" to sanity-check the result. Covering that with more
thorough checks likewise seems like a good thing.
Besides being more thorough, there are a few other bonuses:
- we get rid of some questionable stack allocations of object structs.
These don't seem to currently cause any problems in practice, but
they subtly violate some of the assumptions made by the rest of the
code (e.g., the "struct commit" we put on the stack and
zero-initialize will not have a proper index from
alloc_comit_index().
- likewise, those parsed object structs are the source of some small
memory leaks
- the resulting messages are much better. For example:
[before]
$ echo 'tree 123' | git hash-object -t commit --stdin
error: bogus commit object 0000000000000000000000000000000000000000
fatal: corrupt commit
[after]
$ echo 'tree 123' | git.compile hash-object -t commit --stdin
error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1
fatal: refusing to create malformed object
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-18 23:44:12 +03:00
|
|
|
#include "fsck.h"
|
2023-03-21 09:26:01 +03:00
|
|
|
#include "wrapper.h"
|
2017-01-10 21:47:14 +03:00
|
|
|
#include "trace.h"
|
|
|
|
#include "hook.h"
|
2017-03-15 21:43:05 +03:00
|
|
|
#include "sigchain.h"
|
|
|
|
#include "sub-process.h"
|
|
|
|
#include "pkt-line.h"
|
2007-01-10 07:07:11 +03:00
|
|
|
|
2018-03-12 05:27:55 +03:00
|
|
|
/* The maximum size for an object header. */
|
|
|
|
#define MAX_HEADER_LEN 32
|
|
|
|
|
2018-05-02 03:26:07 +03:00
|
|
|
|
|
|
|
#define EMPTY_TREE_SHA1_BIN_LITERAL \
|
|
|
|
"\x4b\x82\x5d\xc6\x42\xcb\x6e\xb9\xa0\x60" \
|
|
|
|
"\xe5\x4b\xf8\xd6\x92\x88\xfb\xee\x49\x04"
|
2018-11-14 07:09:36 +03:00
|
|
|
#define EMPTY_TREE_SHA256_BIN_LITERAL \
|
|
|
|
"\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \
|
|
|
|
"\x04\xd4\x5d\x8d\x85\xef\xa9\xb0\x57\xb5" \
|
|
|
|
"\x3b\x14\xb4\xb9\xb9\x39\xdd\x74\xde\xcc" \
|
|
|
|
"\x53\x21"
|
2018-05-02 03:26:07 +03:00
|
|
|
|
|
|
|
#define EMPTY_BLOB_SHA1_BIN_LITERAL \
|
|
|
|
"\xe6\x9d\xe2\x9b\xb2\xd1\xd6\x43\x4b\x8b" \
|
|
|
|
"\x29\xae\x77\x5a\xd8\xc2\xe4\x8c\x53\x91"
|
2018-11-14 07:09:36 +03:00
|
|
|
#define EMPTY_BLOB_SHA256_BIN_LITERAL \
|
|
|
|
"\x47\x3a\x0f\x4c\x3b\xe8\xa9\x36\x81\xa2" \
|
|
|
|
"\x67\xe3\xb1\xe9\xa7\xdc\xda\x11\x85\x43" \
|
|
|
|
"\x6f\xe1\x41\xf7\x74\x91\x20\xa3\x03\x72" \
|
|
|
|
"\x18\x13"
|
2018-05-02 03:26:07 +03:00
|
|
|
|
|
|
|
static const struct object_id empty_tree_oid = {
|
2021-04-26 04:02:55 +03:00
|
|
|
.hash = EMPTY_TREE_SHA1_BIN_LITERAL,
|
|
|
|
.algo = GIT_HASH_SHA1,
|
2016-09-01 02:27:18 +03:00
|
|
|
};
|
2018-05-02 03:26:07 +03:00
|
|
|
static const struct object_id empty_blob_oid = {
|
2021-04-26 04:02:55 +03:00
|
|
|
.hash = EMPTY_BLOB_SHA1_BIN_LITERAL,
|
|
|
|
.algo = GIT_HASH_SHA1,
|
2016-09-01 02:27:18 +03:00
|
|
|
};
|
2021-04-26 04:02:56 +03:00
|
|
|
static const struct object_id null_oid_sha1 = {
|
|
|
|
.hash = {0},
|
|
|
|
.algo = GIT_HASH_SHA1,
|
|
|
|
};
|
2018-11-14 07:09:36 +03:00
|
|
|
static const struct object_id empty_tree_oid_sha256 = {
|
2021-04-26 04:02:55 +03:00
|
|
|
.hash = EMPTY_TREE_SHA256_BIN_LITERAL,
|
|
|
|
.algo = GIT_HASH_SHA256,
|
2018-11-14 07:09:36 +03:00
|
|
|
};
|
|
|
|
static const struct object_id empty_blob_oid_sha256 = {
|
2021-04-26 04:02:55 +03:00
|
|
|
.hash = EMPTY_BLOB_SHA256_BIN_LITERAL,
|
|
|
|
.algo = GIT_HASH_SHA256,
|
2018-11-14 07:09:36 +03:00
|
|
|
};
|
2021-04-26 04:02:56 +03:00
|
|
|
static const struct object_id null_oid_sha256 = {
|
|
|
|
.hash = {0},
|
|
|
|
.algo = GIT_HASH_SHA256,
|
|
|
|
};
|
2005-10-01 01:02:47 +04:00
|
|
|
|
2018-02-01 05:18:38 +03:00
|
|
|
static void git_hash_sha1_init(git_hash_ctx *ctx)
|
2017-11-13 00:28:52 +03:00
|
|
|
{
|
2018-02-01 05:18:38 +03:00
|
|
|
git_SHA1_Init(&ctx->sha1);
|
2017-11-13 00:28:52 +03:00
|
|
|
}
|
|
|
|
|
2020-02-22 23:17:27 +03:00
|
|
|
static void git_hash_sha1_clone(git_hash_ctx *dst, const git_hash_ctx *src)
|
|
|
|
{
|
|
|
|
git_SHA1_Clone(&dst->sha1, &src->sha1);
|
|
|
|
}
|
|
|
|
|
2018-02-01 05:18:38 +03:00
|
|
|
static void git_hash_sha1_update(git_hash_ctx *ctx, const void *data, size_t len)
|
2017-11-13 00:28:52 +03:00
|
|
|
{
|
2018-02-01 05:18:38 +03:00
|
|
|
git_SHA1_Update(&ctx->sha1, data, len);
|
2017-11-13 00:28:52 +03:00
|
|
|
}
|
|
|
|
|
2018-02-01 05:18:38 +03:00
|
|
|
static void git_hash_sha1_final(unsigned char *hash, git_hash_ctx *ctx)
|
2017-11-13 00:28:52 +03:00
|
|
|
{
|
2018-02-01 05:18:38 +03:00
|
|
|
git_SHA1_Final(hash, &ctx->sha1);
|
2017-11-13 00:28:52 +03:00
|
|
|
}
|
|
|
|
|
2021-04-26 04:02:52 +03:00
|
|
|
static void git_hash_sha1_final_oid(struct object_id *oid, git_hash_ctx *ctx)
|
|
|
|
{
|
|
|
|
git_SHA1_Final(oid->hash, &ctx->sha1);
|
|
|
|
memset(oid->hash + GIT_SHA1_RAWSZ, 0, GIT_MAX_RAWSZ - GIT_SHA1_RAWSZ);
|
2021-04-26 04:02:55 +03:00
|
|
|
oid->algo = GIT_HASH_SHA1;
|
2021-04-26 04:02:52 +03:00
|
|
|
}
|
|
|
|
|
2018-11-14 07:09:36 +03:00
|
|
|
|
|
|
|
static void git_hash_sha256_init(git_hash_ctx *ctx)
|
|
|
|
{
|
|
|
|
git_SHA256_Init(&ctx->sha256);
|
|
|
|
}
|
|
|
|
|
2020-02-22 23:17:27 +03:00
|
|
|
static void git_hash_sha256_clone(git_hash_ctx *dst, const git_hash_ctx *src)
|
|
|
|
{
|
|
|
|
git_SHA256_Clone(&dst->sha256, &src->sha256);
|
|
|
|
}
|
|
|
|
|
2018-11-14 07:09:36 +03:00
|
|
|
static void git_hash_sha256_update(git_hash_ctx *ctx, const void *data, size_t len)
|
|
|
|
{
|
|
|
|
git_SHA256_Update(&ctx->sha256, data, len);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void git_hash_sha256_final(unsigned char *hash, git_hash_ctx *ctx)
|
|
|
|
{
|
|
|
|
git_SHA256_Final(hash, &ctx->sha256);
|
|
|
|
}
|
|
|
|
|
2021-04-26 04:02:52 +03:00
|
|
|
static void git_hash_sha256_final_oid(struct object_id *oid, git_hash_ctx *ctx)
|
|
|
|
{
|
|
|
|
git_SHA256_Final(oid->hash, &ctx->sha256);
|
|
|
|
/*
|
|
|
|
* This currently does nothing, so the compiler should optimize it out,
|
|
|
|
* but keep it in case we extend the hash size again.
|
|
|
|
*/
|
|
|
|
memset(oid->hash + GIT_SHA256_RAWSZ, 0, GIT_MAX_RAWSZ - GIT_SHA256_RAWSZ);
|
2021-04-26 04:02:55 +03:00
|
|
|
oid->algo = GIT_HASH_SHA256;
|
2021-04-26 04:02:52 +03:00
|
|
|
}
|
|
|
|
|
2022-10-18 04:05:28 +03:00
|
|
|
static void git_hash_unknown_init(git_hash_ctx *ctx UNUSED)
|
2017-11-13 00:28:52 +03:00
|
|
|
{
|
2018-07-21 10:49:19 +03:00
|
|
|
BUG("trying to init unknown hash");
|
2017-11-13 00:28:52 +03:00
|
|
|
}
|
|
|
|
|
2022-10-18 04:05:28 +03:00
|
|
|
static void git_hash_unknown_clone(git_hash_ctx *dst UNUSED,
|
|
|
|
const git_hash_ctx *src UNUSED)
|
2020-02-22 23:17:27 +03:00
|
|
|
{
|
|
|
|
BUG("trying to clone unknown hash");
|
|
|
|
}
|
|
|
|
|
2022-10-18 04:05:28 +03:00
|
|
|
static void git_hash_unknown_update(git_hash_ctx *ctx UNUSED,
|
|
|
|
const void *data UNUSED,
|
|
|
|
size_t len UNUSED)
|
2017-11-13 00:28:52 +03:00
|
|
|
{
|
2018-07-21 10:49:19 +03:00
|
|
|
BUG("trying to update unknown hash");
|
2017-11-13 00:28:52 +03:00
|
|
|
}
|
|
|
|
|
2022-10-18 04:05:28 +03:00
|
|
|
static void git_hash_unknown_final(unsigned char *hash UNUSED,
|
|
|
|
git_hash_ctx *ctx UNUSED)
|
2017-11-13 00:28:52 +03:00
|
|
|
{
|
2018-07-21 10:49:19 +03:00
|
|
|
BUG("trying to finalize unknown hash");
|
2017-11-13 00:28:52 +03:00
|
|
|
}
|
|
|
|
|
2022-10-18 04:05:28 +03:00
|
|
|
static void git_hash_unknown_final_oid(struct object_id *oid UNUSED,
|
|
|
|
git_hash_ctx *ctx UNUSED)
|
2021-04-26 04:02:52 +03:00
|
|
|
{
|
|
|
|
BUG("trying to finalize unknown hash");
|
|
|
|
}
|
|
|
|
|
2017-11-13 00:28:52 +03:00
|
|
|
const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
|
|
|
|
{
|
2022-02-24 12:33:01 +03:00
|
|
|
.name = NULL,
|
|
|
|
.format_id = 0x00000000,
|
|
|
|
.rawsz = 0,
|
|
|
|
.hexsz = 0,
|
|
|
|
.blksz = 0,
|
|
|
|
.init_fn = git_hash_unknown_init,
|
|
|
|
.clone_fn = git_hash_unknown_clone,
|
|
|
|
.update_fn = git_hash_unknown_update,
|
|
|
|
.final_fn = git_hash_unknown_final,
|
|
|
|
.final_oid_fn = git_hash_unknown_final_oid,
|
|
|
|
.empty_tree = NULL,
|
|
|
|
.empty_blob = NULL,
|
|
|
|
.null_oid = NULL,
|
2017-11-13 00:28:52 +03:00
|
|
|
},
|
|
|
|
{
|
2022-02-24 12:33:01 +03:00
|
|
|
.name = "sha1",
|
|
|
|
.format_id = GIT_SHA1_FORMAT_ID,
|
|
|
|
.rawsz = GIT_SHA1_RAWSZ,
|
|
|
|
.hexsz = GIT_SHA1_HEXSZ,
|
|
|
|
.blksz = GIT_SHA1_BLKSZ,
|
|
|
|
.init_fn = git_hash_sha1_init,
|
|
|
|
.clone_fn = git_hash_sha1_clone,
|
|
|
|
.update_fn = git_hash_sha1_update,
|
|
|
|
.final_fn = git_hash_sha1_final,
|
|
|
|
.final_oid_fn = git_hash_sha1_final_oid,
|
|
|
|
.empty_tree = &empty_tree_oid,
|
|
|
|
.empty_blob = &empty_blob_oid,
|
|
|
|
.null_oid = &null_oid_sha1,
|
2017-11-13 00:28:52 +03:00
|
|
|
},
|
2018-11-14 07:09:36 +03:00
|
|
|
{
|
2022-02-24 12:33:01 +03:00
|
|
|
.name = "sha256",
|
|
|
|
.format_id = GIT_SHA256_FORMAT_ID,
|
|
|
|
.rawsz = GIT_SHA256_RAWSZ,
|
|
|
|
.hexsz = GIT_SHA256_HEXSZ,
|
|
|
|
.blksz = GIT_SHA256_BLKSZ,
|
|
|
|
.init_fn = git_hash_sha256_init,
|
|
|
|
.clone_fn = git_hash_sha256_clone,
|
|
|
|
.update_fn = git_hash_sha256_update,
|
|
|
|
.final_fn = git_hash_sha256_final,
|
|
|
|
.final_oid_fn = git_hash_sha256_final_oid,
|
|
|
|
.empty_tree = &empty_tree_oid_sha256,
|
|
|
|
.empty_blob = &empty_blob_oid_sha256,
|
|
|
|
.null_oid = &null_oid_sha256,
|
2018-11-14 07:09:36 +03:00
|
|
|
}
|
2017-11-13 00:28:52 +03:00
|
|
|
};
|
|
|
|
|
2021-04-26 04:02:56 +03:00
|
|
|
const struct object_id *null_oid(void)
|
|
|
|
{
|
|
|
|
return the_hash_algo->null_oid;
|
|
|
|
}
|
|
|
|
|
2018-05-02 03:25:54 +03:00
|
|
|
const char *empty_tree_oid_hex(void)
|
|
|
|
{
|
|
|
|
static char buf[GIT_MAX_HEXSZ + 1];
|
|
|
|
return oid_to_hex_r(buf, the_hash_algo->empty_tree);
|
|
|
|
}
|
|
|
|
|
|
|
|
const char *empty_blob_oid_hex(void)
|
|
|
|
{
|
|
|
|
static char buf[GIT_MAX_HEXSZ + 1];
|
|
|
|
return oid_to_hex_r(buf, the_hash_algo->empty_blob);
|
|
|
|
}
|
|
|
|
|
2018-10-22 05:43:32 +03:00
|
|
|
int hash_algo_by_name(const char *name)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
if (!name)
|
|
|
|
return GIT_HASH_UNKNOWN;
|
|
|
|
for (i = 1; i < GIT_HASH_NALGOS; i++)
|
|
|
|
if (!strcmp(name, hash_algos[i].name))
|
|
|
|
return i;
|
|
|
|
return GIT_HASH_UNKNOWN;
|
|
|
|
}
|
|
|
|
|
|
|
|
int hash_algo_by_id(uint32_t format_id)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
for (i = 1; i < GIT_HASH_NALGOS; i++)
|
|
|
|
if (format_id == hash_algos[i].format_id)
|
|
|
|
return i;
|
|
|
|
return GIT_HASH_UNKNOWN;
|
|
|
|
}
|
|
|
|
|
2019-02-19 03:05:17 +03:00
|
|
|
int hash_algo_by_length(int len)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
for (i = 1; i < GIT_HASH_NALGOS; i++)
|
|
|
|
if (len == hash_algos[i].rawsz)
|
|
|
|
return i;
|
|
|
|
return GIT_HASH_UNKNOWN;
|
|
|
|
}
|
2018-10-22 05:43:32 +03:00
|
|
|
|
2011-02-05 17:03:01 +03:00
|
|
|
/*
|
|
|
|
* This is meant to hold a *small* number of objects that you would
|
2023-03-28 16:58:57 +03:00
|
|
|
* want repo_read_object_file() to be able to return, but yet you do not want
|
2011-02-05 17:03:01 +03:00
|
|
|
* to write them into the object store (e.g. a browse-only
|
|
|
|
* application).
|
|
|
|
*/
|
|
|
|
static struct cached_object {
|
2018-05-02 03:26:03 +03:00
|
|
|
struct object_id oid;
|
2011-02-05 17:03:01 +03:00
|
|
|
enum object_type type;
|
|
|
|
void *buf;
|
|
|
|
unsigned long size;
|
|
|
|
} *cached_objects;
|
|
|
|
static int cached_object_nr, cached_object_alloc;
|
|
|
|
|
|
|
|
static struct cached_object empty_tree = {
|
2022-03-17 20:27:17 +03:00
|
|
|
.oid = {
|
|
|
|
.hash = EMPTY_TREE_SHA1_BIN_LITERAL,
|
|
|
|
},
|
|
|
|
.type = OBJ_TREE,
|
|
|
|
.buf = "",
|
2011-02-05 17:03:01 +03:00
|
|
|
};
|
|
|
|
|
2018-05-02 03:26:03 +03:00
|
|
|
static struct cached_object *find_cached_object(const struct object_id *oid)
|
2011-02-05 17:03:01 +03:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct cached_object *co = cached_objects;
|
|
|
|
|
|
|
|
for (i = 0; i < cached_object_nr; i++, co++) {
|
convert "oidcmp() == 0" to oideq()
Using the more restrictive oideq() should, in the long run,
give the compiler more opportunities to optimize these
callsites. For now, this conversion should be a complete
noop with respect to the generated code.
The result is also perhaps a little more readable, as it
avoids the "zero is equal" idiom. Since it's so prevalent in
C, I think seasoned programmers tend not to even notice it
anymore, but it can sometimes make for awkward double
negations (e.g., we can drop a few !!oidcmp() instances
here).
This patch was generated almost entirely by the included
coccinelle patch. This mechanical conversion should be
completely safe, because we check explicitly for cases where
oidcmp() is compared to 0, which is what oideq() is doing
under the hood. Note that we don't have to catch "!oidcmp()"
separately; coccinelle's standard isomorphisms make sure the
two are treated equivalently.
I say "almost" because I did hand-edit the coccinelle output
to fix up a few style violations (it mostly keeps the
original formatting, but sometimes unwraps long lines).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29 00:22:40 +03:00
|
|
|
if (oideq(&co->oid, oid))
|
2011-02-05 17:03:01 +03:00
|
|
|
return co;
|
|
|
|
}
|
convert "oidcmp() == 0" to oideq()
Using the more restrictive oideq() should, in the long run,
give the compiler more opportunities to optimize these
callsites. For now, this conversion should be a complete
noop with respect to the generated code.
The result is also perhaps a little more readable, as it
avoids the "zero is equal" idiom. Since it's so prevalent in
C, I think seasoned programmers tend not to even notice it
anymore, but it can sometimes make for awkward double
negations (e.g., we can drop a few !!oidcmp() instances
here).
This patch was generated almost entirely by the included
coccinelle patch. This mechanical conversion should be
completely safe, because we check explicitly for cases where
oidcmp() is compared to 0, which is what oideq() is doing
under the hood. Note that we don't have to catch "!oidcmp()"
separately; coccinelle's standard isomorphisms make sure the
two are treated equivalently.
I say "almost" because I did hand-edit the coccinelle output
to fix up a few style violations (it mostly keeps the
original formatting, but sometimes unwraps long lines).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29 00:22:40 +03:00
|
|
|
if (oideq(oid, the_hash_algo->empty_tree))
|
2011-02-05 17:03:01 +03:00
|
|
|
return &empty_tree;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-11-16 19:38:28 +03:00
|
|
|
|
2018-01-14 01:49:31 +03:00
|
|
|
static int get_conv_flags(unsigned flags)
|
2017-11-16 19:38:28 +03:00
|
|
|
{
|
|
|
|
if (flags & HASH_RENORMALIZE)
|
2018-01-14 01:49:31 +03:00
|
|
|
return CONV_EOL_RENORMALIZE;
|
2017-11-16 19:38:28 +03:00
|
|
|
else if (flags & HASH_WRITE_OBJECT)
|
2018-04-15 21:16:07 +03:00
|
|
|
return global_conv_flags_eol | CONV_WRITE_OBJECT;
|
2017-11-16 19:38:28 +03:00
|
|
|
else
|
2018-01-14 01:49:31 +03:00
|
|
|
return 0;
|
2017-11-16 19:38:28 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2011-03-11 03:02:50 +03:00
|
|
|
int mkdir_in_gitdir(const char *path)
|
|
|
|
{
|
|
|
|
if (mkdir(path, 0777)) {
|
|
|
|
int saved_errno = errno;
|
|
|
|
struct stat st;
|
|
|
|
struct strbuf sb = STRBUF_INIT;
|
|
|
|
|
|
|
|
if (errno != EEXIST)
|
|
|
|
return -1;
|
|
|
|
/*
|
|
|
|
* Are we looking at a path in a symlinked worktree
|
|
|
|
* whose original repository does not yet have it?
|
|
|
|
* e.g. .git/rr-cache pointing at its original
|
|
|
|
* repository in which the user hasn't performed any
|
|
|
|
* conflict resolution yet?
|
|
|
|
*/
|
|
|
|
if (lstat(path, &st) || !S_ISLNK(st.st_mode) ||
|
|
|
|
strbuf_readlink(&sb, path, st.st_size) ||
|
|
|
|
!is_absolute_path(sb.buf) ||
|
|
|
|
mkdir(sb.buf, 0777)) {
|
|
|
|
strbuf_release(&sb);
|
|
|
|
errno = saved_errno;
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
strbuf_release(&sb);
|
|
|
|
}
|
|
|
|
return adjust_shared_perm(path);
|
|
|
|
}
|
|
|
|
|
2020-12-02 02:45:04 +03:00
|
|
|
static enum scld_error safe_create_leading_directories_1(char *path, int share)
|
2005-07-06 12:11:52 +04:00
|
|
|
{
|
2014-01-06 17:45:22 +04:00
|
|
|
char *next_component = path + offset_1st_component(path);
|
2014-01-06 17:45:25 +04:00
|
|
|
enum scld_error ret = SCLD_OK;
|
2006-02-10 04:56:13 +03:00
|
|
|
|
2014-01-06 17:45:25 +04:00
|
|
|
while (ret == SCLD_OK && next_component) {
|
2014-01-06 17:45:20 +04:00
|
|
|
struct stat st;
|
2014-01-19 03:40:44 +04:00
|
|
|
char *slash = next_component, slash_character;
|
2014-01-06 17:45:20 +04:00
|
|
|
|
2014-01-19 03:40:44 +04:00
|
|
|
while (*slash && !is_dir_sep(*slash))
|
|
|
|
slash++;
|
|
|
|
|
|
|
|
if (!*slash)
|
2005-07-06 12:11:52 +04:00
|
|
|
break;
|
2014-01-06 17:45:23 +04:00
|
|
|
|
2014-01-06 17:45:22 +04:00
|
|
|
next_component = slash + 1;
|
2014-01-19 03:40:44 +04:00
|
|
|
while (is_dir_sep(*next_component))
|
2014-01-06 17:45:23 +04:00
|
|
|
next_component++;
|
2014-01-06 17:45:22 +04:00
|
|
|
if (!*next_component)
|
2008-09-03 01:10:15 +04:00
|
|
|
break;
|
2014-01-06 17:45:21 +04:00
|
|
|
|
2014-01-19 03:40:44 +04:00
|
|
|
slash_character = *slash;
|
2014-01-06 17:45:21 +04:00
|
|
|
*slash = '\0';
|
2006-02-10 04:56:13 +03:00
|
|
|
if (!stat(path, &st)) {
|
|
|
|
/* path exists */
|
2017-01-06 19:22:25 +03:00
|
|
|
if (!S_ISDIR(st.st_mode)) {
|
|
|
|
errno = ENOTDIR;
|
2014-01-06 17:45:25 +04:00
|
|
|
ret = SCLD_EXISTS;
|
2017-01-06 19:22:25 +03:00
|
|
|
}
|
2014-01-06 17:45:19 +04:00
|
|
|
} else if (mkdir(path, 0777)) {
|
2013-03-17 18:09:27 +04:00
|
|
|
if (errno == EEXIST &&
|
2014-01-06 17:45:24 +04:00
|
|
|
!stat(path, &st) && S_ISDIR(st.st_mode))
|
2013-03-17 18:09:27 +04:00
|
|
|
; /* somebody created it since we checked */
|
2014-01-06 17:45:27 +04:00
|
|
|
else if (errno == ENOENT)
|
|
|
|
/*
|
|
|
|
* Either mkdir() failed because
|
|
|
|
* somebody just pruned the containing
|
|
|
|
* directory, or stat() failed because
|
|
|
|
* the file that was in our way was
|
|
|
|
* just removed. Either way, inform
|
|
|
|
* the caller that it might be worth
|
|
|
|
* trying again:
|
|
|
|
*/
|
|
|
|
ret = SCLD_VANISHED;
|
2014-01-06 17:45:24 +04:00
|
|
|
else
|
2014-01-06 17:45:25 +04:00
|
|
|
ret = SCLD_FAILED;
|
2020-12-02 02:45:04 +03:00
|
|
|
} else if (share && adjust_shared_perm(path)) {
|
2014-01-06 17:45:25 +04:00
|
|
|
ret = SCLD_PERMS;
|
2005-12-23 01:13:56 +03:00
|
|
|
}
|
2014-01-19 03:40:44 +04:00
|
|
|
*slash = slash_character;
|
2005-07-06 12:11:52 +04:00
|
|
|
}
|
2014-01-06 17:45:24 +04:00
|
|
|
return ret;
|
2005-07-06 12:11:52 +04:00
|
|
|
}
|
2005-07-05 22:31:32 +04:00
|
|
|
|
2020-12-02 02:45:04 +03:00
|
|
|
enum scld_error safe_create_leading_directories(char *path)
|
|
|
|
{
|
|
|
|
return safe_create_leading_directories_1(path, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
enum scld_error safe_create_leading_directories_no_share(char *path)
|
|
|
|
{
|
|
|
|
return safe_create_leading_directories_1(path, 0);
|
|
|
|
}
|
|
|
|
|
2014-01-06 17:45:25 +04:00
|
|
|
enum scld_error safe_create_leading_directories_const(const char *path)
|
2008-06-25 09:41:34 +04:00
|
|
|
{
|
2017-01-06 19:22:24 +03:00
|
|
|
int save_errno;
|
2008-06-25 09:41:34 +04:00
|
|
|
/* path points to cache entries, so xstrdup before messing with it */
|
|
|
|
char *buf = xstrdup(path);
|
2014-01-06 17:45:25 +04:00
|
|
|
enum scld_error result = safe_create_leading_directories(buf);
|
2017-01-06 19:22:24 +03:00
|
|
|
|
|
|
|
save_errno = errno;
|
2008-06-25 09:41:34 +04:00
|
|
|
free(buf);
|
2017-01-06 19:22:24 +03:00
|
|
|
errno = save_errno;
|
2008-06-25 09:41:34 +04:00
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
static void fill_loose_path(struct strbuf *buf, const struct object_id *oid)
|
2005-05-07 11:38:04 +04:00
|
|
|
{
|
|
|
|
int i;
|
2018-07-16 04:28:07 +03:00
|
|
|
for (i = 0; i < the_hash_algo->rawsz; i++) {
|
2005-05-07 11:38:04 +04:00
|
|
|
static char hex[] = "0123456789abcdef";
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
unsigned int val = oid->hash[i];
|
fill_sha1_file: write into a strbuf
It's currently the responsibility of the caller to give
fill_sha1_file() enough bytes to write into, leading them to
manually compute the required lengths. Instead, let's just
write into a strbuf so that it's impossible to get this
wrong.
The alt_odb caller already has a strbuf, so this makes
things strictly simpler. The other caller, sha1_file_name(),
uses a static PATH_MAX buffer and dies when it would
overflow. We can convert this to a static strbuf, which
means our allocation cost is amortized (and as a bonus, we
no longer have to worry about PATH_MAX being too short for
normal use).
This does introduce some small overhead in fill_sha1_file(),
as each strbuf_addchar() will check whether it needs to
grow. However, between the optimization in fec501d
(strbuf_addch: avoid calling strbuf_grow, 2015-04-16) and
the fact that this is not generally called in a tight loop
(after all, the next step is typically to access the file!)
this probably doesn't matter. And even if it did, the right
place to micro-optimize is inside fill_sha1_file(), by
calling a single strbuf_grow() there.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:36:09 +03:00
|
|
|
strbuf_addch(buf, hex[val >> 4]);
|
|
|
|
strbuf_addch(buf, hex[val & 0xf]);
|
2016-10-03 23:35:55 +03:00
|
|
|
if (!i)
|
fill_sha1_file: write into a strbuf
It's currently the responsibility of the caller to give
fill_sha1_file() enough bytes to write into, leading them to
manually compute the required lengths. Instead, let's just
write into a strbuf so that it's impossible to get this
wrong.
The alt_odb caller already has a strbuf, so this makes
things strictly simpler. The other caller, sha1_file_name(),
uses a static PATH_MAX buffer and dies when it would
overflow. We can convert this to a static strbuf, which
means our allocation cost is amortized (and as a bonus, we
no longer have to worry about PATH_MAX being too short for
normal use).
This does introduce some small overhead in fill_sha1_file(),
as each strbuf_addchar() will check whether it needs to
grow. However, between the optimization in fec501d
(strbuf_addch: avoid calling strbuf_grow, 2015-04-16) and
the fact that this is not generally called in a tight loop
(after all, the next step is typically to access the file!)
this probably doesn't matter. And even if it did, the right
place to micro-optimize is inside fill_sha1_file(), by
calling a single strbuf_grow() there.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:36:09 +03:00
|
|
|
strbuf_addch(buf, '/');
|
2005-05-07 11:38:04 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
static const char *odb_loose_path(struct object_directory *odb,
|
|
|
|
struct strbuf *buf,
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
const struct object_id *oid)
|
2005-04-19 00:04:43 +04:00
|
|
|
{
|
2018-11-12 17:48:56 +03:00
|
|
|
strbuf_reset(buf);
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
strbuf_addstr(buf, odb->path);
|
2018-01-18 13:08:54 +03:00
|
|
|
strbuf_addch(buf, '/');
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
fill_loose_path(buf, oid);
|
2018-11-12 17:49:35 +03:00
|
|
|
return buf->buf;
|
2005-04-19 00:04:43 +04:00
|
|
|
}
|
|
|
|
|
2018-11-12 17:49:35 +03:00
|
|
|
const char *loose_object_path(struct repository *r, struct strbuf *buf,
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
const struct object_id *oid)
|
2016-10-03 23:35:43 +03:00
|
|
|
{
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
return odb_loose_path(r->objects->odb, buf, oid);
|
2005-04-19 00:04:43 +04:00
|
|
|
}
|
|
|
|
|
link_alt_odb_entry: refactor string handling
The string handling in link_alt_odb_entry() is mostly an
artifact of the original version, which took the path as a
ptr/len combo, and did not have a NUL-terminated string
until we created one in the alternate_object_database
struct. But since 5bdf0a8 (sha1_file: normalize alt_odb
path before comparing and storing, 2011-09-07), the first
thing we do is put the path into a strbuf, which gives us
some easy opportunities for cleanup.
In particular:
- we call strlen(pathbuf.buf), which is silly; we can look
at pathbuf.len.
- even though we have a strbuf, we don't maintain its
"len" field when chomping extra slashes from the
end, and instead keep a separate "pfxlen" variable. We
can fix this and then drop "pfxlen" entirely.
- we don't check whether the path is usable until after we
allocate the new struct, making extra cleanup work for
ourselves. Since we have a NUL-terminated string, we can
bump the "is it usable" checks higher in the function.
While we're at it, we can move that logic to its own
helper, which makes the flow of link_alt_odb_entry()
easier to follow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:48 +03:00
|
|
|
/*
|
|
|
|
* Return non-zero iff the path is usable as an alternate object database.
|
|
|
|
*/
|
2018-03-23 20:21:03 +03:00
|
|
|
static int alt_odb_usable(struct raw_object_store *o,
|
|
|
|
struct strbuf *path,
|
2021-07-08 02:10:15 +03:00
|
|
|
const char *normalized_objdir, khiter_t *pos)
|
link_alt_odb_entry: refactor string handling
The string handling in link_alt_odb_entry() is mostly an
artifact of the original version, which took the path as a
ptr/len combo, and did not have a NUL-terminated string
until we created one in the alternate_object_database
struct. But since 5bdf0a8 (sha1_file: normalize alt_odb
path before comparing and storing, 2011-09-07), the first
thing we do is put the path into a strbuf, which gives us
some easy opportunities for cleanup.
In particular:
- we call strlen(pathbuf.buf), which is silly; we can look
at pathbuf.len.
- even though we have a strbuf, we don't maintain its
"len" field when chomping extra slashes from the
end, and instead keep a separate "pfxlen" variable. We
can fix this and then drop "pfxlen" entirely.
- we don't check whether the path is usable until after we
allocate the new struct, making extra cleanup work for
ourselves. Since we have a NUL-terminated string, we can
bump the "is it usable" checks higher in the function.
While we're at it, we can move that logic to its own
helper, which makes the flow of link_alt_odb_entry()
easier to follow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:48 +03:00
|
|
|
{
|
2021-07-08 02:10:15 +03:00
|
|
|
int r;
|
link_alt_odb_entry: refactor string handling
The string handling in link_alt_odb_entry() is mostly an
artifact of the original version, which took the path as a
ptr/len combo, and did not have a NUL-terminated string
until we created one in the alternate_object_database
struct. But since 5bdf0a8 (sha1_file: normalize alt_odb
path before comparing and storing, 2011-09-07), the first
thing we do is put the path into a strbuf, which gives us
some easy opportunities for cleanup.
In particular:
- we call strlen(pathbuf.buf), which is silly; we can look
at pathbuf.len.
- even though we have a strbuf, we don't maintain its
"len" field when chomping extra slashes from the
end, and instead keep a separate "pfxlen" variable. We
can fix this and then drop "pfxlen" entirely.
- we don't check whether the path is usable until after we
allocate the new struct, making extra cleanup work for
ourselves. Since we have a NUL-terminated string, we can
bump the "is it usable" checks higher in the function.
While we're at it, we can move that logic to its own
helper, which makes the flow of link_alt_odb_entry()
easier to follow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:48 +03:00
|
|
|
|
|
|
|
/* Detect cases where alternate disappeared */
|
|
|
|
if (!is_directory(path->buf)) {
|
2018-07-21 10:49:39 +03:00
|
|
|
error(_("object directory %s does not exist; "
|
|
|
|
"check .git/objects/info/alternates"),
|
link_alt_odb_entry: refactor string handling
The string handling in link_alt_odb_entry() is mostly an
artifact of the original version, which took the path as a
ptr/len combo, and did not have a NUL-terminated string
until we created one in the alternate_object_database
struct. But since 5bdf0a8 (sha1_file: normalize alt_odb
path before comparing and storing, 2011-09-07), the first
thing we do is put the path into a strbuf, which gives us
some easy opportunities for cleanup.
In particular:
- we call strlen(pathbuf.buf), which is silly; we can look
at pathbuf.len.
- even though we have a strbuf, we don't maintain its
"len" field when chomping extra slashes from the
end, and instead keep a separate "pfxlen" variable. We
can fix this and then drop "pfxlen" entirely.
- we don't check whether the path is usable until after we
allocate the new struct, making extra cleanup work for
ourselves. Since we have a NUL-terminated string, we can
bump the "is it usable" checks higher in the function.
While we're at it, we can move that logic to its own
helper, which makes the flow of link_alt_odb_entry()
easier to follow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:48 +03:00
|
|
|
path->buf);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prevent the common mistake of listing the same
|
|
|
|
* thing twice, or object directory itself.
|
|
|
|
*/
|
2021-07-08 02:10:15 +03:00
|
|
|
if (!o->odb_by_path) {
|
|
|
|
khiter_t p;
|
|
|
|
|
|
|
|
o->odb_by_path = kh_init_odb_path_map();
|
|
|
|
assert(!o->odb->next);
|
|
|
|
p = kh_put_odb_path_map(o->odb_by_path, o->odb->path, &r);
|
|
|
|
assert(r == 1); /* never used */
|
|
|
|
kh_value(o->odb_by_path, p) = o->odb;
|
link_alt_odb_entry: refactor string handling
The string handling in link_alt_odb_entry() is mostly an
artifact of the original version, which took the path as a
ptr/len combo, and did not have a NUL-terminated string
until we created one in the alternate_object_database
struct. But since 5bdf0a8 (sha1_file: normalize alt_odb
path before comparing and storing, 2011-09-07), the first
thing we do is put the path into a strbuf, which gives us
some easy opportunities for cleanup.
In particular:
- we call strlen(pathbuf.buf), which is silly; we can look
at pathbuf.len.
- even though we have a strbuf, we don't maintain its
"len" field when chomping extra slashes from the
end, and instead keep a separate "pfxlen" variable. We
can fix this and then drop "pfxlen" entirely.
- we don't check whether the path is usable until after we
allocate the new struct, making extra cleanup work for
ourselves. Since we have a NUL-terminated string, we can
bump the "is it usable" checks higher in the function.
While we're at it, we can move that logic to its own
helper, which makes the flow of link_alt_odb_entry()
easier to follow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:48 +03:00
|
|
|
}
|
2021-07-08 02:10:15 +03:00
|
|
|
if (fspatheq(path->buf, normalized_objdir))
|
link_alt_odb_entry: refactor string handling
The string handling in link_alt_odb_entry() is mostly an
artifact of the original version, which took the path as a
ptr/len combo, and did not have a NUL-terminated string
until we created one in the alternate_object_database
struct. But since 5bdf0a8 (sha1_file: normalize alt_odb
path before comparing and storing, 2011-09-07), the first
thing we do is put the path into a strbuf, which gives us
some easy opportunities for cleanup.
In particular:
- we call strlen(pathbuf.buf), which is silly; we can look
at pathbuf.len.
- even though we have a strbuf, we don't maintain its
"len" field when chomping extra slashes from the
end, and instead keep a separate "pfxlen" variable. We
can fix this and then drop "pfxlen" entirely.
- we don't check whether the path is usable until after we
allocate the new struct, making extra cleanup work for
ourselves. Since we have a NUL-terminated string, we can
bump the "is it usable" checks higher in the function.
While we're at it, we can move that logic to its own
helper, which makes the flow of link_alt_odb_entry()
easier to follow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:48 +03:00
|
|
|
return 0;
|
2021-07-08 02:10:15 +03:00
|
|
|
*pos = kh_put_odb_path_map(o->odb_by_path, path->buf, &r);
|
|
|
|
/* r: 0 = exists, 1 = never used, 2 = deleted */
|
|
|
|
return r == 0 ? 0 : 1;
|
link_alt_odb_entry: refactor string handling
The string handling in link_alt_odb_entry() is mostly an
artifact of the original version, which took the path as a
ptr/len combo, and did not have a NUL-terminated string
until we created one in the alternate_object_database
struct. But since 5bdf0a8 (sha1_file: normalize alt_odb
path before comparing and storing, 2011-09-07), the first
thing we do is put the path into a strbuf, which gives us
some easy opportunities for cleanup.
In particular:
- we call strlen(pathbuf.buf), which is silly; we can look
at pathbuf.len.
- even though we have a strbuf, we don't maintain its
"len" field when chomping extra slashes from the
end, and instead keep a separate "pfxlen" variable. We
can fix this and then drop "pfxlen" entirely.
- we don't check whether the path is usable until after we
allocate the new struct, making extra cleanup work for
ourselves. Since we have a NUL-terminated string, we can
bump the "is it usable" checks higher in the function.
While we're at it, we can move that logic to its own
helper, which makes the flow of link_alt_odb_entry()
easier to follow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:48 +03:00
|
|
|
}
|
|
|
|
|
2005-05-09 00:51:13 +04:00
|
|
|
/*
|
|
|
|
* Prepare alternate object database registry.
|
2005-08-15 04:25:57 +04:00
|
|
|
*
|
|
|
|
* The variable alt_odb_list points at the list of struct
|
2018-11-12 17:48:47 +03:00
|
|
|
* object_directory. The elements on this list come from
|
2005-08-15 04:25:57 +04:00
|
|
|
* non-empty elements from colon separated ALTERNATE_DB_ENVIRONMENT
|
|
|
|
* environment variable, and $GIT_OBJECT_DIRECTORY/info/alternates,
|
2005-12-05 09:48:43 +03:00
|
|
|
* whose contents is similar to that environment variable but can be
|
|
|
|
* LF separated. Its base points at a statically allocated buffer that
|
2005-08-15 04:25:57 +04:00
|
|
|
* contains "/the/directory/corresponding/to/.git/objects/...", while
|
|
|
|
* its name points just after the slash at the end of ".git/objects/"
|
2020-12-31 14:56:21 +03:00
|
|
|
* in the example above, and has enough space to hold all hex characters
|
|
|
|
* of the object ID, an extra slash for the first level indirection, and
|
|
|
|
* the terminating NUL.
|
2005-05-09 00:51:13 +04:00
|
|
|
*/
|
2018-03-23 20:21:08 +03:00
|
|
|
static void read_info_alternates(struct repository *r,
|
|
|
|
const char *relative_base,
|
|
|
|
int depth);
|
2021-07-08 02:10:16 +03:00
|
|
|
static int link_alt_odb_entry(struct repository *r, const struct strbuf *entry,
|
2018-03-23 20:21:04 +03:00
|
|
|
const char *relative_base, int depth, const char *normalized_objdir)
|
2005-05-07 11:38:04 +04:00
|
|
|
{
|
2018-11-12 17:48:47 +03:00
|
|
|
struct object_directory *ent;
|
2011-09-07 14:37:47 +04:00
|
|
|
struct strbuf pathbuf = STRBUF_INIT;
|
object-file: use real paths when adding alternates
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-24 03:55:31 +03:00
|
|
|
struct strbuf tmp = STRBUF_INIT;
|
2021-07-08 02:10:15 +03:00
|
|
|
khiter_t pos;
|
object-file: use real paths when adding alternates
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-24 03:55:31 +03:00
|
|
|
int ret = -1;
|
2005-08-15 04:25:57 +04:00
|
|
|
|
2021-07-08 02:10:16 +03:00
|
|
|
if (!is_absolute_path(entry->buf) && relative_base) {
|
2016-12-12 21:16:55 +03:00
|
|
|
strbuf_realpath(&pathbuf, relative_base, 1);
|
2011-09-07 14:37:47 +04:00
|
|
|
strbuf_addch(&pathbuf, '/');
|
2006-05-07 22:19:21 +04:00
|
|
|
}
|
2021-07-08 02:10:16 +03:00
|
|
|
strbuf_addbuf(&pathbuf, entry);
|
2006-05-07 22:19:21 +04:00
|
|
|
|
object-file: use real paths when adding alternates
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-24 03:55:31 +03:00
|
|
|
if (!strbuf_realpath(&tmp, pathbuf.buf, 0)) {
|
2018-07-21 10:49:39 +03:00
|
|
|
error(_("unable to normalize alternate object path: %s"),
|
link_alt_odb_entry: handle normalize_path errors
When we add a new alternate to the list, we try to normalize
out any redundant "..", etc. However, we do not look at the
return value of normalize_path_copy(), and will happily
continue with a path that could not be normalized. Worse,
the normalizing process is done in-place, so we are left
with whatever half-finished working state the normalizing
function was in.
Fortunately, this cannot cause us to read past the end of
our buffer, as that working state will always leave the
NUL from the original path in place. And we do tend to
notice problems when we check is_directory() on the path.
But you can see the nonsense that we feed to is_directory
with an entry like:
this/../../is/../../way/../../too/../../deep/../../to/../../resolve
in your objects/info/alternates, which yields:
error: object directory
/to/e/deep/too/way//ects/this/../../is/../../way/../../too/../../deep/../../to/../../resolve
does not exist; check .git/objects/info/alternates.
We can easily fix this just by checking the return value.
But that makes it hard to generate a good error message,
since we're normalizing in-place and our input value has
been overwritten by cruft.
Instead, let's provide a strbuf helper that does an in-place
normalize, but restores the original contents on error. This
uses a second buffer under the hood, which is slightly less
efficient, but this is not a performance-critical code path.
The strbuf helper can also properly set the "len" parameter
of the strbuf before returning. Just doing:
normalize_path_copy(buf.buf, buf.buf);
will shorten the string, but leave buf.len at the original
length. That may be confusing to later code which uses the
strbuf.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:17 +03:00
|
|
|
pathbuf.buf);
|
object-file: use real paths when adding alternates
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-24 03:55:31 +03:00
|
|
|
goto error;
|
link_alt_odb_entry: handle normalize_path errors
When we add a new alternate to the list, we try to normalize
out any redundant "..", etc. However, we do not look at the
return value of normalize_path_copy(), and will happily
continue with a path that could not be normalized. Worse,
the normalizing process is done in-place, so we are left
with whatever half-finished working state the normalizing
function was in.
Fortunately, this cannot cause us to read past the end of
our buffer, as that working state will always leave the
NUL from the original path in place. And we do tend to
notice problems when we check is_directory() on the path.
But you can see the nonsense that we feed to is_directory
with an entry like:
this/../../is/../../way/../../too/../../deep/../../to/../../resolve
in your objects/info/alternates, which yields:
error: object directory
/to/e/deep/too/way//ects/this/../../is/../../way/../../too/../../deep/../../to/../../resolve
does not exist; check .git/objects/info/alternates.
We can easily fix this just by checking the return value.
But that makes it hard to generate a good error message,
since we're normalizing in-place and our input value has
been overwritten by cruft.
Instead, let's provide a strbuf helper that does an in-place
normalize, but restores the original contents on error. This
uses a second buffer under the hood, which is slightly less
efficient, but this is not a performance-critical code path.
The strbuf helper can also properly set the "len" parameter
of the strbuf before returning. Just doing:
normalize_path_copy(buf.buf, buf.buf);
will shorten the string, but leave buf.len at the original
length. That may be confusing to later code which uses the
strbuf.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:17 +03:00
|
|
|
}
|
object-file: use real paths when adding alternates
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-24 03:55:31 +03:00
|
|
|
strbuf_swap(&pathbuf, &tmp);
|
2011-09-07 14:37:47 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The trailing slash after the directory name is given by
|
|
|
|
* this function at the end. Remove duplicates.
|
|
|
|
*/
|
link_alt_odb_entry: refactor string handling
The string handling in link_alt_odb_entry() is mostly an
artifact of the original version, which took the path as a
ptr/len combo, and did not have a NUL-terminated string
until we created one in the alternate_object_database
struct. But since 5bdf0a8 (sha1_file: normalize alt_odb
path before comparing and storing, 2011-09-07), the first
thing we do is put the path into a strbuf, which gives us
some easy opportunities for cleanup.
In particular:
- we call strlen(pathbuf.buf), which is silly; we can look
at pathbuf.len.
- even though we have a strbuf, we don't maintain its
"len" field when chomping extra slashes from the
end, and instead keep a separate "pfxlen" variable. We
can fix this and then drop "pfxlen" entirely.
- we don't check whether the path is usable until after we
allocate the new struct, making extra cleanup work for
ourselves. Since we have a NUL-terminated string, we can
bump the "is it usable" checks higher in the function.
While we're at it, we can move that logic to its own
helper, which makes the flow of link_alt_odb_entry()
easier to follow.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03 23:34:48 +03:00
|
|
|
while (pathbuf.len && pathbuf.buf[pathbuf.len - 1] == '/')
|
|
|
|
strbuf_setlen(&pathbuf, pathbuf.len - 1);
|
2011-09-07 14:37:47 +04:00
|
|
|
|
object-file: use real paths when adding alternates
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-24 03:55:31 +03:00
|
|
|
if (!alt_odb_usable(r->objects, &pathbuf, normalized_objdir, &pos))
|
|
|
|
goto error;
|
2006-05-07 22:19:21 +04:00
|
|
|
|
2021-03-13 19:17:22 +03:00
|
|
|
CALLOC_ARRAY(ent, 1);
|
2021-07-08 02:10:15 +03:00
|
|
|
/* pathbuf.buf is already in r->objects->odb_by_path */
|
|
|
|
ent->path = strbuf_detach(&pathbuf, NULL);
|
2006-05-07 22:19:21 +04:00
|
|
|
|
|
|
|
/* add the alternate entry */
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
*r->objects->odb_tail = ent;
|
|
|
|
r->objects->odb_tail = &(ent->next);
|
2006-05-07 22:19:21 +04:00
|
|
|
ent->next = NULL;
|
2021-07-08 02:10:15 +03:00
|
|
|
assert(r->objects->odb_by_path);
|
|
|
|
kh_value(r->objects->odb_by_path, pos) = ent;
|
2006-05-07 22:19:21 +04:00
|
|
|
|
|
|
|
/* recursively add alternates */
|
2021-07-08 02:10:15 +03:00
|
|
|
read_info_alternates(r, ent->path, depth + 1);
|
object-file: use real paths when adding alternates
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-24 03:55:31 +03:00
|
|
|
ret = 0;
|
|
|
|
error:
|
|
|
|
strbuf_release(&tmp);
|
|
|
|
strbuf_release(&pathbuf);
|
|
|
|
return ret;
|
2006-05-07 22:19:21 +04:00
|
|
|
}
|
|
|
|
|
alternates: accept double-quoted paths
We read lists of alternates from objects/info/alternates
files (delimited by newline), as well as from the
GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable
(delimited by colon or semi-colon, depending on the
platform).
There's no mechanism for quoting the delimiters, so it's
impossible to specify an alternate path that contains a
colon in the environment, or one that contains a newline in
a file. We've lived with that restriction for ages because
both alternates and filenames with colons are relatively
rare, and it's only a problem when the two meet. But since
722ff7f87 (receive-pack: quarantine objects until
pre-receive accepts, 2016-10-03), which builds on the
alternates system, every push causes the receiver to set
GIT_ALTERNATE_OBJECT_DIRECTORIES internally.
It would be convenient to have some way to quote the
delimiter so that we can represent arbitrary paths.
The simplest thing would be an escape character before a
quoted delimiter (e.g., "\:" as a literal colon). But that
creates a backwards compatibility problem: any path which
uses that escape character is now broken, and we've just
shifted the problem. We could choose an unlikely escape
character (e.g., something from the non-printable ASCII
range), but that's awkward to use.
Instead, let's treat names as unquoted unless they begin
with a double-quote, in which case they are interpreted via
our usual C-stylke quoting rules. This also breaks
backwards-compatibility, but in a smaller way: it only
matters if your file has a double-quote as the very _first_
character in the path (whereas an escape character is a
problem anywhere in the path). It's also consistent with
many other parts of git, which accept either a bare pathname
or a double-quoted one, and the sender can choose to quote
or not as required.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-12 22:52:22 +03:00
|
|
|
static const char *parse_alt_odb_entry(const char *string,
|
|
|
|
int sep,
|
|
|
|
struct strbuf *out)
|
|
|
|
{
|
|
|
|
const char *end;
|
|
|
|
|
|
|
|
strbuf_reset(out);
|
|
|
|
|
|
|
|
if (*string == '#') {
|
|
|
|
/* comment; consume up to next separator */
|
|
|
|
end = strchrnul(string, sep);
|
|
|
|
} else if (*string == '"' && !unquote_c_style(out, string, &end)) {
|
|
|
|
/*
|
|
|
|
* quoted path; unquote_c_style has copied the
|
|
|
|
* data for us and set "end". Broken quoting (e.g.,
|
|
|
|
* an entry that doesn't end with a quote) falls
|
|
|
|
* back to the unquoted case below.
|
|
|
|
*/
|
|
|
|
} else {
|
|
|
|
/* normal, unquoted path */
|
|
|
|
end = strchrnul(string, sep);
|
|
|
|
strbuf_add(out, string, end - string);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (*end)
|
|
|
|
end++;
|
|
|
|
return end;
|
|
|
|
}
|
|
|
|
|
2018-03-23 20:21:08 +03:00
|
|
|
static void link_alt_odb_entries(struct repository *r, const char *alt,
|
|
|
|
int sep, const char *relative_base, int depth)
|
2006-05-07 22:19:21 +04:00
|
|
|
{
|
2014-07-15 15:29:45 +04:00
|
|
|
struct strbuf objdirbuf = STRBUF_INIT;
|
alternates: accept double-quoted paths
We read lists of alternates from objects/info/alternates
files (delimited by newline), as well as from the
GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable
(delimited by colon or semi-colon, depending on the
platform).
There's no mechanism for quoting the delimiters, so it's
impossible to specify an alternate path that contains a
colon in the environment, or one that contains a newline in
a file. We've lived with that restriction for ages because
both alternates and filenames with colons are relatively
rare, and it's only a problem when the two meet. But since
722ff7f87 (receive-pack: quarantine objects until
pre-receive accepts, 2016-10-03), which builds on the
alternates system, every push causes the receiver to set
GIT_ALTERNATE_OBJECT_DIRECTORIES internally.
It would be convenient to have some way to quote the
delimiter so that we can represent arbitrary paths.
The simplest thing would be an escape character before a
quoted delimiter (e.g., "\:" as a literal colon). But that
creates a backwards compatibility problem: any path which
uses that escape character is now broken, and we've just
shifted the problem. We could choose an unlikely escape
character (e.g., something from the non-printable ASCII
range), but that's awkward to use.
Instead, let's treat names as unquoted unless they begin
with a double-quote, in which case they are interpreted via
our usual C-stylke quoting rules. This also breaks
backwards-compatibility, but in a smaller way: it only
matters if your file has a double-quote as the very _first_
character in the path (whereas an escape character is a
problem anywhere in the path). It's also consistent with
many other parts of git, which accept either a bare pathname
or a double-quoted one, and the sender can choose to quote
or not as required.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-12 22:52:22 +03:00
|
|
|
struct strbuf entry = STRBUF_INIT;
|
2006-05-07 22:19:21 +04:00
|
|
|
|
2017-11-12 13:27:39 +03:00
|
|
|
if (!alt || !*alt)
|
|
|
|
return;
|
|
|
|
|
2006-05-07 22:19:21 +04:00
|
|
|
if (depth > 5) {
|
2018-07-21 10:49:39 +03:00
|
|
|
error(_("%s: ignoring alternate object stores, nesting too deep"),
|
2006-05-07 22:19:21 +04:00
|
|
|
relative_base);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
object-file: use real paths when adding alternates
When adding an alternate ODB, we check if the alternate has the same
path as the object dir, and if so, we do nothing. However, that
comparison does not resolve symlinks. This makes it possible to add the
object dir as an alternate, which may result in bad behavior. For
example, it can trick "git repack -a -l -d" (possibly run by "git gc")
into thinking that all packs come from an alternate and delete all
objects.
rm -rf test &&
git clone https://github.com/git/git test &&
(
cd test &&
ln -s objects .git/alt-objects &&
# -c repack.updateserverinfo=false silences a warning about not
# being able to update "info/refs", it isn't needed to show the
# bad behavior
GIT_ALTERNATE_OBJECT_DIRECTORIES=".git/alt-objects" git \
-c repack.updateserverinfo=false repack -a -l -d &&
# It's broken!
git status
# Because there are no more objects!
ls .git/objects/pack
)
Fix this by resolving symlinks and relative paths before comparing the
alternate and object dir. This lets us clean up a number of issues noted
in 37a95862c6 (alternates: re-allow relative paths from environment,
2016-11-07):
- Now that we compare the real paths, duplicate detection is no longer
foiled by relative paths.
- Using strbuf_realpath() allows us to "normalize" paths that
strbuf_normalize_path() can't, so we can stop silently ignoring errors
when "normalizing" paths from the environment.
- We now store an absolute path based on getcwd() (the "future
direction" named in 37a95862c6), so chdir()-ing in the process no
longer changes the directory pointed to by the alternate. This is a
change in behavior, but a desirable one.
Signed-off-by: Glen Choo <chooglen@google.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-24 03:55:31 +03:00
|
|
|
strbuf_realpath(&objdirbuf, r->objects->odb->path, 1);
|
2014-07-15 15:29:45 +04:00
|
|
|
|
alternates: accept double-quoted paths
We read lists of alternates from objects/info/alternates
files (delimited by newline), as well as from the
GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable
(delimited by colon or semi-colon, depending on the
platform).
There's no mechanism for quoting the delimiters, so it's
impossible to specify an alternate path that contains a
colon in the environment, or one that contains a newline in
a file. We've lived with that restriction for ages because
both alternates and filenames with colons are relatively
rare, and it's only a problem when the two meet. But since
722ff7f87 (receive-pack: quarantine objects until
pre-receive accepts, 2016-10-03), which builds on the
alternates system, every push causes the receiver to set
GIT_ALTERNATE_OBJECT_DIRECTORIES internally.
It would be convenient to have some way to quote the
delimiter so that we can represent arbitrary paths.
The simplest thing would be an escape character before a
quoted delimiter (e.g., "\:" as a literal colon). But that
creates a backwards compatibility problem: any path which
uses that escape character is now broken, and we've just
shifted the problem. We could choose an unlikely escape
character (e.g., something from the non-printable ASCII
range), but that's awkward to use.
Instead, let's treat names as unquoted unless they begin
with a double-quote, in which case they are interpreted via
our usual C-stylke quoting rules. This also breaks
backwards-compatibility, but in a smaller way: it only
matters if your file has a double-quote as the very _first_
character in the path (whereas an escape character is a
problem anywhere in the path). It's also consistent with
many other parts of git, which accept either a bare pathname
or a double-quoted one, and the sender can choose to quote
or not as required.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-12 22:52:22 +03:00
|
|
|
while (*alt) {
|
|
|
|
alt = parse_alt_odb_entry(alt, sep, &entry);
|
|
|
|
if (!entry.len)
|
2005-08-17 05:22:05 +04:00
|
|
|
continue;
|
2021-07-08 02:10:16 +03:00
|
|
|
link_alt_odb_entry(r, &entry,
|
2018-03-23 20:21:04 +03:00
|
|
|
relative_base, depth, objdirbuf.buf);
|
2005-08-17 05:22:05 +04:00
|
|
|
}
|
alternates: accept double-quoted paths
We read lists of alternates from objects/info/alternates
files (delimited by newline), as well as from the
GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable
(delimited by colon or semi-colon, depending on the
platform).
There's no mechanism for quoting the delimiters, so it's
impossible to specify an alternate path that contains a
colon in the environment, or one that contains a newline in
a file. We've lived with that restriction for ages because
both alternates and filenames with colons are relatively
rare, and it's only a problem when the two meet. But since
722ff7f87 (receive-pack: quarantine objects until
pre-receive accepts, 2016-10-03), which builds on the
alternates system, every push causes the receiver to set
GIT_ALTERNATE_OBJECT_DIRECTORIES internally.
It would be convenient to have some way to quote the
delimiter so that we can represent arbitrary paths.
The simplest thing would be an escape character before a
quoted delimiter (e.g., "\:" as a literal colon). But that
creates a backwards compatibility problem: any path which
uses that escape character is now broken, and we've just
shifted the problem. We could choose an unlikely escape
character (e.g., something from the non-printable ASCII
range), but that's awkward to use.
Instead, let's treat names as unquoted unless they begin
with a double-quote, in which case they are interpreted via
our usual C-stylke quoting rules. This also breaks
backwards-compatibility, but in a smaller way: it only
matters if your file has a double-quote as the very _first_
character in the path (whereas an escape character is a
problem anywhere in the path). It's also consistent with
many other parts of git, which accept either a bare pathname
or a double-quoted one, and the sender can choose to quote
or not as required.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-12 22:52:22 +03:00
|
|
|
strbuf_release(&entry);
|
2014-07-15 15:29:45 +04:00
|
|
|
strbuf_release(&objdirbuf);
|
2005-08-15 04:25:57 +04:00
|
|
|
}
|
|
|
|
|
2018-03-23 20:21:08 +03:00
|
|
|
static void read_info_alternates(struct repository *r,
|
|
|
|
const char *relative_base,
|
|
|
|
int depth)
|
2005-08-15 04:25:57 +04:00
|
|
|
{
|
2015-08-19 21:12:45 +03:00
|
|
|
char *path;
|
read_info_alternates: read contents into strbuf
This patch fixes a regression in v2.11.1 where we might read
past the end of an mmap'd buffer. It was introduced in
cf3c635210.
The link_alt_odb_entries() function has always taken a
ptr/len pair as input. Until cf3c635210 (alternates: accept
double-quoted paths, 2016-12-12), we made a copy of those
bytes in a string. But after that commit, we switched to
parsing the input left-to-right, and we ignore "len"
totally, instead reading until we hit a NUL.
This has mostly gone unnoticed for a few reasons:
1. All but one caller passes a NUL-terminated string, with
"len" pointing to the NUL.
2. The remaining caller, read_info_alternates(), passes in
an mmap'd file. Unless the file is an exact multiple of
the page size, it will generally be followed by NUL
padding to the end of the page, which just works.
The easiest way to demonstrate the problem is to build with:
make SANITIZE=address NO_MMAP=Nope test
Any test which involves $GIT_DIR/info/alternates will fail,
as the mmap emulation (correctly) does not add an extra NUL,
and ASAN complains about reading past the end of the buffer.
One solution would be to teach link_alt_odb_entries() to
respect "len". But it's actually a bit tricky, since we
depend on unquote_c_style() under the hood, and it has no
ptr/len variant.
We could also just make a NUL-terminated copy of the input
bytes and operate on that. But since all but one caller
already is passing a string, instead let's just fix that
caller to provide NUL-terminated input in the first place,
by swapping out mmap for strbuf_read_file().
There's no advantage to using mmap on the alternates file.
It's not expected to be large (and anyway, we're copying its
contents into an in-memory linked list). Nor is using
git_open() buying us anything here, since we don't keep the
descriptor open for a long period of time.
Let's also drop the "len" parameter entirely from
link_alt_odb_entries(), since it's completely ignored. That
will avoid any new callers re-introducing a similar bug.
Reported-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-19 22:41:07 +03:00
|
|
|
struct strbuf buf = STRBUF_INIT;
|
2005-08-15 04:25:57 +04:00
|
|
|
|
2015-08-19 21:12:45 +03:00
|
|
|
path = xstrfmt("%s/info/alternates", relative_base);
|
read_info_alternates: read contents into strbuf
This patch fixes a regression in v2.11.1 where we might read
past the end of an mmap'd buffer. It was introduced in
cf3c635210.
The link_alt_odb_entries() function has always taken a
ptr/len pair as input. Until cf3c635210 (alternates: accept
double-quoted paths, 2016-12-12), we made a copy of those
bytes in a string. But after that commit, we switched to
parsing the input left-to-right, and we ignore "len"
totally, instead reading until we hit a NUL.
This has mostly gone unnoticed for a few reasons:
1. All but one caller passes a NUL-terminated string, with
"len" pointing to the NUL.
2. The remaining caller, read_info_alternates(), passes in
an mmap'd file. Unless the file is an exact multiple of
the page size, it will generally be followed by NUL
padding to the end of the page, which just works.
The easiest way to demonstrate the problem is to build with:
make SANITIZE=address NO_MMAP=Nope test
Any test which involves $GIT_DIR/info/alternates will fail,
as the mmap emulation (correctly) does not add an extra NUL,
and ASAN complains about reading past the end of the buffer.
One solution would be to teach link_alt_odb_entries() to
respect "len". But it's actually a bit tricky, since we
depend on unquote_c_style() under the hood, and it has no
ptr/len variant.
We could also just make a NUL-terminated copy of the input
bytes and operate on that. But since all but one caller
already is passing a string, instead let's just fix that
caller to provide NUL-terminated input in the first place,
by swapping out mmap for strbuf_read_file().
There's no advantage to using mmap on the alternates file.
It's not expected to be large (and anyway, we're copying its
contents into an in-memory linked list). Nor is using
git_open() buying us anything here, since we don't keep the
descriptor open for a long period of time.
Let's also drop the "len" parameter entirely from
link_alt_odb_entries(), since it's completely ignored. That
will avoid any new callers re-introducing a similar bug.
Reported-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-19 22:41:07 +03:00
|
|
|
if (strbuf_read_file(&buf, path, 1024) < 0) {
|
2017-09-19 22:41:10 +03:00
|
|
|
warn_on_fopen_errors(path);
|
read_info_alternates: read contents into strbuf
This patch fixes a regression in v2.11.1 where we might read
past the end of an mmap'd buffer. It was introduced in
cf3c635210.
The link_alt_odb_entries() function has always taken a
ptr/len pair as input. Until cf3c635210 (alternates: accept
double-quoted paths, 2016-12-12), we made a copy of those
bytes in a string. But after that commit, we switched to
parsing the input left-to-right, and we ignore "len"
totally, instead reading until we hit a NUL.
This has mostly gone unnoticed for a few reasons:
1. All but one caller passes a NUL-terminated string, with
"len" pointing to the NUL.
2. The remaining caller, read_info_alternates(), passes in
an mmap'd file. Unless the file is an exact multiple of
the page size, it will generally be followed by NUL
padding to the end of the page, which just works.
The easiest way to demonstrate the problem is to build with:
make SANITIZE=address NO_MMAP=Nope test
Any test which involves $GIT_DIR/info/alternates will fail,
as the mmap emulation (correctly) does not add an extra NUL,
and ASAN complains about reading past the end of the buffer.
One solution would be to teach link_alt_odb_entries() to
respect "len". But it's actually a bit tricky, since we
depend on unquote_c_style() under the hood, and it has no
ptr/len variant.
We could also just make a NUL-terminated copy of the input
bytes and operate on that. But since all but one caller
already is passing a string, instead let's just fix that
caller to provide NUL-terminated input in the first place,
by swapping out mmap for strbuf_read_file().
There's no advantage to using mmap on the alternates file.
It's not expected to be large (and anyway, we're copying its
contents into an in-memory linked list). Nor is using
git_open() buying us anything here, since we don't keep the
descriptor open for a long period of time.
Let's also drop the "len" parameter entirely from
link_alt_odb_entries(), since it's completely ignored. That
will avoid any new callers re-introducing a similar bug.
Reported-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-19 22:41:07 +03:00
|
|
|
free(path);
|
2005-06-29 01:56:57 +04:00
|
|
|
return;
|
2005-05-07 11:38:04 +04:00
|
|
|
}
|
2005-08-15 04:25:57 +04:00
|
|
|
|
2018-03-23 20:21:08 +03:00
|
|
|
link_alt_odb_entries(r, buf.buf, '\n', relative_base, depth);
|
read_info_alternates: read contents into strbuf
This patch fixes a regression in v2.11.1 where we might read
past the end of an mmap'd buffer. It was introduced in
cf3c635210.
The link_alt_odb_entries() function has always taken a
ptr/len pair as input. Until cf3c635210 (alternates: accept
double-quoted paths, 2016-12-12), we made a copy of those
bytes in a string. But after that commit, we switched to
parsing the input left-to-right, and we ignore "len"
totally, instead reading until we hit a NUL.
This has mostly gone unnoticed for a few reasons:
1. All but one caller passes a NUL-terminated string, with
"len" pointing to the NUL.
2. The remaining caller, read_info_alternates(), passes in
an mmap'd file. Unless the file is an exact multiple of
the page size, it will generally be followed by NUL
padding to the end of the page, which just works.
The easiest way to demonstrate the problem is to build with:
make SANITIZE=address NO_MMAP=Nope test
Any test which involves $GIT_DIR/info/alternates will fail,
as the mmap emulation (correctly) does not add an extra NUL,
and ASAN complains about reading past the end of the buffer.
One solution would be to teach link_alt_odb_entries() to
respect "len". But it's actually a bit tricky, since we
depend on unquote_c_style() under the hood, and it has no
ptr/len variant.
We could also just make a NUL-terminated copy of the input
bytes and operate on that. But since all but one caller
already is passing a string, instead let's just fix that
caller to provide NUL-terminated input in the first place,
by swapping out mmap for strbuf_read_file().
There's no advantage to using mmap on the alternates file.
It's not expected to be large (and anyway, we're copying its
contents into an in-memory linked list). Nor is using
git_open() buying us anything here, since we don't keep the
descriptor open for a long period of time.
Let's also drop the "len" parameter entirely from
link_alt_odb_entries(), since it's completely ignored. That
will avoid any new callers re-introducing a similar bug.
Reported-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-19 22:41:07 +03:00
|
|
|
strbuf_release(&buf);
|
|
|
|
free(path);
|
2005-05-07 11:38:04 +04:00
|
|
|
}
|
|
|
|
|
2008-04-18 03:32:30 +04:00
|
|
|
void add_to_alternates_file(const char *reference)
|
|
|
|
{
|
2017-10-05 23:32:03 +03:00
|
|
|
struct lock_file lock = LOCK_INIT;
|
add_to_alternates_file: don't add duplicate entries
The add_to_alternates_file function blindly uses
hold_lock_file_for_append to copy the existing contents, and
then adds the new line to it. This has two minor problems:
1. We might add duplicate entries, which are ugly and
inefficient.
2. We do not check that the file ends with a newline, in
which case we would bogusly append to the final line.
This is quite unlikely in practice, though, as we call
this function only from git-clone, so presumably we are
the only writers of the file (and we always add a
newline).
Instead of using hold_lock_file_for_append, let's copy the
file line by line, which ensures all records are properly
terminated. If we see an extra line, we can simply abort the
update (there is no point in even copying the rest, as we
know that it would be identical to the original).
As a bonus, we also get rid of some calls to the
static-buffer mkpath and git_path functions.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-10 12:34:46 +03:00
|
|
|
char *alts = git_pathdup("objects/info/alternates");
|
|
|
|
FILE *in, *out;
|
2017-10-05 23:32:03 +03:00
|
|
|
int found = 0;
|
add_to_alternates_file: don't add duplicate entries
The add_to_alternates_file function blindly uses
hold_lock_file_for_append to copy the existing contents, and
then adds the new line to it. This has two minor problems:
1. We might add duplicate entries, which are ugly and
inefficient.
2. We do not check that the file ends with a newline, in
which case we would bogusly append to the final line.
This is quite unlikely in practice, though, as we call
this function only from git-clone, so presumably we are
the only writers of the file (and we always add a
newline).
Instead of using hold_lock_file_for_append, let's copy the
file line by line, which ensures all records are properly
terminated. If we see an extra line, we can simply abort the
update (there is no point in even copying the rest, as we
know that it would be identical to the original).
As a bonus, we also get rid of some calls to the
static-buffer mkpath and git_path functions.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-10 12:34:46 +03:00
|
|
|
|
2017-10-05 23:32:03 +03:00
|
|
|
hold_lock_file_for_update(&lock, alts, LOCK_DIE_ON_ERROR);
|
|
|
|
out = fdopen_lock_file(&lock, "w");
|
add_to_alternates_file: don't add duplicate entries
The add_to_alternates_file function blindly uses
hold_lock_file_for_append to copy the existing contents, and
then adds the new line to it. This has two minor problems:
1. We might add duplicate entries, which are ugly and
inefficient.
2. We do not check that the file ends with a newline, in
which case we would bogusly append to the final line.
This is quite unlikely in practice, though, as we call
this function only from git-clone, so presumably we are
the only writers of the file (and we always add a
newline).
Instead of using hold_lock_file_for_append, let's copy the
file line by line, which ensures all records are properly
terminated. If we see an extra line, we can simply abort the
update (there is no point in even copying the rest, as we
know that it would be identical to the original).
As a bonus, we also get rid of some calls to the
static-buffer mkpath and git_path functions.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-10 12:34:46 +03:00
|
|
|
if (!out)
|
2018-07-21 10:49:39 +03:00
|
|
|
die_errno(_("unable to fdopen alternates lockfile"));
|
add_to_alternates_file: don't add duplicate entries
The add_to_alternates_file function blindly uses
hold_lock_file_for_append to copy the existing contents, and
then adds the new line to it. This has two minor problems:
1. We might add duplicate entries, which are ugly and
inefficient.
2. We do not check that the file ends with a newline, in
which case we would bogusly append to the final line.
This is quite unlikely in practice, though, as we call
this function only from git-clone, so presumably we are
the only writers of the file (and we always add a
newline).
Instead of using hold_lock_file_for_append, let's copy the
file line by line, which ensures all records are properly
terminated. If we see an extra line, we can simply abort the
update (there is no point in even copying the rest, as we
know that it would be identical to the original).
As a bonus, we also get rid of some calls to the
static-buffer mkpath and git_path functions.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-10 12:34:46 +03:00
|
|
|
|
|
|
|
in = fopen(alts, "r");
|
|
|
|
if (in) {
|
|
|
|
struct strbuf line = STRBUF_INIT;
|
|
|
|
|
2015-10-28 23:29:24 +03:00
|
|
|
while (strbuf_getline(&line, in) != EOF) {
|
add_to_alternates_file: don't add duplicate entries
The add_to_alternates_file function blindly uses
hold_lock_file_for_append to copy the existing contents, and
then adds the new line to it. This has two minor problems:
1. We might add duplicate entries, which are ugly and
inefficient.
2. We do not check that the file ends with a newline, in
which case we would bogusly append to the final line.
This is quite unlikely in practice, though, as we call
this function only from git-clone, so presumably we are
the only writers of the file (and we always add a
newline).
Instead of using hold_lock_file_for_append, let's copy the
file line by line, which ensures all records are properly
terminated. If we see an extra line, we can simply abort the
update (there is no point in even copying the rest, as we
know that it would be identical to the original).
As a bonus, we also get rid of some calls to the
static-buffer mkpath and git_path functions.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-10 12:34:46 +03:00
|
|
|
if (!strcmp(reference, line.buf)) {
|
|
|
|
found = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
fprintf_or_die(out, "%s\n", line.buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
strbuf_release(&line);
|
|
|
|
fclose(in);
|
|
|
|
}
|
|
|
|
else if (errno != ENOENT)
|
2018-07-21 10:49:39 +03:00
|
|
|
die_errno(_("unable to read alternates file"));
|
add_to_alternates_file: don't add duplicate entries
The add_to_alternates_file function blindly uses
hold_lock_file_for_append to copy the existing contents, and
then adds the new line to it. This has two minor problems:
1. We might add duplicate entries, which are ugly and
inefficient.
2. We do not check that the file ends with a newline, in
which case we would bogusly append to the final line.
This is quite unlikely in practice, though, as we call
this function only from git-clone, so presumably we are
the only writers of the file (and we always add a
newline).
Instead of using hold_lock_file_for_append, let's copy the
file line by line, which ensures all records are properly
terminated. If we see an extra line, we can simply abort the
update (there is no point in even copying the rest, as we
know that it would be identical to the original).
As a bonus, we also get rid of some calls to the
static-buffer mkpath and git_path functions.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-10 12:34:46 +03:00
|
|
|
|
2017-10-05 23:32:03 +03:00
|
|
|
if (found) {
|
|
|
|
rollback_lock_file(&lock);
|
|
|
|
} else {
|
add_to_alternates_file: don't add duplicate entries
The add_to_alternates_file function blindly uses
hold_lock_file_for_append to copy the existing contents, and
then adds the new line to it. This has two minor problems:
1. We might add duplicate entries, which are ugly and
inefficient.
2. We do not check that the file ends with a newline, in
which case we would bogusly append to the final line.
This is quite unlikely in practice, though, as we call
this function only from git-clone, so presumably we are
the only writers of the file (and we always add a
newline).
Instead of using hold_lock_file_for_append, let's copy the
file line by line, which ensures all records are properly
terminated. If we see an extra line, we can simply abort the
update (there is no point in even copying the rest, as we
know that it would be identical to the original).
As a bonus, we also get rid of some calls to the
static-buffer mkpath and git_path functions.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-10 12:34:46 +03:00
|
|
|
fprintf_or_die(out, "%s\n", reference);
|
2017-10-05 23:32:03 +03:00
|
|
|
if (commit_lock_file(&lock))
|
2018-07-21 10:49:39 +03:00
|
|
|
die_errno(_("unable to move new alternates file into place"));
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
if (the_repository->objects->loaded_alternates)
|
2018-03-23 20:21:06 +03:00
|
|
|
link_alt_odb_entries(the_repository, reference,
|
|
|
|
'\n', NULL, 0);
|
add_to_alternates_file: don't add duplicate entries
The add_to_alternates_file function blindly uses
hold_lock_file_for_append to copy the existing contents, and
then adds the new line to it. This has two minor problems:
1. We might add duplicate entries, which are ugly and
inefficient.
2. We do not check that the file ends with a newline, in
which case we would bogusly append to the final line.
This is quite unlikely in practice, though, as we call
this function only from git-clone, so presumably we are
the only writers of the file (and we always add a
newline).
Instead of using hold_lock_file_for_append, let's copy the
file line by line, which ensures all records are properly
terminated. If we see an extra line, we can simply abort the
update (there is no point in even copying the rest, as we
know that it would be identical to the original).
As a bonus, we also get rid of some calls to the
static-buffer mkpath and git_path functions.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-10 12:34:46 +03:00
|
|
|
}
|
|
|
|
free(alts);
|
2008-04-18 03:32:30 +04:00
|
|
|
}
|
|
|
|
|
2016-10-03 23:35:03 +03:00
|
|
|
void add_to_alternates_memory(const char *reference)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Make sure alternates are initialized, or else our entry may be
|
|
|
|
* overwritten when they are.
|
|
|
|
*/
|
2018-03-23 20:21:07 +03:00
|
|
|
prepare_alt_odb(the_repository);
|
2016-10-03 23:35:03 +03:00
|
|
|
|
2018-03-23 20:21:06 +03:00
|
|
|
link_alt_odb_entries(the_repository, reference,
|
|
|
|
'\n', NULL, 0);
|
2016-10-03 23:35:03 +03:00
|
|
|
}
|
|
|
|
|
2021-12-07 01:05:04 +03:00
|
|
|
struct object_directory *set_temporary_primary_odb(const char *dir, int will_destroy)
|
|
|
|
{
|
|
|
|
struct object_directory *new_odb;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure alternates are initialized, or else our entry may be
|
|
|
|
* overwritten when they are.
|
|
|
|
*/
|
|
|
|
prepare_alt_odb(the_repository);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make a new primary odb and link the old primary ODB in as an
|
|
|
|
* alternate
|
|
|
|
*/
|
|
|
|
new_odb = xcalloc(1, sizeof(*new_odb));
|
|
|
|
new_odb->path = xstrdup(dir);
|
2021-12-07 01:05:05 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Disable ref updates while a temporary odb is active, since
|
|
|
|
* the objects in the database may roll back.
|
|
|
|
*/
|
|
|
|
new_odb->disable_ref_updates = 1;
|
2021-12-07 01:05:04 +03:00
|
|
|
new_odb->will_destroy = will_destroy;
|
|
|
|
new_odb->next = the_repository->objects->odb;
|
|
|
|
the_repository->objects->odb = new_odb;
|
|
|
|
return new_odb->next;
|
|
|
|
}
|
|
|
|
|
|
|
|
void restore_primary_odb(struct object_directory *restore_odb, const char *old_path)
|
|
|
|
{
|
|
|
|
struct object_directory *cur_odb = the_repository->objects->odb;
|
|
|
|
|
|
|
|
if (strcmp(old_path, cur_odb->path))
|
|
|
|
BUG("expected %s as primary object store; found %s",
|
|
|
|
old_path, cur_odb->path);
|
|
|
|
|
|
|
|
if (cur_odb->next != restore_odb)
|
|
|
|
BUG("we expect the old primary object store to be the first alternate");
|
|
|
|
|
|
|
|
the_repository->objects->odb = restore_odb;
|
|
|
|
free_object_directory(cur_odb);
|
|
|
|
}
|
|
|
|
|
2016-08-16 00:53:24 +03:00
|
|
|
/*
|
|
|
|
* Compute the exact path an alternate is at and returns it. In case of
|
|
|
|
* error NULL is returned and the human readable error is added to `err`
|
2018-06-03 17:32:50 +03:00
|
|
|
* `path` may be relative and should point to $GIT_DIR.
|
2016-08-16 00:53:24 +03:00
|
|
|
* `err` must not be null.
|
|
|
|
*/
|
|
|
|
char *compute_alternate_path(const char *path, struct strbuf *err)
|
|
|
|
{
|
|
|
|
char *ref_git = NULL;
|
2020-03-10 16:11:23 +03:00
|
|
|
const char *repo;
|
2016-08-16 00:53:24 +03:00
|
|
|
int seen_error = 0;
|
|
|
|
|
2020-03-10 16:11:23 +03:00
|
|
|
ref_git = real_pathdup(path, 0);
|
|
|
|
if (!ref_git) {
|
2016-08-16 00:53:24 +03:00
|
|
|
seen_error = 1;
|
|
|
|
strbuf_addf(err, _("path '%s' does not exist"), path);
|
|
|
|
goto out;
|
2020-03-10 16:11:23 +03:00
|
|
|
}
|
2016-08-16 00:53:24 +03:00
|
|
|
|
|
|
|
repo = read_gitfile(ref_git);
|
|
|
|
if (!repo)
|
|
|
|
repo = read_gitfile(mkpath("%s/.git", ref_git));
|
|
|
|
if (repo) {
|
|
|
|
free(ref_git);
|
|
|
|
ref_git = xstrdup(repo);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!repo && is_directory(mkpath("%s/.git/objects", ref_git))) {
|
|
|
|
char *ref_git_git = mkpathdup("%s/.git", ref_git);
|
|
|
|
free(ref_git);
|
|
|
|
ref_git = ref_git_git;
|
|
|
|
} else if (!is_directory(mkpath("%s/objects", ref_git))) {
|
|
|
|
struct strbuf sb = STRBUF_INIT;
|
|
|
|
seen_error = 1;
|
|
|
|
if (get_common_dir(&sb, ref_git)) {
|
|
|
|
strbuf_addf(err,
|
|
|
|
_("reference repository '%s' as a linked "
|
|
|
|
"checkout is not supported yet."),
|
|
|
|
path);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
strbuf_addf(err, _("reference repository '%s' is not a "
|
|
|
|
"local repository."), path);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!access(mkpath("%s/shallow", ref_git), F_OK)) {
|
|
|
|
strbuf_addf(err, _("reference repository '%s' is shallow"),
|
|
|
|
path);
|
|
|
|
seen_error = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!access(mkpath("%s/info/grafts", ref_git), F_OK)) {
|
|
|
|
strbuf_addf(err,
|
|
|
|
_("reference repository '%s' is grafted"),
|
|
|
|
path);
|
|
|
|
seen_error = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
if (seen_error) {
|
2017-06-16 02:15:46 +03:00
|
|
|
FREE_AND_NULL(ref_git);
|
2016-08-16 00:53:24 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
return ref_git;
|
|
|
|
}
|
|
|
|
|
midx: avoid opening multiple MIDXs when writing
Opening multiple instance of the same MIDX can lead to problems like two
separate packed_git structures which represent the same pack being added
to the repository's object store.
The above scenario can happen because prepare_midx_pack() checks if
`m->packs[pack_int_id]` is NULL in order to determine if a pack has been
opened and installed in the repository before. But a caller can
construct two copies of the same MIDX by calling get_multi_pack_index()
and load_multi_pack_index() since the former manipulates the
object store directly but the latter is a lower-level routine which
allocates a new MIDX for each call.
So if prepare_midx_pack() is called on multiple MIDXs with the same
pack_int_id, then that pack will be installed twice in the object
store's packed_git pointer.
This can lead to problems in, for e.g., the pack-bitmap code, which does
something like the following (in pack-bitmap.c:open_pack_bitmap()):
struct bitmap_index *bitmap_git = ...;
for (p = get_all_packs(r); p; p = p->next) {
if (open_pack_bitmap_1(bitmap_git, p) == 0)
ret = 0;
}
which is a problem if two copies of the same pack exist in the
packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a
conditional like the following:
if (bitmap_git->pack || bitmap_git->midx) {
/* ignore extra bitmap file; we can only handle one */
warning("ignoring extra bitmap file: %s", packfile->pack_name);
close(fd);
return -1;
}
Avoid this scenario by not letting write_midx_internal() open a MIDX
that isn't also pointed at by the object store. So long as this is the
case, other routines should prefer to open MIDXs with
get_multi_pack_index() or reprepare_packed_git() instead of creating
instances on their own. Because get_multi_pack_index() returns
`r->object_store->multi_pack_index` if it is non-NULL, we'll only have
one instance of a MIDX open at one time, avoiding these problems.
To encourage this, drop the `struct multi_pack_index *` parameter from
`write_midx_internal()`, and rely instead on the `object_dir` to find
(or initialize) the correct MIDX instance.
Likewise, replace the call to `close_midx()` with
`close_object_store()`, since we're about to replace the MIDX with a new
one and should invalidate the object store's memory of any MIDX that
might have existed beforehand.
Note that this now forbids passing object directories that don't belong
to alternate repositories over `--object-dir`, since before we would
have happily opened a MIDX in any directory, but now restrict ourselves
to only those reachable by `r->objects->multi_pack_index` (and alternate
MIDXs that we can see by walking the `next` pointer).
As far as I can tell, supporting arbitrary directories with
`--object-dir` was a historical accident, since even the documentation
says `<alt>` when referring to the value passed to this option.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 23:34:01 +03:00
|
|
|
struct object_directory *find_odb(struct repository *r, const char *obj_dir)
|
|
|
|
{
|
|
|
|
struct object_directory *odb;
|
|
|
|
char *obj_dir_real = real_pathdup(obj_dir, 1);
|
|
|
|
struct strbuf odb_path_real = STRBUF_INIT;
|
|
|
|
|
|
|
|
prepare_alt_odb(r);
|
|
|
|
for (odb = r->objects->odb; odb; odb = odb->next) {
|
|
|
|
strbuf_realpath(&odb_path_real, odb->path, 1);
|
|
|
|
if (!strcmp(obj_dir_real, odb_path_real.buf))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
free(obj_dir_real);
|
|
|
|
strbuf_release(&odb_path_real);
|
|
|
|
|
|
|
|
if (!odb)
|
|
|
|
die(_("could not find object directory matching %s"), obj_dir);
|
|
|
|
return odb;
|
|
|
|
}
|
|
|
|
|
2019-07-01 16:17:40 +03:00
|
|
|
static void fill_alternate_refs_command(struct child_process *cmd,
|
|
|
|
const char *repo_path)
|
|
|
|
{
|
|
|
|
const char *value;
|
|
|
|
|
|
|
|
if (!git_config_get_value("core.alternateRefsCommand", &value)) {
|
|
|
|
cmd->use_shell = 1;
|
|
|
|
|
2020-07-28 23:25:12 +03:00
|
|
|
strvec_push(&cmd->args, value);
|
|
|
|
strvec_push(&cmd->args, repo_path);
|
2019-07-01 16:17:40 +03:00
|
|
|
} else {
|
|
|
|
cmd->git_cmd = 1;
|
|
|
|
|
2020-07-28 23:25:12 +03:00
|
|
|
strvec_pushf(&cmd->args, "--git-dir=%s", repo_path);
|
|
|
|
strvec_push(&cmd->args, "for-each-ref");
|
|
|
|
strvec_push(&cmd->args, "--format=%(objectname)");
|
2019-07-01 16:17:40 +03:00
|
|
|
|
|
|
|
if (!git_config_get_value("core.alternateRefsPrefixes", &value)) {
|
2020-07-28 23:25:12 +03:00
|
|
|
strvec_push(&cmd->args, "--");
|
|
|
|
strvec_split(&cmd->args, value);
|
2019-07-01 16:17:40 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-06-02 12:09:50 +03:00
|
|
|
strvec_pushv(&cmd->env, (const char **)local_repo_env);
|
2019-07-01 16:17:40 +03:00
|
|
|
cmd->out = -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void read_alternate_refs(const char *path,
|
|
|
|
alternate_ref_fn *cb,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
|
|
|
struct strbuf line = STRBUF_INIT;
|
|
|
|
FILE *fh;
|
|
|
|
|
|
|
|
fill_alternate_refs_command(&cmd, path);
|
|
|
|
|
|
|
|
if (start_command(&cmd))
|
|
|
|
return;
|
|
|
|
|
|
|
|
fh = xfdopen(cmd.out, "r");
|
|
|
|
while (strbuf_getline_lf(&line, fh) != EOF) {
|
|
|
|
struct object_id oid;
|
|
|
|
const char *p;
|
|
|
|
|
|
|
|
if (parse_oid_hex(line.buf, &oid, &p) || *p) {
|
|
|
|
warning(_("invalid line while parsing alternate refs: %s"),
|
|
|
|
line.buf);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
cb(&oid, data);
|
|
|
|
}
|
|
|
|
|
|
|
|
fclose(fh);
|
|
|
|
finish_command(&cmd);
|
2019-08-07 14:15:25 +03:00
|
|
|
strbuf_release(&line);
|
2019-07-01 16:17:40 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
struct alternate_refs_data {
|
|
|
|
alternate_ref_fn *fn;
|
|
|
|
void *data;
|
|
|
|
};
|
|
|
|
|
|
|
|
static int refs_from_alternate_cb(struct object_directory *e,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct strbuf path = STRBUF_INIT;
|
|
|
|
size_t base_len;
|
|
|
|
struct alternate_refs_data *cb = data;
|
|
|
|
|
|
|
|
if (!strbuf_realpath(&path, e->path, 0))
|
|
|
|
goto out;
|
|
|
|
if (!strbuf_strip_suffix(&path, "/objects"))
|
|
|
|
goto out;
|
|
|
|
base_len = path.len;
|
|
|
|
|
|
|
|
/* Is this a git repository with refs? */
|
|
|
|
strbuf_addstr(&path, "/refs");
|
|
|
|
if (!is_directory(path.buf))
|
|
|
|
goto out;
|
|
|
|
strbuf_setlen(&path, base_len);
|
|
|
|
|
|
|
|
read_alternate_refs(path.buf, cb->fn, cb->data);
|
|
|
|
|
|
|
|
out:
|
|
|
|
strbuf_release(&path);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void for_each_alternate_ref(alternate_ref_fn fn, void *data)
|
|
|
|
{
|
|
|
|
struct alternate_refs_data cb;
|
|
|
|
cb.fn = fn;
|
|
|
|
cb.data = data;
|
|
|
|
foreach_alt_odb(refs_from_alternate_cb, &cb);
|
|
|
|
}
|
|
|
|
|
2014-10-16 02:33:13 +04:00
|
|
|
int foreach_alt_odb(alt_odb_fn fn, void *cb)
|
push: receiver end advertises refs from alternate repositories
Earlier, when pushing into a repository that borrows from alternate object
stores, we followed the longstanding design decision not to trust refs in
the alternate repository that houses the object store we are borrowing
from. If your public repository is borrowing from Linus's public
repository, you pushed into it long time ago, and now when you try to push
your updated history that is in sync with more recent history from Linus,
you will end up sending not just your own development, but also the
changes you acquired through Linus's tree, even though the objects needed
for the latter already exists at the receiving end. This is because the
receiving end does not advertise that the objects only reachable from the
borrowed repository (i.e. Linus's) are already available there.
This solves the issue by making the receiving end advertise refs from
borrowed repositories. They are not sent with their true names but with a
phoney name ".have" to make sure that the old senders will safely ignore
them (otherwise, the old senders will misbehave, trying to push matching
refs, and mirror push that deletes refs that only exist at the receiving
end).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-09 12:27:10 +04:00
|
|
|
{
|
2018-11-12 17:48:47 +03:00
|
|
|
struct object_directory *ent;
|
2014-10-16 02:33:13 +04:00
|
|
|
int r = 0;
|
push: receiver end advertises refs from alternate repositories
Earlier, when pushing into a repository that borrows from alternate object
stores, we followed the longstanding design decision not to trust refs in
the alternate repository that houses the object store we are borrowing
from. If your public repository is borrowing from Linus's public
repository, you pushed into it long time ago, and now when you try to push
your updated history that is in sync with more recent history from Linus,
you will end up sending not just your own development, but also the
changes you acquired through Linus's tree, even though the objects needed
for the latter already exists at the receiving end. This is because the
receiving end does not advertise that the objects only reachable from the
borrowed repository (i.e. Linus's) are already available there.
This solves the issue by making the receiving end advertise refs from
borrowed repositories. They are not sent with their true names but with a
phoney name ".have" to make sure that the old senders will safely ignore
them (otherwise, the old senders will misbehave, trying to push matching
refs, and mirror push that deletes refs that only exist at the receiving
end).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-09 12:27:10 +04:00
|
|
|
|
2018-03-23 20:21:07 +03:00
|
|
|
prepare_alt_odb(the_repository);
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
for (ent = the_repository->objects->odb->next; ent; ent = ent->next) {
|
2014-10-16 02:33:13 +04:00
|
|
|
r = fn(ent, cb);
|
|
|
|
if (r)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return r;
|
push: receiver end advertises refs from alternate repositories
Earlier, when pushing into a repository that borrows from alternate object
stores, we followed the longstanding design decision not to trust refs in
the alternate repository that houses the object store we are borrowing
from. If your public repository is borrowing from Linus's public
repository, you pushed into it long time ago, and now when you try to push
your updated history that is in sync with more recent history from Linus,
you will end up sending not just your own development, but also the
changes you acquired through Linus's tree, even though the objects needed
for the latter already exists at the receiving end. This is because the
receiving end does not advertise that the objects only reachable from the
borrowed repository (i.e. Linus's) are already available there.
This solves the issue by making the receiving end advertise refs from
borrowed repositories. They are not sent with their true names but with a
phoney name ".have" to make sure that the old senders will safely ignore
them (otherwise, the old senders will misbehave, trying to push matching
refs, and mirror push that deletes refs that only exist at the receiving
end).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-09-09 12:27:10 +04:00
|
|
|
}
|
|
|
|
|
2018-03-23 20:21:09 +03:00
|
|
|
void prepare_alt_odb(struct repository *r)
|
2006-05-07 22:19:21 +04:00
|
|
|
{
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
if (r->objects->loaded_alternates)
|
2007-05-26 09:24:40 +04:00
|
|
|
return;
|
|
|
|
|
2018-03-23 20:21:09 +03:00
|
|
|
link_alt_odb_entries(r, r->objects->alternate_db, PATH_SEP, NULL, 0);
|
2006-05-07 22:19:21 +04:00
|
|
|
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
read_info_alternates(r, r->objects->odb->path, 0);
|
|
|
|
r->objects->loaded_alternates = 1;
|
2006-05-07 22:19:21 +04:00
|
|
|
}
|
|
|
|
|
2023-04-14 09:02:12 +03:00
|
|
|
int has_alt_odb(struct repository *r)
|
|
|
|
{
|
|
|
|
prepare_alt_odb(r);
|
|
|
|
return !!r->objects->odb->next;
|
|
|
|
}
|
|
|
|
|
2017-03-15 21:43:05 +03:00
|
|
|
#define CAP_GET (1u<<0)
|
|
|
|
|
|
|
|
static int subprocess_map_initialized;
|
|
|
|
static struct hashmap subprocess_map;
|
|
|
|
|
|
|
|
struct read_object_process {
|
|
|
|
struct subprocess_entry subprocess;
|
|
|
|
unsigned int supported_capabilities;
|
|
|
|
};
|
|
|
|
|
|
|
|
static int start_read_object_fn(struct subprocess_entry *subprocess)
|
|
|
|
{
|
|
|
|
struct read_object_process *entry = (struct read_object_process *)subprocess;
|
|
|
|
static int versions[] = {1, 0};
|
|
|
|
static struct subprocess_capability capabilities[] = {
|
|
|
|
{ "get", CAP_GET },
|
|
|
|
{ NULL, 0 }
|
|
|
|
};
|
|
|
|
|
|
|
|
return subprocess_handshake(subprocess, "git-read-object", versions,
|
|
|
|
NULL, capabilities,
|
|
|
|
&entry->supported_capabilities);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int read_object_process(const struct object_id *oid)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
struct read_object_process *entry;
|
|
|
|
struct child_process *process;
|
|
|
|
struct strbuf status = STRBUF_INIT;
|
|
|
|
const char *cmd = find_hook("read-object");
|
|
|
|
uint64_t start;
|
|
|
|
|
|
|
|
start = getnanotime();
|
|
|
|
|
|
|
|
if (!subprocess_map_initialized) {
|
|
|
|
subprocess_map_initialized = 1;
|
|
|
|
hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp,
|
|
|
|
NULL, 0);
|
|
|
|
entry = NULL;
|
|
|
|
} else {
|
|
|
|
entry = (struct read_object_process *) subprocess_find_entry(&subprocess_map, cmd);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!entry) {
|
|
|
|
entry = xmalloc(sizeof(*entry));
|
|
|
|
entry->supported_capabilities = 0;
|
|
|
|
|
|
|
|
if (subprocess_start(&subprocess_map, &entry->subprocess, cmd,
|
|
|
|
start_read_object_fn)) {
|
|
|
|
free(entry);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
process = &entry->subprocess.process;
|
|
|
|
|
|
|
|
if (!(CAP_GET & entry->supported_capabilities))
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
sigchain_push(SIGPIPE, SIG_IGN);
|
|
|
|
|
|
|
|
err = packet_write_fmt_gently(process->in, "command=get\n");
|
|
|
|
if (err)
|
|
|
|
goto done;
|
|
|
|
|
|
|
|
err = packet_write_fmt_gently(process->in, "sha1=%s\n", oid_to_hex(oid));
|
|
|
|
if (err)
|
|
|
|
goto done;
|
|
|
|
|
|
|
|
err = packet_flush_gently(process->in);
|
|
|
|
if (err)
|
|
|
|
goto done;
|
|
|
|
|
|
|
|
err = subprocess_read_status(process->out, &status);
|
|
|
|
err = err ? err : strcmp(status.buf, "success");
|
|
|
|
|
|
|
|
done:
|
|
|
|
sigchain_pop(SIGPIPE);
|
|
|
|
|
|
|
|
if (err || errno == EPIPE) {
|
|
|
|
err = err ? err : errno;
|
|
|
|
if (!strcmp(status.buf, "error")) {
|
|
|
|
/* The process signaled a problem with the file. */
|
|
|
|
}
|
|
|
|
else if (!strcmp(status.buf, "abort")) {
|
|
|
|
/*
|
|
|
|
* The process signaled a permanent problem. Don't try to read
|
|
|
|
* objects with the same command for the lifetime of the current
|
|
|
|
* Git process.
|
|
|
|
*/
|
|
|
|
entry->supported_capabilities &= ~CAP_GET;
|
|
|
|
}
|
|
|
|
else {
|
|
|
|
/*
|
|
|
|
* Something went wrong with the read-object process.
|
|
|
|
* Force shutdown and restart if needed.
|
|
|
|
*/
|
|
|
|
error("external process '%s' failed", cmd);
|
|
|
|
subprocess_stop(&subprocess_map,
|
|
|
|
(struct subprocess_entry *)entry);
|
|
|
|
free(entry);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
trace_performance_since(start, "read_object_process");
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
check_and_freshen_file: fix reversed success-check
When we want to write out a loose object file, we have
always first made sure we don't already have the object
somewhere. Since 33d4221 (write_sha1_file: freshen existing
objects, 2014-10-15), we also update the timestamp on the
file, so that a simultaneous prune knows somebody is
likely to reference it soon.
If our utime() call fails, we treat this the same as not
having the object in the first place; the safe thing to do
is write out another copy. However, the loose-object check
accidentally inverts the utime() check; it returns failure
_only_ when the utime() call actually succeeded. Thus it was
failing to protect us there, and in the normal case where
utime() succeeds, it caused us to pointlessly write out and
link the object.
This passed our freshening tests, because writing out the
new object is certainly _one_ way of updating its utime. So
the normal case was inefficient, but not wrong.
While we're here, let's also drop a comment in front of the
check_and_freshen functions, making a note of their return
type (since it is not our usual "0 for success, -1 for
error").
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-07-08 23:33:52 +03:00
|
|
|
/* Returns 1 if we have successfully freshened the file, 0 otherwise. */
|
2014-10-16 02:42:22 +04:00
|
|
|
static int freshen_file(const char *fn)
|
2005-05-07 11:38:04 +04:00
|
|
|
{
|
2020-04-14 17:27:26 +03:00
|
|
|
return !utime(fn, NULL);
|
2008-11-10 08:59:57 +03:00
|
|
|
}
|
2005-05-07 11:38:04 +04:00
|
|
|
|
check_and_freshen_file: fix reversed success-check
When we want to write out a loose object file, we have
always first made sure we don't already have the object
somewhere. Since 33d4221 (write_sha1_file: freshen existing
objects, 2014-10-15), we also update the timestamp on the
file, so that a simultaneous prune knows somebody is
likely to reference it soon.
If our utime() call fails, we treat this the same as not
having the object in the first place; the safe thing to do
is write out another copy. However, the loose-object check
accidentally inverts the utime() check; it returns failure
_only_ when the utime() call actually succeeded. Thus it was
failing to protect us there, and in the normal case where
utime() succeeds, it caused us to pointlessly write out and
link the object.
This passed our freshening tests, because writing out the
new object is certainly _one_ way of updating its utime. So
the normal case was inefficient, but not wrong.
While we're here, let's also drop a comment in front of the
check_and_freshen functions, making a note of their return
type (since it is not our usual "0 for success, -1 for
error").
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-07-08 23:33:52 +03:00
|
|
|
/*
|
|
|
|
* All of the check_and_freshen functions return 1 if the file exists and was
|
|
|
|
* freshened (if freshening was requested), 0 otherwise. If they return
|
|
|
|
* 0, you should not assume that it is safe to skip a write of the object (it
|
|
|
|
* either does not exist on disk, or has a stale mtime and may be subject to
|
|
|
|
* pruning).
|
|
|
|
*/
|
2017-02-27 21:00:11 +03:00
|
|
|
int check_and_freshen_file(const char *fn, int freshen)
|
2014-10-16 02:42:22 +04:00
|
|
|
{
|
|
|
|
if (access(fn, F_OK))
|
|
|
|
return 0;
|
check_and_freshen_file: fix reversed success-check
When we want to write out a loose object file, we have
always first made sure we don't already have the object
somewhere. Since 33d4221 (write_sha1_file: freshen existing
objects, 2014-10-15), we also update the timestamp on the
file, so that a simultaneous prune knows somebody is
likely to reference it soon.
If our utime() call fails, we treat this the same as not
having the object in the first place; the safe thing to do
is write out another copy. However, the loose-object check
accidentally inverts the utime() check; it returns failure
_only_ when the utime() call actually succeeded. Thus it was
failing to protect us there, and in the normal case where
utime() succeeds, it caused us to pointlessly write out and
link the object.
This passed our freshening tests, because writing out the
new object is certainly _one_ way of updating its utime. So
the normal case was inefficient, but not wrong.
While we're here, let's also drop a comment in front of the
check_and_freshen functions, making a note of their return
type (since it is not our usual "0 for success, -1 for
error").
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-07-08 23:33:52 +03:00
|
|
|
if (freshen && !freshen_file(fn))
|
2014-10-16 02:42:22 +04:00
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
static int check_and_freshen_odb(struct object_directory *odb,
|
|
|
|
const struct object_id *oid,
|
|
|
|
int freshen)
|
2014-10-16 02:42:22 +04:00
|
|
|
{
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
static struct strbuf path = STRBUF_INIT;
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
odb_loose_path(odb, &path, oid);
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
return check_and_freshen_file(path.buf, freshen);
|
|
|
|
}
|
2018-01-17 20:54:54 +03:00
|
|
|
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
static int check_and_freshen_local(const struct object_id *oid, int freshen)
|
|
|
|
{
|
|
|
|
return check_and_freshen_odb(the_repository->objects->odb, oid, freshen);
|
2014-10-16 02:42:22 +04:00
|
|
|
}
|
|
|
|
|
2018-05-02 03:25:34 +03:00
|
|
|
static int check_and_freshen_nonlocal(const struct object_id *oid, int freshen)
|
2008-11-10 08:59:57 +03:00
|
|
|
{
|
2018-11-12 17:48:47 +03:00
|
|
|
struct object_directory *odb;
|
2018-11-12 17:49:35 +03:00
|
|
|
|
2018-03-23 20:21:07 +03:00
|
|
|
prepare_alt_odb(the_repository);
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
for (odb = the_repository->objects->odb->next; odb; odb = odb->next) {
|
|
|
|
if (check_and_freshen_odb(odb, oid, freshen))
|
2008-06-14 22:43:01 +04:00
|
|
|
return 1;
|
2005-05-07 11:38:04 +04:00
|
|
|
}
|
2008-06-14 22:43:01 +04:00
|
|
|
return 0;
|
2005-05-07 11:38:04 +04:00
|
|
|
}
|
|
|
|
|
2017-09-08 12:32:43 +03:00
|
|
|
static int check_and_freshen(const struct object_id *oid, int freshen,
|
|
|
|
int skip_virtualized_objects)
|
2014-10-16 02:42:22 +04:00
|
|
|
{
|
2017-03-15 21:43:05 +03:00
|
|
|
int ret;
|
|
|
|
int tried_hook = 0;
|
|
|
|
|
|
|
|
retry:
|
|
|
|
ret = check_and_freshen_local(oid, freshen) ||
|
2018-05-02 03:25:34 +03:00
|
|
|
check_and_freshen_nonlocal(oid, freshen);
|
2017-09-08 12:32:43 +03:00
|
|
|
if (!ret && core_virtualize_objects && !skip_virtualized_objects &&
|
|
|
|
!tried_hook) {
|
2017-03-15 21:43:05 +03:00
|
|
|
tried_hook = 1;
|
|
|
|
if (!read_object_process(oid))
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
2014-10-16 02:42:22 +04:00
|
|
|
}
|
|
|
|
|
2018-05-02 03:25:34 +03:00
|
|
|
int has_loose_object_nonlocal(const struct object_id *oid)
|
2014-10-16 02:42:22 +04:00
|
|
|
{
|
2018-05-02 03:25:34 +03:00
|
|
|
return check_and_freshen_nonlocal(oid, 0);
|
2014-10-16 02:42:22 +04:00
|
|
|
}
|
|
|
|
|
builtin/pack-objects.c: --cruft without expiration
Teach `pack-objects` how to generate a cruft pack when no objects are
dropped (i.e., `--cruft-expiration=never`). Later patches will teach
`pack-objects` how to generate a cruft pack that prunes objects.
When generating a cruft pack which does not prune objects, we want to
collect all unreachable objects into a single pack (noting and updating
their mtimes as we accumulate them). Ordinary use will pass the result
of a `git repack -A` as a kept pack, so when this patch says "kept
pack", readers should think "reachable objects".
Generating a non-expiring cruft packs works as follows:
- Callers provide a list of every pack they know about, and indicate
which packs are about to be removed.
- All packs which are going to be removed (we'll call these the
redundant ones) are marked as kept in-core.
Any packs the caller did not mention (but are known to the
`pack-objects` process) are also marked as kept in-core. Packs not
mentioned by the caller are assumed to be unknown to them, i.e.,
they entered the repository after the caller decided which packs
should be kept and which should be discarded.
Since we do not want to include objects in these "unknown" packs
(because we don't know which of their objects are or aren't
reachable), these are also marked as kept in-core.
- Then, we enumerate all objects in the repository, and add them to
our packing list if they do not appear in an in-core kept pack.
This results in a new cruft pack which contains all known objects that
aren't included in the kept packs. When the kept pack is the result of
`git repack -A`, the resulting pack contains all unreachable objects.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-21 02:17:52 +03:00
|
|
|
int has_loose_object(const struct object_id *oid)
|
2008-11-10 08:59:57 +03:00
|
|
|
{
|
2017-09-08 12:32:43 +03:00
|
|
|
return check_and_freshen(oid, 0, 0);
|
2008-11-10 08:59:57 +03:00
|
|
|
}
|
|
|
|
|
2014-08-26 19:23:23 +04:00
|
|
|
static void mmap_limit_check(size_t length)
|
|
|
|
{
|
|
|
|
static size_t limit = 0;
|
|
|
|
if (!limit) {
|
|
|
|
limit = git_env_ulong("GIT_MMAP_LIMIT", 0);
|
|
|
|
if (!limit)
|
|
|
|
limit = SIZE_MAX;
|
|
|
|
}
|
|
|
|
if (length > limit)
|
2018-07-21 10:49:39 +03:00
|
|
|
die(_("attempting to mmap %"PRIuMAX" over limit %"PRIuMAX),
|
2014-08-26 19:23:23 +04:00
|
|
|
(uintmax_t)length, (uintmax_t)limit);
|
|
|
|
}
|
|
|
|
|
2015-05-28 10:56:15 +03:00
|
|
|
void *xmmap_gently(void *start, size_t length,
|
|
|
|
int prot, int flags, int fd, off_t offset)
|
2010-11-06 14:44:11 +03:00
|
|
|
{
|
2014-08-26 19:23:23 +04:00
|
|
|
void *ret;
|
|
|
|
|
|
|
|
mmap_limit_check(length);
|
|
|
|
ret = mmap(start, length, prot, flags, fd, offset);
|
packfile: drop release_pack_memory()
Long ago, in 97bfeb34df (Release pack windows before reporting out of
memory., 2006-12-24), we taught xmalloc() and friends to try unmapping
pack windows when malloc() failed. It's unlikely that his helps a lot in
practice, and it has some downsides. First, the downsides:
1. It makes xmalloc() not thread-safe. We've worked around this in
pack-objects.c, which installs its own locking version of the
try_to_free_routine(). But other threaded code doesn't.
2. It makes the system as a whole harder to reason about. Functions
which allocate heap memory under the hood may have farther-reaching
effects than expected.
That might be worth the tradeoff if there's a benefit. But in practice,
it seems unlikely. We're generally dealing with mmap'd files, so the OS
is going to do a much better job at responding to memory pressure by
dropping individual pages (the exception is systems with NO_MMAP, but
even there the OS can probably respond just as well with swapping).
So the only thing we're really freeing is address space. On 64-bit
systems, we have plenty of that to go around. On 32-bit systems, it
could possibly help. But around the same time we made two other changes:
77ccc5bbd1 (Introduce new config option for mmap limit., 2006-12-23) and
60bb8b1453 (Fully activate the sliding window pack access., 2006-12-23).
Together that means that a 32-bit system should have no more than 256MB
total of packed-git mmaps at one time, split between a few 32MB windows.
It's unlikely we have any address space problems since then, but we
don't have any data since the features were all added at the same time.
Likewise, xmmap() will try to free memory. At first glance, it seems
like we'd need this (when we try to mmap a new window, we might need to
close an old one to save address space on a 32-bit system). But we're
saved again by core.packedGitLimit: if we're going to exceed our 256MB
limit, we'll close an existing window before we even call mmap().
So it seems unlikely that this feature is actually doing anything
useful. And while we don't have reports of it harming anything (probably
because it rarely if ever kicks in), it would be nice to simplify the
system overall. This patch drops the whole try_to_free system from
xmalloc(), as well as the manual pack memory release in xmmap().
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-12 23:50:21 +03:00
|
|
|
if (ret == MAP_FAILED && !length)
|
|
|
|
ret = NULL;
|
2010-11-06 14:44:11 +03:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-06-30 03:01:32 +03:00
|
|
|
const char *mmap_os_err(void)
|
|
|
|
{
|
|
|
|
static const char blank[] = "";
|
|
|
|
#if defined(__linux__)
|
|
|
|
if (errno == ENOMEM) {
|
|
|
|
/* this continues an existing error message: */
|
|
|
|
static const char enomem[] =
|
|
|
|
", check sys.vm.max_map_count and/or RLIMIT_DATA";
|
|
|
|
return enomem;
|
|
|
|
}
|
|
|
|
#endif /* OS-specific bits */
|
|
|
|
return blank;
|
|
|
|
}
|
|
|
|
|
2015-05-28 10:56:15 +03:00
|
|
|
void *xmmap(void *start, size_t length,
|
|
|
|
int prot, int flags, int fd, off_t offset)
|
|
|
|
{
|
|
|
|
void *ret = xmmap_gently(start, length, prot, flags, fd, offset);
|
|
|
|
if (ret == MAP_FAILED)
|
2021-06-30 03:01:32 +03:00
|
|
|
die_errno(_("mmap failed%s"), mmap_os_err());
|
2015-05-28 10:56:15 +03:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-02-05 02:48:25 +03:00
|
|
|
static int format_object_header_literally(char *str, size_t size,
|
|
|
|
const char *type, size_t objsize)
|
|
|
|
{
|
|
|
|
return xsnprintf(str, size, "%s %"PRIuMAX, type, (uintmax_t)objsize) + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
int format_object_header(char *str, size_t size, enum object_type type,
|
|
|
|
size_t objsize)
|
|
|
|
{
|
|
|
|
const char *name = type_name(type);
|
|
|
|
|
|
|
|
if (!name)
|
|
|
|
BUG("could not get a type name for 'enum object_type' value %d", type);
|
|
|
|
|
|
|
|
return format_object_header_literally(str, size, name, objsize);
|
|
|
|
}
|
|
|
|
|
2020-01-30 23:32:23 +03:00
|
|
|
int check_object_signature(struct repository *r, const struct object_id *oid,
|
2022-02-05 02:48:32 +03:00
|
|
|
void *buf, unsigned long size,
|
|
|
|
enum object_type type)
|
sha1_file: introduce close_one_pack() to close packs on fd pressure
When the number of open packs exceeds pack_max_fds, unuse_one_window()
is called repeatedly to attempt to release the least-recently-used
pack windows, which, as a side-effect, will also close a pack file
after closing its last open window. If a pack file has been opened,
but no windows have been allocated into it, it will never be selected
by unuse_one_window() and hence its file descriptor will not be
closed. When this happens, git may exceed the number of file
descriptors permitted by the system.
This latter situation can occur in show-ref or receive-pack during ref
advertisement. During ref advertisement, receive-pack will iterate
over every ref in the repository and advertise it to the client after
ensuring that the ref exists in the local repository. If the ref is
located inside a pack, then the pack is opened to ensure that it
exists, but since the object is not actually read from the pack, no
mmap windows are allocated. When the number of open packs exceeds
pack_max_fds, unuse_one_window() will not be able to find any windows to
free and will not be able to close any packs. Once the per-process
file descriptor limit is exceeded, receive-pack will produce a warning,
not an error, for each pack it cannot open, and will then most likely
fail with an error to spawn rev-list or index-pack like:
error: cannot create standard input pipe for rev-list: Too many open files
error: Could not run 'git rev-list'
This may also occur during upload-pack when refs are packed (in the
packed-refs file) and the number of packs that must be opened to
verify that these packed refs exist exceeds the file descriptor
limit. If the refs are loose, then upload-pack will read each ref
from the object database (if the object is in a pack, allocating one
or more mmap windows for it) in order to peel tags and advertise the
underlying object. But when the refs are packed and peeled,
upload-pack will use the peeled sha1 in the packed-refs file and
will not need to read from the pack files, so no mmap windows will
be allocated and just like with receive-pack, unuse_one_window()
will never select these opened packs to close.
When we have file descriptor pressure, we just need to find an open
pack to close. We can leave the existing mmap windows open. If
additional windows need to be mapped into the pack file, it will be
reopened when necessary. If the pack file has been rewritten in the
mean time, open_packed_git_1() should notice when it compares the file
size or the pack's sha1 checksum to what was previously read from the
pack index, and reject it.
Let's introduce a new function close_one_pack() designed specifically
for this purpose to search for and close the least-recently-used pack,
where LRU is defined as (in order of preference):
* pack with oldest mtime and no allocated mmap windows
* pack with the least-recently-used windows, i.e. the pack
with the oldest most-recently-used window, where none of
the windows are in use
* pack with the least-recently-used windows
Signed-off-by: Brandon Casey <drafnel@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-02 09:36:33 +04:00
|
|
|
{
|
2022-02-05 02:48:30 +03:00
|
|
|
struct object_id real_oid;
|
|
|
|
|
|
|
|
hash_object_file(r->hash_algo, buf, size, type, &real_oid);
|
|
|
|
|
|
|
|
return !oideq(oid, &real_oid) ? -1 : 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int stream_object_signature(struct repository *r, const struct object_id *oid)
|
|
|
|
{
|
|
|
|
struct object_id real_oid;
|
|
|
|
unsigned long size;
|
2012-03-07 14:54:18 +04:00
|
|
|
enum object_type obj_type;
|
|
|
|
struct git_istream *st;
|
2018-02-01 05:18:41 +03:00
|
|
|
git_hash_ctx c;
|
2018-03-12 05:27:55 +03:00
|
|
|
char hdr[MAX_HEADER_LEN];
|
2012-03-07 14:54:18 +04:00
|
|
|
int hdrlen;
|
sha1_file: introduce close_one_pack() to close packs on fd pressure
When the number of open packs exceeds pack_max_fds, unuse_one_window()
is called repeatedly to attempt to release the least-recently-used
pack windows, which, as a side-effect, will also close a pack file
after closing its last open window. If a pack file has been opened,
but no windows have been allocated into it, it will never be selected
by unuse_one_window() and hence its file descriptor will not be
closed. When this happens, git may exceed the number of file
descriptors permitted by the system.
This latter situation can occur in show-ref or receive-pack during ref
advertisement. During ref advertisement, receive-pack will iterate
over every ref in the repository and advertise it to the client after
ensuring that the ref exists in the local repository. If the ref is
located inside a pack, then the pack is opened to ensure that it
exists, but since the object is not actually read from the pack, no
mmap windows are allocated. When the number of open packs exceeds
pack_max_fds, unuse_one_window() will not be able to find any windows to
free and will not be able to close any packs. Once the per-process
file descriptor limit is exceeded, receive-pack will produce a warning,
not an error, for each pack it cannot open, and will then most likely
fail with an error to spawn rev-list or index-pack like:
error: cannot create standard input pipe for rev-list: Too many open files
error: Could not run 'git rev-list'
This may also occur during upload-pack when refs are packed (in the
packed-refs file) and the number of packs that must be opened to
verify that these packed refs exist exceeds the file descriptor
limit. If the refs are loose, then upload-pack will read each ref
from the object database (if the object is in a pack, allocating one
or more mmap windows for it) in order to peel tags and advertise the
underlying object. But when the refs are packed and peeled,
upload-pack will use the peeled sha1 in the packed-refs file and
will not need to read from the pack files, so no mmap windows will
be allocated and just like with receive-pack, unuse_one_window()
will never select these opened packs to close.
When we have file descriptor pressure, we just need to find an open
pack to close. We can leave the existing mmap windows open. If
additional windows need to be mapped into the pack file, it will be
reopened when necessary. If the pack file has been rewritten in the
mean time, open_packed_git_1() should notice when it compares the file
size or the pack's sha1 checksum to what was previously read from the
pack index, and reject it.
Let's introduce a new function close_one_pack() designed specifically
for this purpose to search for and close the least-recently-used pack,
where LRU is defined as (in order of preference):
* pack with oldest mtime and no allocated mmap windows
* pack with the least-recently-used windows, i.e. the pack
with the oldest most-recently-used window, where none of
the windows are in use
* pack with the least-recently-used windows
Signed-off-by: Brandon Casey <drafnel@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-02 09:36:33 +04:00
|
|
|
|
2020-01-30 23:32:23 +03:00
|
|
|
st = open_istream(r, oid, &obj_type, &size, NULL);
|
2012-03-07 14:54:18 +04:00
|
|
|
if (!st)
|
|
|
|
return -1;
|
sha1_file: introduce close_one_pack() to close packs on fd pressure
When the number of open packs exceeds pack_max_fds, unuse_one_window()
is called repeatedly to attempt to release the least-recently-used
pack windows, which, as a side-effect, will also close a pack file
after closing its last open window. If a pack file has been opened,
but no windows have been allocated into it, it will never be selected
by unuse_one_window() and hence its file descriptor will not be
closed. When this happens, git may exceed the number of file
descriptors permitted by the system.
This latter situation can occur in show-ref or receive-pack during ref
advertisement. During ref advertisement, receive-pack will iterate
over every ref in the repository and advertise it to the client after
ensuring that the ref exists in the local repository. If the ref is
located inside a pack, then the pack is opened to ensure that it
exists, but since the object is not actually read from the pack, no
mmap windows are allocated. When the number of open packs exceeds
pack_max_fds, unuse_one_window() will not be able to find any windows to
free and will not be able to close any packs. Once the per-process
file descriptor limit is exceeded, receive-pack will produce a warning,
not an error, for each pack it cannot open, and will then most likely
fail with an error to spawn rev-list or index-pack like:
error: cannot create standard input pipe for rev-list: Too many open files
error: Could not run 'git rev-list'
This may also occur during upload-pack when refs are packed (in the
packed-refs file) and the number of packs that must be opened to
verify that these packed refs exist exceeds the file descriptor
limit. If the refs are loose, then upload-pack will read each ref
from the object database (if the object is in a pack, allocating one
or more mmap windows for it) in order to peel tags and advertise the
underlying object. But when the refs are packed and peeled,
upload-pack will use the peeled sha1 in the packed-refs file and
will not need to read from the pack files, so no mmap windows will
be allocated and just like with receive-pack, unuse_one_window()
will never select these opened packs to close.
When we have file descriptor pressure, we just need to find an open
pack to close. We can leave the existing mmap windows open. If
additional windows need to be mapped into the pack file, it will be
reopened when necessary. If the pack file has been rewritten in the
mean time, open_packed_git_1() should notice when it compares the file
size or the pack's sha1 checksum to what was previously read from the
pack index, and reject it.
Let's introduce a new function close_one_pack() designed specifically
for this purpose to search for and close the least-recently-used pack,
where LRU is defined as (in order of preference):
* pack with oldest mtime and no allocated mmap windows
* pack with the least-recently-used windows, i.e. the pack
with the oldest most-recently-used window, where none of
the windows are in use
* pack with the least-recently-used windows
Signed-off-by: Brandon Casey <drafnel@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-02 09:36:33 +04:00
|
|
|
|
2012-03-07 14:54:18 +04:00
|
|
|
/* Generate the header */
|
2022-02-05 02:48:25 +03:00
|
|
|
hdrlen = format_object_header(hdr, sizeof(hdr), obj_type, size);
|
sha1_file: introduce close_one_pack() to close packs on fd pressure
When the number of open packs exceeds pack_max_fds, unuse_one_window()
is called repeatedly to attempt to release the least-recently-used
pack windows, which, as a side-effect, will also close a pack file
after closing its last open window. If a pack file has been opened,
but no windows have been allocated into it, it will never be selected
by unuse_one_window() and hence its file descriptor will not be
closed. When this happens, git may exceed the number of file
descriptors permitted by the system.
This latter situation can occur in show-ref or receive-pack during ref
advertisement. During ref advertisement, receive-pack will iterate
over every ref in the repository and advertise it to the client after
ensuring that the ref exists in the local repository. If the ref is
located inside a pack, then the pack is opened to ensure that it
exists, but since the object is not actually read from the pack, no
mmap windows are allocated. When the number of open packs exceeds
pack_max_fds, unuse_one_window() will not be able to find any windows to
free and will not be able to close any packs. Once the per-process
file descriptor limit is exceeded, receive-pack will produce a warning,
not an error, for each pack it cannot open, and will then most likely
fail with an error to spawn rev-list or index-pack like:
error: cannot create standard input pipe for rev-list: Too many open files
error: Could not run 'git rev-list'
This may also occur during upload-pack when refs are packed (in the
packed-refs file) and the number of packs that must be opened to
verify that these packed refs exist exceeds the file descriptor
limit. If the refs are loose, then upload-pack will read each ref
from the object database (if the object is in a pack, allocating one
or more mmap windows for it) in order to peel tags and advertise the
underlying object. But when the refs are packed and peeled,
upload-pack will use the peeled sha1 in the packed-refs file and
will not need to read from the pack files, so no mmap windows will
be allocated and just like with receive-pack, unuse_one_window()
will never select these opened packs to close.
When we have file descriptor pressure, we just need to find an open
pack to close. We can leave the existing mmap windows open. If
additional windows need to be mapped into the pack file, it will be
reopened when necessary. If the pack file has been rewritten in the
mean time, open_packed_git_1() should notice when it compares the file
size or the pack's sha1 checksum to what was previously read from the
pack index, and reject it.
Let's introduce a new function close_one_pack() designed specifically
for this purpose to search for and close the least-recently-used pack,
where LRU is defined as (in order of preference):
* pack with oldest mtime and no allocated mmap windows
* pack with the least-recently-used windows, i.e. the pack
with the oldest most-recently-used window, where none of
the windows are in use
* pack with the least-recently-used windows
Signed-off-by: Brandon Casey <drafnel@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-02 09:36:33 +04:00
|
|
|
|
2012-03-07 14:54:18 +04:00
|
|
|
/* Sha1.. */
|
2020-01-30 23:32:23 +03:00
|
|
|
r->hash_algo->init_fn(&c);
|
|
|
|
r->hash_algo->update_fn(&c, hdr, hdrlen);
|
2012-03-07 14:54:18 +04:00
|
|
|
for (;;) {
|
|
|
|
char buf[1024 * 16];
|
|
|
|
ssize_t readlen = read_istream(st, buf, sizeof(buf));
|
2005-06-27 14:35:33 +04:00
|
|
|
|
2013-03-26 00:17:17 +04:00
|
|
|
if (readlen < 0) {
|
|
|
|
close_istream(st);
|
|
|
|
return -1;
|
|
|
|
}
|
2012-03-07 14:54:18 +04:00
|
|
|
if (!readlen)
|
|
|
|
break;
|
2020-01-30 23:32:23 +03:00
|
|
|
r->hash_algo->update_fn(&c, buf, readlen);
|
2010-04-19 18:23:06 +04:00
|
|
|
}
|
2022-02-05 02:48:30 +03:00
|
|
|
r->hash_algo->final_oid_fn(&real_oid, &c);
|
2012-03-07 14:54:18 +04:00
|
|
|
close_istream(st);
|
2022-02-05 02:48:30 +03:00
|
|
|
return !oideq(oid, &real_oid) ? -1 : 0;
|
2010-04-19 18:23:06 +04:00
|
|
|
}
|
|
|
|
|
2016-10-28 16:23:07 +03:00
|
|
|
int git_open_cloexec(const char *name, int flags)
|
2012-08-24 13:52:22 +04:00
|
|
|
{
|
2016-10-31 20:41:41 +03:00
|
|
|
int fd;
|
|
|
|
static int o_cloexec = O_CLOEXEC;
|
2012-08-24 13:52:22 +04:00
|
|
|
|
2016-10-31 20:41:41 +03:00
|
|
|
fd = open(name, flags | o_cloexec);
|
|
|
|
if ((o_cloexec & O_CLOEXEC) && fd < 0 && errno == EINVAL) {
|
2016-10-24 21:02:59 +03:00
|
|
|
/* Try again w/o O_CLOEXEC: the kernel might not support it */
|
2016-10-31 20:41:41 +03:00
|
|
|
o_cloexec &= ~O_CLOEXEC;
|
|
|
|
fd = open(name, flags | o_cloexec);
|
2013-12-19 02:59:12 +04:00
|
|
|
}
|
|
|
|
|
2017-07-15 21:55:40 +03:00
|
|
|
#if defined(F_GETFD) && defined(F_SETFD) && defined(FD_CLOEXEC)
|
2013-12-19 02:59:12 +04:00
|
|
|
{
|
2016-10-31 20:41:41 +03:00
|
|
|
static int fd_cloexec = FD_CLOEXEC;
|
2012-08-24 13:52:22 +04:00
|
|
|
|
2016-10-31 20:41:41 +03:00
|
|
|
if (!o_cloexec && 0 <= fd && fd_cloexec) {
|
|
|
|
/* Opened w/o O_CLOEXEC? try with fcntl(2) to add it */
|
2017-07-15 21:55:40 +03:00
|
|
|
int flags = fcntl(fd, F_GETFD);
|
|
|
|
if (fcntl(fd, F_SETFD, flags | fd_cloexec))
|
2016-10-31 20:41:41 +03:00
|
|
|
fd_cloexec = 0;
|
2016-10-24 21:02:59 +03:00
|
|
|
}
|
2008-06-14 22:32:37 +04:00
|
|
|
}
|
2012-08-24 13:52:22 +04:00
|
|
|
#endif
|
2016-10-28 16:23:07 +03:00
|
|
|
return fd;
|
2012-08-24 13:52:22 +04:00
|
|
|
}
|
|
|
|
|
2007-02-02 11:00:03 +03:00
|
|
|
/*
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
* Find "oid" as a loose object in the local repository or in an alternate.
|
2017-01-13 20:54:39 +03:00
|
|
|
* Returns 0 on success, negative on failure.
|
|
|
|
*
|
|
|
|
* The "path" out-parameter will give the path of the object we found (if any).
|
|
|
|
* Note that it may point to static storage and is only valid until another
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
* call to stat_loose_object().
|
2007-02-02 11:00:03 +03:00
|
|
|
*/
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
static int stat_loose_object(struct repository *r, const struct object_id *oid,
|
|
|
|
struct stat *st, const char **path)
|
2005-06-27 14:35:33 +04:00
|
|
|
{
|
2018-11-12 17:48:47 +03:00
|
|
|
struct object_directory *odb;
|
2018-01-17 20:54:54 +03:00
|
|
|
static struct strbuf buf = STRBUF_INIT;
|
|
|
|
|
2018-03-23 20:21:17 +03:00
|
|
|
prepare_alt_odb(r);
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
for (odb = r->objects->odb; odb; odb = odb->next) {
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
*path = odb_loose_path(odb, &buf, oid);
|
2017-01-13 20:54:39 +03:00
|
|
|
if (!lstat(*path, st))
|
sha1_loose_object_info: make type lookup optional
Until recently, the only items to request from
sha1_object_info_extended were type and size. This meant
that we always had to open a loose object file to determine
one or the other. But with the addition of the disk_size
query, it's possible that we can fulfill the query without
even opening the object file at all. However, since the
function interface always returns the type, we have no way
of knowing whether the caller cares about it or not.
This patch only modified sha1_loose_object_info to make type
lookup optional using an out-parameter, similar to the way
the size is handled (and the return value is "0" or "-1" for
success or error, respectively).
There should be no functional change yet, though, as
sha1_object_info_extended, the only caller, will always ask
for a type.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-07-12 10:30:48 +04:00
|
|
|
return 0;
|
2011-02-28 23:52:39 +03:00
|
|
|
}
|
|
|
|
|
2007-02-02 11:00:03 +03:00
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2017-01-13 20:54:39 +03:00
|
|
|
/*
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
* Like stat_loose_object(), but actually open the object and return the
|
2017-01-13 20:54:39 +03:00
|
|
|
* descriptor. See the caveats on the "path" parameter above.
|
|
|
|
*/
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
static int open_loose_object(struct repository *r,
|
|
|
|
const struct object_id *oid, const char **path)
|
2006-12-23 10:34:28 +03:00
|
|
|
{
|
2008-06-14 22:32:37 +04:00
|
|
|
int fd;
|
2018-11-12 17:48:47 +03:00
|
|
|
struct object_directory *odb;
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
int most_interesting_errno = ENOENT;
|
2018-01-17 20:54:54 +03:00
|
|
|
static struct strbuf buf = STRBUF_INIT;
|
|
|
|
|
2018-03-23 20:21:18 +03:00
|
|
|
prepare_alt_odb(r);
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
for (odb = r->objects->odb; odb; odb = odb->next) {
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
*path = odb_loose_path(odb, &buf, oid);
|
2017-01-13 20:54:39 +03:00
|
|
|
fd = git_open(*path);
|
2008-06-14 22:32:37 +04:00
|
|
|
if (fd >= 0)
|
|
|
|
return fd;
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
|
2014-05-15 12:54:06 +04:00
|
|
|
if (most_interesting_errno == ENOENT)
|
|
|
|
most_interesting_errno = errno;
|
2006-12-23 10:34:08 +03:00
|
|
|
}
|
2014-05-15 12:54:06 +04:00
|
|
|
errno = most_interesting_errno;
|
2008-06-14 22:32:37 +04:00
|
|
|
return -1;
|
2005-06-27 14:35:33 +04:00
|
|
|
}
|
|
|
|
|
sha1-file: use loose object cache for quick existence check
In cases where we expect to ask has_sha1_file() about a lot of objects
that we are not likely to have (e.g., during fetch negotiation), we
already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
a simultaneous write or repack) for speed (we avoid re-scanning the pack
directory).
However, even checking for loose objects can be expensive, as we will
stat() each one. On many systems this cost isn't too noticeable, but
stat() can be particularly slow on some operating systems, or due to
network filesystems.
Since the QUICK flag already tells us that we're OK with a slightly
stale answer, we can use that as a cue to look in our in-memory cache of
each object directory. That basically trades an in-memory binary search
for a stat() call.
Note that it is possible for this to actually be _slower_. We'll do a
full readdir() to fill the cache, so if you have a very large number of
loose objects and a very small number of lookups, that readdir() may end
up more expensive.
This shouldn't be a big deal in practice. If you have a large number of
reachable loose objects, you'll already run into performance problems
(which you should remedy by repacking). You may have unreachable objects
which wouldn't otherwise impact performance. Usually these would go away
with the prune step of "git gc", but they may be held for up to 2 weeks
in the default configuration.
So it comes down to how many such objects you might reasonably expect to
have, how much slower is readdir() on N entries versus M stat() calls
(and here we really care about the syscall backing readdir(), like
getdents() on Linux, but I'll just call this readdir() below).
If N is much smaller than M (a typical packed repo), we know this is a
big win (few readdirs() followed by many uses of the resulting cache).
When N and M are similar in size, it's also a win. We care about the
latency of making a syscall, and readdir() should be giving us many
values in a single call. How many?
On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
entries per call (which is 64 bytes per entry; the name itself is 38
bytes, plus there are some other fields). So we can imagine that this is
always a win as long as the number of loose objects in the repository is
a factor of 500 less than the number of lookups you make. It's hard to
auto-tune this because we don't generally know up front how many lookups
we're going to do. But it's unlikely for this to perform significantly
worse.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:54:42 +03:00
|
|
|
static int quick_has_loose(struct repository *r,
|
2019-01-07 11:37:29 +03:00
|
|
|
const struct object_id *oid)
|
sha1-file: use loose object cache for quick existence check
In cases where we expect to ask has_sha1_file() about a lot of objects
that we are not likely to have (e.g., during fetch negotiation), we
already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
a simultaneous write or repack) for speed (we avoid re-scanning the pack
directory).
However, even checking for loose objects can be expensive, as we will
stat() each one. On many systems this cost isn't too noticeable, but
stat() can be particularly slow on some operating systems, or due to
network filesystems.
Since the QUICK flag already tells us that we're OK with a slightly
stale answer, we can use that as a cue to look in our in-memory cache of
each object directory. That basically trades an in-memory binary search
for a stat() call.
Note that it is possible for this to actually be _slower_. We'll do a
full readdir() to fill the cache, so if you have a very large number of
loose objects and a very small number of lookups, that readdir() may end
up more expensive.
This shouldn't be a big deal in practice. If you have a large number of
reachable loose objects, you'll already run into performance problems
(which you should remedy by repacking). You may have unreachable objects
which wouldn't otherwise impact performance. Usually these would go away
with the prune step of "git gc", but they may be held for up to 2 weeks
in the default configuration.
So it comes down to how many such objects you might reasonably expect to
have, how much slower is readdir() on N entries versus M stat() calls
(and here we really care about the syscall backing readdir(), like
getdents() on Linux, but I'll just call this readdir() below).
If N is much smaller than M (a typical packed repo), we know this is a
big win (few readdirs() followed by many uses of the resulting cache).
When N and M are similar in size, it's also a win. We care about the
latency of making a syscall, and readdir() should be giving us many
values in a single call. How many?
On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
entries per call (which is 64 bytes per entry; the name itself is 38
bytes, plus there are some other fields). So we can imagine that this is
always a win as long as the number of loose objects in the repository is
a factor of 500 less than the number of lookups you make. It's hard to
auto-tune this because we don't generally know up front how many lookups
we're going to do. But it's unlikely for this to perform significantly
worse.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:54:42 +03:00
|
|
|
{
|
|
|
|
struct object_directory *odb;
|
|
|
|
|
|
|
|
prepare_alt_odb(r);
|
|
|
|
for (odb = r->objects->odb; odb; odb = odb->next) {
|
2021-07-08 02:10:19 +03:00
|
|
|
if (oidtree_contains(odb_loose_cache(odb, oid), oid))
|
sha1-file: use loose object cache for quick existence check
In cases where we expect to ask has_sha1_file() about a lot of objects
that we are not likely to have (e.g., during fetch negotiation), we
already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
a simultaneous write or repack) for speed (we avoid re-scanning the pack
directory).
However, even checking for loose objects can be expensive, as we will
stat() each one. On many systems this cost isn't too noticeable, but
stat() can be particularly slow on some operating systems, or due to
network filesystems.
Since the QUICK flag already tells us that we're OK with a slightly
stale answer, we can use that as a cue to look in our in-memory cache of
each object directory. That basically trades an in-memory binary search
for a stat() call.
Note that it is possible for this to actually be _slower_. We'll do a
full readdir() to fill the cache, so if you have a very large number of
loose objects and a very small number of lookups, that readdir() may end
up more expensive.
This shouldn't be a big deal in practice. If you have a large number of
reachable loose objects, you'll already run into performance problems
(which you should remedy by repacking). You may have unreachable objects
which wouldn't otherwise impact performance. Usually these would go away
with the prune step of "git gc", but they may be held for up to 2 weeks
in the default configuration.
So it comes down to how many such objects you might reasonably expect to
have, how much slower is readdir() on N entries versus M stat() calls
(and here we really care about the syscall backing readdir(), like
getdents() on Linux, but I'll just call this readdir() below).
If N is much smaller than M (a typical packed repo), we know this is a
big win (few readdirs() followed by many uses of the resulting cache).
When N and M are similar in size, it's also a win. We care about the
latency of making a syscall, and readdir() should be giving us many
values in a single call. How many?
On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
entries per call (which is 64 bytes per entry; the name itself is 38
bytes, plus there are some other fields). So we can imagine that this is
always a win as long as the number of loose objects in the repository is
a factor of 500 less than the number of lookups you make. It's hard to
auto-tune this because we don't generally know up front how many lookups
we're going to do. But it's unlikely for this to perform significantly
worse.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:54:42 +03:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-01-13 20:58:16 +03:00
|
|
|
/*
|
2022-12-14 22:17:41 +03:00
|
|
|
* Map and close the given loose object fd. The path argument is used for
|
|
|
|
* error reporting.
|
2017-01-13 20:58:16 +03:00
|
|
|
*/
|
2022-12-14 22:17:41 +03:00
|
|
|
static void *map_fd(int fd, const char *path, unsigned long *size)
|
2008-06-25 02:58:06 +04:00
|
|
|
{
|
2022-12-14 22:17:41 +03:00
|
|
|
void *map = NULL;
|
|
|
|
struct stat st;
|
2005-06-27 14:35:33 +04:00
|
|
|
|
2022-12-14 22:17:41 +03:00
|
|
|
if (!fstat(fd, &st)) {
|
|
|
|
*size = xsize_t(st.st_size);
|
|
|
|
if (!*size) {
|
|
|
|
/* mmap() is forbidden on empty files */
|
|
|
|
error(_("object file %s is empty"), path);
|
|
|
|
close(fd);
|
|
|
|
return NULL;
|
2005-04-23 22:09:32 +04:00
|
|
|
}
|
2022-12-14 22:17:41 +03:00
|
|
|
map = xmmap(NULL, *size, PROT_READ, MAP_PRIVATE, fd, 0);
|
2007-04-09 09:06:35 +04:00
|
|
|
}
|
2022-12-14 22:17:41 +03:00
|
|
|
close(fd);
|
2005-04-19 00:04:43 +04:00
|
|
|
return map;
|
2007-04-09 09:06:35 +04:00
|
|
|
}
|
|
|
|
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
void *map_loose_object(struct repository *r,
|
|
|
|
const struct object_id *oid,
|
|
|
|
unsigned long *size)
|
2017-02-22 02:47:34 +03:00
|
|
|
{
|
2022-12-14 22:17:41 +03:00
|
|
|
const char *p;
|
|
|
|
int fd = open_loose_object(r, oid, &p);
|
|
|
|
|
|
|
|
if (fd < 0)
|
|
|
|
return NULL;
|
|
|
|
return map_fd(fd, p, size);
|
2017-02-22 02:47:34 +03:00
|
|
|
}
|
|
|
|
|
2021-10-01 12:16:49 +03:00
|
|
|
enum unpack_loose_header_result unpack_loose_header(git_zstream *stream,
|
|
|
|
unsigned char *map,
|
|
|
|
unsigned long mapsize,
|
|
|
|
void *buffer,
|
|
|
|
unsigned long bufsiz,
|
|
|
|
struct strbuf *header)
|
2016-02-25 17:22:52 +03:00
|
|
|
{
|
2021-10-01 12:16:48 +03:00
|
|
|
int status;
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
|
2005-06-02 04:54:59 +04:00
|
|
|
/* Get the data stream */
|
|
|
|
memset(stream, 0, sizeof(*stream));
|
|
|
|
stream->next_in = map;
|
|
|
|
stream->avail_in = mapsize;
|
|
|
|
stream->next_out = buffer;
|
2006-07-11 23:48:08 +04:00
|
|
|
stream->avail_out = bufsiz;
|
2016-02-25 17:22:52 +03:00
|
|
|
|
2009-01-08 06:54:47 +03:00
|
|
|
git_inflate_init(stream);
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
obj_read_unlock();
|
2021-10-01 12:16:48 +03:00
|
|
|
status = git_inflate(stream, 0);
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
obj_read_lock();
|
unpack_sha1_header(): detect malformed object header
When opening a loose object file, we often do this sequence:
- prepare a short buffer for the object header (on stack)
- call unpack_sha1_header() and have early part of the object data
inflated, enough to fill the buffer
- parse that data in the short buffer, assuming that the first part
of the object is <typename> SP <length> NUL
Because the parsing function parse_sha1_header_extended() is not
given the number of bytes inflated into the header buffer, it you
craft a file whose early part inflates a garbage sequence without SP
or NUL, and replace a loose object with it, it will end up reading
past the end of the inflated data.
To correct this, do the following four things:
- rename unpack_sha1_header() to unpack_sha1_short_header() and
have unpack_sha1_header_to_strbuf() keep calling that as its
helper function. This will detect and report zlib errors, but is
not aware of the format of a loose object (as before).
- introduce unpack_sha1_header() that calls the same helper
function, and when zlib reports it inflated OK into the buffer,
check if the inflated data has NUL. This would ensure that
parsing function will terminate within the buffer that holds the
inflated header.
- update unpack_sha1_header_to_strbuf() to check if the resulting
buffer has NUL for the same effect.
- update parse_sha1_header_extended() to make sure that its loop to
find the SP that terminates the <typename> stops at NUL.
Essentially, this makes unpack_*() functions that are asked to
unpack a loose object header to be a bit more strict and detect an
input that cannot possibly be a valid object header, even before the
parsing function kicks in.
Reported-by: Gustavo Grieco <gustavo.grieco@imag.fr>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-26 07:29:04 +03:00
|
|
|
if (status < Z_OK)
|
2021-10-01 12:16:49 +03:00
|
|
|
return ULHR_BAD;
|
2011-03-02 21:01:54 +03:00
|
|
|
|
2015-05-03 17:29:59 +03:00
|
|
|
/*
|
|
|
|
* Check if entire header is unpacked in the first iteration.
|
2011-03-02 21:01:54 +03:00
|
|
|
*/
|
2015-05-03 17:29:59 +03:00
|
|
|
if (memchr(buffer, '\0', stream->next_out - (unsigned char *)buffer))
|
2021-10-01 12:16:49 +03:00
|
|
|
return ULHR_OK;
|
2011-03-02 21:01:54 +03:00
|
|
|
|
2021-10-01 12:16:48 +03:00
|
|
|
/*
|
|
|
|
* We have a header longer than MAX_HEADER_LEN. The "header"
|
|
|
|
* here is only non-NULL when we run "cat-file
|
|
|
|
* --allow-unknown-type".
|
|
|
|
*/
|
|
|
|
if (!header)
|
2021-10-01 12:16:50 +03:00
|
|
|
return ULHR_TOO_LONG;
|
2011-03-02 21:01:54 +03:00
|
|
|
|
2015-05-03 17:29:59 +03:00
|
|
|
/*
|
|
|
|
* buffer[0..bufsiz] was not large enough. Copy the partial
|
|
|
|
* result out to header, and then append the result of further
|
|
|
|
* reading the stream.
|
|
|
|
*/
|
|
|
|
strbuf_add(header, buffer, stream->next_out - (unsigned char *)buffer);
|
|
|
|
stream->next_out = buffer;
|
|
|
|
stream->avail_out = bufsiz;
|
2011-03-02 21:01:54 +03:00
|
|
|
|
2015-05-03 17:29:59 +03:00
|
|
|
do {
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
obj_read_unlock();
|
2015-05-03 17:29:59 +03:00
|
|
|
status = git_inflate(stream, 0);
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
obj_read_lock();
|
2015-05-03 17:29:59 +03:00
|
|
|
strbuf_add(header, buffer, stream->next_out - (unsigned char *)buffer);
|
|
|
|
if (memchr(buffer, '\0', stream->next_out - (unsigned char *)buffer))
|
|
|
|
return 0;
|
|
|
|
stream->next_out = buffer;
|
|
|
|
stream->avail_out = bufsiz;
|
|
|
|
} while (status != Z_STREAM_END);
|
2021-10-01 12:16:50 +03:00
|
|
|
return ULHR_TOO_LONG;
|
2011-03-02 21:01:54 +03:00
|
|
|
}
|
|
|
|
|
2019-01-07 11:37:02 +03:00
|
|
|
static void *unpack_loose_rest(git_zstream *stream,
|
|
|
|
void *buffer, unsigned long size,
|
|
|
|
const struct object_id *oid)
|
2012-02-01 17:48:54 +04:00
|
|
|
{
|
2005-06-02 18:57:25 +04:00
|
|
|
int bytes = strlen(buffer) + 1;
|
2010-01-26 21:24:14 +03:00
|
|
|
unsigned char *buf = xmallocz(size);
|
2006-07-11 23:48:08 +04:00
|
|
|
unsigned long n;
|
2007-03-05 11:21:37 +03:00
|
|
|
int status = Z_OK;
|
2012-02-01 17:48:54 +04:00
|
|
|
|
2006-07-11 23:48:08 +04:00
|
|
|
n = stream->total_out - bytes;
|
|
|
|
if (n > size)
|
|
|
|
n = size;
|
|
|
|
memcpy(buf, (char *) buffer + bytes, n);
|
|
|
|
bytes = n;
|
2007-03-20 08:49:53 +03:00
|
|
|
if (bytes <= size) {
|
|
|
|
/*
|
|
|
|
* The above condition must be (bytes <= size), not
|
|
|
|
* (bytes < size). In other words, even though we
|
2011-05-15 23:16:03 +04:00
|
|
|
* expect no more output and set avail_out to zero,
|
2007-03-20 08:49:53 +03:00
|
|
|
* the input zlib stream may have bytes that express
|
|
|
|
* "this concludes the stream", and we *do* want to
|
|
|
|
* eat that input.
|
|
|
|
*
|
|
|
|
* Otherwise we would not be able to test that we
|
|
|
|
* consumed all the input to reach the expected size;
|
|
|
|
* we also want to check that zlib tells us that all
|
|
|
|
* went well with status == Z_STREAM_END at the end.
|
|
|
|
*/
|
2005-06-02 18:57:25 +04:00
|
|
|
stream->next_out = buf + bytes;
|
|
|
|
stream->avail_out = size - bytes;
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
while (status == Z_OK) {
|
|
|
|
obj_read_unlock();
|
2009-01-08 06:54:47 +03:00
|
|
|
status = git_inflate(stream, Z_FINISH);
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
obj_read_lock();
|
|
|
|
}
|
2005-06-02 18:57:25 +04:00
|
|
|
}
|
2007-03-20 08:49:53 +03:00
|
|
|
if (status == Z_STREAM_END && !stream->avail_in) {
|
2009-01-08 06:54:47 +03:00
|
|
|
git_inflate_end(stream);
|
2007-03-05 11:21:37 +03:00
|
|
|
return buf;
|
2012-02-01 17:48:54 +04:00
|
|
|
}
|
|
|
|
|
2007-03-05 11:21:37 +03:00
|
|
|
if (status < 0)
|
2019-01-07 11:37:02 +03:00
|
|
|
error(_("corrupt loose object '%s'"), oid_to_hex(oid));
|
2007-03-05 11:21:37 +03:00
|
|
|
else if (stream->avail_in)
|
2018-07-21 10:49:39 +03:00
|
|
|
error(_("garbage at end of loose object '%s'"),
|
2019-01-07 11:37:02 +03:00
|
|
|
oid_to_hex(oid));
|
2007-03-05 11:21:37 +03:00
|
|
|
free(buf);
|
|
|
|
return NULL;
|
2012-02-01 17:48:54 +04:00
|
|
|
}
|
|
|
|
|
2014-02-21 20:32:06 +04:00
|
|
|
/*
|
2005-06-02 18:57:25 +04:00
|
|
|
* We used to just use "sscanf()", but that's actually way
|
|
|
|
* too permissive for what we want to check. So do an anal
|
|
|
|
* object header parse by hand.
|
2014-02-21 20:32:06 +04:00
|
|
|
*/
|
object-file.c: stop dying in parse_loose_header()
Make parse_loose_header() return error codes and data instead of
invoking die() by itself.
For now we'll move the relevant die() call to loose_object_info() and
read_loose_object() to keep this change smaller. In a subsequent
commit we'll make read_loose_object() return an error code instead of
dying. We should also address the "allow_unknown" case (should be
moved to builtin/cat-file.c), but for now I'll be leaving it.
For making parse_loose_header() not die() change its prototype to
accept a "struct object_info *" instead of the "unsigned long *sizep"
it accepted before. Its callers can now check the populated populated
"oi->typep".
Because of this we don't need to pass in the "unsigned int flags"
which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do
that check in loose_object_info().
This also refactors some confusing control flow around the "status"
variable. In some cases we set it to the return value of "error()",
i.e. -1, and later checked if "status < 0" was true.
Since 93cff9a978e (sha1_loose_object_info: return error for corrupted
objects, 2017-04-01) the return value of loose_object_info() (then
named sha1_loose_object_info()) had been a "status" variable that be
any negative value, as we were expecting to return the "enum
object_type".
The only negative type happens to be OBJ_BAD, but the code still
assumed that more might be added. This was then used later in
e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now
that parse_loose_header() will return 0 on success instead of the
type (which it'll stick into the "struct object_info") we don't need
to conflate these two cases in its callers.
Since parse_loose_header() doesn't need to return an arbitrary
"status" we only need to treat its "ret < 0" specially, but can
idiomatically overwrite it with our own error() return. This along
with having made unpack_loose_header() return an "enum
unpack_loose_header_result" in an earlier commit means that we can
move the previously nested if/else cases mostly into the "ULHR_OK"
branch of the "switch" statement.
We should be less silent if we reach that "status = -1" branch, which
happens if we've got trailing garbage in loose objects, see
f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13)
for a better way to handle it. For now let's punt on it, a subsequent
commit will address that edge case.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:51 +03:00
|
|
|
int parse_loose_header(const char *hdr, struct object_info *oi)
|
2005-06-27 14:35:33 +04:00
|
|
|
{
|
2015-05-03 17:29:59 +03:00
|
|
|
const char *type_buf = hdr;
|
2021-11-02 18:46:10 +03:00
|
|
|
size_t size;
|
2015-05-03 17:29:59 +03:00
|
|
|
int type, type_len = 0;
|
2006-09-21 08:05:37 +04:00
|
|
|
|
2005-06-02 18:57:25 +04:00
|
|
|
/*
|
2015-05-03 17:29:59 +03:00
|
|
|
* The type can be of any size but is followed by
|
2007-02-26 22:55:59 +03:00
|
|
|
* a space.
|
2005-06-02 18:57:25 +04:00
|
|
|
*/
|
|
|
|
for (;;) {
|
|
|
|
char c = *hdr++;
|
unpack_sha1_header(): detect malformed object header
When opening a loose object file, we often do this sequence:
- prepare a short buffer for the object header (on stack)
- call unpack_sha1_header() and have early part of the object data
inflated, enough to fill the buffer
- parse that data in the short buffer, assuming that the first part
of the object is <typename> SP <length> NUL
Because the parsing function parse_sha1_header_extended() is not
given the number of bytes inflated into the header buffer, it you
craft a file whose early part inflates a garbage sequence without SP
or NUL, and replace a loose object with it, it will end up reading
past the end of the inflated data.
To correct this, do the following four things:
- rename unpack_sha1_header() to unpack_sha1_short_header() and
have unpack_sha1_header_to_strbuf() keep calling that as its
helper function. This will detect and report zlib errors, but is
not aware of the format of a loose object (as before).
- introduce unpack_sha1_header() that calls the same helper
function, and when zlib reports it inflated OK into the buffer,
check if the inflated data has NUL. This would ensure that
parsing function will terminate within the buffer that holds the
inflated header.
- update unpack_sha1_header_to_strbuf() to check if the resulting
buffer has NUL for the same effect.
- update parse_sha1_header_extended() to make sure that its loop to
find the SP that terminates the <typename> stops at NUL.
Essentially, this makes unpack_*() functions that are asked to
unpack a loose object header to be a bit more strict and detect an
input that cannot possibly be a valid object header, even before the
parsing function kicks in.
Reported-by: Gustavo Grieco <gustavo.grieco@imag.fr>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-26 07:29:04 +03:00
|
|
|
if (!c)
|
|
|
|
return -1;
|
2005-06-02 18:57:25 +04:00
|
|
|
if (c == ' ')
|
|
|
|
break;
|
2015-05-03 17:29:59 +03:00
|
|
|
type_len++;
|
2005-06-02 18:57:25 +04:00
|
|
|
}
|
2005-06-27 14:35:33 +04:00
|
|
|
|
2015-05-03 17:29:59 +03:00
|
|
|
type = type_from_string_gently(type_buf, type_len, 1);
|
2018-02-14 21:59:23 +03:00
|
|
|
if (oi->type_name)
|
|
|
|
strbuf_add(oi->type_name, type_buf, type_len);
|
2015-05-03 17:29:59 +03:00
|
|
|
if (oi->typep)
|
|
|
|
*oi->typep = type;
|
2005-06-02 18:57:25 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The length must follow immediately, and be in canonical
|
|
|
|
* decimal format (ie "010" is not valid).
|
|
|
|
*/
|
|
|
|
size = *hdr++ - '0';
|
|
|
|
if (size > 9)
|
|
|
|
return -1;
|
|
|
|
if (size) {
|
|
|
|
for (;;) {
|
|
|
|
unsigned long c = *hdr - '0';
|
|
|
|
if (c > 9)
|
|
|
|
break;
|
|
|
|
hdr++;
|
2021-11-02 18:46:10 +03:00
|
|
|
size = st_add(st_mult(size, 10), c);
|
2014-02-21 20:32:04 +04:00
|
|
|
}
|
2012-02-01 17:48:55 +04:00
|
|
|
}
|
2015-05-03 17:29:59 +03:00
|
|
|
|
|
|
|
if (oi->sizep)
|
2021-11-02 18:46:10 +03:00
|
|
|
*oi->sizep = cast_size_t_to_ulong(size);
|
2005-06-02 18:57:25 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The length must be followed by a zero byte
|
|
|
|
*/
|
object-file.c: stop dying in parse_loose_header()
Make parse_loose_header() return error codes and data instead of
invoking die() by itself.
For now we'll move the relevant die() call to loose_object_info() and
read_loose_object() to keep this change smaller. In a subsequent
commit we'll make read_loose_object() return an error code instead of
dying. We should also address the "allow_unknown" case (should be
moved to builtin/cat-file.c), but for now I'll be leaving it.
For making parse_loose_header() not die() change its prototype to
accept a "struct object_info *" instead of the "unsigned long *sizep"
it accepted before. Its callers can now check the populated populated
"oi->typep".
Because of this we don't need to pass in the "unsigned int flags"
which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do
that check in loose_object_info().
This also refactors some confusing control flow around the "status"
variable. In some cases we set it to the return value of "error()",
i.e. -1, and later checked if "status < 0" was true.
Since 93cff9a978e (sha1_loose_object_info: return error for corrupted
objects, 2017-04-01) the return value of loose_object_info() (then
named sha1_loose_object_info()) had been a "status" variable that be
any negative value, as we were expecting to return the "enum
object_type".
The only negative type happens to be OBJ_BAD, but the code still
assumed that more might be added. This was then used later in
e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now
that parse_loose_header() will return 0 on success instead of the
type (which it'll stick into the "struct object_info") we don't need
to conflate these two cases in its callers.
Since parse_loose_header() doesn't need to return an arbitrary
"status" we only need to treat its "ret < 0" specially, but can
idiomatically overwrite it with our own error() return. This along
with having made unpack_loose_header() return an "enum
unpack_loose_header_result" in an earlier commit means that we can
move the previously nested if/else cases mostly into the "ULHR_OK"
branch of the "switch" statement.
We should be less silent if we reach that "status = -1" branch, which
happens if we've got trailing garbage in loose objects, see
f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13)
for a better way to handle it. For now let's punt on it, a subsequent
commit will address that edge case.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:51 +03:00
|
|
|
if (*hdr)
|
|
|
|
return -1;
|
2007-02-26 22:55:59 +03:00
|
|
|
|
object-file.c: stop dying in parse_loose_header()
Make parse_loose_header() return error codes and data instead of
invoking die() by itself.
For now we'll move the relevant die() call to loose_object_info() and
read_loose_object() to keep this change smaller. In a subsequent
commit we'll make read_loose_object() return an error code instead of
dying. We should also address the "allow_unknown" case (should be
moved to builtin/cat-file.c), but for now I'll be leaving it.
For making parse_loose_header() not die() change its prototype to
accept a "struct object_info *" instead of the "unsigned long *sizep"
it accepted before. Its callers can now check the populated populated
"oi->typep".
Because of this we don't need to pass in the "unsigned int flags"
which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do
that check in loose_object_info().
This also refactors some confusing control flow around the "status"
variable. In some cases we set it to the return value of "error()",
i.e. -1, and later checked if "status < 0" was true.
Since 93cff9a978e (sha1_loose_object_info: return error for corrupted
objects, 2017-04-01) the return value of loose_object_info() (then
named sha1_loose_object_info()) had been a "status" variable that be
any negative value, as we were expecting to return the "enum
object_type".
The only negative type happens to be OBJ_BAD, but the code still
assumed that more might be added. This was then used later in
e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now
that parse_loose_header() will return 0 on success instead of the
type (which it'll stick into the "struct object_info") we don't need
to conflate these two cases in its callers.
Since parse_loose_header() doesn't need to return an arbitrary
"status" we only need to treat its "ret < 0" specially, but can
idiomatically overwrite it with our own error() return. This along
with having made unpack_loose_header() return an "enum
unpack_loose_header_result" in an earlier commit means that we can
move the previously nested if/else cases mostly into the "ULHR_OK"
branch of the "switch" statement.
We should be less silent if we reach that "status = -1" branch, which
happens if we've got trailing garbage in loose objects, see
f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13)
for a better way to handle it. For now let's punt on it, a subsequent
commit will address that edge case.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:51 +03:00
|
|
|
/*
|
|
|
|
* The format is valid, but the type may still be bogus. The
|
|
|
|
* Caller needs to check its oi->typep.
|
|
|
|
*/
|
|
|
|
return 0;
|
2005-08-01 04:53:44 +04:00
|
|
|
}
|
|
|
|
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
static int loose_object_info(struct repository *r,
|
|
|
|
const struct object_id *oid,
|
|
|
|
struct object_info *oi, int flags)
|
2005-06-03 02:20:54 +04:00
|
|
|
{
|
2015-05-03 17:29:59 +03:00
|
|
|
int status = 0;
|
2022-12-14 22:17:42 +03:00
|
|
|
int fd;
|
2015-05-03 17:29:59 +03:00
|
|
|
unsigned long mapsize;
|
2022-12-14 22:17:42 +03:00
|
|
|
const char *path;
|
2005-06-03 02:20:54 +04:00
|
|
|
void *map;
|
2011-06-10 22:52:15 +04:00
|
|
|
git_zstream stream;
|
2018-03-12 05:27:55 +03:00
|
|
|
char hdr[MAX_HEADER_LEN];
|
2015-05-03 17:29:59 +03:00
|
|
|
struct strbuf hdrbuf = STRBUF_INIT;
|
2017-06-22 03:40:21 +03:00
|
|
|
unsigned long size_scratch;
|
object-file.c: stop dying in parse_loose_header()
Make parse_loose_header() return error codes and data instead of
invoking die() by itself.
For now we'll move the relevant die() call to loose_object_info() and
read_loose_object() to keep this change smaller. In a subsequent
commit we'll make read_loose_object() return an error code instead of
dying. We should also address the "allow_unknown" case (should be
moved to builtin/cat-file.c), but for now I'll be leaving it.
For making parse_loose_header() not die() change its prototype to
accept a "struct object_info *" instead of the "unsigned long *sizep"
it accepted before. Its callers can now check the populated populated
"oi->typep".
Because of this we don't need to pass in the "unsigned int flags"
which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do
that check in loose_object_info().
This also refactors some confusing control flow around the "status"
variable. In some cases we set it to the return value of "error()",
i.e. -1, and later checked if "status < 0" was true.
Since 93cff9a978e (sha1_loose_object_info: return error for corrupted
objects, 2017-04-01) the return value of loose_object_info() (then
named sha1_loose_object_info()) had been a "status" variable that be
any negative value, as we were expecting to return the "enum
object_type".
The only negative type happens to be OBJ_BAD, but the code still
assumed that more might be added. This was then used later in
e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now
that parse_loose_header() will return 0 on success instead of the
type (which it'll stick into the "struct object_info") we don't need
to conflate these two cases in its callers.
Since parse_loose_header() doesn't need to return an arbitrary
"status" we only need to treat its "ret < 0" specially, but can
idiomatically overwrite it with our own error() return. This along
with having made unpack_loose_header() return an "enum
unpack_loose_header_result" in an earlier commit means that we can
move the previously nested if/else cases mostly into the "ULHR_OK"
branch of the "switch" statement.
We should be less silent if we reach that "status = -1" branch, which
happens if we've got trailing garbage in loose objects, see
f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13)
for a better way to handle it. For now let's punt on it, a subsequent
commit will address that edge case.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:51 +03:00
|
|
|
enum object_type type_scratch;
|
2021-10-01 12:16:48 +03:00
|
|
|
int allow_unknown = flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE;
|
2005-06-03 02:20:54 +04:00
|
|
|
|
2020-02-24 07:36:56 +03:00
|
|
|
if (oi->delta_base_oid)
|
|
|
|
oidclr(oi->delta_base_oid);
|
2013-12-21 18:24:20 +04:00
|
|
|
|
sha1_loose_object_info: make type lookup optional
Until recently, the only items to request from
sha1_object_info_extended were type and size. This meant
that we always had to open a loose object file to determine
one or the other. But with the addition of the disk_size
query, it's possible that we can fulfill the query without
even opening the object file at all. However, since the
function interface always returns the type, we have no way
of knowing whether the caller cares about it or not.
This patch only modified sha1_loose_object_info to make type
lookup optional using an out-parameter, similar to the way
the size is handled (and the return value is "0" or "-1" for
success or error, respectively).
There should be no functional change yet, though, as
sha1_object_info_extended, the only caller, will always ask
for a type.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-07-12 10:30:48 +04:00
|
|
|
/*
|
|
|
|
* If we don't care about type or size, then we don't
|
2013-11-06 22:00:57 +04:00
|
|
|
* need to look inside the object at all. Note that we
|
|
|
|
* do not optimize out the stat call, even if the
|
|
|
|
* caller doesn't care about the disk-size, since our
|
|
|
|
* return value implicitly indicates whether the
|
|
|
|
* object even exists.
|
sha1_loose_object_info: make type lookup optional
Until recently, the only items to request from
sha1_object_info_extended were type and size. This meant
that we always had to open a loose object file to determine
one or the other. But with the addition of the disk_size
query, it's possible that we can fulfill the query without
even opening the object file at all. However, since the
function interface always returns the type, we have no way
of knowing whether the caller cares about it or not.
This patch only modified sha1_loose_object_info to make type
lookup optional using an out-parameter, similar to the way
the size is handled (and the return value is "0" or "-1" for
success or error, respectively).
There should be no functional change yet, though, as
sha1_object_info_extended, the only caller, will always ask
for a type.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-07-12 10:30:48 +04:00
|
|
|
*/
|
2018-02-14 21:59:23 +03:00
|
|
|
if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) {
|
2013-11-06 22:00:57 +04:00
|
|
|
struct stat st;
|
sha1-file: use loose object cache for quick existence check
In cases where we expect to ask has_sha1_file() about a lot of objects
that we are not likely to have (e.g., during fetch negotiation), we
already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with
a simultaneous write or repack) for speed (we avoid re-scanning the pack
directory).
However, even checking for loose objects can be expensive, as we will
stat() each one. On many systems this cost isn't too noticeable, but
stat() can be particularly slow on some operating systems, or due to
network filesystems.
Since the QUICK flag already tells us that we're OK with a slightly
stale answer, we can use that as a cue to look in our in-memory cache of
each object directory. That basically trades an in-memory binary search
for a stat() call.
Note that it is possible for this to actually be _slower_. We'll do a
full readdir() to fill the cache, so if you have a very large number of
loose objects and a very small number of lookups, that readdir() may end
up more expensive.
This shouldn't be a big deal in practice. If you have a large number of
reachable loose objects, you'll already run into performance problems
(which you should remedy by repacking). You may have unreachable objects
which wouldn't otherwise impact performance. Usually these would go away
with the prune step of "git gc", but they may be held for up to 2 weeks
in the default configuration.
So it comes down to how many such objects you might reasonably expect to
have, how much slower is readdir() on N entries versus M stat() calls
(and here we really care about the syscall backing readdir(), like
getdents() on Linux, but I'll just call this readdir() below).
If N is much smaller than M (a typical packed repo), we know this is a
big win (few readdirs() followed by many uses of the resulting cache).
When N and M are similar in size, it's also a win. We care about the
latency of making a syscall, and readdir() should be giving us many
values in a single call. How many?
On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512
entries per call (which is 64 bytes per entry; the name itself is 38
bytes, plus there are some other fields). So we can imagine that this is
always a win as long as the number of loose objects in the repository is
a factor of 500 less than the number of lookups you make. It's hard to
auto-tune this because we don't generally know up front how many lookups
we're going to do. But it's unlikely for this to perform significantly
worse.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:54:42 +03:00
|
|
|
if (!oi->disk_sizep && (flags & OBJECT_INFO_QUICK))
|
2019-01-07 11:37:29 +03:00
|
|
|
return quick_has_loose(r, oid) ? 0 : -1;
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
if (stat_loose_object(r, oid, &st, &path) < 0)
|
2013-11-06 22:00:57 +04:00
|
|
|
return -1;
|
|
|
|
if (oi->disk_sizep)
|
2013-07-12 10:37:53 +04:00
|
|
|
*oi->disk_sizep = st.st_size;
|
sha1_loose_object_info: make type lookup optional
Until recently, the only items to request from
sha1_object_info_extended were type and size. This meant
that we always had to open a loose object file to determine
one or the other. But with the addition of the disk_size
query, it's possible that we can fulfill the query without
even opening the object file at all. However, since the
function interface always returns the type, we have no way
of knowing whether the caller cares about it or not.
This patch only modified sha1_loose_object_info to make type
lookup optional using an out-parameter, similar to the way
the size is handled (and the return value is "0" or "-1" for
success or error, respectively).
There should be no functional change yet, though, as
sha1_object_info_extended, the only caller, will always ask
for a type.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-07-12 10:30:48 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-12-14 22:17:42 +03:00
|
|
|
fd = open_loose_object(r, oid, &path);
|
|
|
|
if (fd < 0) {
|
|
|
|
if (errno != ENOENT)
|
|
|
|
error_errno(_("unable to open loose object %s"), oid_to_hex(oid));
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
map = map_fd(fd, path, &mapsize);
|
2006-11-28 02:18:55 +03:00
|
|
|
if (!map)
|
2013-05-31 00:00:22 +04:00
|
|
|
return -1;
|
2017-06-22 03:40:21 +03:00
|
|
|
|
|
|
|
if (!oi->sizep)
|
|
|
|
oi->sizep = &size_scratch;
|
object-file.c: stop dying in parse_loose_header()
Make parse_loose_header() return error codes and data instead of
invoking die() by itself.
For now we'll move the relevant die() call to loose_object_info() and
read_loose_object() to keep this change smaller. In a subsequent
commit we'll make read_loose_object() return an error code instead of
dying. We should also address the "allow_unknown" case (should be
moved to builtin/cat-file.c), but for now I'll be leaving it.
For making parse_loose_header() not die() change its prototype to
accept a "struct object_info *" instead of the "unsigned long *sizep"
it accepted before. Its callers can now check the populated populated
"oi->typep".
Because of this we don't need to pass in the "unsigned int flags"
which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do
that check in loose_object_info().
This also refactors some confusing control flow around the "status"
variable. In some cases we set it to the return value of "error()",
i.e. -1, and later checked if "status < 0" was true.
Since 93cff9a978e (sha1_loose_object_info: return error for corrupted
objects, 2017-04-01) the return value of loose_object_info() (then
named sha1_loose_object_info()) had been a "status" variable that be
any negative value, as we were expecting to return the "enum
object_type".
The only negative type happens to be OBJ_BAD, but the code still
assumed that more might be added. This was then used later in
e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now
that parse_loose_header() will return 0 on success instead of the
type (which it'll stick into the "struct object_info") we don't need
to conflate these two cases in its callers.
Since parse_loose_header() doesn't need to return an arbitrary
"status" we only need to treat its "ret < 0" specially, but can
idiomatically overwrite it with our own error() return. This along
with having made unpack_loose_header() return an "enum
unpack_loose_header_result" in an earlier commit means that we can
move the previously nested if/else cases mostly into the "ULHR_OK"
branch of the "switch" statement.
We should be less silent if we reach that "status = -1" branch, which
happens if we've got trailing garbage in loose objects, see
f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13)
for a better way to handle it. For now let's punt on it, a subsequent
commit will address that edge case.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:51 +03:00
|
|
|
if (!oi->typep)
|
|
|
|
oi->typep = &type_scratch;
|
2017-06-22 03:40:21 +03:00
|
|
|
|
2013-07-12 10:37:53 +04:00
|
|
|
if (oi->disk_sizep)
|
|
|
|
*oi->disk_sizep = mapsize;
|
2021-10-01 12:16:48 +03:00
|
|
|
|
2021-10-01 12:16:49 +03:00
|
|
|
switch (unpack_loose_header(&stream, map, mapsize, hdr, sizeof(hdr),
|
|
|
|
allow_unknown ? &hdrbuf : NULL)) {
|
|
|
|
case ULHR_OK:
|
object-file.c: stop dying in parse_loose_header()
Make parse_loose_header() return error codes and data instead of
invoking die() by itself.
For now we'll move the relevant die() call to loose_object_info() and
read_loose_object() to keep this change smaller. In a subsequent
commit we'll make read_loose_object() return an error code instead of
dying. We should also address the "allow_unknown" case (should be
moved to builtin/cat-file.c), but for now I'll be leaving it.
For making parse_loose_header() not die() change its prototype to
accept a "struct object_info *" instead of the "unsigned long *sizep"
it accepted before. Its callers can now check the populated populated
"oi->typep".
Because of this we don't need to pass in the "unsigned int flags"
which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do
that check in loose_object_info().
This also refactors some confusing control flow around the "status"
variable. In some cases we set it to the return value of "error()",
i.e. -1, and later checked if "status < 0" was true.
Since 93cff9a978e (sha1_loose_object_info: return error for corrupted
objects, 2017-04-01) the return value of loose_object_info() (then
named sha1_loose_object_info()) had been a "status" variable that be
any negative value, as we were expecting to return the "enum
object_type".
The only negative type happens to be OBJ_BAD, but the code still
assumed that more might be added. This was then used later in
e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now
that parse_loose_header() will return 0 on success instead of the
type (which it'll stick into the "struct object_info") we don't need
to conflate these two cases in its callers.
Since parse_loose_header() doesn't need to return an arbitrary
"status" we only need to treat its "ret < 0" specially, but can
idiomatically overwrite it with our own error() return. This along
with having made unpack_loose_header() return an "enum
unpack_loose_header_result" in an earlier commit means that we can
move the previously nested if/else cases mostly into the "ULHR_OK"
branch of the "switch" statement.
We should be less silent if we reach that "status = -1" branch, which
happens if we've got trailing garbage in loose objects, see
f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13)
for a better way to handle it. For now let's punt on it, a subsequent
commit will address that edge case.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:51 +03:00
|
|
|
if (parse_loose_header(hdrbuf.len ? hdrbuf.buf : hdr, oi) < 0)
|
|
|
|
status = error(_("unable to parse %s header"), oid_to_hex(oid));
|
|
|
|
else if (!allow_unknown && *oi->typep < 0)
|
|
|
|
die(_("invalid object type"));
|
|
|
|
|
|
|
|
if (!oi->contentp)
|
|
|
|
break;
|
|
|
|
*oi->contentp = unpack_loose_rest(&stream, hdr, *oi->sizep, oid);
|
|
|
|
if (*oi->contentp)
|
|
|
|
goto cleanup;
|
|
|
|
|
|
|
|
status = -1;
|
2021-10-01 12:16:49 +03:00
|
|
|
break;
|
|
|
|
case ULHR_BAD:
|
2018-07-21 10:49:39 +03:00
|
|
|
status = error(_("unable to unpack %s header"),
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
oid_to_hex(oid));
|
2021-10-01 12:16:49 +03:00
|
|
|
break;
|
2021-10-01 12:16:50 +03:00
|
|
|
case ULHR_TOO_LONG:
|
|
|
|
status = error(_("header for %s too long, exceeds %d bytes"),
|
|
|
|
oid_to_hex(oid), MAX_HEADER_LEN);
|
|
|
|
break;
|
2021-10-01 12:16:49 +03:00
|
|
|
}
|
2017-06-22 03:40:21 +03:00
|
|
|
|
2022-12-14 22:17:42 +03:00
|
|
|
if (status && (flags & OBJECT_INFO_DIE_IF_CORRUPT))
|
|
|
|
die(_("loose object %s (stored in %s) is corrupt"),
|
|
|
|
oid_to_hex(oid), path);
|
|
|
|
|
object-file.c: stop dying in parse_loose_header()
Make parse_loose_header() return error codes and data instead of
invoking die() by itself.
For now we'll move the relevant die() call to loose_object_info() and
read_loose_object() to keep this change smaller. In a subsequent
commit we'll make read_loose_object() return an error code instead of
dying. We should also address the "allow_unknown" case (should be
moved to builtin/cat-file.c), but for now I'll be leaving it.
For making parse_loose_header() not die() change its prototype to
accept a "struct object_info *" instead of the "unsigned long *sizep"
it accepted before. Its callers can now check the populated populated
"oi->typep".
Because of this we don't need to pass in the "unsigned int flags"
which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do
that check in loose_object_info().
This also refactors some confusing control flow around the "status"
variable. In some cases we set it to the return value of "error()",
i.e. -1, and later checked if "status < 0" was true.
Since 93cff9a978e (sha1_loose_object_info: return error for corrupted
objects, 2017-04-01) the return value of loose_object_info() (then
named sha1_loose_object_info()) had been a "status" variable that be
any negative value, as we were expecting to return the "enum
object_type".
The only negative type happens to be OBJ_BAD, but the code still
assumed that more might be added. This was then used later in
e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now
that parse_loose_header() will return 0 on success instead of the
type (which it'll stick into the "struct object_info") we don't need
to conflate these two cases in its callers.
Since parse_loose_header() doesn't need to return an arbitrary
"status" we only need to treat its "ret < 0" specially, but can
idiomatically overwrite it with our own error() return. This along
with having made unpack_loose_header() return an "enum
unpack_loose_header_result" in an earlier commit means that we can
move the previously nested if/else cases mostly into the "ULHR_OK"
branch of the "switch" statement.
We should be less silent if we reach that "status = -1" branch, which
happens if we've got trailing garbage in loose objects, see
f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13)
for a better way to handle it. For now let's punt on it, a subsequent
commit will address that edge case.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:51 +03:00
|
|
|
git_inflate_end(&stream);
|
|
|
|
cleanup:
|
2005-06-03 02:20:54 +04:00
|
|
|
munmap(map, mapsize);
|
2017-06-22 03:40:21 +03:00
|
|
|
if (oi->sizep == &size_scratch)
|
|
|
|
oi->sizep = NULL;
|
2015-05-03 17:29:59 +03:00
|
|
|
strbuf_release(&hdrbuf);
|
object-file.c: stop dying in parse_loose_header()
Make parse_loose_header() return error codes and data instead of
invoking die() by itself.
For now we'll move the relevant die() call to loose_object_info() and
read_loose_object() to keep this change smaller. In a subsequent
commit we'll make read_loose_object() return an error code instead of
dying. We should also address the "allow_unknown" case (should be
moved to builtin/cat-file.c), but for now I'll be leaving it.
For making parse_loose_header() not die() change its prototype to
accept a "struct object_info *" instead of the "unsigned long *sizep"
it accepted before. Its callers can now check the populated populated
"oi->typep".
Because of this we don't need to pass in the "unsigned int flags"
which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do
that check in loose_object_info().
This also refactors some confusing control flow around the "status"
variable. In some cases we set it to the return value of "error()",
i.e. -1, and later checked if "status < 0" was true.
Since 93cff9a978e (sha1_loose_object_info: return error for corrupted
objects, 2017-04-01) the return value of loose_object_info() (then
named sha1_loose_object_info()) had been a "status" variable that be
any negative value, as we were expecting to return the "enum
object_type".
The only negative type happens to be OBJ_BAD, but the code still
assumed that more might be added. This was then used later in
e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now
that parse_loose_header() will return 0 on success instead of the
type (which it'll stick into the "struct object_info") we don't need
to conflate these two cases in its callers.
Since parse_loose_header() doesn't need to return an arbitrary
"status" we only need to treat its "ret < 0" specially, but can
idiomatically overwrite it with our own error() return. This along
with having made unpack_loose_header() return an "enum
unpack_loose_header_result" in an earlier commit means that we can
move the previously nested if/else cases mostly into the "ULHR_OK"
branch of the "switch" statement.
We should be less silent if we reach that "status = -1" branch, which
happens if we've got trailing garbage in loose objects, see
f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13)
for a better way to handle it. For now let's punt on it, a subsequent
commit will address that edge case.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:51 +03:00
|
|
|
if (oi->typep == &type_scratch)
|
|
|
|
oi->typep = NULL;
|
2017-08-11 23:36:14 +03:00
|
|
|
oi->whence = OI_LOOSE;
|
object-file.c: stop dying in parse_loose_header()
Make parse_loose_header() return error codes and data instead of
invoking die() by itself.
For now we'll move the relevant die() call to loose_object_info() and
read_loose_object() to keep this change smaller. In a subsequent
commit we'll make read_loose_object() return an error code instead of
dying. We should also address the "allow_unknown" case (should be
moved to builtin/cat-file.c), but for now I'll be leaving it.
For making parse_loose_header() not die() change its prototype to
accept a "struct object_info *" instead of the "unsigned long *sizep"
it accepted before. Its callers can now check the populated populated
"oi->typep".
Because of this we don't need to pass in the "unsigned int flags"
which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do
that check in loose_object_info().
This also refactors some confusing control flow around the "status"
variable. In some cases we set it to the return value of "error()",
i.e. -1, and later checked if "status < 0" was true.
Since 93cff9a978e (sha1_loose_object_info: return error for corrupted
objects, 2017-04-01) the return value of loose_object_info() (then
named sha1_loose_object_info()) had been a "status" variable that be
any negative value, as we were expecting to return the "enum
object_type".
The only negative type happens to be OBJ_BAD, but the code still
assumed that more might be added. This was then used later in
e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now
that parse_loose_header() will return 0 on success instead of the
type (which it'll stick into the "struct object_info") we don't need
to conflate these two cases in its callers.
Since parse_loose_header() doesn't need to return an arbitrary
"status" we only need to treat its "ret < 0" specially, but can
idiomatically overwrite it with our own error() return. This along
with having made unpack_loose_header() return an "enum
unpack_loose_header_result" in an earlier commit means that we can
move the previously nested if/else cases mostly into the "ULHR_OK"
branch of the "switch" statement.
We should be less silent if we reach that "status = -1" branch, which
happens if we've got trailing garbage in loose objects, see
f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13)
for a better way to handle it. For now let's punt on it, a subsequent
commit will address that edge case.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:51 +03:00
|
|
|
return status;
|
2005-06-03 02:20:54 +04:00
|
|
|
}
|
|
|
|
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
int obj_read_use_lock = 0;
|
|
|
|
pthread_mutex_t obj_read_mutex;
|
|
|
|
|
|
|
|
void enable_obj_read_lock(void)
|
|
|
|
{
|
|
|
|
if (obj_read_use_lock)
|
|
|
|
return;
|
|
|
|
|
|
|
|
obj_read_use_lock = 1;
|
|
|
|
init_recursive_mutex(&obj_read_mutex);
|
|
|
|
}
|
|
|
|
|
|
|
|
void disable_obj_read_lock(void)
|
|
|
|
{
|
|
|
|
if (!obj_read_use_lock)
|
|
|
|
return;
|
|
|
|
|
|
|
|
obj_read_use_lock = 0;
|
|
|
|
pthread_mutex_destroy(&obj_read_mutex);
|
|
|
|
}
|
|
|
|
|
2017-12-08 18:27:14 +03:00
|
|
|
int fetch_if_missing = 1;
|
|
|
|
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
static int do_oid_object_info_extended(struct repository *r,
|
|
|
|
const struct object_id *oid,
|
|
|
|
struct object_info *oi, unsigned flags)
|
2006-11-28 02:18:55 +03:00
|
|
|
{
|
2017-06-22 03:40:23 +03:00
|
|
|
static struct object_info blank_oi = OBJECT_INFO_INIT;
|
sha1-file: remove OBJECT_INFO_SKIP_CACHED
In a partial clone, if a user provides the hash of the empty tree ("git
mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which
requires that that object be parsed, for example:
git diff-tree 4b825d <a non-empty tree>
then Git will lazily fetch the empty tree, unnecessarily, because
parsing of that object invokes repo_has_object_file(), which does not
special-case the empty tree.
Instead, teach repo_has_object_file() to consult find_cached_object()
(which handles the empty tree), thus bringing it in line with the rest
of the object-store-accessing functions. A cost is that
repo_has_object_file() will now need to oideq upon each invocation, but
that is trivial compared to the filesystem lookup or the pack index
search required anyway. (And if find_cached_object() needs to do more
because of previous invocations to pretend_object_file(), all the more
reason to be consistent in whether we present cached objects.)
As a historical note, the function now known as repo_read_object_file()
was taught the empty tree in 346245a1bb ("hard-code the empty tree
object", 2008-02-13), and the function now known as oid_object_info()
was taught the empty tree in c4d9986f5f ("sha1_object_info: examine
cached_object store too", 2011-02-07). repo_has_object_file() was never
updated, perhaps due to oversight. The flag OBJECT_INFO_SKIP_CACHED,
introduced later in dfdd4afcf9 ("sha1_file: teach
sha1_object_info_extended more flags", 2017-06-26) and used in
e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26),
was introduced to preserve this difference in empty-tree handling, but
now it can be removed.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-02 23:16:30 +03:00
|
|
|
struct cached_object *co;
|
2006-11-28 02:18:55 +03:00
|
|
|
struct pack_entry e;
|
2013-07-12 10:34:57 +04:00
|
|
|
int rtype;
|
2018-03-12 05:27:54 +03:00
|
|
|
const struct object_id *real = oid;
|
2017-12-08 18:27:14 +03:00
|
|
|
int already_retried = 0;
|
2017-01-10 21:47:14 +03:00
|
|
|
int tried_hook = 0;
|
2006-11-28 02:18:55 +03:00
|
|
|
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
|
2018-03-12 05:27:54 +03:00
|
|
|
if (flags & OBJECT_INFO_LOOKUP_REPLACE)
|
2018-04-25 21:21:06 +03:00
|
|
|
real = lookup_replace_object(r, oid);
|
2006-11-28 02:18:55 +03:00
|
|
|
|
2018-03-12 05:27:54 +03:00
|
|
|
if (is_null_oid(real))
|
sha1_file: fast-path null sha1 as a missing object
In theory nobody should ever ask the low-level object code
for a null sha1. It's used as a sentinel for "no such
object" in lots of places, so leaking through to this level
is a sign that the higher-level code is not being careful
about its error-checking. In practice, though, quite a few
code paths seem to rely on the null sha1 lookup failing as a
way to quietly propagate non-existence (e.g., by feeding it
to lookup_commit_reference_gently(), which then returns
NULL).
When this happens, we do two inefficient things:
1. We actually search for the null sha1 in packs and in
the loose object directory.
2. When we fail to find it, we re-scan the pack directory
in case a simultaneous repack happened to move it from
loose to packed. This can be very expensive if you have
a large number of packs.
Only the second one actually causes noticeable performance
problems, so we could treat them independently. But for the
sake of simplicity (both of code and of reasoning about it),
it makes sense to just declare that the null sha1 cannot be
a real on-disk object, and looking it up will always return
"no such object".
There's no real loss of functionality to do so Its use as a
sentinel value means that anybody who is unlucky enough to
hit the 2^-160th chance of generating an object with that
sha1 is already going to find the object largely unusable.
In an ideal world, we'd simply fix all of the callers to
notice the null sha1 and avoid passing it to us. But a
simple experiment to catch this with a BUG() shows that
there are a large number of code paths that do so.
So in the meantime, let's fix the performance problem by
taking a fast exit from the object lookup when we see a null
sha1. p5551 shows off the improvement (when a fetched ref is
new, the "old" sha1 is 0{40}, which ends up being passed for
fast-forward checks, the status table abbreviations, etc):
Test HEAD^ HEAD
--------------------------------------------------------
5551.4: fetch 5.51(5.03+0.48) 0.17(0.10+0.06) -96.9%
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-11-22 02:17:39 +03:00
|
|
|
return -1;
|
|
|
|
|
2017-06-22 03:40:23 +03:00
|
|
|
if (!oi)
|
|
|
|
oi = &blank_oi;
|
|
|
|
|
2017-01-10 21:47:14 +03:00
|
|
|
retry:
|
sha1-file: remove OBJECT_INFO_SKIP_CACHED
In a partial clone, if a user provides the hash of the empty tree ("git
mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which
requires that that object be parsed, for example:
git diff-tree 4b825d <a non-empty tree>
then Git will lazily fetch the empty tree, unnecessarily, because
parsing of that object invokes repo_has_object_file(), which does not
special-case the empty tree.
Instead, teach repo_has_object_file() to consult find_cached_object()
(which handles the empty tree), thus bringing it in line with the rest
of the object-store-accessing functions. A cost is that
repo_has_object_file() will now need to oideq upon each invocation, but
that is trivial compared to the filesystem lookup or the pack index
search required anyway. (And if find_cached_object() needs to do more
because of previous invocations to pretend_object_file(), all the more
reason to be consistent in whether we present cached objects.)
As a historical note, the function now known as repo_read_object_file()
was taught the empty tree in 346245a1bb ("hard-code the empty tree
object", 2008-02-13), and the function now known as oid_object_info()
was taught the empty tree in c4d9986f5f ("sha1_object_info: examine
cached_object store too", 2011-02-07). repo_has_object_file() was never
updated, perhaps due to oversight. The flag OBJECT_INFO_SKIP_CACHED,
introduced later in dfdd4afcf9 ("sha1_file: teach
sha1_object_info_extended more flags", 2017-06-26) and used in
e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26),
was introduced to preserve this difference in empty-tree handling, but
now it can be removed.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-02 23:16:30 +03:00
|
|
|
co = find_cached_object(real);
|
|
|
|
if (co) {
|
|
|
|
if (oi->typep)
|
|
|
|
*(oi->typep) = co->type;
|
|
|
|
if (oi->sizep)
|
|
|
|
*(oi->sizep) = co->size;
|
|
|
|
if (oi->disk_sizep)
|
|
|
|
*(oi->disk_sizep) = 0;
|
2020-02-24 07:36:56 +03:00
|
|
|
if (oi->delta_base_oid)
|
|
|
|
oidclr(oi->delta_base_oid);
|
sha1-file: remove OBJECT_INFO_SKIP_CACHED
In a partial clone, if a user provides the hash of the empty tree ("git
mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which
requires that that object be parsed, for example:
git diff-tree 4b825d <a non-empty tree>
then Git will lazily fetch the empty tree, unnecessarily, because
parsing of that object invokes repo_has_object_file(), which does not
special-case the empty tree.
Instead, teach repo_has_object_file() to consult find_cached_object()
(which handles the empty tree), thus bringing it in line with the rest
of the object-store-accessing functions. A cost is that
repo_has_object_file() will now need to oideq upon each invocation, but
that is trivial compared to the filesystem lookup or the pack index
search required anyway. (And if find_cached_object() needs to do more
because of previous invocations to pretend_object_file(), all the more
reason to be consistent in whether we present cached objects.)
As a historical note, the function now known as repo_read_object_file()
was taught the empty tree in 346245a1bb ("hard-code the empty tree
object", 2008-02-13), and the function now known as oid_object_info()
was taught the empty tree in c4d9986f5f ("sha1_object_info: examine
cached_object store too", 2011-02-07). repo_has_object_file() was never
updated, perhaps due to oversight. The flag OBJECT_INFO_SKIP_CACHED,
introduced later in dfdd4afcf9 ("sha1_file: teach
sha1_object_info_extended more flags", 2017-06-26) and used in
e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26),
was introduced to preserve this difference in empty-tree handling, but
now it can be removed.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-02 23:16:30 +03:00
|
|
|
if (oi->type_name)
|
|
|
|
strbuf_addstr(oi->type_name, type_name(co->type));
|
|
|
|
if (oi->contentp)
|
|
|
|
*oi->contentp = xmemdupz(co->buf, co->size);
|
|
|
|
oi->whence = OI_CACHED;
|
|
|
|
return 0;
|
2011-02-05 17:03:02 +03:00
|
|
|
}
|
|
|
|
|
2017-12-08 18:27:14 +03:00
|
|
|
while (1) {
|
2018-05-30 08:04:10 +03:00
|
|
|
if (find_pack_entry(r, real, &e))
|
2017-12-08 18:27:14 +03:00
|
|
|
break;
|
|
|
|
|
2008-08-06 00:08:41 +04:00
|
|
|
/* Most likely it's a loose object. */
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
if (!loose_object_info(r, real, oi, flags))
|
2013-07-12 10:34:57 +04:00
|
|
|
return 0;
|
2008-08-06 00:08:41 +04:00
|
|
|
|
|
|
|
/* Not a loose object; someone else may have just packed it. */
|
2018-03-13 18:30:29 +03:00
|
|
|
if (!(flags & OBJECT_INFO_QUICK)) {
|
2018-04-25 21:21:06 +03:00
|
|
|
reprepare_packed_git(r);
|
2018-05-30 08:04:10 +03:00
|
|
|
if (find_pack_entry(r, real, &e))
|
2018-03-13 18:30:29 +03:00
|
|
|
break;
|
2017-01-10 21:47:14 +03:00
|
|
|
if (core_virtualize_objects && !tried_hook) {
|
|
|
|
tried_hook = 1;
|
2017-03-15 21:43:05 +03:00
|
|
|
if (!read_object_process(oid))
|
2017-01-10 21:47:14 +03:00
|
|
|
goto retry;
|
|
|
|
}
|
2018-03-13 18:30:29 +03:00
|
|
|
}
|
2017-12-08 18:27:14 +03:00
|
|
|
|
2021-10-09 00:08:18 +03:00
|
|
|
/*
|
|
|
|
* If r is the_repository, this might be an attempt at
|
|
|
|
* accessing a submodule object as if it were in the_repository
|
|
|
|
* (having called add_submodule_odb() on that submodule's ODB).
|
|
|
|
* If any such ODBs exist, register them and try again.
|
|
|
|
*/
|
|
|
|
if (r == the_repository &&
|
|
|
|
register_all_submodule_odb_as_alternates())
|
2021-08-17 00:09:51 +03:00
|
|
|
/* We added some alternates; retry */
|
|
|
|
continue;
|
|
|
|
|
2017-12-08 18:27:14 +03:00
|
|
|
/* Check if it is a missing object */
|
2021-06-17 20:13:26 +03:00
|
|
|
if (fetch_if_missing && repo_has_promisor_remote(r) &&
|
|
|
|
!already_retried &&
|
2019-05-28 18:19:07 +03:00
|
|
|
!(flags & OBJECT_INFO_SKIP_FETCH_OBJECT)) {
|
2019-06-25 16:40:31 +03:00
|
|
|
promisor_remote_get_direct(r, real, 1);
|
2017-12-08 18:27:14 +03:00
|
|
|
already_retried = 1;
|
|
|
|
continue;
|
2017-06-22 03:40:22 +03:00
|
|
|
}
|
2017-12-08 18:27:14 +03:00
|
|
|
|
2022-12-14 22:17:42 +03:00
|
|
|
if (flags & OBJECT_INFO_DIE_IF_CORRUPT) {
|
|
|
|
const struct packed_git *p;
|
|
|
|
if ((flags & OBJECT_INFO_LOOKUP_REPLACE) && !oideq(real, oid))
|
|
|
|
die(_("replacement %s not found for %s"),
|
|
|
|
oid_to_hex(real), oid_to_hex(oid));
|
|
|
|
if ((p = has_packed_and_bad(r, real)))
|
|
|
|
die(_("packed object %s (stored in %s) is corrupt"),
|
|
|
|
oid_to_hex(real), p->pack_name);
|
|
|
|
}
|
2017-12-08 18:27:14 +03:00
|
|
|
return -1;
|
2006-11-28 02:18:55 +03:00
|
|
|
}
|
2008-10-30 02:02:47 +03:00
|
|
|
|
2017-06-22 03:40:23 +03:00
|
|
|
if (oi == &blank_oi)
|
|
|
|
/*
|
|
|
|
* We know that the caller doesn't actually need the
|
|
|
|
* information below, so return early.
|
|
|
|
*/
|
|
|
|
return 0;
|
2018-04-25 21:21:06 +03:00
|
|
|
rtype = packed_object_info(r, e.p, e.offset, oi);
|
2013-07-12 10:32:25 +04:00
|
|
|
if (rtype < 0) {
|
2021-09-11 23:40:33 +03:00
|
|
|
mark_bad_packed_object(e.p, real);
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
return do_oid_object_info_extended(r, real, oi, 0);
|
2017-08-11 23:36:14 +03:00
|
|
|
} else if (oi->whence == OI_PACKED) {
|
2011-05-13 02:51:38 +04:00
|
|
|
oi->u.packed.offset = e.offset;
|
|
|
|
oi->u.packed.pack = e.p;
|
|
|
|
oi->u.packed.is_delta = (rtype == OBJ_REF_DELTA ||
|
|
|
|
rtype == OBJ_OFS_DELTA);
|
2008-10-30 02:02:47 +03:00
|
|
|
}
|
|
|
|
|
2013-07-12 10:34:57 +04:00
|
|
|
return 0;
|
2006-11-28 02:18:55 +03:00
|
|
|
}
|
|
|
|
|
object-store: allow threaded access to object reading
Allow object reading to be performed by multiple threads protecting it
with an internal lock, the obj_read_mutex. The lock usage can be toggled
with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the
functions which can be safely called in parallel are:
read_object_file_extended(), repo_read_object_file(),
read_object_file(), read_object_with_reference(), read_object(),
oid_object_info() and oid_object_info_extended(). It's also possible
to use obj_read_lock() and obj_read_unlock() to protect other sections
that cannot execute in parallel with object reading.
Probably there are many spots in the functions listed above that could
be executed unlocked (and thus, in parallel). But, for now, we are most
interested in allowing parallel access to zlib inflation. This is one of
the sections where object reading spends most of the time in (e.g. up to
one-third of git-grep's execution time in the chromium repo corresponds
to inflation) and it's already thread-safe. So, to take advantage of
that, the obj_read_mutex is released when calling git_inflate() and
re-acquired right after, for every calling spot in
oid_object_info_extended()'s call chain. We may refine this lock to also
exploit other possible parallel spots in the future, but for now,
threaded zlib inflation should already give great speedups for threaded
object reading callers.
Note that add_delta_base_cache() was also modified to skip adding
already present entries to the cache. This wasn't possible before, but
it would be now, with the parallel inflation. Take for example the
following situation, where two threads - A and B - are executing the
code at unpack_entry():
1. Thread A is performing the decompression of a base O (which is not
yet in the cache) at PHASE II. Thread B is simultaneously trying to
unpack O, but just starting at PHASE I.
2. Since O is not yet in the cache, B will go to PHASE II to also
perform the decompression.
3. When they finish decompressing, one of them will get the object
reading mutex and go to PHASE III while the other waits for the
mutex. Let’s say A got the mutex first.
4. Thread A will add O to the cache, go throughout the rest of PHASE III
and return.
5. Thread B gets the mutex, also add O to the cache (if the check wasn't
there) and returns.
Finally, it is also important to highlight that the object reading lock
can only ensure thread-safety in the mentioned functions thanks to two
complementary mechanisms: the use of 'struct raw_object_store's
replace_mutex, which guards sections in the object reading machinery
that would otherwise be thread-unsafe; and the 'struct pack_window's
inuse_cnt, which protects window reading operations (such as the one
performed during the inflation of a packed object), allowing them to
execute without the acquisition of the obj_read_mutex.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16 05:39:53 +03:00
|
|
|
int oid_object_info_extended(struct repository *r, const struct object_id *oid,
|
|
|
|
struct object_info *oi, unsigned flags)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
obj_read_lock();
|
|
|
|
ret = do_oid_object_info_extended(r, oid, oi, flags);
|
|
|
|
obj_read_unlock();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2013-10-27 02:34:30 +04:00
|
|
|
/* returns enum object_type or negative */
|
2018-04-25 21:21:06 +03:00
|
|
|
int oid_object_info(struct repository *r,
|
|
|
|
const struct object_id *oid,
|
|
|
|
unsigned long *sizep)
|
2011-05-13 02:51:38 +04:00
|
|
|
{
|
2013-07-12 10:34:57 +04:00
|
|
|
enum object_type type;
|
provide an initializer for "struct object_info"
An all-zero initializer is fine for this struct, but because
the first element is a pointer, call sites need to know to
use "NULL" instead of "0". Otherwise some static checkers
like "sparse" will complain; see d099b71 (Fix some sparse
warnings, 2013-07-18) for example. So let's provide an
initializer to make this easier to get right.
But let's also comment that memset() to zero is explicitly
OK[1]. One of the callers embeds object_info in another
struct which is initialized via memset (expand_data in
builtin/cat-file.c). Since our subset of C doesn't allow
assignment from a compound literal, handling this in any
other way is awkward, so we'd like to keep the ability to
initialize by memset(). By documenting this property, it
should make anybody who wants to change the initializer
think twice before doing so.
There's one other caller of interest. In parse_sha1_header(),
we did not initialize the struct fully in the first place.
This turned out not to be a bug because the sub-function it
calls does not look at any other fields except the ones we
did initialize. But that assumption might not hold in the
future, so it's a dangerous construct. This patch switches
it to initializing the whole struct, which protects us
against unexpected reads of the other fields.
[1] Obviously using memset() to initialize a pointer
violates the C standard, but we long ago decided that it
was an acceptable tradeoff in the real world.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-11 12:24:35 +03:00
|
|
|
struct object_info oi = OBJECT_INFO_INIT;
|
2011-05-13 02:51:38 +04:00
|
|
|
|
2013-07-12 10:34:57 +04:00
|
|
|
oi.typep = &type;
|
2011-05-13 02:51:38 +04:00
|
|
|
oi.sizep = sizep;
|
2018-04-25 21:21:06 +03:00
|
|
|
if (oid_object_info_extended(r, oid, &oi,
|
|
|
|
OBJECT_INFO_LOOKUP_REPLACE) < 0)
|
2013-07-12 10:34:57 +04:00
|
|
|
return -1;
|
|
|
|
return type;
|
2011-05-13 02:51:38 +04:00
|
|
|
}
|
|
|
|
|
2018-01-28 03:13:11 +03:00
|
|
|
int pretend_object_file(void *buf, unsigned long len, enum object_type type,
|
|
|
|
struct object_id *oid)
|
2007-02-05 08:42:38 +03:00
|
|
|
{
|
|
|
|
struct cached_object *co;
|
|
|
|
|
2022-02-05 02:48:32 +03:00
|
|
|
hash_object_file(the_hash_algo, buf, len, type, oid);
|
2023-03-28 16:58:50 +03:00
|
|
|
if (repo_has_object_file_with_flags(the_repository, oid, OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT) ||
|
2020-07-22 01:50:20 +03:00
|
|
|
find_cached_object(oid))
|
2007-02-05 08:42:38 +03:00
|
|
|
return 0;
|
2014-03-04 02:32:02 +04:00
|
|
|
ALLOC_GROW(cached_objects, cached_object_nr + 1, cached_object_alloc);
|
2007-02-05 08:42:38 +03:00
|
|
|
co = &cached_objects[cached_object_nr++];
|
|
|
|
co->size = len;
|
2007-02-26 22:55:59 +03:00
|
|
|
co->type = type;
|
2007-02-16 04:02:06 +03:00
|
|
|
co->buf = xmalloc(len);
|
|
|
|
memcpy(co->buf, buf, len);
|
2018-05-02 03:26:03 +03:00
|
|
|
oidcpy(&co->oid, oid);
|
2007-02-05 08:42:38 +03:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-10-28 22:13:06 +04:00
|
|
|
/*
|
|
|
|
* This function dies on corrupt objects; the callers who want to
|
2023-01-07 16:48:55 +03:00
|
|
|
* deal with them should arrange to call oid_object_info_extended() and give
|
|
|
|
* error messages themselves.
|
2010-10-28 22:13:06 +04:00
|
|
|
*/
|
2023-01-07 16:50:33 +03:00
|
|
|
void *repo_read_object_file(struct repository *r,
|
|
|
|
const struct object_id *oid,
|
|
|
|
enum object_type *type,
|
|
|
|
unsigned long *size)
|
2008-07-15 05:46:48 +04:00
|
|
|
{
|
2023-01-07 16:48:55 +03:00
|
|
|
struct object_info oi = OBJECT_INFO_INIT;
|
2023-01-07 16:50:19 +03:00
|
|
|
unsigned flags = OBJECT_INFO_DIE_IF_CORRUPT | OBJECT_INFO_LOOKUP_REPLACE;
|
2010-10-28 22:13:06 +04:00
|
|
|
void *data;
|
2010-10-28 22:13:06 +04:00
|
|
|
|
2023-01-07 16:48:55 +03:00
|
|
|
oi.typep = type;
|
|
|
|
oi.sizep = size;
|
|
|
|
oi.contentp = &data;
|
|
|
|
if (oid_object_info_extended(r, oid, &oi, flags))
|
2023-01-12 19:06:49 +03:00
|
|
|
return NULL;
|
2009-01-23 12:06:53 +03:00
|
|
|
|
2023-01-07 16:48:55 +03:00
|
|
|
return data;
|
2008-07-15 05:46:48 +04:00
|
|
|
}
|
|
|
|
|
2019-06-27 12:28:47 +03:00
|
|
|
void *read_object_with_reference(struct repository *r,
|
|
|
|
const struct object_id *oid,
|
2022-02-05 02:48:34 +03:00
|
|
|
enum object_type required_type,
|
2005-04-29 03:42:27 +04:00
|
|
|
unsigned long *size,
|
2018-03-12 05:27:52 +03:00
|
|
|
struct object_id *actual_oid_return)
|
2005-04-21 05:06:49 +04:00
|
|
|
{
|
2022-02-05 02:48:34 +03:00
|
|
|
enum object_type type;
|
2005-04-21 05:06:49 +04:00
|
|
|
void *buffer;
|
|
|
|
unsigned long isize;
|
2018-03-12 05:27:52 +03:00
|
|
|
struct object_id actual_oid;
|
2005-04-21 05:06:49 +04:00
|
|
|
|
2018-03-12 05:27:52 +03:00
|
|
|
oidcpy(&actual_oid, oid);
|
2005-04-29 03:42:27 +04:00
|
|
|
while (1) {
|
|
|
|
int ref_length = -1;
|
|
|
|
const char *ref_type = NULL;
|
2005-04-21 05:06:49 +04:00
|
|
|
|
2019-06-27 12:28:47 +03:00
|
|
|
buffer = repo_read_object_file(r, &actual_oid, &type, &isize);
|
2005-04-29 03:42:27 +04:00
|
|
|
if (!buffer)
|
|
|
|
return NULL;
|
2007-02-26 22:55:59 +03:00
|
|
|
if (type == required_type) {
|
2005-04-29 03:42:27 +04:00
|
|
|
*size = isize;
|
2018-03-12 05:27:52 +03:00
|
|
|
if (actual_oid_return)
|
|
|
|
oidcpy(actual_oid_return, &actual_oid);
|
2005-04-29 03:42:27 +04:00
|
|
|
return buffer;
|
|
|
|
}
|
|
|
|
/* Handle references */
|
2007-02-26 22:55:59 +03:00
|
|
|
else if (type == OBJ_COMMIT)
|
2005-04-29 03:42:27 +04:00
|
|
|
ref_type = "tree ";
|
2007-02-26 22:55:59 +03:00
|
|
|
else if (type == OBJ_TAG)
|
2005-04-29 03:42:27 +04:00
|
|
|
ref_type = "object ";
|
|
|
|
else {
|
|
|
|
free(buffer);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
ref_length = strlen(ref_type);
|
2005-04-21 05:06:49 +04:00
|
|
|
|
2018-07-16 04:28:07 +03:00
|
|
|
if (ref_length + the_hash_algo->hexsz > isize ||
|
2008-02-18 23:47:52 +03:00
|
|
|
memcmp(buffer, ref_type, ref_length) ||
|
2018-03-12 05:27:52 +03:00
|
|
|
get_oid_hex((char *) buffer + ref_length, &actual_oid)) {
|
2005-04-29 03:42:27 +04:00
|
|
|
free(buffer);
|
|
|
|
return NULL;
|
|
|
|
}
|
2005-08-08 22:44:43 +04:00
|
|
|
free(buffer);
|
2005-04-29 03:42:27 +04:00
|
|
|
/* Now we have the ID of the referred-to object in
|
2018-03-12 05:27:52 +03:00
|
|
|
* actual_oid. Check again. */
|
2005-04-21 05:06:49 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-02-05 02:48:33 +03:00
|
|
|
static void hash_object_body(const struct git_hash_algo *algo, git_hash_ctx *c,
|
hash algorithms: use size_t for section lengths
Continue walking the code path for the >4GB `hash-object --literally`
test to the hash algorithm step for LLP64 systems.
This patch lets the SHA1DC code use `size_t`, making it compatible with
LLP64 data models (as used e.g. by Windows).
The interested reader of this patch will note that we adjust the
signature of the `git_SHA1DCUpdate()` function without updating _any_
call site. This certainly puzzled at least one reviewer already, so here
is an explanation:
This function is never called directly, but always via the macro
`platform_SHA1_Update`, which is usually called via the macro
`git_SHA1_Update`. However, we never call `git_SHA1_Update()` directly
in `struct git_hash_algo`. Instead, we call `git_hash_sha1_update()`,
which is defined thusly:
static void git_hash_sha1_update(git_hash_ctx *ctx,
const void *data, size_t len)
{
git_SHA1_Update(&ctx->sha1, data, len);
}
i.e. it contains an implicit downcast from `size_t` to `unsigned long`
(before this here patch). With this patch, there is no downcast anymore.
With this patch, finally, the t1007-hash-object.sh "files over 4GB hash
literally" test case is fixed.
Signed-off-by: Philip Oakley <philipoakley@iee.email>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2021-11-13 00:16:51 +03:00
|
|
|
const void *buf, size_t len,
|
2022-02-05 02:48:33 +03:00
|
|
|
struct object_id *oid,
|
2021-11-13 00:14:50 +03:00
|
|
|
char *hdr, size_t *hdrlen)
|
2022-02-05 02:48:33 +03:00
|
|
|
{
|
|
|
|
algo->init_fn(c);
|
|
|
|
algo->update_fn(c, hdr, *hdrlen);
|
|
|
|
algo->update_fn(c, buf, len);
|
|
|
|
algo->final_oid_fn(oid, c);
|
|
|
|
}
|
|
|
|
|
2020-01-30 23:32:21 +03:00
|
|
|
static void write_object_file_prepare(const struct git_hash_algo *algo,
|
2021-11-13 00:14:50 +03:00
|
|
|
const void *buf, size_t len,
|
2022-02-05 02:48:33 +03:00
|
|
|
enum object_type type, struct object_id *oid,
|
2021-11-13 00:14:50 +03:00
|
|
|
char *hdr, size_t *hdrlen)
|
2005-06-28 06:03:13 +04:00
|
|
|
{
|
2018-02-01 05:18:41 +03:00
|
|
|
git_hash_ctx c;
|
2005-06-28 06:03:13 +04:00
|
|
|
|
|
|
|
/* Generate the header */
|
2022-02-05 02:48:33 +03:00
|
|
|
*hdrlen = format_object_header(hdr, *hdrlen, type, len);
|
2005-06-28 06:03:13 +04:00
|
|
|
|
hash algorithms: use size_t for section lengths
Continue walking the code path for the >4GB `hash-object --literally`
test to the hash algorithm step for LLP64 systems.
This patch lets the SHA1DC code use `size_t`, making it compatible with
LLP64 data models (as used e.g. by Windows).
The interested reader of this patch will note that we adjust the
signature of the `git_SHA1DCUpdate()` function without updating _any_
call site. This certainly puzzled at least one reviewer already, so here
is an explanation:
This function is never called directly, but always via the macro
`platform_SHA1_Update`, which is usually called via the macro
`git_SHA1_Update`. However, we never call `git_SHA1_Update()` directly
in `struct git_hash_algo`. Instead, we call `git_hash_sha1_update()`,
which is defined thusly:
static void git_hash_sha1_update(git_hash_ctx *ctx,
const void *data, size_t len)
{
git_SHA1_Update(&ctx->sha1, data, len);
}
i.e. it contains an implicit downcast from `size_t` to `unsigned long`
(before this here patch). With this patch, there is no downcast anymore.
With this patch, finally, the t1007-hash-object.sh "files over 4GB hash
literally" test case is fixed.
Signed-off-by: Philip Oakley <philipoakley@iee.email>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2021-11-13 00:16:51 +03:00
|
|
|
/* Hash (function pointers) computation */
|
2022-02-05 02:48:33 +03:00
|
|
|
hash_object_body(algo, &c, buf, len, oid, hdr, hdrlen);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void write_object_file_prepare_literally(const struct git_hash_algo *algo,
|
2021-11-13 00:07:03 +03:00
|
|
|
const void *buf, size_t len,
|
2022-02-05 02:48:33 +03:00
|
|
|
const char *type, struct object_id *oid,
|
2021-11-13 00:14:50 +03:00
|
|
|
char *hdr, size_t *hdrlen)
|
2022-02-05 02:48:33 +03:00
|
|
|
{
|
|
|
|
git_hash_ctx c;
|
|
|
|
|
|
|
|
*hdrlen = format_object_header_literally(hdr, *hdrlen, type, len);
|
|
|
|
hash_object_body(algo, &c, buf, len, oid, hdr, hdrlen);
|
2005-06-28 06:03:13 +04:00
|
|
|
}
|
|
|
|
|
Create object subdirectories on demand
This makes it possible to have a "sparse" git object subdirectory
structure, something that has become much more attractive now that people
use pack-files all the time.
As a result of pack-files, a git object directory doesn't necessarily have
any individual objects lying around, and in that case it's just wasting
space to keep the empty first-level object directories around: on many
filesystems the 256 empty directories will be aboue 1MB of diskspace.
Even more importantly, after you re-pack a project that _used_ to be
unpacked, you could be left with huge directories that no longer contain
anything, but that waste space and take time to look through.
With this change, "git prune-packed" can just do an rmdir() on the
directories, and they'll get removed if empty, and re-created on demand.
This patch also tries to fix up "write_sha1_from_fd()" to use the new
common infrastructure for creating the object files, closing a hole where
we might otherwise leave half-written objects in the object database.
[jc: I unoptimized the part that really removes the fan-out directories
to ease transition. init-db still wastes 1MB of diskspace to hold 256
empty fan-outs, and prune-packed rmdir()'s the grown but empty directories,
but runs mkdir() immediately after that -- reducing the saving from 150KB
to 146KB. These parts will be re-introduced when everybody has the
on-demand capability.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-09 02:54:01 +04:00
|
|
|
/*
|
2009-03-26 02:19:36 +03:00
|
|
|
* Move the just written object into its final resting place.
|
Create object subdirectories on demand
This makes it possible to have a "sparse" git object subdirectory
structure, something that has become much more attractive now that people
use pack-files all the time.
As a result of pack-files, a git object directory doesn't necessarily have
any individual objects lying around, and in that case it's just wasting
space to keep the empty first-level object directories around: on many
filesystems the 256 empty directories will be aboue 1MB of diskspace.
Even more importantly, after you re-pack a project that _used_ to be
unpacked, you could be left with huge directories that no longer contain
anything, but that waste space and take time to look through.
With this change, "git prune-packed" can just do an rmdir() on the
directories, and they'll get removed if empty, and re-created on demand.
This patch also tries to fix up "write_sha1_from_fd()" to use the new
common infrastructure for creating the object files, closing a hole where
we might otherwise leave half-written objects in the object database.
[jc: I unoptimized the part that really removes the fan-out directories
to ease transition. init-db still wastes 1MB of diskspace to hold 256
empty fan-outs, and prune-packed rmdir()'s the grown but empty directories,
but runs mkdir() immediately after that -- reducing the saving from 150KB
to 146KB. These parts will be re-introduced when everybody has the
on-demand capability.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-09 02:54:01 +04:00
|
|
|
*/
|
2015-08-08 00:40:24 +03:00
|
|
|
int finalize_object_file(const char *tmpfile, const char *filename)
|
Create object subdirectories on demand
This makes it possible to have a "sparse" git object subdirectory
structure, something that has become much more attractive now that people
use pack-files all the time.
As a result of pack-files, a git object directory doesn't necessarily have
any individual objects lying around, and in that case it's just wasting
space to keep the empty first-level object directories around: on many
filesystems the 256 empty directories will be aboue 1MB of diskspace.
Even more importantly, after you re-pack a project that _used_ to be
unpacked, you could be left with huge directories that no longer contain
anything, but that waste space and take time to look through.
With this change, "git prune-packed" can just do an rmdir() on the
directories, and they'll get removed if empty, and re-created on demand.
This patch also tries to fix up "write_sha1_from_fd()" to use the new
common infrastructure for creating the object files, closing a hole where
we might otherwise leave half-written objects in the object database.
[jc: I unoptimized the part that really removes the fan-out directories
to ease transition. init-db still wastes 1MB of diskspace to hold 256
empty fan-outs, and prune-packed rmdir()'s the grown but empty directories,
but runs mkdir() immediately after that -- reducing the saving from 150KB
to 146KB. These parts will be re-introduced when everybody has the
on-demand capability.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-09 02:54:01 +04:00
|
|
|
{
|
2008-09-19 02:24:46 +04:00
|
|
|
int ret = 0;
|
2009-03-26 02:19:36 +03:00
|
|
|
|
2009-04-28 02:32:25 +04:00
|
|
|
if (object_creation_mode == OBJECT_CREATION_USES_RENAMES)
|
2009-04-25 13:57:14 +04:00
|
|
|
goto try_rename;
|
|
|
|
else if (link(tmpfile, filename))
|
2008-09-19 02:24:46 +04:00
|
|
|
ret = errno;
|
2005-10-26 21:27:36 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Coda hack - coda doesn't like cross-directory links,
|
|
|
|
* so we fall back to a rename, which will mean that it
|
|
|
|
* won't be able to check collisions, but that's not a
|
|
|
|
* big deal.
|
|
|
|
*
|
|
|
|
* The same holds for FAT formatted media.
|
|
|
|
*
|
2009-03-28 09:14:39 +03:00
|
|
|
* When this succeeds, we just return. We have nothing
|
2005-10-26 21:27:36 +04:00
|
|
|
* left to unlink.
|
|
|
|
*/
|
|
|
|
if (ret && ret != EEXIST) {
|
2009-04-25 13:57:14 +04:00
|
|
|
try_rename:
|
2005-10-26 21:27:36 +04:00
|
|
|
if (!rename(tmpfile, filename))
|
2009-03-28 09:14:39 +03:00
|
|
|
goto out;
|
2005-10-26 03:41:20 +04:00
|
|
|
ret = errno;
|
Create object subdirectories on demand
This makes it possible to have a "sparse" git object subdirectory
structure, something that has become much more attractive now that people
use pack-files all the time.
As a result of pack-files, a git object directory doesn't necessarily have
any individual objects lying around, and in that case it's just wasting
space to keep the empty first-level object directories around: on many
filesystems the 256 empty directories will be aboue 1MB of diskspace.
Even more importantly, after you re-pack a project that _used_ to be
unpacked, you could be left with huge directories that no longer contain
anything, but that waste space and take time to look through.
With this change, "git prune-packed" can just do an rmdir() on the
directories, and they'll get removed if empty, and re-created on demand.
This patch also tries to fix up "write_sha1_from_fd()" to use the new
common infrastructure for creating the object files, closing a hole where
we might otherwise leave half-written objects in the object database.
[jc: I unoptimized the part that really removes the fan-out directories
to ease transition. init-db still wastes 1MB of diskspace to hold 256
empty fan-outs, and prune-packed rmdir()'s the grown but empty directories,
but runs mkdir() immediately after that -- reducing the saving from 150KB
to 146KB. These parts will be re-introduced when everybody has the
on-demand capability.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-09 02:54:01 +04:00
|
|
|
}
|
2009-04-30 01:22:56 +04:00
|
|
|
unlink_or_warn(tmpfile);
|
Create object subdirectories on demand
This makes it possible to have a "sparse" git object subdirectory
structure, something that has become much more attractive now that people
use pack-files all the time.
As a result of pack-files, a git object directory doesn't necessarily have
any individual objects lying around, and in that case it's just wasting
space to keep the empty first-level object directories around: on many
filesystems the 256 empty directories will be aboue 1MB of diskspace.
Even more importantly, after you re-pack a project that _used_ to be
unpacked, you could be left with huge directories that no longer contain
anything, but that waste space and take time to look through.
With this change, "git prune-packed" can just do an rmdir() on the
directories, and they'll get removed if empty, and re-created on demand.
This patch also tries to fix up "write_sha1_from_fd()" to use the new
common infrastructure for creating the object files, closing a hole where
we might otherwise leave half-written objects in the object database.
[jc: I unoptimized the part that really removes the fan-out directories
to ease transition. init-db still wastes 1MB of diskspace to hold 256
empty fan-outs, and prune-packed rmdir()'s the grown but empty directories,
but runs mkdir() immediately after that -- reducing the saving from 150KB
to 146KB. These parts will be re-introduced when everybody has the
on-demand capability.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-09 02:54:01 +04:00
|
|
|
if (ret) {
|
|
|
|
if (ret != EEXIST) {
|
2019-01-07 11:39:33 +03:00
|
|
|
return error_errno(_("unable to write file %s"), filename);
|
Create object subdirectories on demand
This makes it possible to have a "sparse" git object subdirectory
structure, something that has become much more attractive now that people
use pack-files all the time.
As a result of pack-files, a git object directory doesn't necessarily have
any individual objects lying around, and in that case it's just wasting
space to keep the empty first-level object directories around: on many
filesystems the 256 empty directories will be aboue 1MB of diskspace.
Even more importantly, after you re-pack a project that _used_ to be
unpacked, you could be left with huge directories that no longer contain
anything, but that waste space and take time to look through.
With this change, "git prune-packed" can just do an rmdir() on the
directories, and they'll get removed if empty, and re-created on demand.
This patch also tries to fix up "write_sha1_from_fd()" to use the new
common infrastructure for creating the object files, closing a hole where
we might otherwise leave half-written objects in the object database.
[jc: I unoptimized the part that really removes the fan-out directories
to ease transition. init-db still wastes 1MB of diskspace to hold 256
empty fan-outs, and prune-packed rmdir()'s the grown but empty directories,
but runs mkdir() immediately after that -- reducing the saving from 150KB
to 146KB. These parts will be re-introduced when everybody has the
on-demand capability.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-09 02:54:01 +04:00
|
|
|
}
|
|
|
|
/* FIXME!!! Collision check here ? */
|
|
|
|
}
|
|
|
|
|
2009-03-28 09:14:39 +03:00
|
|
|
out:
|
2010-02-23 01:32:16 +03:00
|
|
|
if (adjust_shared_perm(filename))
|
2018-07-21 10:49:39 +03:00
|
|
|
return error(_("unable to set permission to '%s'"), filename);
|
Create object subdirectories on demand
This makes it possible to have a "sparse" git object subdirectory
structure, something that has become much more attractive now that people
use pack-files all the time.
As a result of pack-files, a git object directory doesn't necessarily have
any individual objects lying around, and in that case it's just wasting
space to keep the empty first-level object directories around: on many
filesystems the 256 empty directories will be aboue 1MB of diskspace.
Even more importantly, after you re-pack a project that _used_ to be
unpacked, you could be left with huge directories that no longer contain
anything, but that waste space and take time to look through.
With this change, "git prune-packed" can just do an rmdir() on the
directories, and they'll get removed if empty, and re-created on demand.
This patch also tries to fix up "write_sha1_from_fd()" to use the new
common infrastructure for creating the object files, closing a hole where
we might otherwise leave half-written objects in the object database.
[jc: I unoptimized the part that really removes the fan-out directories
to ease transition. init-db still wastes 1MB of diskspace to hold 256
empty fan-outs, and prune-packed rmdir()'s the grown but empty directories,
but runs mkdir() immediately after that -- reducing the saving from 150KB
to 146KB. These parts will be re-introduced when everybody has the
on-demand capability.]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-10-09 02:54:01 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-02-05 02:48:32 +03:00
|
|
|
static void hash_object_file_literally(const struct git_hash_algo *algo,
|
2021-11-13 00:14:50 +03:00
|
|
|
const void *buf, size_t len,
|
2022-02-05 02:48:32 +03:00
|
|
|
const char *type, struct object_id *oid)
|
2006-10-14 14:45:36 +04:00
|
|
|
{
|
2018-03-12 05:27:55 +03:00
|
|
|
char hdr[MAX_HEADER_LEN];
|
2021-11-13 00:14:50 +03:00
|
|
|
size_t hdrlen = sizeof(hdr);
|
2022-02-05 02:48:32 +03:00
|
|
|
|
2022-02-05 02:48:33 +03:00
|
|
|
write_object_file_prepare_literally(algo, buf, len, type, oid, hdr, &hdrlen);
|
2006-10-14 14:45:36 +04:00
|
|
|
}
|
|
|
|
|
2022-02-05 02:48:32 +03:00
|
|
|
void hash_object_file(const struct git_hash_algo *algo, const void *buf,
|
2021-11-13 00:14:50 +03:00
|
|
|
size_t len, enum object_type type,
|
2022-02-05 02:48:32 +03:00
|
|
|
struct object_id *oid)
|
|
|
|
{
|
|
|
|
hash_object_file_literally(algo, buf, len, type_name(type), oid);
|
2006-10-14 14:45:36 +04:00
|
|
|
}
|
|
|
|
|
2008-06-11 05:47:18 +04:00
|
|
|
/* Finalize a file on disk, and close it. */
|
2022-03-30 21:14:15 +03:00
|
|
|
static void close_loose_object(int fd, const char *filename)
|
2008-06-11 05:47:18 +04:00
|
|
|
{
|
2022-03-11 01:43:21 +03:00
|
|
|
if (the_repository->objects->odb->will_destroy)
|
|
|
|
goto out;
|
2021-12-07 01:05:04 +03:00
|
|
|
|
core.fsyncmethod: batched disk flushes for loose-objects
When adding many objects to a repo with `core.fsync=loose-object`,
the cost of fsync'ing each object file can become prohibitive.
One major source of the cost of fsync is the implied flush of the
hardware writeback cache within the disk drive. This commit introduces
a new `core.fsyncMethod=batch` option that batches up hardware flushes.
It hooks into the bulk-checkin odb-transaction functionality, takes
advantage of tmp-objdir, and uses the writeout-only support code.
When the new mode is enabled, we do the following for each new object:
1a. Create the object in a tmp-objdir.
2a. Issue a pagecache writeback request and wait for it to complete.
At the end of the entire transaction when unplugging bulk checkin:
1b. Issue an fsync against a dummy file to flush the log and hardware
writeback cache, which should by now have seen the tmp-objdir writes.
2b. Rename all of the tmp-objdir files to their final names.
3b. When updating the index and/or refs, we assume that Git will issue
another fsync internal to that operation. This is not the default
today, but the user now has the option of syncing the index and there
is a separate patch series to implement syncing of refs.
On a filesystem with a singular journal that is updated during name
operations (e.g. create, link, rename, etc), such as NTFS, HFS+, or XFS
we would expect the fsync to trigger a journal writeout so that this
sequence is enough to ensure that the user's data is durable by the time
the git command returns. This sequence also ensures that no object files
appear in the main object store unless they are fsync-durable.
Batch mode is only enabled if core.fsync includes loose-objects. If
the legacy core.fsyncObjectFiles setting is enabled, but core.fsync does
not include loose-objects, we will use file-by-file fsyncing.
In step (1a) of the sequence, the tmp-objdir is created lazily to avoid
work if no loose objects are ever added to the ODB. We use a tmp-objdir
to maintain the invariant that no loose-objects are visible in the main
ODB unless they are properly fsync-durable. This is important since
future ODB operations that try to create an object with specific
contents will silently drop the new data if an object with the target
hash exists without checking that the loose-object contents match the
hash. Only a full git-fsck would restore the ODB to a functional state
where dataloss doesn't occur.
In step (1b) of the sequence, we issue a fsync against a dummy file
created specifically for the purpose. This method has a little higher
cost than using one of the input object files, but makes adding new
callers of this mechanism easier, since we don't need to figure out
which object file is "last" or risk sharing violations by caching the fd
of the last object file.
_Performance numbers_:
Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
Windows - Same host as Linux, a preview version of Windows 11.
Adding 500 files to the repo with 'git add' Times reported in seconds.
object file syncing | Linux | Mac | Windows
--------------------|-------|-------|--------
disabled | 0.06 | 0.35 | 0.61
fsync | 1.88 | 11.18 | 2.47
batch | 0.15 | 0.41 | 1.53
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-05 08:20:09 +03:00
|
|
|
if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
|
2022-06-04 00:30:34 +03:00
|
|
|
fsync_loose_object_bulk_checkin(fd, filename);
|
core.fsyncmethod: batched disk flushes for loose-objects
When adding many objects to a repo with `core.fsync=loose-object`,
the cost of fsync'ing each object file can become prohibitive.
One major source of the cost of fsync is the implied flush of the
hardware writeback cache within the disk drive. This commit introduces
a new `core.fsyncMethod=batch` option that batches up hardware flushes.
It hooks into the bulk-checkin odb-transaction functionality, takes
advantage of tmp-objdir, and uses the writeout-only support code.
When the new mode is enabled, we do the following for each new object:
1a. Create the object in a tmp-objdir.
2a. Issue a pagecache writeback request and wait for it to complete.
At the end of the entire transaction when unplugging bulk checkin:
1b. Issue an fsync against a dummy file to flush the log and hardware
writeback cache, which should by now have seen the tmp-objdir writes.
2b. Rename all of the tmp-objdir files to their final names.
3b. When updating the index and/or refs, we assume that Git will issue
another fsync internal to that operation. This is not the default
today, but the user now has the option of syncing the index and there
is a separate patch series to implement syncing of refs.
On a filesystem with a singular journal that is updated during name
operations (e.g. create, link, rename, etc), such as NTFS, HFS+, or XFS
we would expect the fsync to trigger a journal writeout so that this
sequence is enough to ensure that the user's data is durable by the time
the git command returns. This sequence also ensures that no object files
appear in the main object store unless they are fsync-durable.
Batch mode is only enabled if core.fsync includes loose-objects. If
the legacy core.fsyncObjectFiles setting is enabled, but core.fsync does
not include loose-objects, we will use file-by-file fsyncing.
In step (1a) of the sequence, the tmp-objdir is created lazily to avoid
work if no loose objects are ever added to the ODB. We use a tmp-objdir
to maintain the invariant that no loose-objects are visible in the main
ODB unless they are properly fsync-durable. This is important since
future ODB operations that try to create an object with specific
contents will silently drop the new data if an object with the target
hash exists without checking that the loose-object contents match the
hash. Only a full git-fsck would restore the ODB to a functional state
where dataloss doesn't occur.
In step (1b) of the sequence, we issue a fsync against a dummy file
created specifically for the purpose. This method has a little higher
cost than using one of the input object files, but makes adding new
callers of this mechanism easier, since we don't need to figure out
which object file is "last" or risk sharing violations by caching the fd
of the last object file.
_Performance numbers_:
Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
Windows - Same host as Linux, a preview version of Windows 11.
Adding 500 files to the repo with 'git add' Times reported in seconds.
object file syncing | Linux | Mac | Windows
--------------------|-------|-------|--------
disabled | 0.06 | 0.35 | 0.61
fsync | 1.88 | 11.18 | 2.47
batch | 0.15 | 0.41 | 1.53
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-05 08:20:09 +03:00
|
|
|
else if (fsync_object_files > 0)
|
2022-03-30 21:14:15 +03:00
|
|
|
fsync_or_die(fd, filename);
|
2022-03-11 01:43:21 +03:00
|
|
|
else
|
|
|
|
fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd,
|
2022-03-30 21:14:15 +03:00
|
|
|
filename);
|
2022-03-11 01:43:21 +03:00
|
|
|
|
|
|
|
out:
|
2008-06-11 05:47:18 +04:00
|
|
|
if (close(fd) != 0)
|
2019-01-07 11:39:24 +03:00
|
|
|
die_errno(_("error when closing loose object file"));
|
2008-06-11 05:47:18 +04:00
|
|
|
}
|
|
|
|
|
2008-06-14 21:50:12 +04:00
|
|
|
/* Size of directory component, including the ending '/' */
|
|
|
|
static inline int directory_size(const char *filename)
|
|
|
|
{
|
|
|
|
const char *s = strrchr(filename, '/');
|
|
|
|
if (!s)
|
|
|
|
return 0;
|
|
|
|
return s - filename + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This creates a temporary file in the same directory as the final
|
|
|
|
* 'filename'
|
|
|
|
*
|
|
|
|
* We want to avoid cross-directory filename renames, because those
|
|
|
|
* can have problems on various filesystems (FAT, NFS, Coda).
|
|
|
|
*/
|
2015-09-25 00:07:49 +03:00
|
|
|
static int create_tmpfile(struct strbuf *tmp, const char *filename)
|
2008-06-14 21:50:12 +04:00
|
|
|
{
|
|
|
|
int fd, dirlen = directory_size(filename);
|
|
|
|
|
2015-09-25 00:07:49 +03:00
|
|
|
strbuf_reset(tmp);
|
|
|
|
strbuf_add(tmp, filename, dirlen);
|
|
|
|
strbuf_addstr(tmp, "tmp_obj_XXXXXX");
|
|
|
|
fd = git_mkstemp_mode(tmp->buf, 0444);
|
sha1_file: avoid bogus "file exists" error message
This avoids the following misleading error message:
error: unable to create temporary sha1 filename ./objects/15: File exists
mkstemp can fail for many reasons, one of which, ENOENT, can occur if
the directory for the temp file doesn't exist. create_tmpfile tried to
handle this case by always trying to mkdir the directory, even if it
already existed. This caused errno to be clobbered, so one cannot tell
why mkstemp really failed, and it truncated the buffer to just the
directory name, resulting in the strange error message shown above.
Note that in both occasions that I've seen this failure, it has not been
due to a missing directory, or bad permissions, but some other, unknown
mkstemp failure mode that did not occur when I ran git again. This code
could perhaps be made more robust by retrying mkstemp, in case it was a
transient failure.
Signed-off-by: Joey Hess <joey@kitenet.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-11-20 21:56:28 +03:00
|
|
|
if (fd < 0 && dirlen && errno == ENOENT) {
|
2015-09-25 00:07:49 +03:00
|
|
|
/*
|
|
|
|
* Make sure the directory exists; note that the contents
|
|
|
|
* of the buffer are undefined after mkstemp returns an
|
|
|
|
* error, so we have to rewrite the whole buffer from
|
|
|
|
* scratch.
|
|
|
|
*/
|
|
|
|
strbuf_reset(tmp);
|
|
|
|
strbuf_add(tmp, filename, dirlen - 1);
|
|
|
|
if (mkdir(tmp->buf, 0777) && errno != EEXIST)
|
sha1_file.c:create_tmpfile(): Fix race when creating loose object dirs
There are cases (e.g. when running concurrent fetches in a repo) where
multiple Git processes concurrently attempt to create loose objects
within the same objects/XX/ dir. The creation of the loose object files
is (AFAICS) safe from races, but the creation of the objects/XX/ dir in
which the loose objects reside is unsafe, for example:
Two concurrent fetches - A and B. As part of its fetch, A needs to store
12aaaaa as a loose object. B, on the other hand, needs to store 12bbbbb
as a loose object. The objects/12 directory does not already exist.
Concurrently, both A and B determine that they need to create the
objects/12 directory (because their first call to git_mkstemp_mode()
within create_tmpfile() fails witn ENOENT). One of them - let's say A -
executes the following mkdir() call before the other. This first call
returns success, and A moves on. When B gets around to calling mkdir(),
it fails with EEXIST, because A won the race. The mkdir() error causes B
to return -1 from create_tmpfile(), which propagates all the way,
resulting in the fetch failing with:
error: unable to create temporary file: File exists
fatal: failed to write object
fatal: unpack-objects failed
Although it's hard to add a testcase reproducing this issue, it's easy
to provoke if we insert a sleep after the
if (mkdir(buffer, 0777) || adjust_shared_perm(buffer))
return -1;
block, and then run two concurrent "git fetch"es against the same repo.
The fix is to simply handle mkdir() failing with EEXIST as a success.
If EEXIST is somehow returned for the wrong reasons (because the relevant
objects/XX is not a directory, or is otherwise unsuitable for object
storage), the following call to adjust_shared_perm(), or ultimately the
retried call to git_mkstemp_mode() will fail, and we end up returning
error from create_tmpfile() in any case.
Note that there are still cases where two users with unsuitable umasks
in a shared repo can end up in two races where one user first wins the
mkdir() race to create an objects/XX/ directory, and then the other user
wins the adjust_shared_perms() race to chmod() that directory, but fails
because it is (transiently, until the first users completes its chmod())
unwriteable to the other user. However, (an equivalent of) this race also
exists before this patch, and is made no worse by this patch.
Signed-off-by: Johan Herland <johan@herland.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-10-27 15:35:43 +04:00
|
|
|
return -1;
|
2015-09-25 00:07:49 +03:00
|
|
|
if (adjust_shared_perm(tmp->buf))
|
2008-06-14 21:50:12 +04:00
|
|
|
return -1;
|
|
|
|
|
|
|
|
/* Try again */
|
2015-09-25 00:07:49 +03:00
|
|
|
strbuf_addstr(tmp, "/tmp_obj_XXXXXX");
|
|
|
|
fd = git_mkstemp_mode(tmp->buf, 0444);
|
2008-06-14 21:50:12 +04:00
|
|
|
}
|
|
|
|
return fd;
|
|
|
|
}
|
|
|
|
|
2022-06-11 05:44:17 +03:00
|
|
|
/**
|
|
|
|
* Common steps for loose object writers to start writing loose
|
|
|
|
* objects:
|
|
|
|
*
|
|
|
|
* - Create tmpfile for the loose object.
|
|
|
|
* - Setup zlib stream for compression.
|
|
|
|
* - Start to feed header to zlib stream.
|
|
|
|
*
|
|
|
|
* Returns a "fd", which should later be provided to
|
|
|
|
* end_loose_object_common().
|
|
|
|
*/
|
|
|
|
static int start_loose_object_common(struct strbuf *tmp_file,
|
|
|
|
const char *filename, unsigned flags,
|
|
|
|
git_zstream *stream,
|
|
|
|
unsigned char *buf, size_t buflen,
|
|
|
|
git_hash_ctx *c,
|
|
|
|
char *hdr, int hdrlen)
|
|
|
|
{
|
|
|
|
int fd;
|
|
|
|
|
|
|
|
fd = create_tmpfile(tmp_file, filename);
|
|
|
|
if (fd < 0) {
|
|
|
|
if (flags & HASH_SILENT)
|
|
|
|
return -1;
|
|
|
|
else if (errno == EACCES)
|
|
|
|
return error(_("insufficient permission for adding "
|
|
|
|
"an object to repository database %s"),
|
|
|
|
get_object_directory());
|
|
|
|
else
|
|
|
|
return error_errno(
|
|
|
|
_("unable to create temporary file"));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Setup zlib stream for compression */
|
|
|
|
git_deflate_init(stream, zlib_compression_level);
|
|
|
|
stream->next_out = buf;
|
|
|
|
stream->avail_out = buflen;
|
|
|
|
the_hash_algo->init_fn(c);
|
|
|
|
|
|
|
|
/* Start to feed header to zlib stream */
|
|
|
|
stream->next_in = (unsigned char *)hdr;
|
|
|
|
stream->avail_in = hdrlen;
|
|
|
|
while (git_deflate(stream, 0) == Z_OK)
|
|
|
|
; /* nothing */
|
|
|
|
the_hash_algo->update_fn(c, hdr, hdrlen);
|
|
|
|
|
|
|
|
return fd;
|
|
|
|
}
|
|
|
|
|
2022-06-11 05:44:18 +03:00
|
|
|
/**
|
|
|
|
* Common steps for the inner git_deflate() loop for writing loose
|
|
|
|
* objects. Returns what git_deflate() returns.
|
|
|
|
*/
|
|
|
|
static int write_loose_object_common(git_hash_ctx *c,
|
|
|
|
git_zstream *stream, const int flush,
|
|
|
|
unsigned char *in0, const int fd,
|
|
|
|
unsigned char *compressed,
|
|
|
|
const size_t compressed_len)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = git_deflate(stream, flush ? Z_FINISH : 0);
|
|
|
|
the_hash_algo->update_fn(c, in0, stream->next_in - in0);
|
2022-12-13 22:35:07 +03:00
|
|
|
if (write_in_full(fd, compressed, stream->next_out - compressed) < 0)
|
|
|
|
die_errno(_("unable to write loose object file"));
|
2022-06-11 05:44:18 +03:00
|
|
|
stream->next_out = compressed;
|
|
|
|
stream->avail_out = compressed_len;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-06-11 05:44:17 +03:00
|
|
|
/**
|
|
|
|
* Common steps for loose object writers to end writing loose objects:
|
|
|
|
*
|
|
|
|
* - End the compression of zlib stream.
|
|
|
|
* - Get the calculated oid to "oid".
|
|
|
|
*/
|
|
|
|
static int end_loose_object_common(git_hash_ctx *c, git_zstream *stream,
|
|
|
|
struct object_id *oid)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
ret = git_deflate_end_gently(stream);
|
|
|
|
if (ret != Z_OK)
|
|
|
|
return ret;
|
|
|
|
the_hash_algo->final_oid_fn(oid, c);
|
|
|
|
|
|
|
|
return Z_OK;
|
|
|
|
}
|
|
|
|
|
2018-01-28 03:13:21 +03:00
|
|
|
static int write_loose_object(const struct object_id *oid, char *hdr,
|
|
|
|
int hdrlen, const void *buf, unsigned long len,
|
2021-10-12 17:30:49 +03:00
|
|
|
time_t mtime, unsigned flags)
|
2005-04-19 00:04:43 +04:00
|
|
|
{
|
2009-01-29 08:56:34 +03:00
|
|
|
int fd, ret;
|
2010-02-21 07:27:31 +03:00
|
|
|
unsigned char compressed[4096];
|
2011-06-10 22:52:15 +04:00
|
|
|
git_zstream stream;
|
2018-02-01 05:18:41 +03:00
|
|
|
git_hash_ctx c;
|
2018-01-28 03:13:21 +03:00
|
|
|
struct object_id parano_oid;
|
2015-09-25 00:07:49 +03:00
|
|
|
static struct strbuf tmp_file = STRBUF_INIT;
|
2018-01-17 20:54:54 +03:00
|
|
|
static struct strbuf filename = STRBUF_INIT;
|
|
|
|
|
core.fsyncmethod: batched disk flushes for loose-objects
When adding many objects to a repo with `core.fsync=loose-object`,
the cost of fsync'ing each object file can become prohibitive.
One major source of the cost of fsync is the implied flush of the
hardware writeback cache within the disk drive. This commit introduces
a new `core.fsyncMethod=batch` option that batches up hardware flushes.
It hooks into the bulk-checkin odb-transaction functionality, takes
advantage of tmp-objdir, and uses the writeout-only support code.
When the new mode is enabled, we do the following for each new object:
1a. Create the object in a tmp-objdir.
2a. Issue a pagecache writeback request and wait for it to complete.
At the end of the entire transaction when unplugging bulk checkin:
1b. Issue an fsync against a dummy file to flush the log and hardware
writeback cache, which should by now have seen the tmp-objdir writes.
2b. Rename all of the tmp-objdir files to their final names.
3b. When updating the index and/or refs, we assume that Git will issue
another fsync internal to that operation. This is not the default
today, but the user now has the option of syncing the index and there
is a separate patch series to implement syncing of refs.
On a filesystem with a singular journal that is updated during name
operations (e.g. create, link, rename, etc), such as NTFS, HFS+, or XFS
we would expect the fsync to trigger a journal writeout so that this
sequence is enough to ensure that the user's data is durable by the time
the git command returns. This sequence also ensures that no object files
appear in the main object store unless they are fsync-durable.
Batch mode is only enabled if core.fsync includes loose-objects. If
the legacy core.fsyncObjectFiles setting is enabled, but core.fsync does
not include loose-objects, we will use file-by-file fsyncing.
In step (1a) of the sequence, the tmp-objdir is created lazily to avoid
work if no loose objects are ever added to the ODB. We use a tmp-objdir
to maintain the invariant that no loose-objects are visible in the main
ODB unless they are properly fsync-durable. This is important since
future ODB operations that try to create an object with specific
contents will silently drop the new data if an object with the target
hash exists without checking that the loose-object contents match the
hash. Only a full git-fsck would restore the ODB to a functional state
where dataloss doesn't occur.
In step (1b) of the sequence, we issue a fsync against a dummy file
created specifically for the purpose. This method has a little higher
cost than using one of the input object files, but makes adding new
callers of this mechanism easier, since we don't need to figure out
which object file is "last" or risk sharing violations by caching the fd
of the last object file.
_Performance numbers_:
Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
Windows - Same host as Linux, a preview version of Windows 11.
Adding 500 files to the repo with 'git add' Times reported in seconds.
object file syncing | Linux | Mac | Windows
--------------------|-------|-------|--------
disabled | 0.06 | 0.35 | 0.61
fsync | 1.88 | 11.18 | 2.47
batch | 0.15 | 0.41 | 1.53
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-05 08:20:09 +03:00
|
|
|
if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
|
|
|
|
prepare_loose_object_bulk_checkin();
|
|
|
|
|
sha1-file: modernize loose object file functions
The loose object access code in sha1-file.c is some of the oldest in
Git, and could use some modernizing. It mostly uses "unsigned char *"
for object ids, which these days should be "struct object_id".
It also uses the term "sha1_file" in many functions, which is confusing.
The term "loose_objects" is much better. It clearly distinguishes
them from packed objects (which didn't even exist back when the name
"sha1_file" came into being). And it also distinguishes it from the
checksummed-file concept in csum-file.c (which until recently was
actually called "struct sha1file"!).
This patch converts the functions {open,close,map,stat}_sha1_file() into
open_loose_object(), etc, and switches their sha1 arguments for
object_id structs. Similarly, path functions like fill_sha1_path()
become fill_loose_path() and use object_ids.
The function sha1_loose_object_info() already says "loose", so we can
just drop the "sha1" (and teach it to use object_id).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-07 11:35:42 +03:00
|
|
|
loose_object_path(the_repository, &filename, oid);
|
2005-04-25 21:19:53 +04:00
|
|
|
|
2022-06-11 05:44:17 +03:00
|
|
|
fd = start_loose_object_common(&tmp_file, filename.buf, flags,
|
|
|
|
&stream, compressed, sizeof(compressed),
|
|
|
|
&c, hdr, hdrlen);
|
|
|
|
if (fd < 0)
|
|
|
|
return -1;
|
2005-04-25 21:19:53 +04:00
|
|
|
|
|
|
|
/* Then the data itself.. */
|
2010-04-02 04:03:18 +04:00
|
|
|
stream.next_in = (void *)buf;
|
2005-04-25 21:19:53 +04:00
|
|
|
stream.avail_in = len;
|
2010-02-21 07:27:31 +03:00
|
|
|
do {
|
2010-02-21 23:48:06 +03:00
|
|
|
unsigned char *in0 = stream.next_in;
|
2022-06-11 05:44:18 +03:00
|
|
|
|
|
|
|
ret = write_loose_object_common(&c, &stream, 1, in0, fd,
|
|
|
|
compressed, sizeof(compressed));
|
2010-02-21 07:27:31 +03:00
|
|
|
} while (ret == Z_OK);
|
|
|
|
|
Be more careful about zlib return values
When creating a new object, we use "deflate(stream, Z_FINISH)" in a loop
until it no longer returns Z_OK, and then we do "deflateEnd()" to finish
up business.
That should all work, but the fact is, it's not how you're _supposed_ to
use the zlib return values properly:
- deflate() should never return Z_OK in the first place, except if we
need to increase the output buffer size (which we're not doing, and
should never need to do, since we pre-allocated a buffer that is
supposed to be able to hold the output in full). So the "while()" loop
was incorrect: Z_OK doesn't actually mean "ok, continue", it means "ok,
allocate more memory for me and continue"!
- if we got an error return, we would consider it to be end-of-stream,
but it could be some internal zlib error. In short, we should check
for Z_STREAM_END explicitly, since that's the only valid return value
anyway for the Z_FINISH case.
- we never checked deflateEnd() return codes at all.
Now, admittedly, none of these issues should ever happen, unless there is
some internal bug in zlib. So this patch should make zero difference, but
it seems to be the right thing to do.
We should probablybe anal and check the return value of "deflateInit()"
too!
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-20 21:38:34 +03:00
|
|
|
if (ret != Z_STREAM_END)
|
2018-07-21 10:49:39 +03:00
|
|
|
die(_("unable to deflate new object %s (%d)"), oid_to_hex(oid),
|
2018-01-28 03:13:21 +03:00
|
|
|
ret);
|
2022-06-11 05:44:17 +03:00
|
|
|
ret = end_loose_object_common(&c, &stream, ¶no_oid);
|
Be more careful about zlib return values
When creating a new object, we use "deflate(stream, Z_FINISH)" in a loop
until it no longer returns Z_OK, and then we do "deflateEnd()" to finish
up business.
That should all work, but the fact is, it's not how you're _supposed_ to
use the zlib return values properly:
- deflate() should never return Z_OK in the first place, except if we
need to increase the output buffer size (which we're not doing, and
should never need to do, since we pre-allocated a buffer that is
supposed to be able to hold the output in full). So the "while()" loop
was incorrect: Z_OK doesn't actually mean "ok, continue", it means "ok,
allocate more memory for me and continue"!
- if we got an error return, we would consider it to be end-of-stream,
but it could be some internal zlib error. In short, we should check
for Z_STREAM_END explicitly, since that's the only valid return value
anyway for the Z_FINISH case.
- we never checked deflateEnd() return codes at all.
Now, admittedly, none of these issues should ever happen, unless there is
some internal bug in zlib. So this patch should make zero difference, but
it seems to be the right thing to do.
We should probablybe anal and check the return value of "deflateInit()"
too!
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-20 21:38:34 +03:00
|
|
|
if (ret != Z_OK)
|
2018-07-21 10:49:39 +03:00
|
|
|
die(_("deflateEnd on object %s failed (%d)"), oid_to_hex(oid),
|
2018-01-28 03:13:21 +03:00
|
|
|
ret);
|
2018-08-29 00:22:48 +03:00
|
|
|
if (!oideq(oid, ¶no_oid))
|
2018-07-21 10:49:39 +03:00
|
|
|
die(_("confused by unstable object source data for %s"),
|
2018-01-28 03:13:21 +03:00
|
|
|
oid_to_hex(oid));
|
Be more careful about zlib return values
When creating a new object, we use "deflate(stream, Z_FINISH)" in a loop
until it no longer returns Z_OK, and then we do "deflateEnd()" to finish
up business.
That should all work, but the fact is, it's not how you're _supposed_ to
use the zlib return values properly:
- deflate() should never return Z_OK in the first place, except if we
need to increase the output buffer size (which we're not doing, and
should never need to do, since we pre-allocated a buffer that is
supposed to be able to hold the output in full). So the "while()" loop
was incorrect: Z_OK doesn't actually mean "ok, continue", it means "ok,
allocate more memory for me and continue"!
- if we got an error return, we would consider it to be end-of-stream,
but it could be some internal zlib error. In short, we should check
for Z_STREAM_END explicitly, since that's the only valid return value
anyway for the Z_FINISH case.
- we never checked deflateEnd() return codes at all.
Now, admittedly, none of these issues should ever happen, unless there is
some internal bug in zlib. So this patch should make zero difference, but
it seems to be the right thing to do.
We should probablybe anal and check the return value of "deflateInit()"
too!
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-20 21:38:34 +03:00
|
|
|
|
2022-03-30 21:14:15 +03:00
|
|
|
close_loose_object(fd, tmp_file.buf);
|
2005-04-19 00:04:43 +04:00
|
|
|
|
2008-05-14 09:32:48 +04:00
|
|
|
if (mtime) {
|
|
|
|
struct utimbuf utb;
|
|
|
|
utb.actime = mtime;
|
|
|
|
utb.modtime = mtime;
|
2021-10-12 17:30:49 +03:00
|
|
|
if (utime(tmp_file.buf, &utb) < 0 &&
|
|
|
|
!(flags & HASH_SILENT))
|
2018-07-21 10:49:39 +03:00
|
|
|
warning_errno(_("failed utime() on %s"), tmp_file.buf);
|
2008-05-14 09:32:48 +04:00
|
|
|
}
|
|
|
|
|
2018-01-17 20:54:54 +03:00
|
|
|
return finalize_object_file(tmp_file.buf, filename.buf);
|
2005-04-19 00:04:43 +04:00
|
|
|
}
|
2005-04-24 05:47:23 +04:00
|
|
|
|
2017-09-08 12:32:43 +03:00
|
|
|
static int freshen_loose_object(const struct object_id *oid,
|
|
|
|
int skip_virtualized_objects)
|
2014-10-16 02:42:22 +04:00
|
|
|
{
|
2017-09-08 12:32:43 +03:00
|
|
|
return check_and_freshen(oid, 1, skip_virtualized_objects);
|
2014-10-16 02:42:22 +04:00
|
|
|
}
|
|
|
|
|
2018-05-02 03:25:34 +03:00
|
|
|
static int freshen_packed_object(const struct object_id *oid)
|
2014-10-16 02:42:22 +04:00
|
|
|
{
|
|
|
|
struct pack_entry e;
|
2018-05-02 03:25:35 +03:00
|
|
|
if (!find_pack_entry(the_repository, oid, &e))
|
2015-04-20 22:55:00 +03:00
|
|
|
return 0;
|
2022-05-21 02:18:17 +03:00
|
|
|
if (e.p->is_cruft)
|
|
|
|
return 0;
|
2015-04-20 22:55:00 +03:00
|
|
|
if (e.p->freshened)
|
|
|
|
return 1;
|
|
|
|
if (!freshen_file(e.p->pack_name))
|
|
|
|
return 0;
|
|
|
|
e.p->freshened = 1;
|
|
|
|
return 1;
|
2014-10-16 02:42:22 +04:00
|
|
|
}
|
|
|
|
|
2022-06-11 05:44:19 +03:00
|
|
|
int stream_loose_object(struct input_stream *in_stream, size_t len,
|
|
|
|
struct object_id *oid)
|
|
|
|
{
|
|
|
|
int fd, ret, err = 0, flush = 0;
|
|
|
|
unsigned char compressed[4096];
|
|
|
|
git_zstream stream;
|
|
|
|
git_hash_ctx c;
|
|
|
|
struct strbuf tmp_file = STRBUF_INIT;
|
|
|
|
struct strbuf filename = STRBUF_INIT;
|
|
|
|
int dirlen;
|
|
|
|
char hdr[MAX_HEADER_LEN];
|
|
|
|
int hdrlen;
|
|
|
|
|
|
|
|
if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
|
|
|
|
prepare_loose_object_bulk_checkin();
|
|
|
|
|
|
|
|
/* Since oid is not determined, save tmp file to odb path. */
|
|
|
|
strbuf_addf(&filename, "%s/", get_object_directory());
|
|
|
|
hdrlen = format_object_header(hdr, sizeof(hdr), OBJ_BLOB, len);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Common steps for write_loose_object and stream_loose_object to
|
|
|
|
* start writing loose objects:
|
|
|
|
*
|
|
|
|
* - Create tmpfile for the loose object.
|
|
|
|
* - Setup zlib stream for compression.
|
|
|
|
* - Start to feed header to zlib stream.
|
|
|
|
*/
|
|
|
|
fd = start_loose_object_common(&tmp_file, filename.buf, 0,
|
|
|
|
&stream, compressed, sizeof(compressed),
|
|
|
|
&c, hdr, hdrlen);
|
|
|
|
if (fd < 0) {
|
|
|
|
err = -1;
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Then the data itself.. */
|
|
|
|
do {
|
|
|
|
unsigned char *in0 = stream.next_in;
|
|
|
|
|
|
|
|
if (!stream.avail_in && !in_stream->is_finished) {
|
|
|
|
const void *in = in_stream->read(in_stream, &stream.avail_in);
|
|
|
|
stream.next_in = (void *)in;
|
|
|
|
in0 = (unsigned char *)in;
|
|
|
|
/* All data has been read. */
|
|
|
|
if (in_stream->is_finished)
|
|
|
|
flush = 1;
|
|
|
|
}
|
|
|
|
ret = write_loose_object_common(&c, &stream, flush, in0, fd,
|
|
|
|
compressed, sizeof(compressed));
|
|
|
|
/*
|
|
|
|
* Unlike write_loose_object(), we do not have the entire
|
|
|
|
* buffer. If we get Z_BUF_ERROR due to too few input bytes,
|
|
|
|
* then we'll replenish them in the next input_stream->read()
|
|
|
|
* call when we loop.
|
|
|
|
*/
|
|
|
|
} while (ret == Z_OK || ret == Z_BUF_ERROR);
|
|
|
|
|
|
|
|
if (stream.total_in != len + hdrlen)
|
|
|
|
die(_("write stream object %ld != %"PRIuMAX), stream.total_in,
|
|
|
|
(uintmax_t)len + hdrlen);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Common steps for write_loose_object and stream_loose_object to
|
|
|
|
* end writing loose oject:
|
|
|
|
*
|
|
|
|
* - End the compression of zlib stream.
|
|
|
|
* - Get the calculated oid.
|
|
|
|
*/
|
|
|
|
if (ret != Z_STREAM_END)
|
|
|
|
die(_("unable to stream deflate new object (%d)"), ret);
|
|
|
|
ret = end_loose_object_common(&c, &stream, oid);
|
|
|
|
if (ret != Z_OK)
|
|
|
|
die(_("deflateEnd on stream object failed (%d)"), ret);
|
|
|
|
close_loose_object(fd, tmp_file.buf);
|
|
|
|
|
2017-09-08 12:32:43 +03:00
|
|
|
if (freshen_packed_object(oid) || freshen_loose_object(oid, 1)) {
|
2022-06-11 05:44:19 +03:00
|
|
|
unlink_or_warn(tmp_file.buf);
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
|
|
|
loose_object_path(the_repository, &filename, oid);
|
|
|
|
|
|
|
|
/* We finally know the object path, and create the missing dir. */
|
|
|
|
dirlen = directory_size(filename.buf);
|
|
|
|
if (dirlen) {
|
|
|
|
struct strbuf dir = STRBUF_INIT;
|
|
|
|
strbuf_add(&dir, filename.buf, dirlen);
|
|
|
|
|
|
|
|
if (mkdir_in_gitdir(dir.buf) && errno != EEXIST) {
|
|
|
|
err = error_errno(_("unable to create directory %s"), dir.buf);
|
|
|
|
strbuf_release(&dir);
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
strbuf_release(&dir);
|
|
|
|
}
|
|
|
|
|
|
|
|
err = finalize_object_file(tmp_file.buf, filename.buf);
|
|
|
|
cleanup:
|
|
|
|
strbuf_release(&tmp_file);
|
|
|
|
strbuf_release(&filename);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2021-11-13 00:14:50 +03:00
|
|
|
int write_object_file_flags(const void *buf, size_t len,
|
2022-02-05 02:48:26 +03:00
|
|
|
enum object_type type, struct object_id *oid,
|
2021-10-12 17:30:49 +03:00
|
|
|
unsigned flags)
|
2008-05-14 09:32:48 +04:00
|
|
|
{
|
2018-03-12 05:27:55 +03:00
|
|
|
char hdr[MAX_HEADER_LEN];
|
2021-11-13 00:14:50 +03:00
|
|
|
size_t hdrlen = sizeof(hdr);
|
2008-05-14 09:32:48 +04:00
|
|
|
|
|
|
|
/* Normally if we have it in the pack then we do not bother writing
|
|
|
|
* it out into .git/objects/??/?{38} file.
|
|
|
|
*/
|
2020-01-30 23:32:21 +03:00
|
|
|
write_object_file_prepare(the_hash_algo, buf, len, type, oid, hdr,
|
|
|
|
&hdrlen);
|
2017-09-08 12:32:43 +03:00
|
|
|
if (freshen_packed_object(oid) || freshen_loose_object(oid, 1))
|
2008-05-14 09:32:48 +04:00
|
|
|
return 0;
|
2021-10-12 17:30:49 +03:00
|
|
|
return write_loose_object(oid, hdr, hdrlen, buf, len, 0, flags);
|
2008-05-14 09:32:48 +04:00
|
|
|
}
|
|
|
|
|
2021-11-13 00:07:03 +03:00
|
|
|
int write_object_file_literally(const void *buf, size_t len,
|
2022-02-05 02:48:31 +03:00
|
|
|
const char *type, struct object_id *oid,
|
|
|
|
unsigned flags)
|
2015-05-04 10:25:15 +03:00
|
|
|
{
|
|
|
|
char *header;
|
2021-11-13 00:14:50 +03:00
|
|
|
size_t hdrlen;
|
|
|
|
int status = 0;
|
2015-05-04 10:25:15 +03:00
|
|
|
|
|
|
|
/* type string, SP, %lu of the length plus NUL must fit this */
|
2018-03-12 05:27:55 +03:00
|
|
|
hdrlen = strlen(type) + MAX_HEADER_LEN;
|
2015-09-25 00:06:42 +03:00
|
|
|
header = xmalloc(hdrlen);
|
2022-02-05 02:48:33 +03:00
|
|
|
write_object_file_prepare_literally(the_hash_algo, buf, len, type,
|
|
|
|
oid, header, &hdrlen);
|
2015-05-04 10:25:15 +03:00
|
|
|
|
|
|
|
if (!(flags & HASH_WRITE_OBJECT))
|
|
|
|
goto cleanup;
|
2017-09-08 12:32:43 +03:00
|
|
|
if (freshen_packed_object(oid) || freshen_loose_object(oid, 1))
|
2015-05-04 10:25:15 +03:00
|
|
|
goto cleanup;
|
2021-10-12 17:30:49 +03:00
|
|
|
status = write_loose_object(oid, header, hdrlen, buf, len, 0, 0);
|
2015-05-04 10:25:15 +03:00
|
|
|
|
|
|
|
cleanup:
|
|
|
|
free(header);
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
2018-01-28 03:13:20 +03:00
|
|
|
int force_object_loose(const struct object_id *oid, time_t mtime)
|
2008-05-14 09:32:48 +04:00
|
|
|
{
|
|
|
|
void *buf;
|
|
|
|
unsigned long len;
|
2023-01-07 16:48:55 +03:00
|
|
|
struct object_info oi = OBJECT_INFO_INIT;
|
2008-05-14 09:32:48 +04:00
|
|
|
enum object_type type;
|
2018-03-12 05:27:55 +03:00
|
|
|
char hdr[MAX_HEADER_LEN];
|
2008-05-14 09:32:48 +04:00
|
|
|
int hdrlen;
|
2008-10-18 04:37:31 +04:00
|
|
|
int ret;
|
2008-05-14 09:32:48 +04:00
|
|
|
|
2018-05-02 03:25:34 +03:00
|
|
|
if (has_loose_object(oid))
|
2008-05-14 09:32:48 +04:00
|
|
|
return 0;
|
2023-01-07 16:48:55 +03:00
|
|
|
oi.typep = &type;
|
|
|
|
oi.sizep = &len;
|
|
|
|
oi.contentp = &buf;
|
|
|
|
if (oid_object_info_extended(the_repository, oid, &oi, 0))
|
2019-01-07 11:39:33 +03:00
|
|
|
return error(_("cannot read object for %s"), oid_to_hex(oid));
|
2022-02-05 02:48:25 +03:00
|
|
|
hdrlen = format_object_header(hdr, sizeof(hdr), type, len);
|
2021-10-12 17:30:49 +03:00
|
|
|
ret = write_loose_object(oid, hdr, hdrlen, buf, len, mtime, 0);
|
2008-10-18 04:37:31 +04:00
|
|
|
free(buf);
|
|
|
|
|
|
|
|
return ret;
|
2008-05-14 09:32:48 +04:00
|
|
|
}
|
|
|
|
|
2020-08-06 02:06:49 +03:00
|
|
|
int has_object(struct repository *r, const struct object_id *oid,
|
|
|
|
unsigned flags)
|
|
|
|
{
|
|
|
|
int quick = !(flags & HAS_OBJECT_RECHECK_PACKED);
|
|
|
|
unsigned object_info_flags = OBJECT_INFO_SKIP_FETCH_OBJECT |
|
|
|
|
(quick ? OBJECT_INFO_QUICK : 0);
|
|
|
|
|
|
|
|
if (!startup_info->have_repository)
|
|
|
|
return 0;
|
|
|
|
return oid_object_info_extended(r, oid, NULL, object_info_flags) >= 0;
|
|
|
|
}
|
|
|
|
|
2019-02-07 09:05:27 +03:00
|
|
|
int repo_has_object_file_with_flags(struct repository *r,
|
|
|
|
const struct object_id *oid, int flags)
|
2005-04-24 05:47:23 +04:00
|
|
|
{
|
2017-04-12 01:47:13 +03:00
|
|
|
if (!startup_info->have_repository)
|
|
|
|
return 0;
|
sha1-file: remove OBJECT_INFO_SKIP_CACHED
In a partial clone, if a user provides the hash of the empty tree ("git
mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which
requires that that object be parsed, for example:
git diff-tree 4b825d <a non-empty tree>
then Git will lazily fetch the empty tree, unnecessarily, because
parsing of that object invokes repo_has_object_file(), which does not
special-case the empty tree.
Instead, teach repo_has_object_file() to consult find_cached_object()
(which handles the empty tree), thus bringing it in line with the rest
of the object-store-accessing functions. A cost is that
repo_has_object_file() will now need to oideq upon each invocation, but
that is trivial compared to the filesystem lookup or the pack index
search required anyway. (And if find_cached_object() needs to do more
because of previous invocations to pretend_object_file(), all the more
reason to be consistent in whether we present cached objects.)
As a historical note, the function now known as repo_read_object_file()
was taught the empty tree in 346245a1bb ("hard-code the empty tree
object", 2008-02-13), and the function now known as oid_object_info()
was taught the empty tree in c4d9986f5f ("sha1_object_info: examine
cached_object store too", 2011-02-07). repo_has_object_file() was never
updated, perhaps due to oversight. The flag OBJECT_INFO_SKIP_CACHED,
introduced later in dfdd4afcf9 ("sha1_file: teach
sha1_object_info_extended more flags", 2017-06-26) and used in
e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26),
was introduced to preserve this difference in empty-tree handling, but
now it can be removed.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-02 23:16:30 +03:00
|
|
|
return oid_object_info_extended(r, oid, NULL, flags) >= 0;
|
2015-11-10 05:22:19 +03:00
|
|
|
}
|
|
|
|
|
2018-11-14 03:12:48 +03:00
|
|
|
int repo_has_object_file(struct repository *r,
|
|
|
|
const struct object_id *oid)
|
2015-11-10 05:22:19 +03:00
|
|
|
{
|
2019-02-07 09:05:27 +03:00
|
|
|
return repo_has_object_file_with_flags(r, oid, 0);
|
fetch: use "quick" has_sha1_file for tag following
When we auto-follow tags in a fetch, we look at all of the
tags advertised by the remote and fetch ones where we don't
already have the tag, but we do have the object it peels to.
This involves a lot of calls to has_sha1_file(), some of
which we can reasonably expect to fail. Since 45e8a74
(has_sha1_file: re-check pack directory before giving up,
2013-08-30), this may cause many calls to
reprepare_packed_git(), which is potentially expensive.
This has gone unnoticed for several years because it
requires a fairly unique setup to matter:
1. You need to have a lot of packs on the client side to
make reprepare_packed_git() expensive (the most
expensive part is finding duplicates in an unsorted
list, which is currently quadratic).
2. You need a large number of tag refs on the server side
that are candidates for auto-following (i.e., that the
client doesn't have). Each one triggers a re-read of
the pack directory.
3. Under normal circumstances, the client would
auto-follow those tags and after one large fetch, (2)
would no longer be true. But if those tags point to
history which is disconnected from what the client
otherwise fetches, then it will never auto-follow, and
those candidates will impact it on every fetch.
So when all three are true, each fetch pays an extra
O(nr_tags * nr_packs^2) cost, mostly in string comparisons
on the pack names. This was exacerbated by 47bf4b0
(prepare_packed_git_one: refactor duplicate-pack check,
2014-06-30) which uses a slightly more expensive string
check, under the assumption that the duplicate check doesn't
happen very often (and it shouldn't; the real problem here
is how often we are calling reprepare_packed_git()).
This patch teaches fetch to use HAS_SHA1_QUICK to sacrifice
accuracy for speed, in cases where we might be racy with a
simultaneous repack. This is similar to the fix in 0eeb077
(index-pack: avoid excessive re-reading of pack directory,
2015-06-09). As with that case, it's OK for has_sha1_file()
occasionally say "no I don't have it" when we do, because
the worst case is not a corruption, but simply that we may
fail to auto-follow a tag that points to it.
Here are results from the included perf script, which sets
up a situation similar to the one described above:
Test HEAD^ HEAD
----------------------------------------------------------
5550.4: fetch 11.21(10.42+0.78) 0.08(0.04+0.02) -99.3%
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-13 19:53:44 +03:00
|
|
|
}
|
|
|
|
|
hash-object: use fsck for object checks
Since c879daa237 (Make hash-object more robust against malformed
objects, 2011-02-05), we've done some rudimentary checks against objects
we're about to write by running them through our usual parsers for
trees, commits, and tags.
These parsers catch some problems, but they are not nearly as careful as
the fsck functions (which make sense; the parsers are designed to be
fast and forgiving, bailing only when the input is unintelligible). We
are better off doing the more thorough fsck checks when writing objects.
Doing so at write time is much better than writing garbage only to find
out later (after building more history atop it!) that fsck complains
about it, or hosts with transfer.fsckObjects reject it.
This is obviously going to be a user-visible behavior change, and the
test changes earlier in this series show the scope of the impact. But
I'd argue that this is OK:
- the documentation for hash-object is already vague about which
checks we might do, saying that --literally will allow "any
garbage[...] which might not otherwise pass standard object parsing
or git-fsck checks". So we are already covered under the documented
behavior.
- users don't generally run hash-object anyway. There are a lot of
spots in the tests that needed to be updated because creating
garbage objects is something that Git's tests disproportionately do.
- it's hard to imagine anyone thinking the new behavior is worse. Any
object we reject would be a potential problem down the road for the
user. And if they really want to create garbage, --literally is
already the escape hatch they need.
Note that the change here is actually in index_mem(), which handles the
HASH_FORMAT_CHECK flag passed by hash-object. That flag is also used by
"git-replace --edit" to sanity-check the result. Covering that with more
thorough checks likewise seems like a good thing.
Besides being more thorough, there are a few other bonuses:
- we get rid of some questionable stack allocations of object structs.
These don't seem to currently cause any problems in practice, but
they subtly violate some of the assumptions made by the rest of the
code (e.g., the "struct commit" we put on the stack and
zero-initialize will not have a proper index from
alloc_comit_index().
- likewise, those parsed object structs are the source of some small
memory leaks
- the resulting messages are much better. For example:
[before]
$ echo 'tree 123' | git hash-object -t commit --stdin
error: bogus commit object 0000000000000000000000000000000000000000
fatal: corrupt commit
[after]
$ echo 'tree 123' | git.compile hash-object -t commit --stdin
error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1
fatal: refusing to create malformed object
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-18 23:44:12 +03:00
|
|
|
/*
|
|
|
|
* We can't use the normal fsck_error_function() for index_mem(),
|
|
|
|
* because we don't yet have a valid oid for it to report. Instead,
|
|
|
|
* report the minimal fsck error here, and rely on the caller to
|
|
|
|
* give more context.
|
|
|
|
*/
|
|
|
|
static int hash_format_check_report(struct fsck_options *opts,
|
|
|
|
const struct object_id *oid,
|
|
|
|
enum object_type object_type,
|
|
|
|
enum fsck_msg_type msg_type,
|
|
|
|
enum fsck_msg_id msg_id,
|
|
|
|
const char *message)
|
|
|
|
{
|
|
|
|
error(_("object fails fsck: %s"), message);
|
|
|
|
return 1;
|
2011-02-05 13:52:21 +03:00
|
|
|
}
|
|
|
|
|
2018-09-21 18:57:31 +03:00
|
|
|
static int index_mem(struct index_state *istate,
|
|
|
|
struct object_id *oid, void *buf, size_t size,
|
2011-05-08 12:47:33 +04:00
|
|
|
enum object_type type,
|
|
|
|
const char *path, unsigned flags)
|
2006-05-23 22:19:04 +04:00
|
|
|
{
|
2022-02-05 02:48:24 +03:00
|
|
|
int ret = 0;
|
2022-02-05 02:48:23 +03:00
|
|
|
int re_allocated = 0;
|
2011-05-08 12:47:33 +04:00
|
|
|
int write_object = flags & HASH_WRITE_OBJECT;
|
2005-05-02 10:45:49 +04:00
|
|
|
|
2005-07-09 03:51:55 +04:00
|
|
|
if (!type)
|
2007-02-28 22:45:56 +03:00
|
|
|
type = OBJ_BLOB;
|
Lazy man's auto-CRLF
It currently does NOT know about file attributes, so it does its
conversion purely based on content. Maybe that is more in the "git
philosophy" anyway, since content is king, but I think we should try to do
the file attributes to turn it off on demand.
Anyway, BY DEFAULT it is off regardless, because it requires a
[core]
AutoCRLF = true
in your config file to be enabled. We could make that the default for
Windows, of course, the same way we do some other things (filemode etc).
But you can actually enable it on UNIX, and it will cause:
- "git update-index" will write blobs without CRLF
- "git diff" will diff working tree files without CRLF
- "git checkout" will write files to the working tree _with_ CRLF
and things work fine.
Funnily, it actually shows an odd file in git itself:
git clone -n git test-crlf
cd test-crlf
git config core.autocrlf true
git checkout
git diff
shows a diff for "Documentation/docbook-xsl.css". Why? Because we have
actually checked in that file *with* CRLF! So when "core.autocrlf" is
true, we'll always generate a *different* hash for it in the index,
because the index hash will be for the content _without_ CRLF.
Is this complete? I dunno. It seems to work for me. It doesn't use the
filename at all right now, and that's probably a deficiency (we could
certainly make the "is_binary()" heuristics also take standard filename
heuristics into account).
I don't pass in the filename at all for the "index_fd()" case
(git-update-index), so that would need to be passed around, but this
actually works fine.
NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours
truly. I will not guarantee that they work at all reasonable. Caveat
emptor. But it _is_ simple, and it _is_ safe, since it's all off by
default.
The patch is pretty simple - the biggest part is the new "convert.c" file,
but even that is really just basic stuff that anybody can write in
"Teaching C 101" as a final project for their first class in programming.
Not to say that it's bug-free, of course - but at least we're not talking
about rocket surgery here.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-13 22:07:23 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Convert blobs to git internal format
|
|
|
|
*/
|
2008-08-03 08:39:16 +04:00
|
|
|
if ((type == OBJ_BLOB) && path) {
|
2008-10-09 23:12:12 +04:00
|
|
|
struct strbuf nbuf = STRBUF_INIT;
|
2018-09-21 18:57:31 +03:00
|
|
|
if (convert_to_git(istate, path, buf, size, &nbuf,
|
2018-01-14 01:49:31 +03:00
|
|
|
get_conv_flags(flags))) {
|
2007-09-27 14:58:23 +04:00
|
|
|
buf = strbuf_detach(&nbuf, &size);
|
Lazy man's auto-CRLF
It currently does NOT know about file attributes, so it does its
conversion purely based on content. Maybe that is more in the "git
philosophy" anyway, since content is king, but I think we should try to do
the file attributes to turn it off on demand.
Anyway, BY DEFAULT it is off regardless, because it requires a
[core]
AutoCRLF = true
in your config file to be enabled. We could make that the default for
Windows, of course, the same way we do some other things (filemode etc).
But you can actually enable it on UNIX, and it will cause:
- "git update-index" will write blobs without CRLF
- "git diff" will diff working tree files without CRLF
- "git checkout" will write files to the working tree _with_ CRLF
and things work fine.
Funnily, it actually shows an odd file in git itself:
git clone -n git test-crlf
cd test-crlf
git config core.autocrlf true
git checkout
git diff
shows a diff for "Documentation/docbook-xsl.css". Why? Because we have
actually checked in that file *with* CRLF! So when "core.autocrlf" is
true, we'll always generate a *different* hash for it in the index,
because the index hash will be for the content _without_ CRLF.
Is this complete? I dunno. It seems to work for me. It doesn't use the
filename at all right now, and that's probably a deficiency (we could
certainly make the "is_binary()" heuristics also take standard filename
heuristics into account).
I don't pass in the filename at all for the "index_fd()" case
(git-update-index), so that would need to be passed around, but this
actually works fine.
NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours
truly. I will not guarantee that they work at all reasonable. Caveat
emptor. But it _is_ simple, and it _is_ safe, since it's all off by
default.
The patch is pretty simple - the biggest part is the new "convert.c" file,
but even that is really just basic stuff that anybody can write in
"Teaching C 101" as a final project for their first class in programming.
Not to say that it's bug-free, of course - but at least we're not talking
about rocket surgery here.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-13 22:07:23 +03:00
|
|
|
re_allocated = 1;
|
|
|
|
}
|
|
|
|
}
|
2011-05-08 12:47:33 +04:00
|
|
|
if (flags & HASH_FORMAT_CHECK) {
|
hash-object: use fsck for object checks
Since c879daa237 (Make hash-object more robust against malformed
objects, 2011-02-05), we've done some rudimentary checks against objects
we're about to write by running them through our usual parsers for
trees, commits, and tags.
These parsers catch some problems, but they are not nearly as careful as
the fsck functions (which make sense; the parsers are designed to be
fast and forgiving, bailing only when the input is unintelligible). We
are better off doing the more thorough fsck checks when writing objects.
Doing so at write time is much better than writing garbage only to find
out later (after building more history atop it!) that fsck complains
about it, or hosts with transfer.fsckObjects reject it.
This is obviously going to be a user-visible behavior change, and the
test changes earlier in this series show the scope of the impact. But
I'd argue that this is OK:
- the documentation for hash-object is already vague about which
checks we might do, saying that --literally will allow "any
garbage[...] which might not otherwise pass standard object parsing
or git-fsck checks". So we are already covered under the documented
behavior.
- users don't generally run hash-object anyway. There are a lot of
spots in the tests that needed to be updated because creating
garbage objects is something that Git's tests disproportionately do.
- it's hard to imagine anyone thinking the new behavior is worse. Any
object we reject would be a potential problem down the road for the
user. And if they really want to create garbage, --literally is
already the escape hatch they need.
Note that the change here is actually in index_mem(), which handles the
HASH_FORMAT_CHECK flag passed by hash-object. That flag is also used by
"git-replace --edit" to sanity-check the result. Covering that with more
thorough checks likewise seems like a good thing.
Besides being more thorough, there are a few other bonuses:
- we get rid of some questionable stack allocations of object structs.
These don't seem to currently cause any problems in practice, but
they subtly violate some of the assumptions made by the rest of the
code (e.g., the "struct commit" we put on the stack and
zero-initialize will not have a proper index from
alloc_comit_index().
- likewise, those parsed object structs are the source of some small
memory leaks
- the resulting messages are much better. For example:
[before]
$ echo 'tree 123' | git hash-object -t commit --stdin
error: bogus commit object 0000000000000000000000000000000000000000
fatal: corrupt commit
[after]
$ echo 'tree 123' | git.compile hash-object -t commit --stdin
error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1
fatal: refusing to create malformed object
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-18 23:44:12 +03:00
|
|
|
struct fsck_options opts = FSCK_OPTIONS_DEFAULT;
|
|
|
|
|
|
|
|
opts.strict = 1;
|
|
|
|
opts.error_func = hash_format_check_report;
|
|
|
|
if (fsck_buffer(null_oid(), type, buf, size, &opts))
|
|
|
|
die(_("refusing to create malformed object"));
|
|
|
|
fsck_finish(&opts);
|
2011-02-05 13:52:21 +03:00
|
|
|
}
|
Lazy man's auto-CRLF
It currently does NOT know about file attributes, so it does its
conversion purely based on content. Maybe that is more in the "git
philosophy" anyway, since content is king, but I think we should try to do
the file attributes to turn it off on demand.
Anyway, BY DEFAULT it is off regardless, because it requires a
[core]
AutoCRLF = true
in your config file to be enabled. We could make that the default for
Windows, of course, the same way we do some other things (filemode etc).
But you can actually enable it on UNIX, and it will cause:
- "git update-index" will write blobs without CRLF
- "git diff" will diff working tree files without CRLF
- "git checkout" will write files to the working tree _with_ CRLF
and things work fine.
Funnily, it actually shows an odd file in git itself:
git clone -n git test-crlf
cd test-crlf
git config core.autocrlf true
git checkout
git diff
shows a diff for "Documentation/docbook-xsl.css". Why? Because we have
actually checked in that file *with* CRLF! So when "core.autocrlf" is
true, we'll always generate a *different* hash for it in the index,
because the index hash will be for the content _without_ CRLF.
Is this complete? I dunno. It seems to work for me. It doesn't use the
filename at all right now, and that's probably a deficiency (we could
certainly make the "is_binary()" heuristics also take standard filename
heuristics into account).
I don't pass in the filename at all for the "index_fd()" case
(git-update-index), so that would need to be passed around, but this
actually works fine.
NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours
truly. I will not guarantee that they work at all reasonable. Caveat
emptor. But it _is_ simple, and it _is_ safe, since it's all off by
default.
The patch is pretty simple - the biggest part is the new "convert.c" file,
but even that is really just basic stuff that anybody can write in
"Teaching C 101" as a final project for their first class in programming.
Not to say that it's bug-free, of course - but at least we're not talking
about rocket surgery here.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-13 22:07:23 +03:00
|
|
|
|
2005-07-09 03:51:55 +04:00
|
|
|
if (write_object)
|
2022-02-05 02:48:26 +03:00
|
|
|
ret = write_object_file(buf, size, type, oid);
|
2006-10-14 14:45:36 +04:00
|
|
|
else
|
2022-02-05 02:48:32 +03:00
|
|
|
hash_object_file(the_hash_algo, buf, size, type, oid);
|
2008-08-03 08:39:16 +04:00
|
|
|
if (re_allocated)
|
Lazy man's auto-CRLF
It currently does NOT know about file attributes, so it does its
conversion purely based on content. Maybe that is more in the "git
philosophy" anyway, since content is king, but I think we should try to do
the file attributes to turn it off on demand.
Anyway, BY DEFAULT it is off regardless, because it requires a
[core]
AutoCRLF = true
in your config file to be enabled. We could make that the default for
Windows, of course, the same way we do some other things (filemode etc).
But you can actually enable it on UNIX, and it will cause:
- "git update-index" will write blobs without CRLF
- "git diff" will diff working tree files without CRLF
- "git checkout" will write files to the working tree _with_ CRLF
and things work fine.
Funnily, it actually shows an odd file in git itself:
git clone -n git test-crlf
cd test-crlf
git config core.autocrlf true
git checkout
git diff
shows a diff for "Documentation/docbook-xsl.css". Why? Because we have
actually checked in that file *with* CRLF! So when "core.autocrlf" is
true, we'll always generate a *different* hash for it in the index,
because the index hash will be for the content _without_ CRLF.
Is this complete? I dunno. It seems to work for me. It doesn't use the
filename at all right now, and that's probably a deficiency (we could
certainly make the "is_binary()" heuristics also take standard filename
heuristics into account).
I don't pass in the filename at all for the "index_fd()" case
(git-update-index), so that would need to be passed around, but this
actually works fine.
NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours
truly. I will not guarantee that they work at all reasonable. Caveat
emptor. But it _is_ simple, and it _is_ safe, since it's all off by
default.
The patch is pretty simple - the biggest part is the new "convert.c" file,
but even that is really just basic stuff that anybody can write in
"Teaching C 101" as a final project for their first class in programming.
Not to say that it's bug-free, of course - but at least we're not talking
about rocket surgery here.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-13 22:07:23 +03:00
|
|
|
free(buf);
|
2008-08-03 08:39:16 +04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-09-21 18:57:31 +03:00
|
|
|
static int index_stream_convert_blob(struct index_state *istate,
|
|
|
|
struct object_id *oid,
|
|
|
|
int fd,
|
|
|
|
const char *path,
|
|
|
|
unsigned flags)
|
2014-08-26 19:23:25 +04:00
|
|
|
{
|
2022-02-05 02:48:24 +03:00
|
|
|
int ret = 0;
|
2014-08-26 19:23:25 +04:00
|
|
|
const int write_object = flags & HASH_WRITE_OBJECT;
|
|
|
|
struct strbuf sbuf = STRBUF_INIT;
|
|
|
|
|
|
|
|
assert(path);
|
2018-09-21 18:57:31 +03:00
|
|
|
assert(would_convert_to_git_filter_fd(istate, path));
|
2014-08-26 19:23:25 +04:00
|
|
|
|
2018-09-21 18:57:31 +03:00
|
|
|
convert_to_git_filter_fd(istate, path, fd, &sbuf,
|
2018-01-14 01:49:31 +03:00
|
|
|
get_conv_flags(flags));
|
2014-08-26 19:23:25 +04:00
|
|
|
|
|
|
|
if (write_object)
|
2022-02-05 02:48:26 +03:00
|
|
|
ret = write_object_file(sbuf.buf, sbuf.len, OBJ_BLOB,
|
2018-01-28 03:13:19 +03:00
|
|
|
oid);
|
2014-08-26 19:23:25 +04:00
|
|
|
else
|
2022-02-05 02:48:32 +03:00
|
|
|
hash_object_file(the_hash_algo, sbuf.buf, sbuf.len, OBJ_BLOB,
|
|
|
|
oid);
|
2014-08-26 19:23:25 +04:00
|
|
|
strbuf_release(&sbuf);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-09-21 18:57:31 +03:00
|
|
|
static int index_pipe(struct index_state *istate, struct object_id *oid,
|
|
|
|
int fd, enum object_type type,
|
2011-05-08 12:47:34 +04:00
|
|
|
const char *path, unsigned flags)
|
|
|
|
{
|
|
|
|
struct strbuf sbuf = STRBUF_INIT;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (strbuf_read(&sbuf, fd, 4096) >= 0)
|
2018-09-21 18:57:31 +03:00
|
|
|
ret = index_mem(istate, oid, sbuf.buf, sbuf.len, type, path, flags);
|
2011-05-08 12:47:34 +04:00
|
|
|
else
|
|
|
|
ret = -1;
|
|
|
|
strbuf_release(&sbuf);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-02-21 09:32:19 +03:00
|
|
|
#define SMALL_FILE_SIZE (32*1024)
|
|
|
|
|
2018-09-21 18:57:31 +03:00
|
|
|
static int index_core(struct index_state *istate,
|
|
|
|
struct object_id *oid, int fd, size_t size,
|
2011-05-08 12:47:34 +04:00
|
|
|
enum object_type type, const char *path,
|
|
|
|
unsigned flags)
|
2008-08-03 08:39:16 +04:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2011-05-08 12:47:34 +04:00
|
|
|
if (!size) {
|
2018-09-21 18:57:31 +03:00
|
|
|
ret = index_mem(istate, oid, "", size, type, path, flags);
|
2010-02-21 09:32:19 +03:00
|
|
|
} else if (size <= SMALL_FILE_SIZE) {
|
|
|
|
char *buf = xmalloc(size);
|
2017-09-27 09:01:07 +03:00
|
|
|
ssize_t read_result = read_in_full(fd, buf, size);
|
|
|
|
if (read_result < 0)
|
2018-07-21 10:49:39 +03:00
|
|
|
ret = error_errno(_("read error while indexing %s"),
|
2017-09-27 09:01:07 +03:00
|
|
|
path ? path : "<unknown>");
|
|
|
|
else if (read_result != size)
|
2018-07-21 10:49:39 +03:00
|
|
|
ret = error(_("short read while indexing %s"),
|
2017-09-27 09:01:07 +03:00
|
|
|
path ? path : "<unknown>");
|
2010-02-21 09:32:19 +03:00
|
|
|
else
|
2018-09-21 18:57:31 +03:00
|
|
|
ret = index_mem(istate, oid, buf, size, type, path, flags);
|
2010-02-21 09:32:19 +03:00
|
|
|
free(buf);
|
2010-05-11 01:38:17 +04:00
|
|
|
} else {
|
2008-08-03 08:39:16 +04:00
|
|
|
void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
|
2018-09-21 18:57:31 +03:00
|
|
|
ret = index_mem(istate, oid, buf, size, type, path, flags);
|
2005-05-03 22:46:16 +04:00
|
|
|
munmap(buf, size);
|
2010-05-11 01:38:17 +04:00
|
|
|
}
|
2011-05-08 12:47:34 +04:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-05-08 12:47:35 +04:00
|
|
|
/*
|
2011-10-29 01:48:40 +04:00
|
|
|
* This creates one packfile per large blob unless bulk-checkin
|
|
|
|
* machinery is "plugged".
|
2011-05-08 12:47:35 +04:00
|
|
|
*
|
|
|
|
* This also bypasses the usual "convert-to-git" dance, and that is on
|
|
|
|
* purpose. We could write a streaming version of the converting
|
|
|
|
* functions and insert that before feeding the data to fast-import
|
do not stream large files to pack when filters are in use
Because git's object format requires us to specify the
number of bytes in the object in its header, we must know
the size before streaming a blob into the object database.
This is not a problem when adding a regular file, as we can
get the size from stat(). However, when filters are in use
(such as autocrlf, or the ident, filter, or eol
gitattributes), we have no idea what the ultimate size will
be.
The current code just punts on the whole issue and ignores
filter configuration entirely for files larger than
core.bigfilethreshold. This can generate confusing results
if you use filters for large binary files, as the filter
will suddenly stop working as the file goes over a certain
size. Rather than try to handle unknown input sizes with
streaming, this patch just turns off the streaming
optimization when filters are in use.
This has a slight performance regression in a very specific
case: if you have autocrlf on, but no gitattributes, a large
binary file will avoid the streaming code path because we
don't know beforehand whether it will need conversion or
not. But if you are handling large binary files, you should
be marking them as such via attributes (or at least not
using autocrlf, and instead marking your text files as
such). And the flip side is that if you have a large
_non_-binary file, there is a correctness improvement;
before we did not apply the conversion at all.
The first half of the new t1051 script covers these failures
on input. The second half tests the matching output code
paths. These already work correctly, and do not need any
adjustment.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-25 02:10:17 +04:00
|
|
|
* (or equivalent in-core API described above). However, that is
|
|
|
|
* somewhat complicated, as we do not know the size of the filter
|
|
|
|
* result, which we need to know beforehand when writing a git object.
|
|
|
|
* Since the primary motivation for trying to stream from the working
|
|
|
|
* tree file and to avoid mmaping it in core is to deal with large
|
|
|
|
* binary blobs, they generally do not want to get any conversion, and
|
|
|
|
* callers should avoid this code path when filters are requested.
|
2011-05-08 12:47:35 +04:00
|
|
|
*/
|
2017-08-20 23:09:31 +03:00
|
|
|
static int index_stream(struct object_id *oid, int fd, size_t size,
|
2011-05-08 12:47:35 +04:00
|
|
|
enum object_type type, const char *path,
|
|
|
|
unsigned flags)
|
|
|
|
{
|
2018-03-12 05:27:21 +03:00
|
|
|
return index_bulk_checkin(oid, fd, size, type, path, flags);
|
2011-05-08 12:47:35 +04:00
|
|
|
}
|
|
|
|
|
2018-09-21 18:57:31 +03:00
|
|
|
int index_fd(struct index_state *istate, struct object_id *oid,
|
|
|
|
int fd, struct stat *st,
|
2011-05-08 12:47:34 +04:00
|
|
|
enum object_type type, const char *path, unsigned flags)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2014-09-21 14:03:26 +04:00
|
|
|
/*
|
|
|
|
* Call xsize_t() only when needed to avoid potentially unnecessary
|
|
|
|
* die() for large files.
|
|
|
|
*/
|
2018-09-21 18:57:31 +03:00
|
|
|
if (type == OBJ_BLOB && path && would_convert_to_git_filter_fd(istate, path))
|
|
|
|
ret = index_stream_convert_blob(istate, oid, fd, path, flags);
|
2014-08-26 19:23:25 +04:00
|
|
|
else if (!S_ISREG(st->st_mode))
|
2018-09-21 18:57:31 +03:00
|
|
|
ret = index_pipe(istate, oid, fd, type, path, flags);
|
2014-09-21 14:03:26 +04:00
|
|
|
else if (st->st_size <= big_file_threshold || type != OBJ_BLOB ||
|
2018-09-21 18:57:31 +03:00
|
|
|
(path && would_convert_to_git(istate, path)))
|
|
|
|
ret = index_core(istate, oid, fd, xsize_t(st->st_size),
|
|
|
|
type, path, flags);
|
2011-05-08 12:47:35 +04:00
|
|
|
else
|
2017-08-20 23:09:31 +03:00
|
|
|
ret = index_stream(oid, fd, xsize_t(st->st_size), type, path,
|
2014-09-21 14:03:26 +04:00
|
|
|
flags);
|
2008-08-03 08:39:16 +04:00
|
|
|
close(fd);
|
2005-05-03 22:46:16 +04:00
|
|
|
return ret;
|
2005-05-02 10:45:49 +04:00
|
|
|
}
|
2005-10-07 14:42:00 +04:00
|
|
|
|
2018-09-21 18:57:31 +03:00
|
|
|
int index_path(struct index_state *istate, struct object_id *oid,
|
|
|
|
const char *path, struct stat *st, unsigned flags)
|
2005-10-07 14:42:00 +04:00
|
|
|
{
|
|
|
|
int fd;
|
2008-12-17 20:51:53 +03:00
|
|
|
struct strbuf sb = STRBUF_INIT;
|
2017-08-30 21:00:29 +03:00
|
|
|
int rc = 0;
|
2005-10-07 14:42:00 +04:00
|
|
|
|
|
|
|
switch (st->st_mode & S_IFMT) {
|
|
|
|
case S_IFREG:
|
|
|
|
fd = open(path, O_RDONLY);
|
|
|
|
if (fd < 0)
|
2016-05-08 12:47:56 +03:00
|
|
|
return error_errno("open(\"%s\")", path);
|
2018-09-21 18:57:31 +03:00
|
|
|
if (index_fd(istate, oid, fd, st, OBJ_BLOB, path, flags) < 0)
|
2018-07-21 10:49:39 +03:00
|
|
|
return error(_("%s: failed to insert into database"),
|
2005-10-07 14:42:00 +04:00
|
|
|
path);
|
|
|
|
break;
|
|
|
|
case S_IFLNK:
|
2016-05-08 12:47:56 +03:00
|
|
|
if (strbuf_readlink(&sb, path, st->st_size))
|
|
|
|
return error_errno("readlink(\"%s\")", path);
|
2011-05-08 12:47:33 +04:00
|
|
|
if (!(flags & HASH_WRITE_OBJECT))
|
2020-01-30 23:32:22 +03:00
|
|
|
hash_object_file(the_hash_algo, sb.buf, sb.len,
|
2022-02-05 02:48:32 +03:00
|
|
|
OBJ_BLOB, oid);
|
2022-02-05 02:48:26 +03:00
|
|
|
else if (write_object_file(sb.buf, sb.len, OBJ_BLOB, oid))
|
2018-07-21 10:49:39 +03:00
|
|
|
rc = error(_("%s: failed to insert into database"), path);
|
2008-12-17 20:51:53 +03:00
|
|
|
strbuf_release(&sb);
|
2005-10-07 14:42:00 +04:00
|
|
|
break;
|
2007-04-10 08:20:29 +04:00
|
|
|
case S_IFDIR:
|
refs: convert resolve_gitlink_ref to struct object_id
Convert the declaration and definition of resolve_gitlink_ref to use
struct object_id and apply the following semantic patch:
@@
expression E1, E2, E3;
@@
- resolve_gitlink_ref(E1, E2, E3.hash)
+ resolve_gitlink_ref(E1, E2, &E3)
@@
expression E1, E2, E3;
@@
- resolve_gitlink_ref(E1, E2, E3->hash)
+ resolve_gitlink_ref(E1, E2, E3)
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-16 01:07:07 +03:00
|
|
|
return resolve_gitlink_ref(path, "HEAD", oid);
|
2005-10-07 14:42:00 +04:00
|
|
|
default:
|
2018-07-21 10:49:39 +03:00
|
|
|
return error(_("%s: unsupported file type"), path);
|
2005-10-07 14:42:00 +04:00
|
|
|
}
|
2017-08-30 21:00:29 +03:00
|
|
|
return rc;
|
2005-10-07 14:42:00 +04:00
|
|
|
}
|
2007-01-23 08:55:18 +03:00
|
|
|
|
|
|
|
int read_pack_header(int fd, struct pack_header *header)
|
|
|
|
{
|
2017-09-13 21:47:22 +03:00
|
|
|
if (read_in_full(fd, header, sizeof(*header)) != sizeof(*header))
|
2008-05-03 17:27:26 +04:00
|
|
|
/* "eof before pack header was fully read" */
|
|
|
|
return PH_ERROR_EOF;
|
|
|
|
|
2007-01-23 08:55:18 +03:00
|
|
|
if (header->hdr_signature != htonl(PACK_SIGNATURE))
|
|
|
|
/* "protocol error (pack signature mismatch detected)" */
|
|
|
|
return PH_ERROR_PACK_SIGNATURE;
|
|
|
|
if (!pack_version_ok(header->hdr_version))
|
|
|
|
/* "protocol error (pack version unsupported)" */
|
|
|
|
return PH_ERROR_PROTOCOL;
|
|
|
|
return 0;
|
|
|
|
}
|
make commit_tree a library function
Until now, this has been part of the commit-tree builtin.
However, it is already used by other builtins (like commit,
merge, and notes), and it would be useful to access it from
library code.
The check_valid helper has to come along, too, but is given
a more library-ish name of "assert_sha1_type".
Otherwise, the code is unchanged. There are still a few
rough edges for a library function, like printing the utf8
warning to stderr, but we can address those if and when they
come up as inappropriate.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-02 04:05:23 +04:00
|
|
|
|
2018-03-12 05:27:42 +03:00
|
|
|
void assert_oid_type(const struct object_id *oid, enum object_type expect)
|
make commit_tree a library function
Until now, this has been part of the commit-tree builtin.
However, it is already used by other builtins (like commit,
merge, and notes), and it would be useful to access it from
library code.
The check_valid helper has to come along, too, but is given
a more library-ish name of "assert_sha1_type".
Otherwise, the code is unchanged. There are still a few
rough edges for a library function, like printing the utf8
warning to stderr, but we can address those if and when they
come up as inappropriate.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-02 04:05:23 +04:00
|
|
|
{
|
2018-04-25 21:20:59 +03:00
|
|
|
enum object_type type = oid_object_info(the_repository, oid, NULL);
|
make commit_tree a library function
Until now, this has been part of the commit-tree builtin.
However, it is already used by other builtins (like commit,
merge, and notes), and it would be useful to access it from
library code.
The check_valid helper has to come along, too, but is given
a more library-ish name of "assert_sha1_type".
Otherwise, the code is unchanged. There are still a few
rough edges for a library function, like printing the utf8
warning to stderr, but we can address those if and when they
come up as inappropriate.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-02 04:05:23 +04:00
|
|
|
if (type < 0)
|
2018-07-21 10:49:39 +03:00
|
|
|
die(_("%s is not a valid object"), oid_to_hex(oid));
|
make commit_tree a library function
Until now, this has been part of the commit-tree builtin.
However, it is already used by other builtins (like commit,
merge, and notes), and it would be useful to access it from
library code.
The check_valid helper has to come along, too, but is given
a more library-ish name of "assert_sha1_type".
Otherwise, the code is unchanged. There are still a few
rough edges for a library function, like printing the utf8
warning to stderr, but we can address those if and when they
come up as inappropriate.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-02 04:05:23 +04:00
|
|
|
if (type != expect)
|
2018-07-21 10:49:39 +03:00
|
|
|
die(_("%s is not a valid '%s' object"), oid_to_hex(oid),
|
2018-02-14 21:59:24 +03:00
|
|
|
type_name(expect));
|
make commit_tree a library function
Until now, this has been part of the commit-tree builtin.
However, it is already used by other builtins (like commit,
merge, and notes), and it would be useful to access it from
library code.
The check_valid helper has to come along, too, but is given
a more library-ish name of "assert_sha1_type".
Otherwise, the code is unchanged. There are still a few
rough edges for a library function, like printing the utf8
warning to stderr, but we can address those if and when they
come up as inappropriate.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-04-02 04:05:23 +04:00
|
|
|
}
|
2014-10-16 02:38:55 +04:00
|
|
|
|
2017-06-24 17:09:39 +03:00
|
|
|
int for_each_file_in_obj_subdir(unsigned int subdir_nr,
|
2017-06-22 21:19:48 +03:00
|
|
|
struct strbuf *path,
|
|
|
|
each_loose_object_fn obj_cb,
|
|
|
|
each_loose_cruft_fn cruft_cb,
|
|
|
|
each_loose_subdir_fn subdir_cb,
|
|
|
|
void *data)
|
2014-10-16 02:38:55 +04:00
|
|
|
{
|
2017-06-24 15:12:30 +03:00
|
|
|
size_t origlen, baselen;
|
|
|
|
DIR *dir;
|
2014-10-16 02:38:55 +04:00
|
|
|
struct dirent *de;
|
|
|
|
int r = 0;
|
2017-10-31 16:50:06 +03:00
|
|
|
struct object_id oid;
|
2014-10-16 02:38:55 +04:00
|
|
|
|
2017-06-24 17:09:39 +03:00
|
|
|
if (subdir_nr > 0xff)
|
|
|
|
BUG("invalid loose object subdirectory: %x", subdir_nr);
|
|
|
|
|
2017-06-24 15:12:30 +03:00
|
|
|
origlen = path->len;
|
|
|
|
strbuf_complete(path, '/');
|
|
|
|
strbuf_addf(path, "%02x", subdir_nr);
|
|
|
|
|
|
|
|
dir = opendir(path->buf);
|
2014-10-16 02:38:55 +04:00
|
|
|
if (!dir) {
|
2017-06-24 15:12:30 +03:00
|
|
|
if (errno != ENOENT)
|
2018-07-21 10:49:39 +03:00
|
|
|
r = error_errno(_("unable to open %s"), path->buf);
|
2017-06-24 15:12:30 +03:00
|
|
|
strbuf_setlen(path, origlen);
|
|
|
|
return r;
|
2014-10-16 02:38:55 +04:00
|
|
|
}
|
|
|
|
|
2017-10-31 16:50:06 +03:00
|
|
|
oid.hash[0] = subdir_nr;
|
2017-12-04 17:06:03 +03:00
|
|
|
strbuf_addch(path, '/');
|
|
|
|
baselen = path->len;
|
2017-10-31 16:50:06 +03:00
|
|
|
|
2021-05-12 20:28:22 +03:00
|
|
|
while ((de = readdir_skip_dot_and_dotdot(dir))) {
|
2017-12-04 17:06:03 +03:00
|
|
|
size_t namelen;
|
2014-10-16 02:38:55 +04:00
|
|
|
|
2017-12-04 17:06:03 +03:00
|
|
|
namelen = strlen(de->d_name);
|
2014-10-16 02:38:55 +04:00
|
|
|
strbuf_setlen(path, baselen);
|
2017-12-04 17:06:03 +03:00
|
|
|
strbuf_add(path, de->d_name, namelen);
|
2018-07-16 04:28:07 +03:00
|
|
|
if (namelen == the_hash_algo->hexsz - 2 &&
|
2017-10-31 16:50:06 +03:00
|
|
|
!hex_to_bytes(oid.hash + 1, de->d_name,
|
2018-07-16 04:28:07 +03:00
|
|
|
the_hash_algo->rawsz - 1)) {
|
2021-04-26 04:02:55 +03:00
|
|
|
oid_set_algo(&oid, the_hash_algo);
|
2017-10-31 16:50:06 +03:00
|
|
|
if (obj_cb) {
|
|
|
|
r = obj_cb(&oid, path->buf, data);
|
|
|
|
if (r)
|
|
|
|
break;
|
2014-10-16 02:38:55 +04:00
|
|
|
}
|
2017-10-31 16:50:06 +03:00
|
|
|
continue;
|
2014-10-16 02:38:55 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
if (cruft_cb) {
|
|
|
|
r = cruft_cb(de->d_name, path->buf, data);
|
|
|
|
if (r)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2015-08-12 20:43:01 +03:00
|
|
|
closedir(dir);
|
2014-10-16 02:38:55 +04:00
|
|
|
|
2017-12-04 17:06:03 +03:00
|
|
|
strbuf_setlen(path, baselen - 1);
|
2014-10-16 02:38:55 +04:00
|
|
|
if (!r && subdir_cb)
|
|
|
|
r = subdir_cb(subdir_nr, path->buf, data);
|
|
|
|
|
2017-06-24 15:12:30 +03:00
|
|
|
strbuf_setlen(path, origlen);
|
|
|
|
|
2014-10-16 02:38:55 +04:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2015-02-09 04:13:22 +03:00
|
|
|
int for_each_loose_file_in_objdir_buf(struct strbuf *path,
|
2014-10-16 02:38:55 +04:00
|
|
|
each_loose_object_fn obj_cb,
|
|
|
|
each_loose_cruft_fn cruft_cb,
|
|
|
|
each_loose_subdir_fn subdir_cb,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
int r = 0;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < 256; i++) {
|
2015-02-09 04:13:22 +03:00
|
|
|
r = for_each_file_in_obj_subdir(i, path, obj_cb, cruft_cb,
|
2014-10-16 02:38:55 +04:00
|
|
|
subdir_cb, data);
|
|
|
|
if (r)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2015-02-09 04:13:22 +03:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
int for_each_loose_file_in_objdir(const char *path,
|
|
|
|
each_loose_object_fn obj_cb,
|
|
|
|
each_loose_cruft_fn cruft_cb,
|
|
|
|
each_loose_subdir_fn subdir_cb,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
strbuf_addstr(&buf, path);
|
|
|
|
r = for_each_loose_file_in_objdir_buf(&buf, obj_cb, cruft_cb,
|
|
|
|
subdir_cb, data);
|
2014-10-16 02:38:55 +04:00
|
|
|
strbuf_release(&buf);
|
2015-02-09 04:13:22 +03:00
|
|
|
|
2014-10-16 02:38:55 +04:00
|
|
|
return r;
|
|
|
|
}
|
2014-10-16 02:41:21 +04:00
|
|
|
|
2018-08-11 02:09:44 +03:00
|
|
|
int for_each_loose_object(each_loose_object_fn cb, void *data,
|
|
|
|
enum for_each_object_flags flags)
|
2014-10-16 02:41:21 +04:00
|
|
|
{
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
struct object_directory *odb;
|
2015-02-09 04:15:39 +03:00
|
|
|
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
prepare_alt_odb(the_repository);
|
|
|
|
for (odb = the_repository->objects->odb; odb; odb = odb->next) {
|
|
|
|
int r = for_each_loose_file_in_objdir(odb->path, cb, NULL,
|
|
|
|
NULL, data);
|
|
|
|
if (r)
|
|
|
|
return r;
|
2014-10-16 02:41:21 +04:00
|
|
|
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
if (flags & FOR_EACH_OBJECT_LOCAL_ONLY)
|
|
|
|
break;
|
|
|
|
}
|
reachable: only mark local objects as recent
When pruning and repacking a repository that has an
alternate object store configured, we may traverse a large
number of objects in the alternate. This serves no purpose,
and may be expensive to do. A longer explanation is below.
Commits d3038d2 and abcb865 taught prune and pack-objects
(respectively) to treat "recent" objects as tips for
reachability, so that we keep whole chunks of history. They
built on the object traversal in 660c889 (sha1_file: add
for_each iterators for loose and packed objects,
2014-10-15), which covers both local and alternate objects.
In both cases, covering alternate objects is unnecessary, as
both commands can only drop objects from the local
repository. In the case of prune, we traverse only the local
object directory. And in the case of repacking, while we may
or may not include local objects in our pack, we will never
reach into the alternate with "repack -d". The "-l" option
is only a question of whether we are migrating objects from
the alternate into our repository, or leaving them
untouched.
It is possible that we may drop an object that is depended
upon by another object in the alternate. For example,
imagine two repositories, A and B, with A pointing to B as
an alternate. Now imagine a commit that is in B which
references a tree that is only in A. Traversing from recent
objects in B might prevent A from dropping that tree. But
this case isn't worth covering. Repo B should take
responsibility for its own objects. It would never have had
the commit in the first place if it did not also have the
tree, and assuming it is using the same "keep recent chunks
of history" scheme, then it would itself keep the tree, as
well.
So checking the alternate objects is not worth doing, and
come with a significant performance impact. In both cases,
we skip any recent objects that have already been marked
SEEN (i.e., that we know are already reachable for prune, or
included in the pack for a repack). So there is a slight
waste of time in opening the alternate packs at all, only to
notice that we have already considered each object. But much
worse, the alternate repository may have a large number of
objects that are not reachable from the local repository at
all, and we end up adding them to the traversal.
We can fix this by considering only local unseen objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-03-27 14:32:41 +03:00
|
|
|
|
sha1-file: use an object_directory for the main object dir
Our handling of alternate object directories is needlessly different
from the main object directory. As a result, many places in the code
basically look like this:
do_something(r->objects->objdir);
for (odb = r->objects->alt_odb_list; odb; odb = odb->next)
do_something(odb->path);
That gets annoying when do_something() is non-trivial, and we've
resorted to gross hacks like creating fake alternates (see
find_short_object_filename()).
Instead, let's give each raw_object_store a unified list of
object_directory structs. The first will be the main store, and
everything after is an alternate. Very few callers even care about the
distinction, and can just loop over the whole list (and those who care
can just treat the first element differently).
A few observations:
- we don't need r->objects->objectdir anymore, and can just
mechanically convert that to r->objects->odb->path
- object_directory's path field needs to become a real pointer rather
than a FLEX_ARRAY, in order to fill it with expand_base_dir()
- we'll call prepare_alt_odb() earlier in many functions (i.e.,
outside of the loop). This may result in us calling it even when our
function would be satisfied looking only at the main odb.
But this doesn't matter in practice. It's not a very expensive
operation in the first place, and in the majority of cases it will
be a noop. We call it already (and cache its results) in
prepare_packed_git(), and we'll generally check packs before loose
objects. So essentially every program is going to call it
immediately once per program.
Arguably we should just prepare_alt_odb() immediately upon setting
up the repository's object directory, which would save us sprinkling
calls throughout the code base (and forgetting to do so has been a
source of subtle bugs in the past). But I've stopped short of that
here, since there are already a lot of other moving parts in this
patch.
- Most call sites just get shorter. The check_and_freshen() functions
are an exception, because they have entry points to handle local and
nonlocal directories separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 17:50:39 +03:00
|
|
|
return 0;
|
2014-10-16 02:41:21 +04:00
|
|
|
}
|
|
|
|
|
2023-02-24 09:39:24 +03:00
|
|
|
static int append_loose_object(const struct object_id *oid,
|
|
|
|
const char *path UNUSED,
|
2018-11-12 17:50:56 +03:00
|
|
|
void *data)
|
2014-10-16 02:41:21 +04:00
|
|
|
{
|
2021-07-08 02:10:19 +03:00
|
|
|
oidtree_insert(data, oid);
|
2018-11-12 17:50:56 +03:00
|
|
|
return 0;
|
|
|
|
}
|
2014-10-16 02:41:21 +04:00
|
|
|
|
2021-07-08 02:10:19 +03:00
|
|
|
struct oidtree *odb_loose_cache(struct object_directory *odb,
|
2019-01-06 19:45:30 +03:00
|
|
|
const struct object_id *oid)
|
|
|
|
{
|
|
|
|
int subdir_nr = oid->hash[0];
|
2018-11-12 17:50:56 +03:00
|
|
|
struct strbuf buf = STRBUF_INIT;
|
2021-07-08 02:10:17 +03:00
|
|
|
size_t word_bits = bitsizeof(odb->loose_objects_subdir_seen[0]);
|
|
|
|
size_t word_index = subdir_nr / word_bits;
|
2021-12-01 03:29:02 +03:00
|
|
|
size_t mask = (size_t)1u << (subdir_nr % word_bits);
|
2021-07-08 02:10:17 +03:00
|
|
|
uint32_t *bitmap;
|
2014-10-16 02:41:21 +04:00
|
|
|
|
2018-11-12 17:50:56 +03:00
|
|
|
if (subdir_nr < 0 ||
|
2021-07-08 02:10:17 +03:00
|
|
|
subdir_nr >= bitsizeof(odb->loose_objects_subdir_seen))
|
2018-11-12 17:50:56 +03:00
|
|
|
BUG("subdir_nr out of range");
|
reachable: only mark local objects as recent
When pruning and repacking a repository that has an
alternate object store configured, we may traverse a large
number of objects in the alternate. This serves no purpose,
and may be expensive to do. A longer explanation is below.
Commits d3038d2 and abcb865 taught prune and pack-objects
(respectively) to treat "recent" objects as tips for
reachability, so that we keep whole chunks of history. They
built on the object traversal in 660c889 (sha1_file: add
for_each iterators for loose and packed objects,
2014-10-15), which covers both local and alternate objects.
In both cases, covering alternate objects is unnecessary, as
both commands can only drop objects from the local
repository. In the case of prune, we traverse only the local
object directory. And in the case of repacking, while we may
or may not include local objects in our pack, we will never
reach into the alternate with "repack -d". The "-l" option
is only a question of whether we are migrating objects from
the alternate into our repository, or leaving them
untouched.
It is possible that we may drop an object that is depended
upon by another object in the alternate. For example,
imagine two repositories, A and B, with A pointing to B as
an alternate. Now imagine a commit that is in B which
references a tree that is only in A. Traversing from recent
objects in B might prevent A from dropping that tree. But
this case isn't worth covering. Repo B should take
responsibility for its own objects. It would never have had
the commit in the first place if it did not also have the
tree, and assuming it is using the same "keep recent chunks
of history" scheme, then it would itself keep the tree, as
well.
So checking the alternate objects is not worth doing, and
come with a significant performance impact. In both cases,
we skip any recent objects that have already been marked
SEEN (i.e., that we know are already reachable for prune, or
included in the pack for a repack). So there is a slight
waste of time in opening the alternate packs at all, only to
notice that we have already considered each object. But much
worse, the alternate repository may have a large number of
objects that are not reachable from the local repository at
all, and we end up adding them to the traversal.
We can fix this by considering only local unseen objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-03-27 14:32:41 +03:00
|
|
|
|
2021-07-08 02:10:17 +03:00
|
|
|
bitmap = &odb->loose_objects_subdir_seen[word_index];
|
|
|
|
if (*bitmap & mask)
|
2021-07-08 02:10:19 +03:00
|
|
|
return odb->loose_objects_cache;
|
|
|
|
if (!odb->loose_objects_cache) {
|
|
|
|
ALLOC_ARRAY(odb->loose_objects_cache, 1);
|
|
|
|
oidtree_init(odb->loose_objects_cache);
|
|
|
|
}
|
2018-11-12 17:50:56 +03:00
|
|
|
strbuf_addstr(&buf, odb->path);
|
|
|
|
for_each_file_in_obj_subdir(subdir_nr, &buf,
|
|
|
|
append_loose_object,
|
|
|
|
NULL, NULL,
|
2021-07-08 02:10:19 +03:00
|
|
|
odb->loose_objects_cache);
|
2021-07-08 02:10:17 +03:00
|
|
|
*bitmap |= mask;
|
2018-11-22 20:53:00 +03:00
|
|
|
strbuf_release(&buf);
|
2021-07-08 02:10:19 +03:00
|
|
|
return odb->loose_objects_cache;
|
2014-10-16 02:41:21 +04:00
|
|
|
}
|
|
|
|
|
2019-01-06 19:45:39 +03:00
|
|
|
void odb_clear_loose_cache(struct object_directory *odb)
|
|
|
|
{
|
2021-07-08 02:10:19 +03:00
|
|
|
oidtree_clear(odb->loose_objects_cache);
|
|
|
|
FREE_AND_NULL(odb->loose_objects_cache);
|
2019-01-06 19:45:39 +03:00
|
|
|
memset(&odb->loose_objects_subdir_seen, 0,
|
|
|
|
sizeof(odb->loose_objects_subdir_seen));
|
|
|
|
}
|
|
|
|
|
2019-01-07 11:37:02 +03:00
|
|
|
static int check_stream_oid(git_zstream *stream,
|
|
|
|
const char *hdr,
|
|
|
|
unsigned long size,
|
|
|
|
const char *path,
|
|
|
|
const struct object_id *expected_oid)
|
2017-01-13 20:58:16 +03:00
|
|
|
{
|
2018-02-01 05:18:41 +03:00
|
|
|
git_hash_ctx c;
|
2019-01-07 11:37:02 +03:00
|
|
|
struct object_id real_oid;
|
2017-01-13 20:58:16 +03:00
|
|
|
unsigned char buf[4096];
|
|
|
|
unsigned long total_read;
|
|
|
|
int status = Z_OK;
|
|
|
|
|
2018-02-01 05:18:41 +03:00
|
|
|
the_hash_algo->init_fn(&c);
|
|
|
|
the_hash_algo->update_fn(&c, hdr, stream->total_out);
|
2017-01-13 20:58:16 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We already read some bytes into hdr, but the ones up to the NUL
|
|
|
|
* do not count against the object's content size.
|
|
|
|
*/
|
|
|
|
total_read = stream->total_out - strlen(hdr) - 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This size comparison must be "<=" to read the final zlib packets;
|
2019-01-07 11:37:02 +03:00
|
|
|
* see the comment in unpack_loose_rest for details.
|
2017-01-13 20:58:16 +03:00
|
|
|
*/
|
|
|
|
while (total_read <= size &&
|
2018-10-31 07:12:12 +03:00
|
|
|
(status == Z_OK ||
|
|
|
|
(status == Z_BUF_ERROR && !stream->avail_out))) {
|
2017-01-13 20:58:16 +03:00
|
|
|
stream->next_out = buf;
|
|
|
|
stream->avail_out = sizeof(buf);
|
|
|
|
if (size - total_read < stream->avail_out)
|
|
|
|
stream->avail_out = size - total_read;
|
|
|
|
status = git_inflate(stream, Z_FINISH);
|
2018-02-01 05:18:41 +03:00
|
|
|
the_hash_algo->update_fn(&c, buf, stream->next_out - buf);
|
2017-01-13 20:58:16 +03:00
|
|
|
total_read += stream->next_out - buf;
|
|
|
|
}
|
|
|
|
git_inflate_end(stream);
|
|
|
|
|
|
|
|
if (status != Z_STREAM_END) {
|
2019-01-07 11:37:02 +03:00
|
|
|
error(_("corrupt loose object '%s'"), oid_to_hex(expected_oid));
|
2017-01-13 20:58:16 +03:00
|
|
|
return -1;
|
|
|
|
}
|
2017-01-13 21:00:25 +03:00
|
|
|
if (stream->avail_in) {
|
2018-07-21 10:49:39 +03:00
|
|
|
error(_("garbage at end of loose object '%s'"),
|
2019-01-07 11:37:02 +03:00
|
|
|
oid_to_hex(expected_oid));
|
2017-01-13 21:00:25 +03:00
|
|
|
return -1;
|
|
|
|
}
|
2017-01-13 20:58:16 +03:00
|
|
|
|
2021-04-26 04:02:53 +03:00
|
|
|
the_hash_algo->final_oid_fn(&real_oid, &c);
|
2019-01-07 11:37:02 +03:00
|
|
|
if (!oideq(expected_oid, &real_oid)) {
|
2019-01-07 11:40:34 +03:00
|
|
|
error(_("hash mismatch for %s (expected %s)"), path,
|
2019-01-07 11:37:02 +03:00
|
|
|
oid_to_hex(expected_oid));
|
2017-01-13 20:58:16 +03:00
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int read_loose_object(const char *path,
|
2018-03-12 05:27:38 +03:00
|
|
|
const struct object_id *expected_oid,
|
fsck: report invalid object type-path combinations
Improve the error that's emitted in cases where we find a loose object
we parse, but which isn't at the location we expect it to be.
Before this change we'd prefix the error with a not-a-OID derived from
the path at which the object was found, due to an emergent behavior in
how we'd end up with an "OID" in these codepaths.
Now we'll instead say what object we hashed, and what path it was
found at. Before this patch series e.g.:
$ git hash-object --stdin -w -t blob </dev/null
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
$ mv objects/e6/ objects/e7
Would emit ("[...]" used to abbreviate the OIDs):
git fsck
error: hash mismatch for ./objects/e7/9d[...] (expected e79d[...])
error: e79d[...]: object corrupt or missing: ./objects/e7/9d[...]
Now we'll instead emit:
error: e69d[...]: hash-path mismatch, found at: ./objects/e7/9d[...]
Furthermore, we'll do the right thing when the object type and its
location are bad. I.e. this case:
$ git hash-object --stdin -w -t garbage --literally </dev/null
8315a83d2acc4c174aed59430f9a9c4ed926440f
$ mv objects/83 objects/84
As noted in an earlier commits we'd simply die early in those cases,
until preceding commits fixed the hard die on invalid object type:
$ git fsck
fatal: invalid object type
Now we'll instead emit sensible error messages:
$ git fsck
error: 8315[...]: hash-path mismatch, found at: ./objects/84/15[...]
error: 8315[...]: object is of unknown type 'garbage': ./objects/84/15[...]
In both fsck.c and object-file.c we're using null_oid as a sentinel
value for checking whether we got far enough to be certain that the
issue was indeed this OID mismatch.
We need to add the "object corrupt or missing" special-case to deal
with cases where read_loose_object() will return an error before
completing check_object_signature(), e.g. if we have an error in
unpack_loose_rest() because we find garbage after the valid gzip
content:
$ git hash-object --stdin -w -t blob </dev/null
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
$ chmod 755 objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
$ echo garbage >>objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
$ git fsck
error: garbage at end of loose object 'e69d[...]'
error: unable to unpack contents of ./objects/e6/9d[...]
error: e69d[...]: object corrupt or missing: ./objects/e6/9d[...]
There is currently some weird messaging in the edge case when the two
are combined, i.e. because we're not explicitly passing along an error
state about this specific scenario from check_stream_oid() via
read_loose_object() we'll end up printing the null OID if an object is
of an unknown type *and* it can't be unpacked by zlib, e.g.:
$ git hash-object --stdin -w -t garbage --literally </dev/null
8315a83d2acc4c174aed59430f9a9c4ed926440f
$ chmod 755 objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f
$ echo garbage >>objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f
$ /usr/bin/git fsck
fatal: invalid object type
$ ~/g/git/git fsck
error: garbage at end of loose object '8315a83d2acc4c174aed59430f9a9c4ed926440f'
error: unable to unpack contents of ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f
error: 8315a83d2acc4c174aed59430f9a9c4ed926440f: object corrupt or missing: ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f
error: 0000000000000000000000000000000000000000: object is of unknown type 'garbage': ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f
[...]
I think it's OK to leave that for future improvements, which would
involve enum-ifying more error state as we've done with "enum
unpack_loose_header_result" in preceding commits. In these
increasingly more obscure cases the worst that can happen is that
we'll get slightly nonsensical or inapplicable error messages.
There's other such potential edge cases, all of which might produce
some confusing messaging, but still be handled correctly as far as
passing along errors goes. E.g. if check_object_signature() returns
and oideq(real_oid, null_oid()) is true, which could happen if it
returns -1 due to the read_istream() call having failed.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01 12:16:53 +03:00
|
|
|
struct object_id *real_oid,
|
2021-10-01 12:16:52 +03:00
|
|
|
void **contents,
|
|
|
|
struct object_info *oi)
|
2017-01-13 20:58:16 +03:00
|
|
|
{
|
|
|
|
int ret = -1;
|
2022-12-14 22:17:41 +03:00
|
|
|
int fd;
|
2017-01-13 20:58:16 +03:00
|
|
|
void *map = NULL;
|
|
|
|
unsigned long mapsize;
|
|
|
|
git_zstream stream;
|
2018-03-12 05:27:55 +03:00
|
|
|
char hdr[MAX_HEADER_LEN];
|
2021-10-01 12:16:52 +03:00
|
|
|
unsigned long *size = oi->sizep;
|
2017-01-13 20:58:16 +03:00
|
|
|
|
2022-12-14 22:17:41 +03:00
|
|
|
fd = git_open(path);
|
|
|
|
if (fd >= 0)
|
|
|
|
map = map_fd(fd, path, &mapsize);
|
2017-01-13 20:58:16 +03:00
|
|
|
if (!map) {
|
2018-07-21 10:49:39 +03:00
|
|
|
error_errno(_("unable to mmap %s"), path);
|
2017-01-13 20:58:16 +03:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2022-05-16 17:53:27 +03:00
|
|
|
if (unpack_loose_header(&stream, map, mapsize, hdr, sizeof(hdr),
|
|
|
|
NULL) != ULHR_OK) {
|
2018-07-21 10:49:39 +03:00
|
|
|
error(_("unable to unpack header of %s"), path);
|
2017-01-13 20:58:16 +03:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2021-10-01 12:16:52 +03:00
|
|
|
if (parse_loose_header(hdr, oi) < 0) {
|
2018-07-21 10:49:39 +03:00
|
|
|
error(_("unable to parse header of %s"), path);
|
2017-01-13 20:58:16 +03:00
|
|
|
git_inflate_end(&stream);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2021-10-01 12:16:52 +03:00
|
|
|
if (*oi->typep == OBJ_BLOB && *size > big_file_threshold) {
|
2019-01-07 11:37:02 +03:00
|
|
|
if (check_stream_oid(&stream, hdr, *size, path, expected_oid) < 0)
|
2017-01-13 20:58:16 +03:00
|
|
|
goto out;
|
|
|
|
} else {
|
2019-01-07 11:37:02 +03:00
|
|
|
*contents = unpack_loose_rest(&stream, hdr, *size, expected_oid);
|
2017-01-13 20:58:16 +03:00
|
|
|
if (!*contents) {
|
2018-07-21 10:49:39 +03:00
|
|
|
error(_("unable to unpack contents of %s"), path);
|
2017-01-13 20:58:16 +03:00
|
|
|
git_inflate_end(&stream);
|
|
|
|
goto out;
|
|
|
|
}
|
2022-02-05 02:48:32 +03:00
|
|
|
hash_object_file_literally(the_repository->hash_algo,
|
object-file: free(*contents) only in read_loose_object() caller
In the preceding commit a free() of uninitialized memory regression in
96e41f58fe1 (fsck: report invalid object type-path combinations,
2021-10-01) was fixed, but we'd still have an issue with leaking
memory from fsck_loose(). Let's fix that issue too.
That issue was introduced in my 31deb28f5e0 (fsck: don't hard die on
invalid object types, 2021-10-01). It can be reproduced under
SANITIZE=leak with the test I added in 093fffdfbec (fsck tests: add
test for fsck-ing an unknown type, 2021-10-01):
./t1450-fsck.sh --run=84 -vixd
In some sense it's not a problem, we lost the same amount of memory in
terms of things malloc'd and not free'd. It just moved from the "still
reachable" to "definitely lost" column in valgrind(1) nomenclature[1],
since we'd have die()'d before.
But now that we don't hard die() anymore in the library let's properly
free() it. Doing so makes this code much easier to follow, since we'll
now have one function owning the freeing of the "contents" variable,
not two.
For context on that memory management pattern the read_loose_object()
function was added in f6371f92104 (sha1_file: add read_loose_object()
function, 2017-01-13) and subsequently used in c68b489e564 (fsck:
parse loose object paths directly, 2017-01-13). The pattern of it
being the task of both sides to free() the memory has been there in
this form since its inception.
1. https://valgrind.org/docs/manual/mc-manual.html#mc-manual.leaks
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-11 08:18:56 +03:00
|
|
|
*contents, *size,
|
2022-02-05 02:48:32 +03:00
|
|
|
oi->type_name->buf, real_oid);
|
2022-02-05 02:48:30 +03:00
|
|
|
if (!oideq(expected_oid, real_oid))
|
2017-01-13 20:58:16 +03:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = 0; /* everything checks out */
|
|
|
|
|
|
|
|
out:
|
|
|
|
if (map)
|
|
|
|
munmap(map, mapsize);
|
|
|
|
return ret;
|
|
|
|
}
|