2006-09-05 08:50:12 +04:00
|
|
|
#include "cache.h"
|
|
|
|
#include "tag.h"
|
|
|
|
#include "commit.h"
|
|
|
|
#include "tree.h"
|
|
|
|
#include "blob.h"
|
|
|
|
#include "diff.h"
|
|
|
|
#include "tree-walk.h"
|
|
|
|
#include "revision.h"
|
|
|
|
#include "list-objects.h"
|
|
|
|
|
|
|
|
static void process_blob(struct rev_info *revs,
|
|
|
|
struct blob *blob,
|
process_{tree,blob}: show objects without buffering
Here's a less trivial thing, and slightly more dubious one.
I was looking at that "struct object_array objects", and wondering why we
do that. I have honestly totally forgotten. Why not just call the "show()"
function as we encounter the objects? Rather than add the objects to the
object_array, and then at the very end going through the array and doing a
'show' on all, just do things more incrementally.
Now, there are possible downsides to this:
- the "buffer using object_array" _can_ in theory result in at least
better I-cache usage (two tight loops rather than one more spread out
one). I don't think this is a real issue, but in theory..
- this _does_ change the order of the objects printed. Instead of doing a
"process_tree(revs, commit->tree, &objects, NULL, "");" in the loop
over the commits (which puts all the root trees _first_ in the object
list, this patch just adds them to the list of pending objects, and
then we'll traverse them in that order (and thus show each root tree
object together with the objects we discover under it)
I _think_ the new ordering actually makes more sense, but the object
ordering is actually a subtle thing when it comes to packing
efficiency, so any change in order is going to have implications for
packing. Good or bad, I dunno.
- There may be some reason why we did it that odd way with the object
array, that I have simply forgotten.
Anyway, now that we don't buffer up the objects before showing them
that may actually result in lower memory usage during that whole
traverse_commit_list() phase.
This is seriously not very deeply tested. It makes sense to me, it seems
to pass all the tests, it looks ok, but...
Does anybody remember why we did that "object_array" thing? It used to be
an "object_list" a long long time ago, but got changed into the array due
to better memory usage patterns (those linked lists of obejcts are
horrible from a memory allocation standpoint). But I wonder why we didn't
do this back then. Maybe there's a reason for it.
Or maybe there _used_ to be a reason, and no longer is.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-11 04:27:58 +04:00
|
|
|
show_object_fn show,
|
2006-09-05 08:50:12 +04:00
|
|
|
struct name_path *path,
|
2011-09-02 02:43:33 +04:00
|
|
|
const char *name,
|
|
|
|
void *cb_data)
|
2006-09-05 08:50:12 +04:00
|
|
|
{
|
|
|
|
struct object *obj = &blob->object;
|
|
|
|
|
|
|
|
if (!revs->blob_objects)
|
|
|
|
return;
|
2008-02-18 23:47:56 +03:00
|
|
|
if (!obj)
|
|
|
|
die("bad blob object");
|
2006-09-05 08:50:12 +04:00
|
|
|
if (obj->flags & (UNINTERESTING | SEEN))
|
|
|
|
return;
|
|
|
|
obj->flags |= SEEN;
|
2011-09-02 02:43:33 +04:00
|
|
|
show(obj, path, name, cb_data);
|
2006-09-05 08:50:12 +04:00
|
|
|
}
|
|
|
|
|
2007-04-13 20:25:01 +04:00
|
|
|
/*
|
|
|
|
* Processing a gitlink entry currently does nothing, since
|
|
|
|
* we do not recurse into the subproject.
|
|
|
|
*
|
|
|
|
* We *could* eventually add a flag that actually does that,
|
|
|
|
* which would involve:
|
|
|
|
* - is the subproject actually checked out?
|
|
|
|
* - if so, see if the subproject has already been added
|
|
|
|
* to the alternates list, and add it if not.
|
|
|
|
* - process the commit (or tag) the gitlink points to
|
|
|
|
* recursively.
|
|
|
|
*
|
|
|
|
* However, it's unclear whether there is really ever any
|
|
|
|
* reason to see superprojects and subprojects as such a
|
|
|
|
* "unified" object pool (potentially resulting in a totally
|
|
|
|
* humongous pack - avoiding which was the whole point of
|
|
|
|
* having gitlinks in the first place!).
|
|
|
|
*
|
|
|
|
* So for now, there is just a note that we *could* follow
|
|
|
|
* the link, and how to do it. Whether it necessarily makes
|
|
|
|
* any sense what-so-ever to ever do that is another issue.
|
|
|
|
*/
|
|
|
|
static void process_gitlink(struct rev_info *revs,
|
|
|
|
const unsigned char *sha1,
|
process_{tree,blob}: show objects without buffering
Here's a less trivial thing, and slightly more dubious one.
I was looking at that "struct object_array objects", and wondering why we
do that. I have honestly totally forgotten. Why not just call the "show()"
function as we encounter the objects? Rather than add the objects to the
object_array, and then at the very end going through the array and doing a
'show' on all, just do things more incrementally.
Now, there are possible downsides to this:
- the "buffer using object_array" _can_ in theory result in at least
better I-cache usage (two tight loops rather than one more spread out
one). I don't think this is a real issue, but in theory..
- this _does_ change the order of the objects printed. Instead of doing a
"process_tree(revs, commit->tree, &objects, NULL, "");" in the loop
over the commits (which puts all the root trees _first_ in the object
list, this patch just adds them to the list of pending objects, and
then we'll traverse them in that order (and thus show each root tree
object together with the objects we discover under it)
I _think_ the new ordering actually makes more sense, but the object
ordering is actually a subtle thing when it comes to packing
efficiency, so any change in order is going to have implications for
packing. Good or bad, I dunno.
- There may be some reason why we did it that odd way with the object
array, that I have simply forgotten.
Anyway, now that we don't buffer up the objects before showing them
that may actually result in lower memory usage during that whole
traverse_commit_list() phase.
This is seriously not very deeply tested. It makes sense to me, it seems
to pass all the tests, it looks ok, but...
Does anybody remember why we did that "object_array" thing? It used to be
an "object_list" a long long time ago, but got changed into the array due
to better memory usage patterns (those linked lists of obejcts are
horrible from a memory allocation standpoint). But I wonder why we didn't
do this back then. Maybe there's a reason for it.
Or maybe there _used_ to be a reason, and no longer is.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-11 04:27:58 +04:00
|
|
|
show_object_fn show,
|
2007-04-13 20:25:01 +04:00
|
|
|
struct name_path *path,
|
2011-09-02 02:43:33 +04:00
|
|
|
const char *name,
|
|
|
|
void *cb_data)
|
2007-04-13 20:25:01 +04:00
|
|
|
{
|
|
|
|
/* Nothing to do */
|
|
|
|
}
|
|
|
|
|
2006-09-05 08:50:12 +04:00
|
|
|
static void process_tree(struct rev_info *revs,
|
|
|
|
struct tree *tree,
|
process_{tree,blob}: show objects without buffering
Here's a less trivial thing, and slightly more dubious one.
I was looking at that "struct object_array objects", and wondering why we
do that. I have honestly totally forgotten. Why not just call the "show()"
function as we encounter the objects? Rather than add the objects to the
object_array, and then at the very end going through the array and doing a
'show' on all, just do things more incrementally.
Now, there are possible downsides to this:
- the "buffer using object_array" _can_ in theory result in at least
better I-cache usage (two tight loops rather than one more spread out
one). I don't think this is a real issue, but in theory..
- this _does_ change the order of the objects printed. Instead of doing a
"process_tree(revs, commit->tree, &objects, NULL, "");" in the loop
over the commits (which puts all the root trees _first_ in the object
list, this patch just adds them to the list of pending objects, and
then we'll traverse them in that order (and thus show each root tree
object together with the objects we discover under it)
I _think_ the new ordering actually makes more sense, but the object
ordering is actually a subtle thing when it comes to packing
efficiency, so any change in order is going to have implications for
packing. Good or bad, I dunno.
- There may be some reason why we did it that odd way with the object
array, that I have simply forgotten.
Anyway, now that we don't buffer up the objects before showing them
that may actually result in lower memory usage during that whole
traverse_commit_list() phase.
This is seriously not very deeply tested. It makes sense to me, it seems
to pass all the tests, it looks ok, but...
Does anybody remember why we did that "object_array" thing? It used to be
an "object_list" a long long time ago, but got changed into the array due
to better memory usage patterns (those linked lists of obejcts are
horrible from a memory allocation standpoint). But I wonder why we didn't
do this back then. Maybe there's a reason for it.
Or maybe there _used_ to be a reason, and no longer is.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-11 04:27:58 +04:00
|
|
|
show_object_fn show,
|
2006-09-05 08:50:12 +04:00
|
|
|
struct name_path *path,
|
2010-12-17 16:26:47 +03:00
|
|
|
struct strbuf *base,
|
2011-09-02 02:43:33 +04:00
|
|
|
const char *name,
|
|
|
|
void *cb_data)
|
2006-09-05 08:50:12 +04:00
|
|
|
{
|
|
|
|
struct object *obj = &tree->object;
|
|
|
|
struct tree_desc desc;
|
|
|
|
struct name_entry entry;
|
|
|
|
struct name_path me;
|
2011-10-24 10:36:10 +04:00
|
|
|
enum interesting match = revs->diffopt.pathspec.nr == 0 ?
|
|
|
|
all_entries_interesting: entry_not_interesting;
|
2010-12-17 16:26:47 +03:00
|
|
|
int baselen = base->len;
|
2006-09-05 08:50:12 +04:00
|
|
|
|
|
|
|
if (!revs->tree_objects)
|
|
|
|
return;
|
2008-02-18 23:47:56 +03:00
|
|
|
if (!obj)
|
|
|
|
die("bad tree object");
|
2006-09-05 08:50:12 +04:00
|
|
|
if (obj->flags & (UNINTERESTING | SEEN))
|
|
|
|
return;
|
add `ignore_missing_links` mode to revwalk
When pack-objects is computing the reachability bitmap to
serve a fetch request, it can erroneously die() if some of
the UNINTERESTING objects are not present. Upload-pack
throws away HAVE lines from the client for objects we do not
have, but we may have a tip object without all of its
ancestors (e.g., if the tip is no longer reachable and was
new enough to survive a `git prune`, but some of its
reachable objects did get pruned).
In the non-bitmap case, we do a revision walk with the HAVE
objects marked as UNINTERESTING. The revision walker
explicitly ignores errors in accessing UNINTERESTING commits
to handle this case (and we do not bother looking at
UNINTERESTING trees or blobs at all).
When we have bitmaps, however, the process is quite
different. The bitmap index for a pack-objects run is
calculated in two separate steps:
First, we perform an extensive walk from all the HAVEs to
find the full set of objects reachable from them. This walk
is usually optimized away because we are expected to hit an
object with a bitmap during the traversal, which allows us
to terminate early.
Secondly, we perform an extensive walk from all the WANTs,
which usually also terminates early because we hit a commit
with an existing bitmap.
Once we have the resulting bitmaps from the two walks, we
AND-NOT them together to obtain the resulting set of objects
we need to pack.
When we are walking the HAVE objects, the revision walker
does not know that we are walking it only to mark the
results as uninteresting. We strip out the UNINTERESTING flag,
because those objects _are_ interesting to us during the
first walk. We want to keep going to get a complete set of
reachable objects if we can.
We need some way to tell the revision walker that it's OK to
silently truncate the HAVE walk, just like it does for the
UNINTERESTING case. This patch introduces a new
`ignore_missing_links` flag to the `rev_info` struct, which
we set only for the HAVE walk.
It also adds tests to cover UNINTERESTING objects missing
from several positions: a missing blob, a missing tree, and
a missing parent commit. The missing blob already worked (as
we do not care about its contents at all), but the other two
cases caused us to die().
Note that there are a few cases we do not need to test:
1. We do not need to test a missing tree, with the blob
still present. Without the tree that refers to it, we
would not know that the blob is relevant to our walk.
2. We do not need to test a tip commit that is missing.
Upload-pack omits these for us (and in fact, we
complain even in the non-bitmap case if it fails to do
so).
Reported-by: Siddharth Agarwal <sid0@fb.com>
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-28 14:00:43 +04:00
|
|
|
if (parse_tree(tree) < 0) {
|
|
|
|
if (revs->ignore_missing_links)
|
|
|
|
return;
|
2006-09-05 08:50:12 +04:00
|
|
|
die("bad tree object %s", sha1_to_hex(obj->sha1));
|
add `ignore_missing_links` mode to revwalk
When pack-objects is computing the reachability bitmap to
serve a fetch request, it can erroneously die() if some of
the UNINTERESTING objects are not present. Upload-pack
throws away HAVE lines from the client for objects we do not
have, but we may have a tip object without all of its
ancestors (e.g., if the tip is no longer reachable and was
new enough to survive a `git prune`, but some of its
reachable objects did get pruned).
In the non-bitmap case, we do a revision walk with the HAVE
objects marked as UNINTERESTING. The revision walker
explicitly ignores errors in accessing UNINTERESTING commits
to handle this case (and we do not bother looking at
UNINTERESTING trees or blobs at all).
When we have bitmaps, however, the process is quite
different. The bitmap index for a pack-objects run is
calculated in two separate steps:
First, we perform an extensive walk from all the HAVEs to
find the full set of objects reachable from them. This walk
is usually optimized away because we are expected to hit an
object with a bitmap during the traversal, which allows us
to terminate early.
Secondly, we perform an extensive walk from all the WANTs,
which usually also terminates early because we hit a commit
with an existing bitmap.
Once we have the resulting bitmaps from the two walks, we
AND-NOT them together to obtain the resulting set of objects
we need to pack.
When we are walking the HAVE objects, the revision walker
does not know that we are walking it only to mark the
results as uninteresting. We strip out the UNINTERESTING flag,
because those objects _are_ interesting to us during the
first walk. We want to keep going to get a complete set of
reachable objects if we can.
We need some way to tell the revision walker that it's OK to
silently truncate the HAVE walk, just like it does for the
UNINTERESTING case. This patch introduces a new
`ignore_missing_links` flag to the `rev_info` struct, which
we set only for the HAVE walk.
It also adds tests to cover UNINTERESTING objects missing
from several positions: a missing blob, a missing tree, and
a missing parent commit. The missing blob already worked (as
we do not care about its contents at all), but the other two
cases caused us to die().
Note that there are a few cases we do not need to test:
1. We do not need to test a missing tree, with the blob
still present. Without the tree that refers to it, we
would not know that the blob is relevant to our walk.
2. We do not need to test a tip commit that is missing.
Upload-pack omits these for us (and in fact, we
complain even in the non-bitmap case if it fails to do
so).
Reported-by: Siddharth Agarwal <sid0@fb.com>
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-28 14:00:43 +04:00
|
|
|
}
|
2006-09-05 08:50:12 +04:00
|
|
|
obj->flags |= SEEN;
|
2011-09-02 02:43:33 +04:00
|
|
|
show(obj, path, name, cb_data);
|
2006-09-05 08:50:12 +04:00
|
|
|
me.up = path;
|
|
|
|
me.elem = name;
|
|
|
|
me.elem_len = strlen(name);
|
|
|
|
|
2011-03-25 12:34:20 +03:00
|
|
|
if (!match) {
|
2010-12-17 16:26:47 +03:00
|
|
|
strbuf_addstr(base, name);
|
|
|
|
if (base->len)
|
|
|
|
strbuf_addch(base, '/');
|
|
|
|
}
|
|
|
|
|
2007-03-21 20:08:25 +03:00
|
|
|
init_tree_desc(&desc, tree->buffer, tree->size);
|
2006-09-05 08:50:12 +04:00
|
|
|
|
|
|
|
while (tree_entry(&desc, &entry)) {
|
2011-10-24 10:36:10 +04:00
|
|
|
if (match != all_entries_interesting) {
|
2011-03-25 12:34:20 +03:00
|
|
|
match = tree_entry_interesting(&entry, base, 0,
|
|
|
|
&revs->diffopt.pathspec);
|
2011-10-24 10:36:10 +04:00
|
|
|
if (match == all_entries_not_interesting)
|
2010-12-17 16:26:47 +03:00
|
|
|
break;
|
2011-10-24 10:36:10 +04:00
|
|
|
if (match == entry_not_interesting)
|
2010-12-17 16:26:47 +03:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2006-09-05 08:50:12 +04:00
|
|
|
if (S_ISDIR(entry.mode))
|
|
|
|
process_tree(revs,
|
|
|
|
lookup_tree(entry.sha1),
|
2011-09-02 02:43:33 +04:00
|
|
|
show, &me, base, entry.path,
|
|
|
|
cb_data);
|
2007-05-22 00:08:28 +04:00
|
|
|
else if (S_ISGITLINK(entry.mode))
|
2007-04-13 20:25:01 +04:00
|
|
|
process_gitlink(revs, entry.sha1,
|
2011-09-02 02:43:33 +04:00
|
|
|
show, &me, entry.path,
|
|
|
|
cb_data);
|
2006-09-05 08:50:12 +04:00
|
|
|
else
|
|
|
|
process_blob(revs,
|
|
|
|
lookup_blob(entry.sha1),
|
2011-09-02 02:43:33 +04:00
|
|
|
show, &me, entry.path,
|
|
|
|
cb_data);
|
2006-09-05 08:50:12 +04:00
|
|
|
}
|
2010-12-17 16:26:47 +03:00
|
|
|
strbuf_setlen(base, baselen);
|
2013-06-06 02:37:39 +04:00
|
|
|
free_tree_buffer(tree);
|
2006-09-05 08:50:12 +04:00
|
|
|
}
|
|
|
|
|
2006-09-06 12:42:23 +04:00
|
|
|
static void mark_edge_parents_uninteresting(struct commit *commit,
|
|
|
|
struct rev_info *revs,
|
|
|
|
show_edge_fn show_edge)
|
|
|
|
{
|
|
|
|
struct commit_list *parents;
|
|
|
|
|
|
|
|
for (parents = commit->parents; parents; parents = parents->next) {
|
|
|
|
struct commit *parent = parents->item;
|
|
|
|
if (!(parent->object.flags & UNINTERESTING))
|
|
|
|
continue;
|
|
|
|
mark_tree_uninteresting(parent->tree);
|
|
|
|
if (revs->edge_hint && !(parent->object.flags & SHOWN)) {
|
|
|
|
parent->object.flags |= SHOWN;
|
|
|
|
show_edge(parent);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-08-16 13:52:06 +04:00
|
|
|
void mark_edges_uninteresting(struct rev_info *revs, show_edge_fn show_edge)
|
2006-09-06 12:42:23 +04:00
|
|
|
{
|
2013-08-16 13:52:06 +04:00
|
|
|
struct commit_list *list;
|
list-objects: mark more commits as edges in mark_edges_uninteresting
The purpose of edge commits is to let pack-objects know what objects
it can use as base, but does not need to include in the thin pack
because the other side is supposed to already have them. So far we
mark uninteresting parents of interesting commits as edges. But even
an unrelated uninteresting commit (that the other side has) may
become a good base for pack-objects and help produce more efficient
packs.
This is especially true for shallow clone, when the client issues a
fetch with a depth smaller or equal to the number of commits the
server is ahead of the client. For example, in this commit history
the client has up to "A" and the server has up to "B":
-------A---B
have--^ ^
/
want--+
If depth 1 is requested, the commit list to send to the client
includes only B. The way m_e_u is working, it checks if parent
commits of B are uninteresting, if so mark them as edges. Due to
shallow effect, commit B is grafted to have no parents and the
revision walker never sees A as the parent of B. In fact it marks no
edges at all in this simple case and sends everything B has to the
client even if it could have excluded what A and also the client
already have.
In a slightly different case where A is not a direct parent of B
(iow there are commits in between A and B), marking A as an edge can
still save some because B may still have stuff from the far ancestor
A.
There is another case from the earlier patch, when we deepen a ref
from C->E to A->E:
---A---B C---D---E
want--^ ^ ^
shallow-+ /
have-------+
In this case we need to send A and B to the client, and C (i.e. the
current shallow point that the client informs the server) is a very
good base because it's closet to A and B. Normal m_e_u won't recognize
C as an edge because it only looks back to parents (i.e. A<-B) not the
opposite way B->C even if C is already marked as uninteresting commit
by the previous patch.
This patch includes all uninteresting commits from command line as
edges and lets pack-objects decide what's best to do. The upside is we
have better chance of producing better packs in certain cases. The
downside is we may need to process some extra objects on the server
side.
For the shallow case on git.git, when the client is 5 commits behind
and does "fetch --depth=3", the result pack is 99.26 KiB instead of
4.92 MiB.
Reported-and-analyzed-by: Matthijs Kooijman <matthijs@stdin.nl>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-16 13:52:07 +04:00
|
|
|
int i;
|
|
|
|
|
2013-08-16 13:52:06 +04:00
|
|
|
for (list = revs->commits; list; list = list->next) {
|
2006-09-06 12:42:23 +04:00
|
|
|
struct commit *commit = list->item;
|
|
|
|
|
|
|
|
if (commit->object.flags & UNINTERESTING) {
|
|
|
|
mark_tree_uninteresting(commit->tree);
|
2014-12-25 02:05:39 +03:00
|
|
|
if (revs->edge_hint_aggressive && !(commit->object.flags & SHOWN)) {
|
list-objects: mark more commits as edges in mark_edges_uninteresting
The purpose of edge commits is to let pack-objects know what objects
it can use as base, but does not need to include in the thin pack
because the other side is supposed to already have them. So far we
mark uninteresting parents of interesting commits as edges. But even
an unrelated uninteresting commit (that the other side has) may
become a good base for pack-objects and help produce more efficient
packs.
This is especially true for shallow clone, when the client issues a
fetch with a depth smaller or equal to the number of commits the
server is ahead of the client. For example, in this commit history
the client has up to "A" and the server has up to "B":
-------A---B
have--^ ^
/
want--+
If depth 1 is requested, the commit list to send to the client
includes only B. The way m_e_u is working, it checks if parent
commits of B are uninteresting, if so mark them as edges. Due to
shallow effect, commit B is grafted to have no parents and the
revision walker never sees A as the parent of B. In fact it marks no
edges at all in this simple case and sends everything B has to the
client even if it could have excluded what A and also the client
already have.
In a slightly different case where A is not a direct parent of B
(iow there are commits in between A and B), marking A as an edge can
still save some because B may still have stuff from the far ancestor
A.
There is another case from the earlier patch, when we deepen a ref
from C->E to A->E:
---A---B C---D---E
want--^ ^ ^
shallow-+ /
have-------+
In this case we need to send A and B to the client, and C (i.e. the
current shallow point that the client informs the server) is a very
good base because it's closet to A and B. Normal m_e_u won't recognize
C as an edge because it only looks back to parents (i.e. A<-B) not the
opposite way B->C even if C is already marked as uninteresting commit
by the previous patch.
This patch includes all uninteresting commits from command line as
edges and lets pack-objects decide what's best to do. The upside is we
have better chance of producing better packs in certain cases. The
downside is we may need to process some extra objects on the server
side.
For the shallow case on git.git, when the client is 5 commits behind
and does "fetch --depth=3", the result pack is 99.26 KiB instead of
4.92 MiB.
Reported-and-analyzed-by: Matthijs Kooijman <matthijs@stdin.nl>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-16 13:52:07 +04:00
|
|
|
commit->object.flags |= SHOWN;
|
|
|
|
show_edge(commit);
|
|
|
|
}
|
2006-09-06 12:42:23 +04:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
mark_edge_parents_uninteresting(commit, revs, show_edge);
|
|
|
|
}
|
2014-12-25 02:05:39 +03:00
|
|
|
if (revs->edge_hint_aggressive) {
|
list-objects: only look at cmdline trees with edge_hint
When rev-list is given a command-line like:
git rev-list --objects $commit --not --all
the most accurate answer is the difference between the set
of objects reachable from $commit and the set reachable from
all of the existing refs. However, we have not historically
provided that answer, because it is very expensive to
calculate. We would have to open every tree of every commit
in the entire history.
Instead, we find the accurate set difference of the
reachable commits, and then mark the trees at the boundaries
as uninteresting. This misses objects which appear in the
trees of both the interesting commits and deep within the
uninteresting history.
Commit fbd4a70 (list-objects: mark more commits as edges in
mark_edges_uninteresting, 2013-08-16) noticed that we miss
those objects during pack-objects, and added code to examine
the trees of all of the "--not" refs given on the
command-line. Note that this is still not the complete set
difference, because we look only at the tips of the
command-line arguments, not all of their reachable commits.
But it increases the set of boundary objects we consider,
which is especially important for shallow fetches. So we
are trading extra CPU time for a larger set of boundary
objects, which can improve the resulting pack size for a
--thin pack.
This tradeoff probably makes sense in the context of
pack-objects, where we have set revs->edge_hint to have the
traversal feed us the set of boundary objects. For a
regular rev-list, though, it is probably not a good
tradeoff. It is true that it makes our list slightly closer
to a true set difference, but it is a rare case where this
is important. And because we do not have revs->edge_hint
set, we do nothing useful with the larger set of boundary
objects.
This patch therefore ties the extra tree examination to the
revs->edge_hint flag; it is the presence of that flag that
makes the tradeoff worthwhile.
Here is output from the p0001-rev-list showing the
improvement in performance:
Test HEAD^ HEAD
-----------------------------------------------------------------------------------------
0001.1: rev-list --all 0.69(0.65+0.02) 0.69(0.66+0.02) +0.0%
0001.2: rev-list --all --objects 3.22(3.19+0.03) 3.23(3.20+0.03) +0.3%
0001.4: rev-list $commit --not --all 0.04(0.04+0.00) 0.04(0.04+0.00) +0.0%
0001.5: rev-list --objects $commit --not --all 0.27(0.26+0.01) 0.04(0.04+0.00) -85.2%
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-01-21 06:25:40 +04:00
|
|
|
for (i = 0; i < revs->cmdline.nr; i++) {
|
|
|
|
struct object *obj = revs->cmdline.rev[i].item;
|
|
|
|
struct commit *commit = (struct commit *)obj;
|
|
|
|
if (obj->type != OBJ_COMMIT || !(obj->flags & UNINTERESTING))
|
|
|
|
continue;
|
|
|
|
mark_tree_uninteresting(commit->tree);
|
|
|
|
if (!(obj->flags & SHOWN)) {
|
|
|
|
obj->flags |= SHOWN;
|
|
|
|
show_edge(commit);
|
|
|
|
}
|
list-objects: mark more commits as edges in mark_edges_uninteresting
The purpose of edge commits is to let pack-objects know what objects
it can use as base, but does not need to include in the thin pack
because the other side is supposed to already have them. So far we
mark uninteresting parents of interesting commits as edges. But even
an unrelated uninteresting commit (that the other side has) may
become a good base for pack-objects and help produce more efficient
packs.
This is especially true for shallow clone, when the client issues a
fetch with a depth smaller or equal to the number of commits the
server is ahead of the client. For example, in this commit history
the client has up to "A" and the server has up to "B":
-------A---B
have--^ ^
/
want--+
If depth 1 is requested, the commit list to send to the client
includes only B. The way m_e_u is working, it checks if parent
commits of B are uninteresting, if so mark them as edges. Due to
shallow effect, commit B is grafted to have no parents and the
revision walker never sees A as the parent of B. In fact it marks no
edges at all in this simple case and sends everything B has to the
client even if it could have excluded what A and also the client
already have.
In a slightly different case where A is not a direct parent of B
(iow there are commits in between A and B), marking A as an edge can
still save some because B may still have stuff from the far ancestor
A.
There is another case from the earlier patch, when we deepen a ref
from C->E to A->E:
---A---B C---D---E
want--^ ^ ^
shallow-+ /
have-------+
In this case we need to send A and B to the client, and C (i.e. the
current shallow point that the client informs the server) is a very
good base because it's closet to A and B. Normal m_e_u won't recognize
C as an edge because it only looks back to parents (i.e. A<-B) not the
opposite way B->C even if C is already marked as uninteresting commit
by the previous patch.
This patch includes all uninteresting commits from command line as
edges and lets pack-objects decide what's best to do. The upside is we
have better chance of producing better packs in certain cases. The
downside is we may need to process some extra objects on the server
side.
For the shallow case on git.git, when the client is 5 commits behind
and does "fetch --depth=3", the result pack is 99.26 KiB instead of
4.92 MiB.
Reported-and-analyzed-by: Matthijs Kooijman <matthijs@stdin.nl>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-16 13:52:07 +04:00
|
|
|
}
|
|
|
|
}
|
2006-09-06 12:42:23 +04:00
|
|
|
}
|
|
|
|
|
process_{tree,blob}: show objects without buffering
Here's a less trivial thing, and slightly more dubious one.
I was looking at that "struct object_array objects", and wondering why we
do that. I have honestly totally forgotten. Why not just call the "show()"
function as we encounter the objects? Rather than add the objects to the
object_array, and then at the very end going through the array and doing a
'show' on all, just do things more incrementally.
Now, there are possible downsides to this:
- the "buffer using object_array" _can_ in theory result in at least
better I-cache usage (two tight loops rather than one more spread out
one). I don't think this is a real issue, but in theory..
- this _does_ change the order of the objects printed. Instead of doing a
"process_tree(revs, commit->tree, &objects, NULL, "");" in the loop
over the commits (which puts all the root trees _first_ in the object
list, this patch just adds them to the list of pending objects, and
then we'll traverse them in that order (and thus show each root tree
object together with the objects we discover under it)
I _think_ the new ordering actually makes more sense, but the object
ordering is actually a subtle thing when it comes to packing
efficiency, so any change in order is going to have implications for
packing. Good or bad, I dunno.
- There may be some reason why we did it that odd way with the object
array, that I have simply forgotten.
Anyway, now that we don't buffer up the objects before showing them
that may actually result in lower memory usage during that whole
traverse_commit_list() phase.
This is seriously not very deeply tested. It makes sense to me, it seems
to pass all the tests, it looks ok, but...
Does anybody remember why we did that "object_array" thing? It used to be
an "object_list" a long long time ago, but got changed into the array due
to better memory usage patterns (those linked lists of obejcts are
horrible from a memory allocation standpoint). But I wonder why we didn't
do this back then. Maybe there's a reason for it.
Or maybe there _used_ to be a reason, and no longer is.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-11 04:27:58 +04:00
|
|
|
static void add_pending_tree(struct rev_info *revs, struct tree *tree)
|
|
|
|
{
|
|
|
|
add_pending_object(revs, &tree->object, "");
|
|
|
|
}
|
|
|
|
|
2006-09-05 08:50:12 +04:00
|
|
|
void traverse_commit_list(struct rev_info *revs,
|
2009-04-06 23:28:36 +04:00
|
|
|
show_commit_fn show_commit,
|
|
|
|
show_object_fn show_object,
|
|
|
|
void *data)
|
2006-09-05 08:50:12 +04:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct commit *commit;
|
2010-12-17 16:26:47 +03:00
|
|
|
struct strbuf base;
|
2006-09-05 08:50:12 +04:00
|
|
|
|
2010-12-17 16:26:47 +03:00
|
|
|
strbuf_init(&base, PATH_MAX);
|
2006-09-05 08:50:12 +04:00
|
|
|
while ((commit = get_revision(revs)) != NULL) {
|
2011-03-14 22:29:50 +03:00
|
|
|
/*
|
|
|
|
* an uninteresting boundary commit may not have its tree
|
|
|
|
* parsed yet, but we are not going to show them anyway
|
|
|
|
*/
|
|
|
|
if (commit->tree)
|
|
|
|
add_pending_tree(revs, commit->tree);
|
2009-04-06 23:28:36 +04:00
|
|
|
show_commit(commit, data);
|
2006-09-05 08:50:12 +04:00
|
|
|
}
|
|
|
|
for (i = 0; i < revs->pending.nr; i++) {
|
|
|
|
struct object_array_entry *pending = revs->pending.objects + i;
|
|
|
|
struct object *obj = pending->item;
|
|
|
|
const char *name = pending->name;
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 02:43:19 +04:00
|
|
|
const char *path = pending->path;
|
2006-09-05 08:50:12 +04:00
|
|
|
if (obj->flags & (UNINTERESTING | SEEN))
|
|
|
|
continue;
|
|
|
|
if (obj->type == OBJ_TAG) {
|
|
|
|
obj->flags |= SEEN;
|
2011-09-02 02:43:33 +04:00
|
|
|
show_object(obj, NULL, name, data);
|
2006-09-05 08:50:12 +04:00
|
|
|
continue;
|
|
|
|
}
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 02:43:19 +04:00
|
|
|
if (!path)
|
|
|
|
path = "";
|
2006-09-05 08:50:12 +04:00
|
|
|
if (obj->type == OBJ_TREE) {
|
process_{tree,blob}: show objects without buffering
Here's a less trivial thing, and slightly more dubious one.
I was looking at that "struct object_array objects", and wondering why we
do that. I have honestly totally forgotten. Why not just call the "show()"
function as we encounter the objects? Rather than add the objects to the
object_array, and then at the very end going through the array and doing a
'show' on all, just do things more incrementally.
Now, there are possible downsides to this:
- the "buffer using object_array" _can_ in theory result in at least
better I-cache usage (two tight loops rather than one more spread out
one). I don't think this is a real issue, but in theory..
- this _does_ change the order of the objects printed. Instead of doing a
"process_tree(revs, commit->tree, &objects, NULL, "");" in the loop
over the commits (which puts all the root trees _first_ in the object
list, this patch just adds them to the list of pending objects, and
then we'll traverse them in that order (and thus show each root tree
object together with the objects we discover under it)
I _think_ the new ordering actually makes more sense, but the object
ordering is actually a subtle thing when it comes to packing
efficiency, so any change in order is going to have implications for
packing. Good or bad, I dunno.
- There may be some reason why we did it that odd way with the object
array, that I have simply forgotten.
Anyway, now that we don't buffer up the objects before showing them
that may actually result in lower memory usage during that whole
traverse_commit_list() phase.
This is seriously not very deeply tested. It makes sense to me, it seems
to pass all the tests, it looks ok, but...
Does anybody remember why we did that "object_array" thing? It used to be
an "object_list" a long long time ago, but got changed into the array due
to better memory usage patterns (those linked lists of obejcts are
horrible from a memory allocation standpoint). But I wonder why we didn't
do this back then. Maybe there's a reason for it.
Or maybe there _used_ to be a reason, and no longer is.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-11 04:27:58 +04:00
|
|
|
process_tree(revs, (struct tree *)obj, show_object,
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 02:43:19 +04:00
|
|
|
NULL, &base, path, data);
|
2006-09-05 08:50:12 +04:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (obj->type == OBJ_BLOB) {
|
process_{tree,blob}: show objects without buffering
Here's a less trivial thing, and slightly more dubious one.
I was looking at that "struct object_array objects", and wondering why we
do that. I have honestly totally forgotten. Why not just call the "show()"
function as we encounter the objects? Rather than add the objects to the
object_array, and then at the very end going through the array and doing a
'show' on all, just do things more incrementally.
Now, there are possible downsides to this:
- the "buffer using object_array" _can_ in theory result in at least
better I-cache usage (two tight loops rather than one more spread out
one). I don't think this is a real issue, but in theory..
- this _does_ change the order of the objects printed. Instead of doing a
"process_tree(revs, commit->tree, &objects, NULL, "");" in the loop
over the commits (which puts all the root trees _first_ in the object
list, this patch just adds them to the list of pending objects, and
then we'll traverse them in that order (and thus show each root tree
object together with the objects we discover under it)
I _think_ the new ordering actually makes more sense, but the object
ordering is actually a subtle thing when it comes to packing
efficiency, so any change in order is going to have implications for
packing. Good or bad, I dunno.
- There may be some reason why we did it that odd way with the object
array, that I have simply forgotten.
Anyway, now that we don't buffer up the objects before showing them
that may actually result in lower memory usage during that whole
traverse_commit_list() phase.
This is seriously not very deeply tested. It makes sense to me, it seems
to pass all the tests, it looks ok, but...
Does anybody remember why we did that "object_array" thing? It used to be
an "object_list" a long long time ago, but got changed into the array due
to better memory usage patterns (those linked lists of obejcts are
horrible from a memory allocation standpoint). But I wonder why we didn't
do this back then. Maybe there's a reason for it.
Or maybe there _used_ to be a reason, and no longer is.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-11 04:27:58 +04:00
|
|
|
process_blob(revs, (struct blob *)obj, show_object,
|
traverse_commit_list: support pending blobs/trees with paths
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).
When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.
However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:
1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).
2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.
Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-10-16 02:43:19 +04:00
|
|
|
NULL, path, data);
|
2006-09-05 08:50:12 +04:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
die("unknown pending object %s (%s)",
|
|
|
|
sha1_to_hex(obj->sha1), name);
|
|
|
|
}
|
2014-10-16 02:34:34 +04:00
|
|
|
object_array_clear(&revs->pending);
|
2010-12-17 16:26:47 +03:00
|
|
|
strbuf_release(&base);
|
2006-09-05 08:50:12 +04:00
|
|
|
}
|