<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/object.h, branch v2.45.3</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.45.3</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.45.3'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2024-03-28T21:13:50Z</updated>
<entry>
<title>Merge branch 'eb/hash-transition'</title>
<updated>2024-03-28T21:13:50Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-03-28T21:13:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=1002f28a527d33893f7dab068dbac7011f84af65'/>
<id>urn:sha1:1002f28a527d33893f7dab068dbac7011f84af65</id>
<content type='text'>
Work to support a repository that work with both SHA-1 and SHA-256
hash algorithms has started.

* eb/hash-transition: (30 commits)
  t1016-compatObjectFormat: add tests to verify the conversion between objects
  t1006: test oid compatibility with cat-file
  t1006: rename sha1 to oid
  test-lib: compute the compatibility hash so tests may use it
  builtin/ls-tree: let the oid determine the output algorithm
  object-file: handle compat objects in check_object_signature
  tree-walk: init_tree_desc take an oid to get the hash algorithm
  builtin/cat-file: let the oid determine the output algorithm
  rev-parse: add an --output-object-format parameter
  repository: implement extensions.compatObjectFormat
  object-file: update object_info_extended to reencode objects
  object-file-convert: convert commits that embed signed tags
  object-file-convert: convert commit objects when writing
  object-file-convert: don't leak when converting tag objects
  object-file-convert: convert tag objects when writing
  object-file-convert: add a function to convert trees between algorithms
  object: factor out parse_mode out of fast-import and tree-walk into in object.h
  cache: add a function to read an OID of a specific algorithm
  tag: sign both hashes
  commit: export add_header_signature to support handling signatures on tags
  ...
</content>
</entry>
<entry>
<title>upload-pack: free tree buffers after parsing</title>
<updated>2024-02-28T22:42:01Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-02-28T22:39:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=6cd05e768b7e54ca48b16fb0214df4c70aecd46c'/>
<id>urn:sha1:6cd05e768b7e54ca48b16fb0214df4c70aecd46c</id>
<content type='text'>
When a client sends us a "want" or "have" line, we call parse_object()
to get an object struct. If the object is a tree, then the parsed state
means that tree-&gt;buffer points to the uncompressed contents of the tree.
But we don't really care about it. We only really need to parse commits
and tags; for trees and blobs, the important output is just a "struct
object" with the correct type.

But much worse, we do not ever free that tree buffer. It's not leaked in
the traditional sense, in that we still have a pointer to it from the
global object hash. But if the client requests many trees, we'll hold
all of their contents in memory at the same time.

Nobody really noticed because it's rare for clients to directly request
a tree. It might happen for a lightweight tag pointing straight at a
tree, or it might happen for a "tree:depth" partial clone filling in
missing trees.

But it's also possible for a malicious client to request a lot of trees,
causing upload-pack's memory to balloon. For example, without this
patch, requesting every tree in git.git like:

  pktline() {
    local msg="$*"
    printf "%04x%s\n" $((1+4+${#msg})) "$msg"
  }

  want_trees() {
    pktline command=fetch
    printf 0001
    git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)' |
      while read oid type; do
        test "$type" = "tree" || continue
        pktline want $oid
      done
      pktline done
      printf 0000
  }

  want_trees | GIT_PROTOCOL=version=2 valgrind --tool=massif ./git upload-pack . &gt;/dev/null

shows a peak heap usage of ~3.7GB. Which is just about the sum of the
sizes of all of the uncompressed trees. For linux.git, it's closer to
17GB.

So the obvious thing to do is to call free_tree_buffer() after we
realize that we've parsed a tree. We know that upload-pack won't need it
later. But let's push the logic into parse_object_with_flags(), telling
it to discard the tree buffer immediately. There are two reasons for
this. One, all of the relevant call-sites already call the with_options
variant to pass the SKIP_HASH flag. So it actually ends up as less code
than manually free-ing in each spot. And two, it enables an extra
optimization that I'll discuss below.

I've touched all of the sites that currently use SKIP_HASH in
upload-pack. That drops the peak heap of the upload-pack invocation
above from 3.7GB to ~24MB.

I've also modified the caller in get_reference(); a partial clone
benefits from its use in pack-objects for the reasons given in
0bc2557951 (upload-pack: skip parse-object re-hashing of "want" objects,
2022-09-06), where we were measuring blob requests. But note that the
results of get_reference() are used for traversing, as well; so we
really would _eventually_ use the tree contents. That makes this at
first glance a space/time tradeoff: we won't hold all of the trees in
memory at once, but we'll have to reload them each when it comes time to
traverse.

And here's where our extra optimization comes in. If the caller is not
going to immediately look at the tree contents, and it doesn't care
about checking the hash, then parse_object() can simply skip loading the
tree entirely, just like we do for blobs! And now it's not a space/time
tradeoff in get_reference() anymore. It's just a lazy-load: we're
delaying reading the tree contents until it's time to actually traverse
them one by one.

And of course for upload-pack, this optimization means we never load the
trees at all, saving lots of CPU time. Timing the "every tree from
git.git" request above shows upload-pack dropping from 32 seconds of CPU
to 19 (the remainder is mostly due to pack-objects actually sending the
pack; timing just the upload-pack portion shows we go from 13s to
~0.28s).

These are all highly gamed numbers, of course. For real-world
partial-clone requests we're saving only a small bit of time in
practice. But it does help harden upload-pack against malicious
denial-of-service attacks.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>object: factor out parse_mode out of fast-import and tree-walk into in object.h</title>
<updated>2023-10-02T21:57:39Z</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2023-10-02T02:40:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=45b3b1214154975ac03c818838e3d12571f57d1e'/>
<id>urn:sha1:45b3b1214154975ac03c818838e3d12571f57d1e</id>
<content type='text'>
builtin/fast-import.c and tree-walk.c have almost identical version of
get_mode.  The two functions started out the same but have diverged
slightly.  The version in fast-import changed mode to a uint16_t to
save memory.  The version in tree-walk started erroring if no mode was
present.

As far as I can tell both of these changes are valid for both of the
callers, so add the both changes and place the common parsing helper
in object.h

Rename the helper from get_mode to parse_mode so it does not
conflict with another helper named get_mode in diff-no-index.c

This will be used shortly in a new helper decode_tree_entry_raw
which is used to compute cmpatibility objects as part of
the sha256 transition.

Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'tb/pack-bitmap-traversal-with-boundary'</title>
<updated>2023-06-22T23:29:05Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2023-06-22T23:29:05Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=f2ffc7418685f75e43e2919426276141fd62c656'/>
<id>urn:sha1:f2ffc7418685f75e43e2919426276141fd62c656</id>
<content type='text'>
The object traversal using reachability bitmap done by
"pack-object" has been tweaked to take advantage of the fact that
using "boundary" commits as representative of all the uninteresting
ones can save quite a lot of object enumeration.

* tb/pack-bitmap-traversal-with-boundary:
  pack-bitmap.c: use commit boundary during bitmap traversal
  pack-bitmap.c: extract `fill_in_bitmap()`
  object: add object_array initializer helper function
</content>
</entry>
<entry>
<title>object: add object_array initializer helper function</title>
<updated>2023-05-08T19:05:55Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-05-08T17:38:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=fe90355361430dc52f858845a821370db0c54c80'/>
<id>urn:sha1:fe90355361430dc52f858845a821370db0c54c80</id>
<content type='text'>
The object_array API has an OBJECT_ARRAY_INIT macro, but lacks a
function to initialize an object_array at a given location in memory.

Introduce `object_array_init()` to implement such a function.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>hash-ll.h: split out of hash.h to remove dependency on repository.h</title>
<updated>2023-04-24T19:47:32Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2023-04-22T20:17:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=d1cbe1e6d8a9cab2b4ffe8a17d34db214dce1e49'/>
<id>urn:sha1:d1cbe1e6d8a9cab2b4ffe8a17d34db214dce1e49</id>
<content type='text'>
hash.h depends upon and includes repository.h, due to the definition and
use of the_hash_algo (defined as the_repository-&gt;hash_algo).  However,
most headers trying to include hash.h are only interested in the layout
of the structs like object_id.  Move the parts of hash.h that do not
depend upon repository.h into a new file hash-ll.h (the "low level"
parts of hash.h), and adjust other files to use this new header where
the convenience inline functions aren't needed.

This allows hash.h and object.h to be fairly small, minimal headers.  It
also exposes a lot of hidden dependencies on both path.h (which was
brought in by repository.h) and repository.h (which was previously
implicitly brought in by object.h), so also adjust other files to be
more explicit about what they depend upon.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>object.h: move some inline functions and defines from cache.h</title>
<updated>2023-04-11T15:52:10Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2023-04-11T07:41:55Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=8876ea83a74b45046c3dabdd5f55f07852bb98ae'/>
<id>urn:sha1:8876ea83a74b45046c3dabdd5f55f07852bb98ae</id>
<content type='text'>
The object_type() inline function is very tied to the enum object_type
declaration within object.h, and just seemed to make more sense to live
there.  That makes S_ISGITLINK and some other defines make sense to go
with it, as well as the create_ce_mode() and canon_mode() inline
functions.  Move all these inline functions and defines from cache.h to
object.h.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Acked-by: Calvin Wan &lt;calvinwan@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>object.h: stop depending on cache.h; make cache.h depend on object.h</title>
<updated>2023-02-24T01:25:29Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2023-02-24T00:09:30Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a64215b6cd5e67939187475c5b248dc5d13e3d60'/>
<id>urn:sha1:a64215b6cd5e67939187475c5b248dc5d13e3d60</id>
<content type='text'>
Things should be able to depend on object.h without pulling in all of
cache.h.  Move an enum to allow this.

Note that a couple files previously depended on things brought in
through cache.h indirectly (revision.h -&gt; commit.h -&gt; object.h -&gt;
cache.h).  As such, this change requires making existing dependencies
more explicit in half a dozen files.  The inclusion of strbuf.h in
some headers if of particular note: these headers directly embedded a
strbuf in some new structs, meaning they should have been including
strbuf.h all along but were indirectly getting the necessary
definitions.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>parse_object(): allow skipping hash check</title>
<updated>2022-09-07T19:18:57Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2022-09-06T23:01:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=c868d8e91fee01999eb34c5559250ea059293b33'/>
<id>urn:sha1:c868d8e91fee01999eb34c5559250ea059293b33</id>
<content type='text'>
The parse_object() function checks the object hash of any object it
parses. This is a nice feature, as it means we may catch bit corruption
during normal use, rather than waiting for specific fsck operations.

But it also can be slow. It's particularly noticeable for blobs, where
except for the hash check, we could return without loading the object
contents at all. Now one may wonder what is the point of calling
parse_object() on a blob in the first place then, but usually it's not
intentional: we were fed an oid from somewhere, don't know the type, and
want an object struct. For commits and trees, the parsing is usually
helpful; we're about to look at the contents anyway. But this is less
true for blobs, where we may be collecting them as part of a
reachability traversal, etc, and don't actually care what's in them. And
blobs, of course, tend to be larger.

We don't want to just throw out the hash-checks for blobs, though. We do
depend on them in some circumstances (e.g., rev-list --verify-objects
uses parse_object() to check them). It's only the callers that know
how they're going to use the result. And so we can help them by
providing a special flag to skip the hash check.

We could just apply this to blobs, as they're going to be the main
source of performance improvement. But if a caller doesn't care about
checking the hash, we might as well skip it for other object types, too.
Even though we can't avoid reading the object contents, we can still
skip the actual hash computation.

If this seems like it is making Git a little bit less safe against
corruption, it may be. But it's part of a series of tradeoffs we're
already making. For instance, "rev-list --objects" does not open the
contents of blobs it prints. And when a commit graph is present, we skip
opening most commits entirely. The important thing will be to use this
flag in cases where it's safe to skip the check. For instance, when
serving a pack for a fetch, we know the client will fully index the
objects and do a connectivity check itself. There's little to be gained
from the server side re-hashing a blob itself. And indeed, most of the
time we don't! The revision machinery won't open up a blob reached by
traversal, but only one requested directly with a "want" line. So
applied properly, this new feature shouldn't make anything less safe in
practice.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>revision: allow --ancestry-path to take an argument</title>
<updated>2022-08-19T17:45:08Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2022-08-19T04:28:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=257418c59040c13bfa839e01922e21833cda6a52'/>
<id>urn:sha1:257418c59040c13bfa839e01922e21833cda6a52</id>
<content type='text'>
We have long allowed users to run e.g.
    git log --ancestry-path master..seen
which shows all commits which satisfy all three of these criteria:
  * are an ancestor of seen
  * are not an ancestor of master
  * have master as an ancestor

This commit allows another variant:
    git log --ancestry-path=$TOPIC master..seen
which shows all commits which satisfy all of these criteria:
  * are an ancestor of seen
  * are not an ancestor of master
  * have $TOPIC in their ancestry-path
that last bullet can be defined as commits meeting any of these
criteria:
    * are an ancestor of $TOPIC
    * have $TOPIC as an ancestor
    * are $TOPIC

This also allows multiple --ancestry-path arguments, which can be
used to find commits with any of the given topics in their ancestry
path.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Acked-by: Derrick Stolee &lt;derrickstolee@github.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
