<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/packfile.c, branch v2.30.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.30.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.30.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2020-12-08T23:11:18Z</updated>
<entry>
<title>Merge branch 'tb/idx-midx-race-fix'</title>
<updated>2020-12-08T23:11:18Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-12-08T23:11:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=6bac6a1ef9e9c2bcf6f927e2e58e66ef39659e93'/>
<id>urn:sha1:6bac6a1ef9e9c2bcf6f927e2e58e66ef39659e93</id>
<content type='text'>
Processes that access packdata while the .idx file gets removed
(e.g. while repacking) did not fail or fall back gracefully as they
could.

* tb/idx-midx-race-fix:
  midx.c: protect against disappearing packs
  packfile.c: protect against disappearing indexes
</content>
</entry>
<entry>
<title>packfile.c: protect against disappearing indexes</title>
<updated>2020-11-25T21:15:49Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2020-11-25T17:17:28Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=c8a45eb66e0a1086a31af4d76cd0f5e228497229'/>
<id>urn:sha1:c8a45eb66e0a1086a31af4d76cd0f5e228497229</id>
<content type='text'>
In 17c35c8969 (packfile: skip loading index if in multi-pack-index,
2018-07-12) we stopped loading the .idx file for packs that are
contained within a multi-pack index.

This saves us the effort of loading an .idx and doing some lightweight
validity checks by way of 'packfile.c:load_idx()', but introduces a race
between processes that need to load the index (e.g., to generate a
reverse index) and processes that can delete the index.

For example, running the following in your shell:

    $ git init repo &amp;&amp; cd repo
    $ git commit --allow-empty -m 'base'
    $ git repack -ad &amp;&amp; git multi-pack-index write

followed by:

    $ rm -f .git/objects/pack/pack-*.idx
    $ git rev-parse HEAD | git cat-file --batch-check='%(objectsize:disk)'

will result in a segfault prior to this patch. What's happening here is
that we notice that the pack is in the multi-pack index, and so don't
check that it still has a .idx. When we then try and load that index to
generate a reverse index, we don't have it, so the call to
'find_pack_revindex()' in 'packfile.c:packed_object_info()' returns
NULL, and then dereferencing it causes a segfault.

Of course, we don't ever expect someone to remove the index file by
hand, or to be in a state where we never wrote it to begin with (yet
find that pack in the multi-pack-index). But, this can happen in a
timing race with 'git repack -ad', which removes all existing packs
after writing a new pack containing all of their objects.

Avoid this by reverting the hunk of 17c35c8969 which stops loading the
index when the pack is contained in a MIDX. This makes the latter half
of 17c35c8969 useless, since we'll always have a non-NULL
'p-&gt;index_data', in which case that if statement isn't guarding
anything.

These two together effectively revert 17c35c8969, and avoid the race
explained above.

Co-authored-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>packfile: detect overflow in .idx file size checks</title>
<updated>2020-11-16T21:41:35Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-11-13T05:07:19Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=81c4c5cf2e18d2b562639596385e49c3c48677de'/>
<id>urn:sha1:81c4c5cf2e18d2b562639596385e49c3c48677de</id>
<content type='text'>
In load_idx(), we check that the .idx file is sized appropriately for
the number of objects it claims to have. We recently fixed the case
where the number of objects caused our expected size to overflow a
32-bit unsigned int, and we switched to size_t.

On a 64-bit system, this is fine; our size_t covers any expected size.
On a 32-bit system, though, it won't. The file may claim to have 2^31
objects, which will overflow even a size_t.

This doesn't hurt us at all for a well-formed idx file. A 32-bit system
would already have failed to mmap such a file, since it would be too
big. But an .idx file which _claims_ to have 2^31 objects but is
actually much smaller would fool our check.

This is a broken file, and for the most part we don't care that much
what happens. But:

  - it's a little friendlier to notice up front "woah, this file is
    broken" than it is to get nonsense results

  - later access of the data assumes that the loading function
    sanity-checked that we have at least enough bytes for the regular
    object-id table. A malformed .idx file could lead to an
    out-of-bounds read.

So let's use our overflow-checking functions to make sure that we're not
fooled by a malformed file.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>use size_t to store pack .idx byte offsets</title>
<updated>2020-11-16T21:41:35Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-11-13T05:07:01Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a9bc372ef8210e60d924ec6c0b46624c6d0a4033'/>
<id>urn:sha1:a9bc372ef8210e60d924ec6c0b46624c6d0a4033</id>
<content type='text'>
We sometimes store the offset into a pack .idx file as an "unsigned
long", but the mmap'd size of a pack .idx file can exceed 4GB. This is
sufficient on LP64 systems like Linux, but will be too small on LLP64
systems like Windows, where "unsigned long" is still only 32 bits. Let's
use size_t, which is a better type for an offset into a memory buffer.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>compute pack .idx byte offsets using size_t</title>
<updated>2020-11-16T21:41:35Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-11-13T05:06:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=f86f769550ef7fb5162a9cd4be5e2be7981d063b'/>
<id>urn:sha1:f86f769550ef7fb5162a9cd4be5e2be7981d063b</id>
<content type='text'>
A pack and its matching .idx file are limited to 2^32 objects, because
the pack format contains a 32-bit field to store the number of objects.
Hence we use uint32_t in the code.

But the byte count of even a .idx file can be much larger than that,
because it stores at least a hash and an offset for each object. So
using SHA-1, a v2 .idx file will cross the 4GB boundary at 153,391,650
objects. This confuses load_idx(), which computes the minimum size like
this:

  unsigned long min_size = 8 + 4*256 + nr*(hashsz + 4 + 4) + hashsz + hashsz;

Even though min_size will be big enough on most 64-bit platforms, the
actual arithmetic is done as a uint32_t, resulting in a truncation. We
actually exceed that min_size, but then we do:

  unsigned long max_size = min_size;
  if (nr)
          max_size += (nr - 1)*8;

to account for the variable-sized table. That computation doesn't
overflow quite so low, but with the truncation for min_size, we end up
with a max_size that is much smaller than our actual size. So we
complain that the idx is invalid, and can't find any of its objects.

We can fix this case by casting "nr" to a size_t, which will do the
multiplication in 64-bits (assuming you're on a 64-bit platform; this
will never work on a 32-bit system since we couldn't map the whole .idx
anyway). Likewise, we don't have to worry about further additions,
because adding a smaller number to a size_t will convert the other side
to a size_t.

A few notes:

  - obviously we could just declare "nr" as a size_t in the first place
    (and likewise, packed_git.num_objects).  But it's conceptually a
    uint32_t because of the on-disk format, and we correctly treat it
    that way in other contexts that don't need to compute byte offsets
    (e.g., iterating over the set of objects should and generally does
    use a uint32_t). Switching to size_t would make all of those other
    cases look wrong.

  - it could be argued that the proper type is off_t to represent the
    file offset. But in practice the .idx file must fit within memory,
    because we mmap the whole thing. And the rest of the code (including
    the idx_size variable we're comparing against) uses size_t.

  - we'll add the same cast to the max_size arithmetic line. Even though
    we're adding to a larger type, which will convert our result, the
    multiplication is still done as a 32-bit value and can itself
    overflow. I didn't check this with my test case, since it would need
    an even larger pack (~530M objects), but looking at compiler output
    shows that it works this way. The standard should agree, but I
    couldn't find anything explicit in 6.3.1.8 ("usual arithmetic
    conversions").

The case in load_idx() was the most immediate one that I was able to
trigger. After fixing it, looking up actual objects (including the very
last one in sha1 order) works in a test repo with 153,725,110 objects.
That's because bsearch_hash() works with uint32_t entry indices, and the
actual byte access:

  int cmp = hashcmp(table + mi * stride, sha1);

is done with "stride" as a size_t, causing the uint32_t "mi" to be
promoted to a size_t. This is the way most code will access the index
data.

However, I audited all of the other byte-wise accesses of
packed_git.index_data, and many of the others are suspect (they are
similar to the max_size one, where we are adding to a properly sized
offset or directly to a pointer, but the multiplication in the
sub-expression can overflow). I didn't trigger any of these in practice,
but I believe they're potential problems, and certainly adding in the
cast is not going to hurt anything here.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'mt/delta-base-cache-races'</title>
<updated>2020-10-04T19:49:15Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-10-04T19:49:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=26b42b4dd8edb5da12f4d83b4e1f77f90cb532d8'/>
<id>urn:sha1:26b42b4dd8edb5da12f4d83b4e1f77f90cb532d8</id>
<content type='text'>
A race that leads to an access to a free'd data was corrected in
the codepath that reads pack files.

* mt/delta-base-cache-races:
  packfile: fix memory leak in add_delta_base_cache()
  packfile: fix race condition on unpack_entry()
</content>
</entry>
<entry>
<title>packfile: fix memory leak in add_delta_base_cache()</title>
<updated>2020-09-29T00:41:53Z</updated>
<author>
<name>Matheus Tavares</name>
<email>matheus.bernardino@usp.br</email>
</author>
<published>2020-09-29T00:01:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=bda959c4766d73ab435f26f2cc7c8c67b9443f5a'/>
<id>urn:sha1:bda959c4766d73ab435f26f2cc7c8c67b9443f5a</id>
<content type='text'>
When add_delta_base_cache() is called with a base that is already in the
cache, no operation is performed. But the check is done after allocating
space for a new entry, so we end up leaking memory on the early return.
In addition, the caller never free()'s the base as it expects the
function to take ownership of it. But the base is not released when we
skip insertion, so it also gets leaked. To fix these problems, move the
allocation of a new entry further down in add_delta_base_cache(), and
free() the base on early return.

Signed-off-by: Matheus Tavares &lt;matheus.bernardino@usp.br&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>packfile: fix race condition on unpack_entry()</title>
<updated>2020-09-29T00:41:52Z</updated>
<author>
<name>Matheus Tavares</name>
<email>matheus.bernardino@usp.br</email>
</author>
<published>2020-09-29T00:01:52Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=74b052f8c269ef0b6f61a5ecf04a8568399345d9'/>
<id>urn:sha1:74b052f8c269ef0b6f61a5ecf04a8568399345d9</id>
<content type='text'>
The third phase of unpack_entry() performs the following sequence in a
loop, until all the deltas enumerated in phase one are applied and the
entry is fully reconstructed:

1. Add the current base entry to the delta base cache
2. Unpack the next delta
3. Patch the unpacked delta on top of the base

When the optional object reading lock is enabled, the above steps will
be performed while holding the lock. However, step 2. momentarily
releases it so that inflation can be performed in parallel for increased
performance. Because the `base` buffer inserted in the cache at 1. is
not duplicated, another thread can potentially free() it while the lock
is released at 2. (e.g. when there is no space left in the cache to
insert another entry). In this case, the later attempt to dereference
`base` at 3. will cause a segmentation fault. This problem was observed
during a multithreaded git-grep execution on a repository with large
objects.

To fix the race condition (and later segmentation fault), let's reorder
the aforementioned steps so that `base` is only added to the cache at
the end. This will prevent the buffer from being released by another
thread while it is still in use. An alternative solution which would not
require the reordering would be to duplicate `base` before inserting it
in the cache. However, as Phil Hord mentioned, memcpy()'ing large bases
can negatively affect performance: in his experiments, this alternative
approach slowed git-grep down by 10% to 20%.

Reported-by: Phil Hord &lt;phil.hord@gmail.com&gt;
Signed-off-by: Matheus Tavares &lt;matheus.bernardino@usp.br&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'jk/dont-count-existing-objects-twice'</title>
<updated>2020-09-22T19:36:32Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-09-22T19:36:31Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=221b755f3a99f600d08441fc5436d0f2ffe57065'/>
<id>urn:sha1:221b755f3a99f600d08441fc5436d0f2ffe57065</id>
<content type='text'>
There is a logic to estimate how many objects are in the
repository, which is mean to run once per process invocation, but
it ran every time the estimated value was requested.

* jk/dont-count-existing-objects-twice:
  packfile: actually set approximate_object_count_valid
</content>
</entry>
<entry>
<title>packfile: actually set approximate_object_count_valid</title>
<updated>2020-09-17T18:36:14Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-09-17T16:47:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=67bb65de5ddf008cb39206354ae4b7af66c05b6c'/>
<id>urn:sha1:67bb65de5ddf008cb39206354ae4b7af66c05b6c</id>
<content type='text'>
The approximate_object_count() function tries to compute the count only
once per process. But ever since it was introduced in 8e3f52d778
(find_unique_abbrev: move logic out of get_short_sha1(), 2016-10-03), we
failed to actually set the "valid" flag, meaning we'd compute it fresh
on every call.

This turns out not to be _too_ bad, because we're only iterating through
the packed_git list, and not making any system calls. But since it may
get called for every abbreviated hash we output, even this can add up if
you have many packs.

Here are before-and-after timings for a new perf test which just asks
rev-list to abbreviate each commit hash (the test repo is linux.git,
with commit-graphs):

  Test                            origin              HEAD
  ----------------------------------------------------------------------------
  5303.3: rev-list (1)            28.91(28.46+0.44)   29.03(28.65+0.38) +0.4%
  5303.4: abbrev-commit (1)       1.18(1.06+0.11)     1.17(1.02+0.14) -0.8%
  5303.7: rev-list (50)           28.95(28.56+0.38)   29.50(29.17+0.32) +1.9%
  5303.8: abbrev-commit (50)      3.67(3.56+0.10)     3.57(3.42+0.15) -2.7%
  5303.11: rev-list (1000)        30.34(29.89+0.43)   30.82(30.35+0.46) +1.6%
  5303.12: abbrev-commit (1000)   86.82(86.52+0.29)   77.82(77.59+0.22) -10.4%
  5303.15: load 10,000 packs      0.08(0.02+0.05)     0.08(0.02+0.06) +0.0%

It doesn't help at all when we have 1 pack (5303.4), but we get a 10%
speedup when there are 1000 packs (5303.12). That's a modest speedup for
a case that's already slow and we'd hope to avoid in general (note how
slow it is even after, because we have to look in each of those packs
for abbreviations). But it's a one-line change that clearly matches the
original intent, so it seems worth doing.

The included perf test may also be useful for keeping an eye on any
regressions in the overall abbreviation code.

Reported-by: Rasmus Villemoes &lt;rv@rasmusvillemoes.dk&gt;
Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
