<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/pack-objects.c, branch v2.50.0</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.50.0</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.50.0'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2023-12-14T22:38:07Z</updated>
<entry>
<title>pack-objects: free packing_data in more places</title>
<updated>2023-12-14T22:38:07Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:23:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=66f0c71073ee5fe1c9d12d2952305a4793d7b43f'/>
<id>urn:sha1:66f0c71073ee5fe1c9d12d2952305a4793d7b43f</id>
<content type='text'>
The pack-objects internals use a packing_data struct to track what
objects are part of the pack(s) being formed.

Since these structures contain allocated fields, failing to
appropriately free() them results in a leak. Plug that leak by
introducing a clear_packing_data() function, and call it in the
appropriate spots.

This is a fairly straightforward leak to plug, since none of the callers
expect to read any values or have any references to parts of the address
space being freed.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>parse: separate out parsing functions from config.h</title>
<updated>2023-09-29T22:14:57Z</updated>
<author>
<name>Calvin Wan</name>
<email>calvinwan@google.com</email>
</author>
<published>2023-09-29T21:20:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b1bda751739d90e1a94b47397676bcb8eebf16d5'/>
<id>urn:sha1:b1bda751739d90e1a94b47397676bcb8eebf16d5</id>
<content type='text'>
The files config.{h,c} contain functions that have to do with parsing,
but not config.

In order to further reduce all-in-one headers, separate out functions in
config.c that do not operate on config into its own file, parse.h,
and update the include directives in the .c files that need only such
functions accordingly.

Signed-off-by: Calvin Wan &lt;calvinwan@google.com&gt;
Signed-off-by: Jonathan Tan &lt;jonathantanmy@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>git-compat-util: move alloc macros to git-compat-util.h</title>
<updated>2023-07-05T18:42:31Z</updated>
<author>
<name>Calvin Wan</name>
<email>calvinwan@google.com</email>
</author>
<published>2023-07-05T17:09:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=91c080dff511b7a81f91d1cc79589b49e16a2b7a'/>
<id>urn:sha1:91c080dff511b7a81f91d1cc79589b49e16a2b7a</id>
<content type='text'>
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for
dynamic array allocation. Moving these macros to git-compat-util.h with
the other alloc macros focuses alloc.[ch] to allocation for Git objects
and additionally allows us to remove inclusions to alloc.h from files
that solely used the above macros.

Signed-off-by: Calvin Wan &lt;calvinwan@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>alloc.h: move ALLOC_GROW() functions from cache.h</title>
<updated>2023-02-24T01:25:28Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2023-02-24T00:09:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=36bf19589055fb71aac0ed6719dfe5b385adc2bf'/>
<id>urn:sha1:36bf19589055fb71aac0ed6719dfe5b385adc2bf</id>
<content type='text'>
This allows us to replace includes of cache.h with includes of the much
smaller alloc.h in many places.  It does mean that we also need to add
includes of alloc.h in a number of C files.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-mtimes: support writing pack .mtimes files</title>
<updated>2022-05-26T22:48:26Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2022-05-20T23:17:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=5dfaf49a5a651d3e8c3552bc2e833312a3671a4f'/>
<id>urn:sha1:5dfaf49a5a651d3e8c3552bc2e833312a3671a4f</id>
<content type='text'>
Now that the `.mtimes` format is defined, supplement the pack-write API
to be able to conditionally write an `.mtimes` file along with a pack by
setting an additional flag and passing an oidmap that contains the
timestamps corresponding to each object in the pack.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>use CALLOC_ARRAY</title>
<updated>2021-03-14T00:00:09Z</updated>
<author>
<name>René Scharfe</name>
<email>l.s.r@web.de</email>
</author>
<published>2021-03-13T16:17:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=ca56dadb4b65ccaeab809d80db80a312dc00941a'/>
<id>urn:sha1:ca56dadb4b65ccaeab809d80db80a312dc00941a</id>
<content type='text'>
Add and apply a semantic patch for converting code that open-codes
CALLOC_ARRAY to use it instead.  It shortens the code and infers the
element size automatically.

Signed-off-by: René Scharfe &lt;l.s.r@web.de&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-objects: convert oe_set_delta_ext() to use object_id</title>
<updated>2020-02-24T20:55:52Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-02-24T04:30:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a93c141ddef25dc999fff73c590b42d3af606ff3'/>
<id>urn:sha1:a93c141ddef25dc999fff73c590b42d3af606ff3</id>
<content type='text'>
We already store an object_id internally, and now our sole caller also
has one. Let's stop passing around the internal hash array, which adds a
bit of type safety.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'jk/optim-in-pack-idx-conversion'</title>
<updated>2019-12-01T17:04:38Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2019-12-01T17:04:38Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=8faff3899e1fdefbdd143eaf5ce3b96532205bc7'/>
<id>urn:sha1:8faff3899e1fdefbdd143eaf5ce3b96532205bc7</id>
<content type='text'>
Code clean-up.

* jk/optim-in-pack-idx-conversion:
  pack-objects: avoid pointless oe_map_new_pack() calls
</content>
</entry>
<entry>
<title>pack-objects: avoid pointless oe_map_new_pack() calls</title>
<updated>2019-11-12T04:36:36Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2019-11-11T11:12:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=f66e0401abaa884aa91100b4c155c4d042c47e0d'/>
<id>urn:sha1:f66e0401abaa884aa91100b4c155c4d042c47e0d</id>
<content type='text'>
This patch fixes an extreme slowdown in pack-objects when you have more
than 1023 packs. See below for numbers.

Since 43fa44fa3b (pack-objects: move in_pack out of struct object_entry,
2018-04-14), we use a complicated system to save some per-object memory.

Each object_entry structs gets a 10-bit field to store the index of the
pack it's in. We map those indices into pointers using
packing_data-&gt;in_pack_by_idx, which we initialize at the start of the
program. If we have 2^10 or more packs, then we instead create an array
of pack pointers, one per object. This is packing_data-&gt;in_pack.

So far so good. But there's one other tricky case: if a new pack arrives
after we've initialized in_pack_by_idx, it won't have an index yet. We
solve that by calling oe_map_new_pack(), which just switches on the fly
to the less-optimal in_pack mechanism, allocating the array and
back-filling it for already-seen objects.

But that logic kicks in even when we've switched to it already (whether
because we really did see a new pack, or because we had too many packs
in the first place). The result doesn't produce a wrong outcome, but
it's very slow. What happens is this:

  - imagine you have a repo with 500k objects and 2000 packs that you
    want to repack.

  - before looking at any objects, we call prepare_in_pack_by_idx(). It
    starts allocating an index for each pack. On the 1024th pack, it
    sees there are too many, so it bails, leaving in_pack_by_idx as
    NULL.

  - while actually adding objects to the packing list, we call
    oe_set_in_pack(), which checks whether the pack already has an
    index. If it's one of the packs after the first 1023, then it
    doesn't have one, and we'll call oe_map_new_pack().

    But there's no useful work for that function to do. We're already
    using in_pack, so it just uselessly walks over the complete list of
    objects, trying to backfill in_pack.

    And we end up doing this for almost 1000 packs (each of which may be
    triggered by more than one object). And each time it triggers, we
    may iterate over up to 500k objects. So in the absolute worst case,
    this is quadratic in the number of objects.

The solution is simple: we don't need to bother checking whether the
pack has an index if we've already converted to using in_pack, since by
definition we're not going to use it. So we can just push the "does the
pack have a valid index" check down into that half of the conditional,
where we know we're going to use it.

The current test in p5303 sadly doesn't notice this problem, since it
maxes out at 1000 packs. If we add a new test to it at 2000 packs, it
does show the improvement:

  Test                      HEAD^               HEAD
  ----------------------------------------------------------------------
  5303.12: repack (2000)    26.72(39.68+0.67)   15.70(28.70+0.66) -41.2%

However, these many-pack test cases are rather expensive to run, so
adding larger and larger numbers isn't appealing. Instead, we can show
it off more easily by using GIT_TEST_FULL_IN_PACK_ARRAY, which forces us
into the absolute worst case: no pack has an index, so we'll trigger
oe_map_new_pack() pointlessly for every single object, making it truly
quadratic.

Here are the numbers (on git.git) with the included change to p5303:

  Test                      HEAD^               HEAD
  ----------------------------------------------------------------------
  5303.3: rev-list (1)      2.05(1.98+0.06)     2.06(1.99+0.06) +0.5%
  5303.4: repack (1)        33.45(33.46+0.19)   2.75(2.73+0.22) -91.8%
  5303.6: rev-list (50)     2.07(2.01+0.06)     2.06(2.01+0.05) -0.5%
  5303.7: repack (50)       34.21(35.18+0.16)   3.49(4.50+0.12) -89.8%
  5303.9: rev-list (1000)   2.87(2.78+0.08)     2.88(2.80+0.07) +0.3%
  5303.10: repack (1000)    41.26(51.30+0.47)   10.75(20.75+0.44) -73.9%

Again, those improvements aren't realistic for the 1-pack case (because
in the real world, the full-array solution doesn't kick in), but it's
more useful to be testing the more-complicated code path.

While we're looking at this issue, we'll tweak one more thing: in
oe_map_new_pack(), we call REALLOC_ARRAY(pack-&gt;in_pack). But we'd never
expect to get here unless we're back-filling it for the first time, in
which case it would be NULL. So let's switch that to ALLOC_ARRAY() for
clarity, and add a BUG() to document the expectation. Unfortunately this
code isn't well-covered in the test suite because it's inherently racy
(it only kicks in if somebody else adds a new pack while we're in the
middle of repacking).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-objects: drop packlist index_pos optimization</title>
<updated>2019-09-06T18:03:42Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2019-09-06T01:36:05Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=3a37876b5dca4c18bda67bcdead9c1d79a59933d'/>
<id>urn:sha1:3a37876b5dca4c18bda67bcdead9c1d79a59933d</id>
<content type='text'>
Once upon a time, the code to add an object to our packing list in
pack-objects all lived in a single function. It computed the position
within the hash table once, then used it to check if the object was
already present, and if not, to add it.

Later, in 2834bc27c1 (pack-objects: refactor the packing list,
2013-10-24), this was split into two functions: packlist_find() and
packlist_alloc(). We ended up with an "index_pos" variable that gets
passed through several functions to make it from one to the other.

The resulting code is rather confusing to follow. The "index_pos"
variable is sometimes undefined, if we don't yet have a hash table. This
works out in practice because in that case packlist_alloc() won't use it
at all, since it will have to create/grow the hash table. But it's hard
to verify that, and it does cause gcc 9.2.1's -Wmaybe-uninitialized to
complain when compiled with "-flto -O3" (rightfully, since we do pass
the uninitialized value as a function parameter, even if nobody ends up
using it).

All of this is to save computing the hash index again when we're
inserting into the hash table, which I found doesn't make a measurable
difference in the program runtime (which is not surprising, since we're
doing all kinds of other heavyweight things for each object).

Let's just drop this index_pos variable entirely, simplifying the code
(and pleasing the compiler).

We might be better still refactoring this custom hash table to use one
of our existing implementations (an oidmap, or a kh_oid_map). I stopped
short of that here, but this would be the likely first step towards that
anyway.

Reported-by: Stephan Beyer &lt;s-beyer@gmx.net&gt;
Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
