<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/midx.c, branch v2.45.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.45.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.45.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2024-04-23T18:52:40Z</updated>
<entry>
<title>Merge branch 'ps/missing-btmp-fix'</title>
<updated>2024-04-23T18:52:40Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-04-23T18:52:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=567293123d874bfb0607de7e8d6c65642107f132'/>
<id>urn:sha1:567293123d874bfb0607de7e8d6c65642107f132</id>
<content type='text'>
GIt 2.44 introduced a regression that makes the updated code to
barf in repositories with multi-pack index written by older
versions of Git, which has been corrected.

* ps/missing-btmp-fix:
  pack-bitmap: gracefully handle missing BTMP chunks
</content>
</entry>
<entry>
<title>pack-bitmap: gracefully handle missing BTMP chunks</title>
<updated>2024-04-15T17:42:00Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-15T06:41:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=795006fff45fb2b5e92be0121f92a876c409a1a3'/>
<id>urn:sha1:795006fff45fb2b5e92be0121f92a876c409a1a3</id>
<content type='text'>
In 0fea6b73f1 (Merge branch 'tb/multi-pack-verbatim-reuse', 2024-01-12)
we have introduced multi-pack verbatim reuse of objects. This series has
introduced a new BTMP chunk, which encodes information about bitmapped
objects in the multi-pack index. Starting with dab60934e3 (pack-bitmap:
pass `bitmapped_pack` struct to pack-reuse functions, 2023-12-14) we use
this information to figure out objects which we can reuse from each of
the packfiles.

One thing that we glossed over though is backwards compatibility with
repositories that do not yet have BTMP chunks in their multi-pack index.
In that case, `nth_bitmapped_pack()` would return an error, which causes
us to emit a warning followed by another error message. These warnings
are visible to users that fetch from a repository:

```
$ git fetch
...
remote: error: MIDX does not contain the BTMP chunk
remote: warning: unable to load pack: 'pack-f6bb7bd71d345ea9fe604b60cab9ba9ece54ffbe.idx', disabling pack-reuse
remote: Enumerating objects: 40, done.
remote: Counting objects: 100% (40/40), done.
remote: Compressing objects: 100% (39/39), done.
remote: Total 40 (delta 5), reused 0 (delta 0), pack-reused 0 (from 0)
...
```

While the fetch succeeds the user is left wondering what they did wrong.
Furthermore, as visible both from the warning and from the reuse stats,
pack-reuse is completely disabled in such repositories.

What is quite interesting is that this issue can even be triggered in
case `pack.allowPackReuse=single` is set, which is the default value.
One could have expected that in this case we fall back to the old logic,
which is to use the preferred packfile without consulting BTMP chunks at
all. But either we fail with the above error in case they are missing,
or we use the first pack in the multi-pack-index. The former case
disables pack-reuse altogether, whereas the latter case may result in
reusing objects from a suboptimal packfile.

Fix this issue by partially reverting the logic back to what we had
before this patch series landed. Namely, in the case where we have no
BTMP chunks or when `pack.allowPackReuse=single` are set, we use the
preferred pack instead of consulting the BTMP chunks.

Helped-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx-write: move writing-related functions from midx.c</title>
<updated>2024-04-01T21:18:16Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2024-04-01T21:16:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=748b88a0214d4f65fd56d4728234ea86dbcfffd6'/>
<id>urn:sha1:748b88a0214d4f65fd56d4728234ea86dbcfffd6</id>
<content type='text'>
Introduce a new midx-write.c source file, which holds all of the
functionality from the MIDX sub-system related to writing new MIDX files.

Similar to the relationship between "pack-bitmap.c" and
"pack-bitmap-write.c", this source file will hold code that is specific
to writing MIDX files as opposed to reading them (the latter will remain
in midx.c).

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: use strvec_pushf() for pack-objects base name</title>
<updated>2024-03-25T19:03:27Z</updated>
<author>
<name>René Scharfe</name>
<email>l.s.r@web.de</email>
</author>
<published>2024-03-24T16:40:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=4d45e79e115ef3259179032367cb84d08198c6c5'/>
<id>urn:sha1:4d45e79e115ef3259179032367cb84d08198c6c5</id>
<content type='text'>
Build the pack base name argument directly using strvec_pushf() instead
of with an intermediate strbuf.  This is shorter, simpler and avoids the
need for explicit cleanup.

Signed-off-by: René Scharfe &lt;l.s.r@web.de&gt;
Reviewed-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: implement `midx_preferred_pack()`</title>
<updated>2023-12-14T22:38:08Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:24:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b1e3333068247ddd44021a0b69457c249ddee7a1'/>
<id>urn:sha1:b1e3333068247ddd44021a0b69457c249ddee7a1</id>
<content type='text'>
When performing a binary search over the objects in a MIDX's bitmap
(i.e. in pseudo-pack order), the reader reconstructs the pseudo-pack
ordering using a combination of (a) the preferred pack, (b) the pack's
lexical position in the MIDX based on pack names, and (c) the object
offset within the pack.

In order to perform this binary search, the reader must know the
identity of the preferred pack. This could be stored in the MIDX, but
isn't for historical reasons, mostly because it can easily be inferred
at read-time by looking at the object in the first bit position and
finding out which pack it was selected from in the MIDX, like so:

    nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0));

In midx_to_pack_pos() which performs this binary search, we look up the
identity of the preferred pack before each search. This is relatively
quick, since it involves two table-driven lookups (one in the MIDX's
revindex for `pack_pos_to_midx()`, and another in the MIDX's object
table for `nth_midxed_pack_int_id()`).

But since the preferred pack does not change after the MIDX is written,
it is safe to cache this value on the MIDX itself.

Write a helper to do just that, and rewrite all of the existing
call-sites that care about the identity of the preferred pack in terms
of this new helper.

This will prepare us for a subsequent patch where we will need to binary
search through the MIDX's pseudo-pack order multiple times.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: implement `midx_locate_pack()`</title>
<updated>2023-12-14T22:38:07Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:23:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=307d75bbe6b33267872633173a5d9b5d88b87793'/>
<id>urn:sha1:307d75bbe6b33267872633173a5d9b5d88b87793</id>
<content type='text'>
The multi-pack index API exposes a `midx_contains_pack()` function that
takes in a string ending in either ".idx" or ".pack" and returns whether
or not the MIDX contains a given pack corresponding to that string.

There is no corresponding function to locate the position of a pack
within the MIDX's pack order (sorted lexically by pack filename).

We could add an optional out parameter to `midx_contains_pack()` that is
filled out with the pack's position when the parameter is non-NULL. To
minimize the amount of fallout from this change, instead introduce a new
function by renaming `midx_contains_pack()` to `midx_locate_pack()`,
adding that output parameter, and then reimplementing
`midx_contains_pack()` in terms of it.

Future patches will make use of this new function.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: implement `BTMP` chunk</title>
<updated>2023-12-14T22:38:07Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:23:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=5f5ccd959573f88e126d53df16b149c64e6e9091'/>
<id>urn:sha1:5f5ccd959573f88e126d53df16b149c64e6e9091</id>
<content type='text'>
When a multi-pack bitmap is used to implement verbatim pack reuse (that
is, when verbatim chunks from an on-disk packfile are copied
directly[^1]), it does so by using its "preferred pack" as the source
for pack-reuse.

This allows repositories to pack the majority of their objects into a
single (often large) pack, and then use it as the single source for
verbatim pack reuse. This increases the amount of objects that are
reused verbatim (and consequently, decrease the amount of time it takes
to generate many packs). But this performance comes at a cost, which is
that the preferred packfile must pace its growth with that of the entire
repository in order to maintain the utility of verbatim pack reuse.

As repositories grow beyond what we can reasonably store in a single
packfile, the utility of verbatim pack reuse diminishes. Or, at the very
least, it becomes increasingly more expensive to maintain as the pack
grows larger and larger.

It would be beneficial to be able to perform this same optimization over
multiple packs, provided some modest constraints (most importantly, that
the set of packs eligible for verbatim reuse are disjoint with respect
to the subset of their objects being sent).

If we assume that the packs which we treat as candidates for verbatim
reuse are disjoint with respect to any of their objects we may output,
we need to make only modest modifications to the verbatim pack-reuse
code itself. Most notably, we need to remove the assumption that the
bits in the reachability bitmap corresponding to objects from the single
reuse pack begin at the first bit position.

Future patches will unwind these assumptions and reimplement their
existing functionality as special cases of the more general assumptions
(e.g. that reuse bits can start anywhere within the bitset, but happen
to start at 0 for all existing cases).

This patch does not yet relax any of those assumptions. Instead, it
implements a foundational data-structure, the "Bitampped Packs" (`BTMP`)
chunk of the multi-pack index. The `BTMP` chunk's contents are described
in detail here. Importantly, the `BTMP` chunk contains information to
map regions of a multi-pack index's reachability bitmap to the packs
whose objects they represent.

For now, this chunk is only written, not read (outside of the test-tool
used in this patch to test the new chunk's behavior). Future patches
will begin to make use of this new chunk.

[^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
  pack that wasn't used verbatim.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: factor out `fill_pack_info()`</title>
<updated>2023-12-14T22:38:07Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:23:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=fba68184b8de477934d1642ac1c311cfcf0f6041'/>
<id>urn:sha1:fba68184b8de477934d1642ac1c311cfcf0f6041</id>
<content type='text'>
When selecting which packfiles will be written while generating a MIDX,
the MIDX internals fill out a 'struct pack_info' with various pieces of
book-keeping.

Instead of filling out each field of the `pack_info` structure
individually in each of the two spots that modify the array of such
structures (`ctx-&gt;info`), extract a common routine that does this for
us.

This reduces the code duplication by a modest amount. But more
importantly, it zero-initializes the structure before assigning values
into it. This hardens us for a future change which will add additional
fields to this structure which (until this patch) was not
zero-initialized.

As a result, any new fields added to the `pack_info` structure need only
be updated in a single location, instead of at each spot within midx.c.

There are no functional changes in this patch.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-objects: free packing_data in more places</title>
<updated>2023-12-14T22:38:07Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:23:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=66f0c71073ee5fe1c9d12d2952305a4793d7b43f'/>
<id>urn:sha1:66f0c71073ee5fe1c9d12d2952305a4793d7b43f</id>
<content type='text'>
The pack-objects internals use a packing_data struct to track what
objects are part of the pack(s) being formed.

Since these structures contain allocated fields, failing to
appropriately free() them results in a leak. Plug that leak by
introducing a clear_packing_data() function, and call it in the
appropriate spots.

This is a fairly straightforward leak to plug, since none of the callers
expect to read any values or have any references to parts of the address
space being freed.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: check consistency of fanout table</title>
<updated>2023-11-09T10:07:52Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-11-09T07:12:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9d78fb0eb6fa67cee069d3a403acb2fadd46f058'/>
<id>urn:sha1:9d78fb0eb6fa67cee069d3a403acb2fadd46f058</id>
<content type='text'>
The commit-graph, midx, and pack idx on-disk formats all have oid fanout
tables which are fed to bsearch_hash(). If these tables do not increase
monotonically, then the binary search may not only produce bogus values,
it may cause out of bounds reads.

We fixed this for commit graphs in 4169d89645 (commit-graph: check
consistency of fanout table, 2023-10-09). That commit argued that we did
not need to do the same for midx and pack idx files, because they
already did this check. However, that is wrong. We _do_ check the fanout
table for pack idx files when we load them, but we only do so for midx
files when running "git multi-pack-index verify". So it is possible to
get an out-of-bounds read by running a normal command with a specially
crafted midx file.

Let's fix this using the same solution (and roughly the same test) we
did for the commit-graph in 4169d89645. This replaces the same check
from "multi-pack-index verify", because verify uses the same read
routines, we'd bail on reading the midx much sooner now. So let's make
sure to copy its verbose error message.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
