<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/midx.h, branch jch</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=jch</id>
<link rel='self' href='https://git.shady.money/git/atom?h=jch'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2026-02-24T19:16:34Z</updated>
<entry>
<title>midx: implement MIDX compaction</title>
<updated>2026-02-24T19:16:34Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2026-02-24T19:00:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9df44a97f165bbdd8d591c6e99c8894ae036b5bd'/>
<id>urn:sha1:9df44a97f165bbdd8d591c6e99c8894ae036b5bd</id>
<content type='text'>
When managing a MIDX chain with many layers, it is convenient to combine
a sequence of adjacent layers into a single layer to prevent the chain
from growing too long.

While it is conceptually possible to "compact" a sequence of MIDX layers
together by running "git multi-pack-index write --stdin-packs", there
are a few drawbacks that make this less than desirable:

 - Preserving the MIDX chain is impossible, since there is no way to
   write a MIDX layer that contains objects or packs found in an earlier
   MIDX layer already part of the chain. So callers would have to write
   an entirely new (non-incremental) MIDX containing only the compacted
   layers, discarding all other objects/packs from the MIDX.

 - There is (currently) no way to write a MIDX layer outside of the MIDX
   chain to work around the above, such that the MIDX chain could be
   reassembled substituting the compacted layers with the MIDX that was
   written.

 - The `--stdin-packs` command-line option does not allow us to specify
   the order of packs as they appear in the MIDX. Therefore, even if
   there were workarounds for the previous two challenges, any bitmaps
   belonging to layers which come after the compacted layer(s) would no
   longer be valid.

This commit introduces a way to compact a sequence of adjacent MIDX
layers into a single layer while preserving the MIDX chain, as well as
any bitmap(s) in layers which are newer than the compacted ones.

Implementing MIDX compaction does not require a significant number of
changes to how MIDX layers are written. The main changes are as follows:

 - Instead of calling `fill_packs_from_midx()`, we call a new function
   `fill_packs_from_midx_range()`, which walks backwards along the
   portion of the MIDX chain which we are compacting, and adds packs one
   layer a time.

   In order to preserve the pseudo-pack order, the concatenated pack
   order is preserved, with the exception of preferred packs which are
   always added first.

 - After adding entries from the set of packs in the compaction range,
   `compute_sorted_entries()` must adjust the `pack_int_id`'s for all
   objects added in each fanout layer to match their original
   `pack_int_id`'s (as opposed to the index at which each pack appears
   in `ctx.info`).

   Note that we cannot reuse `midx_fanout_add_midx_fanout()` directly
   here, as it unconditionally recurs through the `-&gt;base_midx`. Factor
   out a `_1()` variant that operates on a single layer, reimplement
   the existing function in terms of it, and use the new variant from
   `midx_fanout_add_compact()`.

   Since we are sorting the list of objects ourselves, the order we add
   them in does not matter.

 - When writing out the new 'multi-pack-index-chain' file, discard any
   layers in the compaction range, replacing them with the newly written
   layer, instead of keeping them and placing the new layer at the end
   of the chain.

This ends up being sufficient to implement MIDX compaction in such a way
that preserves bitmaps corresponding to more recent layers in the MIDX
chain.

The tests for MIDX compaction are so far fairly spartan, since the main
interesting behavior here is ensuring that the right packs/objects are
selected from each layer, and that the pack order is preserved despite
whether or not they are sorted in lexicographic order in the original
MIDX chain.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: do not require packs to be sorted in lexicographic order</title>
<updated>2026-02-24T19:16:33Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2026-02-24T19:00:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b2ec8e90c201d2395b67f89f3bf38bb472e43460'/>
<id>urn:sha1:b2ec8e90c201d2395b67f89f3bf38bb472e43460</id>
<content type='text'>
The MIDX file format currently requires that pack files be identified by
the lexicographic ordering of their names (that is, a pack having a
checksum beginning with "abc" would have a numeric pack_int_id which is
smaller than the same value for a pack beginning with "bcd").

As a result, it is impossible to combine adjacent MIDX layers together
without permuting bits from bitmaps that are in more recent layer(s).

To see why, consider the following example:

          | packs       | preferred pack
  --------+-------------+---------------
  MIDX #0 | { X, Y, Z } | Y
  MIDX #1 | { A, B, C } | B
  MIDX #2 | { D, E, F } | D

, where MIDX #2's base MIDX is MIDX #1, and so on. Suppose that we want
to combine MIDX layers #0 and #1, to create a new layer #0' containing
the packs from both layers. With the original three MIDX layers, objects
are laid out in the bitmap in the order they appear in their source
pack, and the packs themselves are arranged according to the pseudo-pack
order. In this case, that ordering is Y, X, Z, B, A, C.

But recall that the pseudo-pack ordering is defined by the order that
packs appear in the MIDX, with the exception of the preferred pack,
which sorts ahead of all other packs regardless of its position within
the MIDX. In the above example, that means that pack 'Y' could be placed
anywhere (so long as it is designated as preferred), however, all other
packs must be placed in the location listed above.

Because that ordering isn't sorted lexicographically, it is impossible
to compact MIDX layers in the above configuration without permuting the
object-to-bit-position mapping. Changing this mapping would affect all
bitmaps belonging to newer layers, rendering the bitmaps associated with
MIDX #2 unreadable.

One of the goals of MIDX compaction is that we are able to shrink the
length of the MIDX chain *without* invalidating bitmaps that belong to
newer layers, and the lexicographic ordering constraint is at odds with
this goal.

However, packs do not *need* to be lexicographically ordered within the
MIDX. As far as I can gather, the only reason they are sorted lexically
is to make it possible to perform a binary search over the pack names in
a MIDX, necessary to make `midx_contains_pack()`'s performance
logarithmic in the number of packs rather than linear.

Relax this constraint by allowing MIDX writes to proceed with packs that
are not arranged in lexicographic order. `midx_contains_pack()` will
lazily instantiate a `pack_names_sorted` array on the MIDX, which will
be used to implement the binary search over pack names.

This change produces MIDXs which may not be correctly read with external
tools or older versions of Git. Though older versions of Git know how to
gracefully degrade and ignore any MIDX(s) they consider corrupt,
external tools may not be as robust. To avoid unintentionally breaking
any such tools, guard this change behind a version bump in the MIDX's
on-disk format.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: introduce `midx_get_checksum_hex()`</title>
<updated>2026-02-24T19:16:32Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2026-02-24T18:59:44Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=6e86f679248580bec5c105b37217862f3022b504'/>
<id>urn:sha1:6e86f679248580bec5c105b37217862f3022b504</id>
<content type='text'>
When trying to print out, say, the hexadecimal representation of a
MIDX's hash, our code will do something like:

    hash_to_hex_algop(midx_get_checksum_hash(m),
                      m-&gt;source-&gt;odb-&gt;repo-&gt;hash_algo);

, which is both cumbersome and repetitive. In fact, all but a handful of
callers to `midx_get_checksum_hash()` do exactly the above. Reduce the
repetitive nature of calling `midx_get_checksum_hash()` by having it
return a pointer into a static buffer containing the above result.

For the handful of callers that do need to compare the raw bytes and
don't want to deal with an encoded copy (e.g., because they are passing
it to hasheq() or similar), they may still rely on
`midx_get_checksum_hash()` which returns the raw bytes.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: rename `get_midx_checksum()` to `midx_get_checksum_hash()`</title>
<updated>2026-02-24T19:16:32Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2026-02-24T18:59:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=de811c26bb97ac324b60883ae4f4db84a83be2f1'/>
<id>urn:sha1:de811c26bb97ac324b60883ae4f4db84a83be2f1</id>
<content type='text'>
Since 541204aabea (Documentation: document naming schema for structs and
their functions, 2024-07-30), we have adopted a naming convention for
functions that would prefer a name like, say, `midx_get_checksum()` over
`get_midx_checksum()`.

Adopt this convention throughout the midx.h API. Since this function
returns a raw (that is, non-hex encoded) hash, let's suffix the function
with "_hash()" to make this clear. As a side effect, this prepares us
for the subsequent change which will introduce a "_hex()" variant that
encodes the checksum itself.

Suggested-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: mark `get_midx_checksum()` arguments as const</title>
<updated>2026-02-24T19:16:31Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2026-02-24T18:59:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=00c0d84e6f0cdb91b0f261e12ed058ad3baa49bf'/>
<id>urn:sha1:00c0d84e6f0cdb91b0f261e12ed058ad3baa49bf</id>
<content type='text'>
To make clear that the function `get_midx_checksum()` does not do
anything to modify its argument, mark the MIDX pointer as const.

The following commit will rename this function altogether to make clear
that it returns the raw bytes of the checksum, not a hex-encoded copy of
it.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>packfile: move `get_multi_pack_index()` into "midx.c"</title>
<updated>2025-09-24T18:53:50Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2025-09-23T10:17:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=ab8aff4a6b2a1d5aa79deeb64bdeecc0234b4ddf'/>
<id>urn:sha1:ab8aff4a6b2a1d5aa79deeb64bdeecc0234b4ddf</id>
<content type='text'>
The `get_multi_pack_index()` function is declared and implemented in the
packfile subsystem, even though it really belongs into the multi-pack
index subsystem. The reason for this is likely that it needs to call
`packfile_store_prepare()`, which is not exposed by the packfile system.
In a subsequent commit we're about to add another caller outside of the
packfile system though, so we'll have to expose the function anyway.

Do so now already and move `get_multi_pack_index()` into the MIDX
subsystem.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: compute paths via their source</title>
<updated>2025-08-11T16:22:23Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2025-08-11T13:46:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=13296ac909d53e14712f89a7f4fda94dd0465479'/>
<id>urn:sha1:13296ac909d53e14712f89a7f4fda94dd0465479</id>
<content type='text'>
With the preceding commits we started to always have the object database
source available when we load, write or access multi-pack indices. With
this in place we can change how MIDX paths are computed so that we don't
have to pass in the combination of a hash algorithm and object directory
anymore, but only the object database source.

Refactor the code accordingly.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: stop duplicating info redundant with its owning source</title>
<updated>2025-08-11T16:22:23Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2025-08-11T13:46:49Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=7744936f374308d6fa3c6e317fb8fe0b685d0ef2'/>
<id>urn:sha1:7744936f374308d6fa3c6e317fb8fe0b685d0ef2</id>
<content type='text'>
Multi-pack indices store some information that is redundant with their
owning source:

  - The locality bit that tracks whether the source is the primary
    object source or an alternate.

  - The object directory path the multi-pack index is located in.

  - The pointer to the owning parent directory.

All of this information is already contained in `struct odb_source`. So
now that we always have that struct available when loading a multi-pack
index we have it readily accessible.

Drop the redundant information and instead store a pointer to the object
source.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: write multi-pack indices via their source</title>
<updated>2025-08-11T16:22:22Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2025-08-11T13:46:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=c3f5d251469525a52074b0373671a588f0e5b972'/>
<id>urn:sha1:c3f5d251469525a52074b0373671a588f0e5b972</id>
<content type='text'>
Similar to the preceding commit, refactor the writing side of multi-pack
indices so that we pass in the object database source where the index
should be written to.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: load multi-pack indices via their source</title>
<updated>2025-08-11T16:22:22Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2025-08-11T13:46:47Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=017db7bb14246dea55b678fc20e34ce91c28968a'/>
<id>urn:sha1:017db7bb14246dea55b678fc20e34ce91c28968a</id>
<content type='text'>
To load a multi-pack index the caller is expected to pass both the
repository and the object directory where the multi-pack index is
located. While this works, this layout has a couple of downsides:

  - We need to pass in information reduntant with the owning source,
    namely its object directory and whether the source is local or not.

  - We don't have access to the source when loading the multi-pack
    index. If we had that access, we could store a pointer to the owning
    source in the MIDX and thus deduplicate some information.

  - Multi-pack indices are inherently specific to the object source and
    its format. With the goal of pluggable object backends in mind we
    will eventually want the backends to own the logic of reading and
    writing multi-pack indices. Making the logic work on top of object
    sources is a step into that direction.

Refactor loading of multi-pack indices accordingly.

This surfaces one small problem though: git-multi-pack-index(1) and our
MIDX test helper both know to read and write multi-pack-indices located
in a different object directory. This issue is addressed by adding the
user-provided object directory as an in-memory alternate.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
