<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/pack-bitmap.h, branch v2.45.4</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.45.4</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.45.4'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2023-12-14T22:38:09Z</updated>
<entry>
<title>pack-bitmap: enable reuse from all bitmapped packs</title>
<updated>2023-12-14T22:38:09Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:24:44Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=af626ac0e02570e3afac8b4238199157181d43c2'/>
<id>urn:sha1:af626ac0e02570e3afac8b4238199157181d43c2</id>
<content type='text'>
Now that both the pack-bitmap and pack-objects code are prepared to
handle marking and using objects from multiple bitmapped packs for
verbatim reuse, allow marking objects from all bitmapped packs as
eligible for reuse.

Within the `reuse_partial_packfile_from_bitmap()` function, we no longer
only mark the pack whose first object is at bit position zero for reuse,
and instead mark any pack contained in the MIDX as a reuse candidate.

Provide a handful of test cases in a new script (t5332) exercising
interesting behavior for multi-pack reuse to ensure that we performed
all of the previous steps correctly.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: implement `midx_preferred_pack()`</title>
<updated>2023-12-14T22:38:08Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:24:25Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b1e3333068247ddd44021a0b69457c249ddee7a1'/>
<id>urn:sha1:b1e3333068247ddd44021a0b69457c249ddee7a1</id>
<content type='text'>
When performing a binary search over the objects in a MIDX's bitmap
(i.e. in pseudo-pack order), the reader reconstructs the pseudo-pack
ordering using a combination of (a) the preferred pack, (b) the pack's
lexical position in the MIDX based on pack names, and (c) the object
offset within the pack.

In order to perform this binary search, the reader must know the
identity of the preferred pack. This could be stored in the MIDX, but
isn't for historical reasons, mostly because it can easily be inferred
at read-time by looking at the object in the first bit position and
finding out which pack it was selected from in the MIDX, like so:

    nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0));

In midx_to_pack_pos() which performs this binary search, we look up the
identity of the preferred pack before each search. This is relatively
quick, since it involves two table-driven lookups (one in the MIDX's
revindex for `pack_pos_to_midx()`, and another in the MIDX's object
table for `nth_midxed_pack_int_id()`).

But since the preferred pack does not change after the MIDX is written,
it is safe to cache this value on the MIDX itself.

Write a helper to do just that, and rewrite all of the existing
call-sites that care about the identity of the preferred pack in terms
of this new helper.

This will prepare us for a subsequent patch where we will need to binary
search through the MIDX's pseudo-pack order multiple times.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`</title>
<updated>2023-12-14T22:38:08Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:24:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=83296d20e84e248ea539fe1332fca2139cfcfb8b'/>
<id>urn:sha1:83296d20e84e248ea539fe1332fca2139cfcfb8b</id>
<content type='text'>
Further prepare for enabling verbatim pack-reuse over multiple packfiles
by changing the signature of reuse_partial_packfile_from_bitmap() to
populate an array of `struct bitmapped_pack *`'s instead of a pointer to
a single packfile.

Since the array we're filling out is sized dynamically[^1], add an
additional `size_t *` parameter which will hold the number of reusable
packs (equal to the number of elements in the array).

Note that since we still have not implemented true multi-pack reuse,
these changes aren't propagated out to the rest of the caller in
builtin/pack-objects.c.

In the interim state, we expect that the array has a single element, and
we use that element to fill out the static `reuse_packfile` variable
(which is a bog-standard `struct packed_git *`). Future commits will
continue to push this change further out through the pack-objects code.

[^1]: That is, even though we know the number of packs which are
  candidates for pack-reuse, we do not know how many of those
  candidates we can actually reuse.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature</title>
<updated>2023-12-14T22:38:08Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:24:01Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=35e156b9de1dcc43673c6050cdb65735a7457c1a'/>
<id>urn:sha1:35e156b9de1dcc43673c6050cdb65735a7457c1a</id>
<content type='text'>
The signature of `reuse_partial_packfile_from_bitmap()` currently takes
in a bitmap, as well as three output parameters (filled through
pointers, and passed as arguments), and also returns an integer result.

The output parameters are filled out with: (a) the packfile used for
pack-reuse, (b) the number of objects from that pack that we can reuse,
and (c) a bitmap indicating which objects we can reuse. The return value
is either -1 (when there are no objects to reuse), or 0 (when there is
at least one object to reuse).

Some of these parameters are redundant. Notably, we can infer from the
bitmap how many objects are reused by calling bitmap_popcount(). And we
can similar compute the return value based on that number as well.

As such, clean up the signature of this function to drop the "*entries"
parameter, as well as the int return value, since the single caller of
this function can infer these values themself.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>midx: implement `BTMP` chunk</title>
<updated>2023-12-14T22:38:07Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-12-14T22:23:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=5f5ccd959573f88e126d53df16b149c64e6e9091'/>
<id>urn:sha1:5f5ccd959573f88e126d53df16b149c64e6e9091</id>
<content type='text'>
When a multi-pack bitmap is used to implement verbatim pack reuse (that
is, when verbatim chunks from an on-disk packfile are copied
directly[^1]), it does so by using its "preferred pack" as the source
for pack-reuse.

This allows repositories to pack the majority of their objects into a
single (often large) pack, and then use it as the single source for
verbatim pack reuse. This increases the amount of objects that are
reused verbatim (and consequently, decrease the amount of time it takes
to generate many packs). But this performance comes at a cost, which is
that the preferred packfile must pace its growth with that of the entire
repository in order to maintain the utility of verbatim pack reuse.

As repositories grow beyond what we can reasonably store in a single
packfile, the utility of verbatim pack reuse diminishes. Or, at the very
least, it becomes increasingly more expensive to maintain as the pack
grows larger and larger.

It would be beneficial to be able to perform this same optimization over
multiple packs, provided some modest constraints (most importantly, that
the set of packs eligible for verbatim reuse are disjoint with respect
to the subset of their objects being sent).

If we assume that the packs which we treat as candidates for verbatim
reuse are disjoint with respect to any of their objects we may output,
we need to make only modest modifications to the verbatim pack-reuse
code itself. Most notably, we need to remove the assumption that the
bits in the reachability bitmap corresponding to objects from the single
reuse pack begin at the first bit position.

Future patches will unwind these assumptions and reimplement their
existing functionality as special cases of the more general assumptions
(e.g. that reuse bits can start anywhere within the bitset, but happen
to start at 0 for all existing cases).

This patch does not yet relax any of those assumptions. Instead, it
implements a foundational data-structure, the "Bitampped Packs" (`BTMP`)
chunk of the multi-pack index. The `BTMP` chunk's contents are described
in detail here. Importantly, the `BTMP` chunk contains information to
map regions of a multi-pack index's reachability bitmap to the packs
whose objects they represent.

For now, this chunk is only written, not read (outside of the test-tool
used in this patch to test the new chunk's behavior). Future patches
will begin to make use of this new chunk.

[^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
  pack that wasn't used verbatim.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'tb/pack-bitmap-traversal-with-boundary'</title>
<updated>2023-06-22T23:29:05Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2023-06-22T23:29:05Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=f2ffc7418685f75e43e2919426276141fd62c656'/>
<id>urn:sha1:f2ffc7418685f75e43e2919426276141fd62c656</id>
<content type='text'>
The object traversal using reachability bitmap done by
"pack-object" has been tweaked to take advantage of the fact that
using "boundary" commits as representative of all the uninteresting
ones can save quite a lot of object enumeration.

* tb/pack-bitmap-traversal-with-boundary:
  pack-bitmap.c: use commit boundary during bitmap traversal
  pack-bitmap.c: extract `fill_in_bitmap()`
  object: add object_array initializer helper function
</content>
</entry>
<entry>
<title>pack-bitmap.c: use commit boundary during bitmap traversal</title>
<updated>2023-05-08T19:05:55Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2023-05-08T17:38:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b0afdce5dab61f224fd66c13768facc36a7f8705'/>
<id>urn:sha1:b0afdce5dab61f224fd66c13768facc36a7f8705</id>
<content type='text'>
When reachability bitmap coverage exists in a repository, Git will use a
different (and hopefully faster) traversal to compute revision walks.

Consider a set of positive and negative tips (which we'll refer to with
their standard bitmap parlance by "wants", and "haves"). In order to
figure out what objects exist between the tips, the existing traversal
in `prepare_bitmap_walk()` does something like:

  1. Consider if we can even compute the set of objects with bitmaps,
     and fall back to the usual traversal if we cannot. For example,
     pathspec limiting traversals can't be computed using bitmaps (since
     they don't know which objects are at which paths). The same is true
     of certain kinds of non-trivial object filters.

  2. If we can compute the traversal with bitmaps, partition the
     (dereferenced) tips into two object lists, "haves", and "wants",
     based on whether or not the objects have the UNINTERESTING flag,
     respectively.

  3. Fall back to the ordinary object traversal if either (a) there are
     more than zero haves, none of which are in the bitmapped pack or
     MIDX, or (b) there are no wants.

  4. Construct a reachability bitmap for the "haves" side by walking
     from the revision tips down to any existing bitmaps, OR-ing in any
     bitmaps as they are found.

  5. Then do the same for the "wants" side, stopping at any objects that
     appear in the "haves" bitmap.

  6. Filter the results if any object filter (that can be easily
     computed with bitmaps alone) was given, and then return back to the
     caller.

When there is good bitmap coverage relative to the traversal tips, this
walk is often significantly faster than an ordinary object traversal
because it can visit far fewer objects.

But in certain cases, it can be significantly *slower* than the usual
object traversal. Why? Because we need to compute complete bitmaps on
either side of the walk. If either one (or both) of the sides require
walking many (or all!) objects before they get to an existing bitmap,
the extra bitmap machinery is mostly or all overhead.

One of the benefits, however, is that even if the walk is slower, bitmap
traversals are guaranteed to provide an *exact* answer. Unlike the
traditional object traversal algorithm, which can over-count the results
by not opening trees for older commits, the bitmap walk builds an exact
reachability bitmap for either side, meaning the results are never
over-counted.

But producing non-exact results is OK for our traversal here (both in
the bitmap case and not), as long as the results are over-counted, not
under.

Relaxing the bitmap traversal to allow it to produce over-counted
results gives us the opportunity to make some significant improvements.
Instead of the above, the new algorithm only has to walk from the
*boundary* down to the nearest bitmap, instead of from each of the
UNINTERESTING tips.

The boundary-based approach still has degenerate cases, but we'll show
in a moment that it is often a significant improvement.

The new algorithm works as follows:

  1. Build a (partial) bitmap of the haves side by first OR-ing any
     bitmap(s) that already exist for UNINTERESTING commits between the
     haves and the boundary.

  2. For each commit along the boundary, add it as a fill-in traversal
     tip (where the traversal terminates once an existing bitmap is
     found), and perform fill-in traversal.

  3. Build up a complete bitmap of the wants side as usual, stopping any
     time we intersect the (partial) haves side.

  4. Return the results.

And is more-or-less equivalent to using the *old* algorithm with this
invocation:

    $ git rev-list --objects --use-bitmap-index $WANTS --not \
        $(git rev-list --objects --boundary $WANTS --not $HAVES |
          perl -lne 'print $1 if /^-(.*)/')

The new result performs significantly better in many cases, particularly
when the distance from the boundary commit(s) to an existing bitmap is
shorter than the distance from (all of) the have tips to the nearest
bitmapped commit.

Note that when using the old bitmap traversal algorithm, the results can
be *slower* than without bitmaps! Under the new algorithm, the result is
computed faster with bitmaps than without (at the cost of over-counting
the true number of objects in a similar fashion as the non-bitmap
traversal):

    # (Computing the number of tagged objects not on any branches
    # without bitmaps).
    $ time git rev-list --count --objects --tags --not --branches
    20

    real	0m1.388s
    user	0m1.092s
    sys	0m0.296s

    # (Computing the same query using the old bitmap traversal).
    $ time git rev-list --count --objects --tags --not --branches --use-bitmap-index
    19

    real	0m22.709s
    user	0m21.628s
    sys	0m1.076s

    # (this commit)
    $ time git.compile rev-list --count --objects --tags --not --branches --use-bitmap-index
    19

    real	0m1.518s
    user	0m1.234s
    sys	0m0.284s

The new algorithm is still slower than not using bitmaps at all, but it
is nearly a 15-fold improvement over the existing traversal.

In a more realistic setting (using my local copy of git.git), I can
observe a similar (if more modest) speed-up:

    $ argv="--count --objects --branches --not --tags"
    hyperfine \
      -n 'no bitmaps' "git.compile rev-list $argv" \
      -n 'existing traversal' "git.compile rev-list --use-bitmap-index $argv" \
      -n 'boundary traversal' "git.compile -c pack.useBitmapBoundaryTraversal=true rev-list --use-bitmap-index $argv"
    Benchmark 1: no bitmaps
      Time (mean ± σ):     124.6 ms ±   2.1 ms    [User: 103.7 ms, System: 20.8 ms]
      Range (min … max):   122.6 ms … 133.1 ms    22 runs

    Benchmark 2: existing traversal
      Time (mean ± σ):     368.6 ms ±   3.0 ms    [User: 325.3 ms, System: 43.1 ms]
      Range (min … max):   365.1 ms … 374.8 ms    10 runs

    Benchmark 3: boundary traversal
      Time (mean ± σ):     167.6 ms ±   0.9 ms    [User: 139.5 ms, System: 27.9 ms]
      Range (min … max):   166.1 ms … 169.2 ms    17 runs

    Summary
      'no bitmaps' ran
        1.34 ± 0.02 times faster than 'boundary traversal'
        2.96 ± 0.05 times faster than 'existing traversal'

Here, the new algorithm is also still slower than not using bitmaps, but
represents a more than 2-fold improvement over the existing traversal in
a more modest example.

Since this algorithm was originally written (nearly a year and a half
ago, at the time of writing), the bitmap lookup table shipped, making
the new algorithm's result more competitive. A few other future
directions for improving bitmap traversal times beyond not using bitmaps
at all:

  - Decrease the cost to decompress and OR together many bitmaps
    together (particularly when enumerating the uninteresting side of
    the walk). Here we could explore more efficient bitmap storage
    techniques, like Roaring+Run and/or use SIMD instructions to speed
    up ORing them together.

  - Store pseudo-merge bitmaps, which could allow us to OR together
    fewer "summary" bitmaps (which would also help with the above).

Helped-by: Jeff King &lt;peff@peff.net&gt;
Helped-by: Derrick Stolee &lt;derrickstolee@github.com&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>fsck: verify checksums of all .bitmap files</title>
<updated>2023-05-02T15:48:22Z</updated>
<author>
<name>Derrick Stolee</name>
<email>derrickstolee@github.com</email>
</author>
<published>2023-05-02T13:27:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=756f1bcd29a711075cbaecb02c923328e86a41a4'/>
<id>urn:sha1:756f1bcd29a711075cbaecb02c923328e86a41a4</id>
<content type='text'>
If a filesystem-level corruption occurs in a .bitmap file, Git can react
poorly. This could take the form of a run-time error due to failing to
parse an EWAH bitmap or be more subtle such as returning the wrong set
of objects to a fetch or clone.

A natural first response to either of these kinds of errors is to run
'git fsck' to see if any files are corrupt. This currently ignores all
.bitmap files.

Add checks to 'git fsck' for all .bitmap files that are currently
associated with a multi-pack-index or pack file. Verify their checksums
using the hashfile API.

We iterate through all multi-pack-indexes and pack-files to be sure to
check all .bitmap files, not just the one that would be read by the
process. For example, a multi-pack-index bitmap overrules a pack-bitmap.
However, if the multi-pack-index is removed, the pack-bitmap may be
selected instead. Be thorough to include every file that could become
active in such a way. This includes checking files in alternates.

There is potential that we could extend this effort to check the
structure of the reachability bitmaps themselves, but it is very
expensive to do so. At minimum, it's as expensive as generating the
bitmaps in the first place, and that's assuming that we don't use the
trivial algorithm of verifying each bitmap individually. The trivial
algorithm will result in quadratic behavior (number of objects times
number of bitmapped commits) while the bitmap building operation
constructs a lattice of commits to build bitmaps incrementally and then
generate the final bitmaps from a subset of those commits.

If we were to extend 'git fsck' to check .bitmap file contents more
closely like this, then we would likely want to hide it behind an option
that signals the user is more willing to do expensive operations such as
this.

For testing, set up a repository with a pack-bitmap _and_ a
multi-pack-index bitmap. This requires some file movement to avoid
deleting the pack-bitmap during the repack that creates the
multi-pack-index bitmap. We can then verify that 'git fsck' is checking
all files, not just the "active" bitmap.

Signed-off-by: Derrick Stolee &lt;derrickstolee@github.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-bitmap: prepare to read lookup table extension</title>
<updated>2022-08-26T17:13:58Z</updated>
<author>
<name>Abhradeep Chakraborty</name>
<email>chakrabortyabhradeep79@gmail.com</email>
</author>
<published>2022-08-14T16:55:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=28cd730680dd7d5c0e0971827315bf8ae32f47b7'/>
<id>urn:sha1:28cd730680dd7d5c0e0971827315bf8ae32f47b7</id>
<content type='text'>
Earlier change teaches Git to write bitmap lookup table. But Git
does not know how to parse them.

Teach Git to parse the existing bitmap lookup table. The older
versions of Git are not affected by it. Those versions ignore the
lookup table.

Mentored-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Co-Mentored-by: Kaartic Sivaraam &lt;kaartic.sivaraam@gmail.com&gt;
Signed-off-by: Abhradeep Chakraborty &lt;chakrabortyabhradeep79@gmail.com&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>pack-bitmap-write.c: write lookup table extension</title>
<updated>2022-08-26T17:13:50Z</updated>
<author>
<name>Abhradeep Chakraborty</name>
<email>chakrabortyabhradeep79@gmail.com</email>
</author>
<published>2022-08-14T16:55:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=93eb41e2403788fa9105211956e87b6b2c22c68c'/>
<id>urn:sha1:93eb41e2403788fa9105211956e87b6b2c22c68c</id>
<content type='text'>
The bitmap lookup table extension was documented by an earlier
change, but Git does not yet know how to write that extension.

Teach Git to write bitmap lookup table extension. The table contains
the list of `N` &lt;commit_pos, offset, xor_row&gt;` triplets. These
triplets are sorted according to their commit pos (ascending order).
The meaning of each data in the i'th triplet is given below:

  - commit_pos stores commit position (in the pack-index or midx).
    It is a 4 byte network byte order unsigned integer.

  - offset is the position (in the bitmap file) from which that
    commit's bitmap can be read.

  - xor_row is the position of the triplet in the lookup table
    whose bitmap is used to compress this bitmap, or `0xffffffff`
    if no such bitmap exists.

Mentored-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Co-mentored-by: Kaartic Sivaraam &lt;kaartic.sivaraam@gmail.com&gt;
Co-authored-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Abhradeep Chakraborty &lt;chakrabortyabhradeep79@gmail.com&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
