<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/reftable, branch v2.45.4</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.45.4</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.45.4'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2024-04-23T18:52:37Z</updated>
<entry>
<title>Merge branch 'ps/reftable-block-iteration-optim'</title>
<updated>2024-04-23T18:52:37Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-04-23T18:52:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=33bbc21c922798c9dc882e42162c5f2ad7b77852'/>
<id>urn:sha1:33bbc21c922798c9dc882e42162c5f2ad7b77852</id>
<content type='text'>
The code to iterate over reftable blocks has seen some optimization
to reduce memory allocation and deallocation.

* ps/reftable-block-iteration-optim:
  reftable/block: avoid copying block iterators on seek
  reftable/block: reuse `zstream` state on inflation
  reftable/block: open-code call to `uncompress2()`
  reftable/block: reuse uncompressed blocks
  reftable/reader: iterate to next block in place
  reftable/block: move ownership of block reader into `struct table_iter`
  reftable/block: introduce `block_reader_release()`
  reftable/block: better grouping of functions
  reftable/block: merge `block_iter_seek()` and `block_reader_seek()`
  reftable/block: rename `block_reader_start()`
</content>
</entry>
<entry>
<title>Merge branch 'jt/reftable-geometric-compaction'</title>
<updated>2024-04-16T21:50:30Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-04-16T21:50:30Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=82a31ec32441cd06daa5e0397a73f4159cdaad4b'/>
<id>urn:sha1:82a31ec32441cd06daa5e0397a73f4159cdaad4b</id>
<content type='text'>
The strategy to compact multiple tables of reftables after many
operations accumulate many entries has been improved to avoid
accumulating too many tables uncollected.

* jt/reftable-geometric-compaction:
  reftable/stack: use geometric table compaction
  reftable/stack: add env to disable autocompaction
  reftable/stack: expose option to disable auto-compaction
</content>
</entry>
<entry>
<title>reftable/block: avoid copying block iterators on seek</title>
<updated>2024-04-15T17:37:59Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:17:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9da5c992dd8856035003cb06ff9c996a23956951'/>
<id>urn:sha1:9da5c992dd8856035003cb06ff9c996a23956951</id>
<content type='text'>
When seeking a reftable record in a block we need to position the
iterator _before_ the sought-after record so that the next call to
`block_iter_next()` would yield that record. To achieve this, the loop
that performs the linear seeks to restore the previous position once it
has found the record.

This is done by advancing two `block_iter`s: one to check whether the
next record is our sought-after record, and one that we update after
every iteration. This of course involves quite a lot of copying and also
leads to needless memory allocations.

Refactor the code to get rid of the `next` iterator and the copying this
involves. Instead, we can restore the previous offset such that the call
to `next` will return the correct record.

Next to being simpler conceptually this also leads to a nice speedup.
The following benchmark parser 10k refs out of 100k existing refs via
`git-rev-list --no-walk`:

  Benchmark 1: rev-list: print many refs (HEAD~)
    Time (mean ± σ):     170.2 ms ±   1.7 ms    [User: 86.1 ms, System: 83.6 ms]
    Range (min … max):   166.4 ms … 180.3 ms    500 runs

  Benchmark 2: rev-list: print many refs (HEAD~)
    Time (mean ± σ):     161.6 ms ±   1.6 ms    [User: 78.1 ms, System: 83.0 ms]
    Range (min … max):   158.4 ms … 172.3 ms    500 runs

  Summary
    rev-list: print many refs (HEAD) ran
      1.05 ± 0.01 times faster than rev-list: print many refs (HEAD~)

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: reuse `zstream` state on inflation</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:17:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=ce1f213cc91cf545736048f28117fe1de89b8134'/>
<id>urn:sha1:ce1f213cc91cf545736048f28117fe1de89b8134</id>
<content type='text'>
When calling `inflateInit()` and `inflate()`, the zlib library will
allocate several data structures for the underlying `zstream` to keep
track of various information. Thus, when inflating repeatedly, it is
possible to optimize memory allocation patterns by reusing the `zstream`
and then calling `inflateReset()` on it to prepare it for the next chunk
of data to inflate.

This is exactly what the reftable code is doing: when iterating through
reflogs we need to potentially inflate many log blocks, but we discard
the `zstream` every single time. Instead, as we reuse the `block_reader`
for each of the blocks anyway, we can initialize the `zstream` once and
then reuse it for subsequent inflations.

Refactor the code to do so, which leads to a significant reduction in
the number of allocations. The following measurements were done when
iterating through 1 million reflog entries. Before:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 23,028 allocs, 22,906 frees, 162,813,552 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 302 allocs, 180 frees, 88,352 bytes allocated

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: open-code call to `uncompress2()`</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:59Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=15a60b747e4f0e0d11353f8e89bc9ce7b36c5512'/>
<id>urn:sha1:15a60b747e4f0e0d11353f8e89bc9ce7b36c5512</id>
<content type='text'>
The reftable format stores log blocks in a compressed format. Thus,
whenever we want to read such a block we first need to decompress it.
This is done by calling the convenience function `uncompress2()` of the
zlib library, which is a simple wrapper that manages the lifecycle of
the `zstream` structure for us.

While nice for one-off inflation of data, when iterating through reflogs
we will likely end up inflating many such log blocks. This requires us
to reallocate the state of the `zstream` every single time, which adds
up over time. It would thus be great to reuse the `zstream` instead of
discarding it after every inflation.

Open-code the call to `uncompress2()` such that we can start reusing the
`zstream` in the subsequent commit. Note that our open-coded variant is
different from `uncompress2()` in two ways:

  - We do not loop around `inflate()` until we have processed all input.
    As our input is limited by the maximum block size, which is 16MB, we
    should not hit limits of `inflate()`.

  - We use `Z_FINISH` instead of `Z_NO_FLUSH`. Quoting the `inflate()`
    documentation: "inflate() should normally be called until it returns
    Z_STREAM_END or an error. However if all decompression is to be
    performed in a single step (a single call of inflate), the parameter
    flush should be set to Z_FINISH."

    Furthermore, "Z_FINISH also informs inflate to not maintain a
    sliding window if the stream completes, which reduces inflate's
    memory footprint."

Other than that this commit is expected to be functionally equivalent
and does not yet reuse the `zstream`.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: reuse uncompressed blocks</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=dd347bbce6053538d3332e6e8e499a3bebdbd251'/>
<id>urn:sha1:dd347bbce6053538d3332e6e8e499a3bebdbd251</id>
<content type='text'>
The reftable backend stores reflog entries in a compressed format and
thus needs to uncompress blocks before one can read records from it.
For each reflog block we thus have to allocate an array that we can
decompress the block contents into. This block is being discarded
whenever the table iterator moves to the next block. Consequently, we
reallocate a new array on every block, which is quite wasteful.

Refactor the code to reuse the uncompressed block data when moving the
block reader to a new block. This significantly reduces the number of
allocations when iterating through many compressed blocks. The following
measurements are done with `git reflog list` when listing 100k reflogs.
Before:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 45,755 allocs, 45,633 frees, 254,779,456 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 23,028 allocs, 22,906 frees, 162,813,547 bytes allocated

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/reader: iterate to next block in place</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b00bcb7c49a4f96d39e4a448998b366bcd484de2'/>
<id>urn:sha1:b00bcb7c49a4f96d39e4a448998b366bcd484de2</id>
<content type='text'>
The table iterator has to iterate towards the next block once it has
yielded all records of the current block. This is done by creating a new
table iterator, initializing it to the next block, releasing the old
iterator and then copying over the data.

Refactor the code to instead advance the table iterator in place. This
is simpler and unlocks some optimizations in subsequent patches. Also,
it allows us to avoid some allocations.

The following measurements show a single matching ref out of 1 million
refs. Before this change:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 315 allocs, 190 frees, 107,027 bytes allocated

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: move ownership of block reader into `struct table_iter`</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:45Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=bcdc586db0b3310d05256cbe38724551e4f70475'/>
<id>urn:sha1:bcdc586db0b3310d05256cbe38724551e4f70475</id>
<content type='text'>
The table iterator allows the caller to iterate through all records in a
reftable table. To do so it iterates through all blocks of the desired
type one by one, where for each block it creates a new block iterator
and yields all its entries.

One of the things that is somewhat confusing in this context is who owns
the block reader that is being used to read the blocks and pass them to
the block iterator. Intuitively, as the table iterator is responsible
for iterating through the blocks, one would assume that this iterator is
also responsible for managing the lifecycle of the reader. And while it
somewhat is, the block reader is ultimately stored inside of the block
iterator.

Refactor the code such that the block reader is instead fully managed by
the table iterator. Instead of passing the reader to the block iterator,
we now only end up passing the block data to it. Despite clearing up the
lifecycle of the reader, it will also allow for better reuse of the
reader in subsequent patches.

The following benchmark prints a single matching ref out of 1 million
refs. Before:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 6,607 allocs, 6,482 frees, 509,635 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated

Note that while there are more allocation and free calls now, the
overall number of bytes allocated is significantly lower. The number of
allocations will be reduced significantly by the next patch though.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: introduce `block_reader_release()`</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b371221a607fdb4ea781733e9449e3835be12c91'/>
<id>urn:sha1:b371221a607fdb4ea781733e9449e3835be12c91</id>
<content type='text'>
Introduce a new function `block_reader_release()` that releases
resources acquired by the block reader. This function will be extended
in a subsequent commit.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: better grouping of functions</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:36Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=aac8c03cc45e047b1b721a61580c920995bf43a3'/>
<id>urn:sha1:aac8c03cc45e047b1b721a61580c920995bf43a3</id>
<content type='text'>
Function definitions and declaration of `struct block_reader` and
`struct block_iter` are somewhat mixed up, making it hard to see which
functions belong together. Rearrange them.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
