<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/reftable/reader.c, branch v2.45.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.45.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.45.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2024-04-15T17:36:09Z</updated>
<entry>
<title>reftable/block: reuse `zstream` state on inflation</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:17:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=ce1f213cc91cf545736048f28117fe1de89b8134'/>
<id>urn:sha1:ce1f213cc91cf545736048f28117fe1de89b8134</id>
<content type='text'>
When calling `inflateInit()` and `inflate()`, the zlib library will
allocate several data structures for the underlying `zstream` to keep
track of various information. Thus, when inflating repeatedly, it is
possible to optimize memory allocation patterns by reusing the `zstream`
and then calling `inflateReset()` on it to prepare it for the next chunk
of data to inflate.

This is exactly what the reftable code is doing: when iterating through
reflogs we need to potentially inflate many log blocks, but we discard
the `zstream` every single time. Instead, as we reuse the `block_reader`
for each of the blocks anyway, we can initialize the `zstream` once and
then reuse it for subsequent inflations.

Refactor the code to do so, which leads to a significant reduction in
the number of allocations. The following measurements were done when
iterating through 1 million reflog entries. Before:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 23,028 allocs, 22,906 frees, 162,813,552 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 302 allocs, 180 frees, 88,352 bytes allocated

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: reuse uncompressed blocks</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=dd347bbce6053538d3332e6e8e499a3bebdbd251'/>
<id>urn:sha1:dd347bbce6053538d3332e6e8e499a3bebdbd251</id>
<content type='text'>
The reftable backend stores reflog entries in a compressed format and
thus needs to uncompress blocks before one can read records from it.
For each reflog block we thus have to allocate an array that we can
decompress the block contents into. This block is being discarded
whenever the table iterator moves to the next block. Consequently, we
reallocate a new array on every block, which is quite wasteful.

Refactor the code to reuse the uncompressed block data when moving the
block reader to a new block. This significantly reduces the number of
allocations when iterating through many compressed blocks. The following
measurements are done with `git reflog list` when listing 100k reflogs.
Before:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 45,755 allocs, 45,633 frees, 254,779,456 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 23,028 allocs, 22,906 frees, 162,813,547 bytes allocated

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/reader: iterate to next block in place</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b00bcb7c49a4f96d39e4a448998b366bcd484de2'/>
<id>urn:sha1:b00bcb7c49a4f96d39e4a448998b366bcd484de2</id>
<content type='text'>
The table iterator has to iterate towards the next block once it has
yielded all records of the current block. This is done by creating a new
table iterator, initializing it to the next block, releasing the old
iterator and then copying over the data.

Refactor the code to instead advance the table iterator in place. This
is simpler and unlocks some optimizations in subsequent patches. Also,
it allows us to avoid some allocations.

The following measurements show a single matching ref out of 1 million
refs. Before this change:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 315 allocs, 190 frees, 107,027 bytes allocated

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: move ownership of block reader into `struct table_iter`</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:45Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=bcdc586db0b3310d05256cbe38724551e4f70475'/>
<id>urn:sha1:bcdc586db0b3310d05256cbe38724551e4f70475</id>
<content type='text'>
The table iterator allows the caller to iterate through all records in a
reftable table. To do so it iterates through all blocks of the desired
type one by one, where for each block it creates a new block iterator
and yields all its entries.

One of the things that is somewhat confusing in this context is who owns
the block reader that is being used to read the blocks and pass them to
the block iterator. Intuitively, as the table iterator is responsible
for iterating through the blocks, one would assume that this iterator is
also responsible for managing the lifecycle of the reader. And while it
somewhat is, the block reader is ultimately stored inside of the block
iterator.

Refactor the code such that the block reader is instead fully managed by
the table iterator. Instead of passing the reader to the block iterator,
we now only end up passing the block data to it. Despite clearing up the
lifecycle of the reader, it will also allow for better reuse of the
reader in subsequent patches.

The following benchmark prints a single matching ref out of 1 million
refs. Before:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 6,607 allocs, 6,482 frees, 509,635 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated

Note that while there are more allocation and free calls now, the
overall number of bytes allocated is significantly lower. The number of
allocations will be reduced significantly by the next patch though.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: introduce `block_reader_release()`</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:40Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b371221a607fdb4ea781733e9449e3835be12c91'/>
<id>urn:sha1:b371221a607fdb4ea781733e9449e3835be12c91</id>
<content type='text'>
Introduce a new function `block_reader_release()` that releases
resources acquired by the block reader. This function will be extended
in a subsequent commit.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: merge `block_iter_seek()` and `block_reader_seek()`</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:31Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=42c7bdc36d0aacfb7c0910126257a2009de0b1ca'/>
<id>urn:sha1:42c7bdc36d0aacfb7c0910126257a2009de0b1ca</id>
<content type='text'>
The function `block_iter_seek()` is merely a simple wrapper around
`block_reader_seek()`. Merge those two functions into a new function
`block_iter_seek_key()` that more clearly says what it is actually
doing.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>reftable/block: rename `block_reader_start()`</title>
<updated>2024-04-15T17:36:09Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-04-08T12:16:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=3122d44025e29571aecda02fe3699f240246f3b2'/>
<id>urn:sha1:3122d44025e29571aecda02fe3699f240246f3b2</id>
<content type='text'>
The function `block_reader_start()` does not really apply to the block
reader, but to the block iterator. It's name is thus somewhat confusing.
Rename it to `block_iter_seek_start()` to clarify.

We will rename `block_reader_seek()` in similar spirit in the next
commit.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ps/reftable-iteration-perf'</title>
<updated>2024-02-27T02:10:24Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-02-27T02:10:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9f67cbd0a7725857d219da6720fc9a5acfda2960'/>
<id>urn:sha1:9f67cbd0a7725857d219da6720fc9a5acfda2960</id>
<content type='text'>
The code to iterate over refs with the reftable backend has seen
some optimization.

* ps/reftable-iteration-perf:
  reftable/reader: add comments to `table_iter_next()`
  reftable/record: don't try to reallocate ref record name
  reftable/block: swap buffers instead of copying
  reftable/pq: allocation-less comparison of entry keys
  reftable/merged: skip comparison for records of the same subiter
  reftable/merged: allocation-less dropping of shadowed records
  reftable/record: introduce function to compare records by key
</content>
</entry>
<entry>
<title>Merge branch 'ps/reftable-styles'</title>
<updated>2024-02-12T21:16:10Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-02-12T21:16:10Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=f424d7c33df373fb8eb4c9dc63ab6dc24de24aa5'/>
<id>urn:sha1:f424d7c33df373fb8eb4c9dc63ab6dc24de24aa5</id>
<content type='text'>
Code clean-up in various reftable code paths.

* ps/reftable-styles:
  reftable/record: improve semantics when initializing records
  reftable/merged: refactor initialization of iterators
  reftable/merged: refactor seeking of records
  reftable/stack: use `size_t` to track stack length
  reftable/stack: use `size_t` to track stack slices during compaction
  reftable/stack: index segments with `size_t`
  reftable/stack: fix parameter validation when compacting range
  reftable: introduce macros to allocate arrays
  reftable: introduce macros to grow arrays
</content>
</entry>
<entry>
<title>reftable/reader: add comments to `table_iter_next()`</title>
<updated>2024-02-12T17:19:27Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-02-12T08:32:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=c68ca7abd30b22404ce59d5133566729c07ffe8f'/>
<id>urn:sha1:c68ca7abd30b22404ce59d5133566729c07ffe8f</id>
<content type='text'>
While working on the optimizations in the preceding patches I stumbled
upon `table_iter_next()` multiple times. It is quite easy to miss the
fact that we don't call `table_iter_next_in_block()` twice, but that the
second call is in fact `table_iter_next_block()`.

Add comments to explain what exactly is going on here to make things
more obvious. While at it, touch up the code to conform to our code
style better.

Note that one of the refactorings merges two conditional blocks into
one. Before, we had the following code:

```
err = table_iter_next_block(&amp;next, ti);
if (err != 0) {
	ti-&gt;is_finished = 1;
}
table_iter_block_done(ti);
if (err != 0) {
	return err;
}
```

As `table_iter_block_done()` does not care about `is_finished`, the
conditional blocks can be merged into one block:

```
err = table_iter_next_block(&amp;next, ti);
table_iter_block_done(ti);
if (err != 0) {
	ti-&gt;is_finished = 1;
	return err;
}
```

This is both easier to reason about and more performant because we have
one branch less.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
