git/reftable/reader.c, branch v2.46.2

Merge branch 'ps/reftable-reusable-iterator'

2024-05-30T21:15:12Z

Code clean-up to make the reftable iterator closer to be reusable. * ps/reftable-reusable-iterator: reftable/merged: adapt interface to allow reuse of iterators reftable/stack: provide convenience functions to create iterators reftable/reader: adapt interface to allow reuse of iterators reftable/generic: adapt interface to allow reuse of iterators reftable/generic: move seeking of records into the iterator reftable/merged: simplify indices for subiterators reftable/merged: split up initialization and seeking of records reftable/reader: set up the reader when initializing table iterator reftable/reader: inline `reader_seek_internal()` reftable/reader: separate concerns of table iter and reftable reader reftable/reader: unify indexed and linear seeking reftable/reader: avoid copying index iterator reftable/block: use `size_t` to track restart point index

reftable/reader: adapt interface to allow reuse of iterators

2024-05-14T00:04:18Z

Refactor the interfaces exposed by `struct reftable_reader` and `struct table_iterator` such that they support iterator reuse. This is done by separating initialization of the iterator and seeking on it. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

reftable/generic: move seeking of records into the iterator

2024-05-14T00:04:18Z

Reftable iterators are created by seeking on the parent structure of a corresponding record. For example, to create an iterator for the merged table you would call `reftable_merged_table_seek_ref()`. Most notably, it is not posible to create an iterator and then seek it afterwards. While this may be a bit easier to reason about, it comes with two significant downsides. The first downside is that the logic to find records is split up between the parent data structure and the iterator itself. Conceptually, it is more straight forward if all that logic was contained in a single place, which should be the iterator. The second and more significant downside is that it is impossible to reuse iterators for multiple seeks. Whenever you want to look up a record, you need to re-create the whole infrastructure again, which is quite a waste of time. Furthermore, it is impossible to optimize seeks, such as when seeking the same record multiple times. To address this, we essentially split up the concerns properly such that the parent data structure is responsible for setting up the iterator via a new `init_iter()` callback, whereas the iterator handles seeks via a new `seek()` callback. This will eventually allow us to call `seek()` on the iterator multiple times, where every iterator can potentially optimize for certain cases. Note that at this point in time we are not yet ready to reuse the iterators. This will be left for a future patch series. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

reftable/reader: set up the reader when initializing table iterator

2024-05-14T00:04:17Z

All the seeking functions accept a `struct reftable_reader` as input such that they can use the reader to look up the respective blocks. Refactor the code to instead set up the reader as a member of `struct table_iter` during initialization such that we don't have to pass the reader on every single call. This step is required to move seeking of records into the generic `struct reftable_iterator` infrastructure. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

reftable/reader: inline `reader_seek_internal()`

2024-05-14T00:04:17Z

We have both `reader_seek()` and `reader_seek_internal()`, where the former function only exists so that we can exit early in case the given table has no records of the sought-after type. Merge these two functions into one. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

reftable/reader: separate concerns of table iter and reftable reader

2024-05-14T00:04:17Z

In "reftable/reader.c" we implement two different interfaces: - The reftable reader contains the logic to read reftables. - The table iterator is used to iterate through a single reftable read by the reader. The way those two types are used in the code is somewhat confusing though because seeking inside a table is implemented as if it was part of the reftable reader, even though it is ultimately more of a detail implemented by the table iterator. Make the boundary between those two types clearer by renaming functions that seek records in a table such that they clearly belong to the table iterator's logic. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

reftable/reader: unify indexed and linear seeking

2024-05-14T00:04:16Z

In `reader_seek_internal()` we either end up doing an indexed seek when there is one or a linear seek otherwise. These two code paths are disjunct without a good reason, where the indexed seek will cause us to exit early. Refactor the two code paths such that it becomes possible to share a bit more code between them. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

reftable/reader: avoid copying index iterator

2024-05-14T00:04:16Z

When doing an indexed seek we need to walk down the multi-level index until we finally hit a record of the desired indexed type. This loop performs a copy of the index iterator on every iteration, which is both hard to understand and completely unnecessary. Refactor the code so that we use a single iterator to walk down the indices, only. Note that while this should improve performance, the improvement is negligible in all but the most unreasonable repositories. This is because the effect is only really noticeable when we have to walk down many levels of indices, which is not something that a repository would typically have. So the motivation for this change is really only about readability. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

reftable/dump: support dumping a table's block structure

2024-05-14T00:02:38Z

We're about to introduce new configs that will allow users to have more control over how exactly reftables are written. To verify that these configs are effective we will need to take a peak into the actual blocks written by the reftable backend. Introduce a new mode to the dumping logic that prints out the block structure. This logic can be invoked via `test-tool dump-reftables -b`. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

reftable/block: reuse `zstream` state on inflation

2024-04-15T17:36:09Z

When calling `inflateInit()` and `inflate()`, the zlib library will allocate several data structures for the underlying `zstream` to keep track of various information. Thus, when inflating repeatedly, it is possible to optimize memory allocation patterns by reusing the `zstream` and then calling `inflateReset()` on it to prepare it for the next chunk of data to inflate. This is exactly what the reftable code is doing: when iterating through reflogs we need to potentially inflate many log blocks, but we discard the `zstream` every single time. Instead, as we reuse the `block_reader` for each of the blocks anyway, we can initialize the `zstream` once and then reuse it for subsequent inflations. Refactor the code to do so, which leads to a significant reduction in the number of allocations. The following measurements were done when iterating through 1 million reflog entries. Before: HEAP SUMMARY: in use at exit: 13,473 bytes in 122 blocks total heap usage: 23,028 allocs, 22,906 frees, 162,813,552 bytes allocated After: HEAP SUMMARY: in use at exit: 13,473 bytes in 122 blocks total heap usage: 302 allocs, 180 frees, 88,352 bytes allocated Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano