packfile: fix approximation of object counts - git - Mirror of https://git.kernel.org/pub/scm/git/git.git/

diff options

author	Patrick Steinhardt <ps@pks.im>	2025-10-30 11:38:41 +0100
committer	Junio C Hamano <gitster@pobox.com>	2025-10-30 07:09:52 -0700
commit	02a7f6ffab9ec7641f88032f30998976bca07820 (patch)
tree	dc918724f836b06ad846aaef3f63dab54e46a801 /refs/iterator.c
parent	http: refactor subsystem to use `packfile_list`s (diff)
download	git-02a7f6ffab9ec7641f88032f30998976bca07820.tar.gz git-02a7f6ffab9ec7641f88032f30998976bca07820.zip

packfile: fix approximation of object counts

When approximating the number of objects in a repository we only take into account two data sources, the multi-pack index and the packfile indices, as both of these data structures allow us to easily figure out how many objects they contain. But the way we currently approximate the number of objects is broken in presence of a multi-pack index. This is due to two separate reasons: - We have recently introduced initial infrastructure for incremental multi-pack indices. Starting with that series, `num_objects` only counts the number of objects of a specific layer of the MIDX chain, so we do not take into account objects from parent layers. This issue is fixed by adding `num_objects_in_base`, which contains the sum of all objects in previous layers. - When using the multi-pack index we may count objects contained in packfiles twice: once via the multi-pack index, but then we again count them via the packfile itself. This issue is fixed by skipping any packfiles that have an MIDX. Overall, given that we _always_ count the packs, we can only end up overestimating the number of objects, and the overestimation is limited to a factor of two at most. The consequences of those issues are very limited though, as we only approximate object counts in a small number of cases: - When writing a commit-graph we use the approximate object count to display the upper limit of a progress display. - In `repo_find_unique_abbrev_r()` we use it to specify a lower limit of how many hex digits we want to abbreviate to. Given that we use power-of-two here to derive the lower limit we may end up with an abbreviated hash that is one digit longer than required. - In `estimate_repack_memory()` we may end up overestimating how much memory a repack needs to pack objects. Conseuqently, we may end up dropping some packfiles from a repack. None of these are really game-changing. But it's nice to fix those issues regardless. While at it, convert the code to use `repo_for_each_pack()`. Furthermore, use `odb_prepare_alternates()` instead of explicitly preparing the packfile store. We really only want to prepare the object database sources, and `get_multi_pack_index()` already knows to prepare the packfile store for us. Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Diffstat (limited to 'refs/iterator.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: