git/bloom.c, branch v2.37.2

commit-graph: fix corrupt upgrade from generation v1 to v2

2022-07-15T23:51:39Z

The previous commit demonstrates a bug where a commit-graph using generation v2 could enter a state where one of the GDA2 values has its most-significant bit set (indicating that its value should be read from the extended offset table in the GDO2 chunk) without having a GDO2 chunk to read from. This results in the following error message being displayed to the caller: fatal: commit-graph requires overflow generation data but has none This bug arises in the following scenario: - We decide to write a commit-graph using generation number v2, and decide (correctly) that no GDO2 chunk is necessary (e.g., because all of the commiter date offsets are no larger than 2^31-1). - The v2 generation numbers are stored in the `->generation` member of the commit slab holding `struct commit_graph_data`'s. - Later on, `load_commit_graph_info()` is called, overwriting the v2 generation data in the aforementioned slab with any existing v1 generation data. Then, when the commit-graph code goes to write the GDA2 chunk via `write_graph_chunk_generation_data()`, we use the overwritten generation v1 data in a place where we expect to use a v2 generation number: offset = commit_graph_data_at(c)->generation - c->date; ...because `commit_graph_data_at(c)->generation` used to hold the v2 generation data, but it was overwritten to contain the v1 generation number via `load_commit_graph_info()`. If the `offset` computation above overflows the v2 generation number max, then `write_graph_chunk_generation_data()` will update its count of large offsets and write the marker accordingly: if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) { offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows; num_generation_data_overflows++; } and reads will look for the GDO2 chunk containing the overflowing v2 generation number, *after* the commit-graph code decided that no such chunk was necessary. The main problem is that the slab containing `struct commit_graph_data` has a dual purpose. It is used to hold data that we are about to write to disk while generating a commit-graph, as well as hold data that was read from an existing commit-graph. When the two mix, namely when the result of reading the commit-graph has a side-effect that mixes poorly with an in-progress commit-graph write, we end up with corrupt data. A complete fix might be to introduce a new slab that is used exclusively for writing, and gate access between the two slabs based on context provided by the caller (e.g., whether this computation is part of a "read" or "write" operation). But a more minimal fix addresses the only known path which overwrites the slab data, which is `compute_bloom_filters()` -> `get_or_compute_bloom_filter()` -> `load_commit_graph_info()` -> `fill_commit_graph_info()` by avoiding the last call which clobbers the data altogether. This path only needs to learn the graph position of a given commit so that it can be used in `load_bloom_filter_from_graph()`. By replacing the last steps of the above with one that records the graph position into a temporary variable which is then used to load the existing Bloom data, we eliminate the clobbering, removing the corruption. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano

Merge branch 'ah/plugleaks'

2021-05-07T03:47:41Z

Plug various leans reported by LSAN. * ah/plugleaks: builtin/rm: avoid leaking pathspec and seen builtin/rebase: release git_format_patch_opt too builtin/for-each-ref: free filter and UNLEAK sorting. mailinfo: also free strbuf lists when clearing mailinfo builtin/checkout: clear pending objects after diffing builtin/check-ignore: clear_pathspec before returning builtin/bugreport: don't leak prefixed filename branch: FREE_AND_NULL instead of NULL'ing real_ref bloom: clear each bloom_key after use ls-files: free max_prefix when done wt-status: fix multiple small leaks revision: free remainder of old commit list in limit_list

bloom: clear each bloom_key after use

2021-04-28T00:25:44Z

fill_bloom_key() allocates memory into bloom_key, we need to clean that up once the key is no longer needed. This leak was found while running t0002-t0099. Although this leak is happening in code being called from a test-helper, the same code is also used in various locations around git, and can therefore happen during normal usage too. Gabor's analysis shows that peak-memory usage during 'git commit-graph write' is reduced on the order of 10% for a selection of larger repos (along with an even larger reduction if we override modified path bloom filter limits): https://lore.kernel.org/git/20210411072651.GF2947267@szeder.dev/ LSAN output: Direct leak of 308 byte(s) in 11 object(s) allocated from: #0 0x49a5e2 in calloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3 #1 0x6f4032 in xcalloc wrapper.c:140:8 #2 0x4f2905 in fill_bloom_key bloom.c:137:28 #3 0x4f34c1 in get_or_compute_bloom_filter bloom.c:284:4 #4 0x4cb484 in get_bloom_filter_for_commit t/helper/test-bloom.c:43:11 #5 0x4cb072 in cmd__bloom t/helper/test-bloom.c:97:3 #6 0x4ca7ef in cmd_main t/helper/test-tool.c:121:11 #7 0x4caace in main common-main.c:52:11 #8 0x7f798af95349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 308 byte(s) leaked in 11 allocation(s). Signed-off-by: Andrzej Hunt Signed-off-by: Junio C Hamano

use CALLOC_ARRAY

2021-03-14T00:00:09Z

Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe Signed-off-by: Junio C Hamano

Use new HASHMAP_INIT macro to simplify hashmap initialization

2020-11-11T20:55:27Z

Now that hashamp has lazy initialization and a HASHMAP_INIT macro, hashmaps allocated on the stack can be initialized without a call to hashmap_init() and in some cases makes the code a bit shorter. Convert some callsites over to take advantage of this. Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano

hashmap: provide deallocation function names

2020-11-02T20:15:50Z

hashmap_free(), hashmap_free_entries(), and hashmap_free_() have existed for a while, but aren't necessarily the clearest names, especially with hashmap_partial_clear() being added to the mix and lazy-initialization now being supported. Peff suggested we adopt the following names[1]: - hashmap_clear() - remove all entries and de-allocate any hashmap-specific data, but be ready for reuse - hashmap_clear_and_free() - ditto, but free the entries themselves - hashmap_partial_clear() - remove all entries but don't deallocate table - hashmap_partial_clear_and_free() - ditto, but free the entries This patch provides the new names and converts all existing callers over to the new naming scheme. [1] https://lore.kernel.org/git/20201030125059.GA3277724@coredump.intra.peff.net/ Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano

builtin/commit-graph.c: introduce '--max-new-filters='

2020-09-18T17:35:39Z

Introduce a command-line flag to specify the maximum number of new Bloom filters that a 'git commit-graph write' is willing to compute from scratch. Prior to this patch, a commit-graph write with '--changed-paths' would compute Bloom filters for all selected commits which haven't already been computed (i.e., by a previous commit-graph write with '--split' such that a roll-up or replacement is performed). This behavior can cause prohibitively-long commit-graph writes for a variety of reasons: * There may be lots of filters whose diffs take a long time to generate (for example, they have close to the maximum number of changes, diffing itself takes a long time, etc). * Old-style commit-graphs (which encode filters with too many entries as not having been computed at all) cause us to waste time recomputing filters that appear to have not been computed only to discover that they are too-large. This can make the upper-bound of the time it takes for 'git commit-graph write --changed-paths' to be rather unpredictable. To make this command behave more predictably, introduce '--max-new-filters=' to allow computing at most '' Bloom filters from scratch. This lets "computing" already-known filters proceed quickly, while bounding the number of slow tasks that Git is willing to do. Helped-by: Junio C Hamano Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano

bloom: encode out-of-bounds filters as non-empty

2020-09-18T04:55:50Z

When a changed-path Bloom filter has either zero, or more than a certain number (commonly 512) of entries, the commit-graph machinery encodes it as "missing". More specifically, it sets the indices adjacent in the BIDX chunk as equal to each other to indicate a "length 0" filter; that is, that the filter occupies zero bytes on disk. This has heretofore been fine, since the commit-graph machinery has no need to care about these filters with too few or too many changed paths. Both cases act like no filter has been generated at all, and so there is no need to store them. In a subsequent commit, however, the commit-graph machinery will learn to only compute Bloom filters for some commits in the current commit-graph layer. This is a change from the current implementation which computes Bloom filters for all commits that are in the layer being written. Critically for this patch, only computing some of the Bloom filters means adding a third state for length 0 Bloom filters: zero entries, too many entries, or "hasn't been computed". It will be important for that future patch to distinguish between "not representable" (i.e., zero or too-many changed paths), and "hasn't been computed". In particular, we don't want to waste time recomputing filters that have already been computed. To that end, change how we store Bloom filters in the "computed but not representable" category: - Bloom filters with no entries are stored as a single byte with all bits low (i.e., all queries to that Bloom filter will return "definitely not") - Bloom filters with too many entries are stored as a single byte with all bits set high (i.e., all queries to that Bloom filter will return "maybe"). These rules are sufficient to not incur a behavior change by changing the on-disk representation of these two classes. Likewise, no specification changes are necessary for the commit-graph format, either: - Filters that were previously empty will be recomputed and stored according to the new rules, and - old clients reading filters generated by new clients will interpret the filters correctly and be none the wiser to how they were generated. Clients will invoke the Bloom machinery in more cases than before, but this can be addressed by returning a NULL filter when all bits are set high. This can be addressed in a future patch. Note that this does increase the size of on-disk commit-graphs, but far less than other proposals. In particular, this is generally more efficient than storing a bitmap for which commits haven't computed their Bloom filters. Storing a bitmap incurs a penalty of one bit per commit, whereas storing explicit filters as above incurs a penalty of one byte per too-large or empty commit. In practice, these boundary commits likely occupy a small proportion of the overall number of commits, and so the size penalty is likely smaller than storing a bitmap for all commits. See, for example, these relative proportions of such boundary commits (collected by SZEDER Gábor): | Percentage of | commit-graph | | | commits modifying | file size | | ├────────┬──────────────┼───────────────────┤ pct. | | 0 path | >= 512 paths | before | after | change | ┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤ | android-base | 13.20% | 0.13% | 37.468M | 37.534M | +0.1741 % | | cmssw | 0.15% | 0.23% | 17.118M | 17.119M | +0.0091 % | | cpython | 3.07% | 0.01% | 7.967M | 7.971M | +0.0423 % | | elasticsearch | 0.70% | 1.00% | 8.833M | 8.835M | +0.0128 % | | gcc | 0.00% | 0.08% | 16.073M | 16.074M | +0.0030 % | | gecko-dev | 0.14% | 0.64% | 59.868M | 59.874M | +0.0105 % | | git | 0.11% | 0.02% | 3.895M | 3.895M | +0.0020 % | | glibc | 0.02% | 0.10% | 3.555M | 3.555M | +0.0021 % | | go | 0.00% | 0.07% | 3.186M | 3.186M | +0.0018 % | | homebrew-cask | 0.40% | 0.02% | 7.035M | 7.035M | +0.0065 % | | homebrew-core | 0.01% | 0.01% | 11.611M | 11.611M | +0.0002 % | | jdk | 0.26% | 5.64% | 5.537M | 5.540M | +0.0590 % | | linux | 0.01% | 0.51% | 63.735M | 63.740M | +0.0073 % | | llvm-project | 0.12% | 0.03% | 25.515M | 25.516M | +0.0050 % | | rails | 0.10% | 0.10% | 6.252M | 6.252M | +0.0027 % | | rust | 0.07% | 0.17% | 9.364M | 9.364M | +0.0033 % | | tensorflow | 0.09% | 1.02% | 7.009M | 7.010M | +0.0158 % | | webkit | 0.05% | 0.31% | 17.405M | 17.406M | +0.0047 % | (where the above increase is determined by computing a non-split commit-graph before and after this patch). Given that these projects are all "large" by commit count, the storage cost by writing these filters explicitly is negligible. In the most extreme example, android-base (which has 494,848 commits at the time of writing) would have its commit-graph increase by a modest 68.4 KB. Finally, a test to exercise filters which contain too many changed path entries will be introduced in a subsequent patch. Suggested-by: SZEDER Gábor Suggested-by: Jakub Narębski Helped-by: Derrick Stolee Helped-by: SZEDER Gábor Helped-by: Junio C Hamano Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano

bloom/diff: properly short-circuit on max_changes

2020-09-17T16:31:25Z

Commit e3696980 (diff: halt tree-diff early after max_changes, 2020-03-30) intended to create a mechanism to short-circuit a diff calculation after a certain number of paths were modified. By incrementing a "num_changes" counter throughout the recursive ll_diff_tree_paths(), this was supposed to match the number of changes that would be written into the changed-path Bloom filters. Unfortunately, this was not implemented correctly and instead misses simple cases like file modifications. This then does not stop very large changed-path filters from being written (unless they add or remove many files). To start, change the implementation in ll_diff_tree_paths() to instead use the global diff_queue_diff struct's 'nr' member as the count. This is a way to simplify the logic instead of making more mistakes in the complicated diff code. This has a drawback: the diff_queue_diff struct only lists the paths corresponding to blob changes, not their leading directories. Thus, get_or_compute_bloom_filter() needs an additional check to see if the hashmap with the leading directories becomes too large. One reason why this was not caught by test cases was that the test in t4216-log-bloom.sh that was supposed to check this "too many changes" condition only checked this on the initial commit of a repository. The old logic counted these values correctly. Update this test in a few ways: 1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit, allowing smaller commits to engage with this logic. 2. Create several interesting cases of edits, adds, removes, and mode changes (in the second commit). By testing both sides of the inequality with the *_MAX_CHANGED_PATHS variable, we can see that the count is exactly correct, so none of these changes are missed or over-counted. 3. Use the trace2 data value filter_found_large to verify that these commits are on the correct side of the limit. Another way to verify the behavior is correct is through performance tests. By testing on my local copies of the Git repository and the Linux kernel repository, I could measure the effect of these short-circuits when computing a fresh commit-graph file with changed-path Bloom filters using the command GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \ git commit-graph write --reachable --changed-paths and reporting the wall time and resulting commit-graph size. For Git, the results are | | N=1 | N=10 | N=512 | |--------|----------------|----------------|----------------| | HEAD~1 | 10.90s 9.18MB | 11.11s 9.34MB | 11.31s 9.35MB | | HEAD | 9.21s 8.62MB | 11.11s 9.29MB | 11.29s 9.34MB | For Linux, the results are | | N=1 | N=20 | N=512 | |--------|----------------|---------------|---------------| | HEAD~1 | 61.28s 64.3MB | 76.9s 72.6MB | 77.6s 72.6MB | | HEAD | 49.44s 56.3MB | 68.7s 65.9MB | 69.2s 65.9MB | Naturally, the improvement becomes much less as the limit grows, as fewer commits satisfy the short-circuit. Reported-by: SZEDER Gábor Signed-off-by: Derrick Stolee Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano

bloom: use provided 'struct bloom_filter_settings'

2020-09-17T16:31:25Z

When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter from scratch, it looks to the default 'struct bloom_filter_settings' in order to determine the maximum number of changed paths, number of bits per entry, and so on. All of these values have so far been constant, and so there was no need to pass in a pointer from the caller (eg., the one that is stored in the 'struct write_commit_graph_context'). Start passing in a 'struct bloom_filter_settings *' instead of using the default values to respect graph-specific settings (eg., in the case of setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS'). In order to have an initialized value for these settings, move its initialization to earlier in the commit-graph write. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano