git/bloom.c, branch v2.45.4

treewide: remove unnecessary includes in source files

2023-12-26T20:04:31Z

Each of these were checked with gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE} to ensure that removing the direct inclusion of the header actually resulted in that header no longer being included at all (i.e. that no other header pulled it in transitively). ...except for a few cases where we verified that although the header was brought in transitively, nothing from it was directly used in that source file. These cases were: * builtin/credential-cache.c * builtin/pull.c * builtin/send-pack.c Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano

commit-graph: detect out-of-order BIDX offsets

2023-10-09T22:55:02Z

The BIDX chunk tells us the offsets at which each commit's Bloom filters can be found in the BDAT chunk. We compute the length of each filter by checking the offsets of neighbors and subtracting them. If the offsets are out of order, then we'll get a negative length, which we then store as a very large unsigned value. This can cause us to read out-of-bounds memory, as we access the hash data modulo "filter->len * BITS_PER_WORD". We can easily detect this case when loading the individual filters. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit-graph: check bounds when accessing BDAT chunk

2023-10-09T22:55:01Z

When loading Bloom filters from a commit-graph file, we use the offset values in the BIDX chunk to index into the memory mapped for the BDAT chunk. But since we don't record how big the BDAT chunk is, we just trust that the BIDX offsets won't cause us to read outside of the chunk memory. A corrupted or malicious commit-graph file will cause us to segfault (in practice this isn't a very interesting attack, since commit-graph files are local-only, and the worst case is an out-of-bounds read). We can't fix this by checking the chunk size during parsing, since the data in the BDAT chunk doesn't have a fixed size (that's why we need the BIDX in the first place). So we'll fix it in two parts: 1. Record the BDAT chunk size during parsing, and then later check that the BIDX offsets we look up are within bounds. 2. Because the offsets are relative to the end of the BDAT header, we must also make sure that the BDAT chunk is at least as large as the expected header size. Otherwise, we overflow when trying to move past the header, even for an offset of "0". We can check this early, during the parsing stage. The error messages are rather verbose, but since this is not something you'd expect to see outside of severe bugs or corruption, it makes sense to err on the side of too many details. Sadly we can't mention the filename during the chunk-parsing stage, as we haven't set g->filename at this point, nor passed it down through the stack. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit.h: reduce unnecessary includes

2023-04-24T19:47:33Z

Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano

git-compat-util.h: use "UNUSED", not "UNUSED(var)"

2022-09-01T17:49:48Z

As reported in [1] the "UNUSED(var)" macro introduced in 2174b8c75de (Merge branch 'jk/unused-annotation' into next, 2022-08-24) breaks coccinelle's parsing of our sources in files where it occurs. Let's instead partially go with the approach suggested in [2] of making this not take an argument. As noted in [1] "coccinelle" will ignore such tokens in argument lists that it doesn't know about, and it's less of a surprise to syntax highlighters. This undoes the "help us notice when a parameter marked as unused is actually use" part of 9b240347543 (git-compat-util: add UNUSED macro, 2022-08-19), a subsequent commit will further tweak the macro to implement a replacement for that functionality. 1. https://lore.kernel.org/git/220825.86ilmg4mil.gmgdl@evledraar.gmail.com/ 2. https://lore.kernel.org/git/220819.868rnk54ju.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason Signed-off-by: Junio C Hamano

hashmap: mark unused callback parameters

2022-08-19T19:18:55Z

Hashmap comparison functions must conform to a particular callback interface, but many don't use all of their parameters. Especially the void cmp_data pointer, but some do not use keydata either (because they can easily form a full struct to pass when doing lookups). Let's mark these to make -Wunused-parameter happy. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit-graph: fix corrupt upgrade from generation v1 to v2

2022-07-15T23:51:39Z

The previous commit demonstrates a bug where a commit-graph using generation v2 could enter a state where one of the GDA2 values has its most-significant bit set (indicating that its value should be read from the extended offset table in the GDO2 chunk) without having a GDO2 chunk to read from. This results in the following error message being displayed to the caller: fatal: commit-graph requires overflow generation data but has none This bug arises in the following scenario: - We decide to write a commit-graph using generation number v2, and decide (correctly) that no GDO2 chunk is necessary (e.g., because all of the commiter date offsets are no larger than 2^31-1). - The v2 generation numbers are stored in the `->generation` member of the commit slab holding `struct commit_graph_data`'s. - Later on, `load_commit_graph_info()` is called, overwriting the v2 generation data in the aforementioned slab with any existing v1 generation data. Then, when the commit-graph code goes to write the GDA2 chunk via `write_graph_chunk_generation_data()`, we use the overwritten generation v1 data in a place where we expect to use a v2 generation number: offset = commit_graph_data_at(c)->generation - c->date; ...because `commit_graph_data_at(c)->generation` used to hold the v2 generation data, but it was overwritten to contain the v1 generation number via `load_commit_graph_info()`. If the `offset` computation above overflows the v2 generation number max, then `write_graph_chunk_generation_data()` will update its count of large offsets and write the marker accordingly: if (offset > GENERATION_NUMBER_V2_OFFSET_MAX) { offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows; num_generation_data_overflows++; } and reads will look for the GDO2 chunk containing the overflowing v2 generation number, *after* the commit-graph code decided that no such chunk was necessary. The main problem is that the slab containing `struct commit_graph_data` has a dual purpose. It is used to hold data that we are about to write to disk while generating a commit-graph, as well as hold data that was read from an existing commit-graph. When the two mix, namely when the result of reading the commit-graph has a side-effect that mixes poorly with an in-progress commit-graph write, we end up with corrupt data. A complete fix might be to introduce a new slab that is used exclusively for writing, and gate access between the two slabs based on context provided by the caller (e.g., whether this computation is part of a "read" or "write" operation). But a more minimal fix addresses the only known path which overwrites the slab data, which is `compute_bloom_filters()` -> `get_or_compute_bloom_filter()` -> `load_commit_graph_info()` -> `fill_commit_graph_info()` by avoiding the last call which clobbers the data altogether. This path only needs to learn the graph position of a given commit so that it can be used in `load_bloom_filter_from_graph()`. By replacing the last steps of the above with one that records the graph position into a temporary variable which is then used to load the existing Bloom data, we eliminate the clobbering, removing the corruption. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano

Merge branch 'ah/plugleaks'

2021-05-07T03:47:41Z

Plug various leans reported by LSAN. * ah/plugleaks: builtin/rm: avoid leaking pathspec and seen builtin/rebase: release git_format_patch_opt too builtin/for-each-ref: free filter and UNLEAK sorting. mailinfo: also free strbuf lists when clearing mailinfo builtin/checkout: clear pending objects after diffing builtin/check-ignore: clear_pathspec before returning builtin/bugreport: don't leak prefixed filename branch: FREE_AND_NULL instead of NULL'ing real_ref bloom: clear each bloom_key after use ls-files: free max_prefix when done wt-status: fix multiple small leaks revision: free remainder of old commit list in limit_list

bloom: clear each bloom_key after use

2021-04-28T00:25:44Z

fill_bloom_key() allocates memory into bloom_key, we need to clean that up once the key is no longer needed. This leak was found while running t0002-t0099. Although this leak is happening in code being called from a test-helper, the same code is also used in various locations around git, and can therefore happen during normal usage too. Gabor's analysis shows that peak-memory usage during 'git commit-graph write' is reduced on the order of 10% for a selection of larger repos (along with an even larger reduction if we override modified path bloom filter limits): https://lore.kernel.org/git/20210411072651.GF2947267@szeder.dev/ LSAN output: Direct leak of 308 byte(s) in 11 object(s) allocated from: #0 0x49a5e2 in calloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3 #1 0x6f4032 in xcalloc wrapper.c:140:8 #2 0x4f2905 in fill_bloom_key bloom.c:137:28 #3 0x4f34c1 in get_or_compute_bloom_filter bloom.c:284:4 #4 0x4cb484 in get_bloom_filter_for_commit t/helper/test-bloom.c:43:11 #5 0x4cb072 in cmd__bloom t/helper/test-bloom.c:97:3 #6 0x4ca7ef in cmd_main t/helper/test-tool.c:121:11 #7 0x4caace in main common-main.c:52:11 #8 0x7f798af95349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 308 byte(s) leaked in 11 allocation(s). Signed-off-by: Andrzej Hunt Signed-off-by: Junio C Hamano

use CALLOC_ARRAY

2021-03-14T00:00:09Z

Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe Signed-off-by: Junio C Hamano