<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/bloom.c, branch v2.37.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.37.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.37.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2022-07-15T23:51:39Z</updated>
<entry>
<title>commit-graph: fix corrupt upgrade from generation v1 to v2</title>
<updated>2022-07-15T23:51:39Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2022-07-12T23:10:33Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9550f6c16a8be18bd4868909d4d5e29d05bd9733'/>
<id>urn:sha1:9550f6c16a8be18bd4868909d4d5e29d05bd9733</id>
<content type='text'>
The previous commit demonstrates a bug where a commit-graph using
generation v2 could enter a state where one of the GDA2 values has its
most-significant bit set (indicating that its value should be read from
the extended offset table in the GDO2 chunk) without having a GDO2 chunk
to read from.

This results in the following error message being displayed to the
caller:

    fatal: commit-graph requires overflow generation data but has none

This bug arises in the following scenario:

  - We decide to write a commit-graph using generation number v2, and
    decide (correctly) that no GDO2 chunk is necessary (e.g., because
    all of the commiter date offsets are no larger than 2^31-1).

  - The v2 generation numbers are stored in the `-&gt;generation` member of
    the commit slab holding `struct commit_graph_data`'s.

  - Later on, `load_commit_graph_info()` is called, overwriting the
    v2 generation data in the aforementioned slab with any existing v1
    generation data.

Then, when the commit-graph code goes to write the GDA2 chunk via
`write_graph_chunk_generation_data()`, we use the overwritten generation
v1 data in a place where we expect to use a v2 generation number:

    offset = commit_graph_data_at(c)-&gt;generation - c-&gt;date;

...because `commit_graph_data_at(c)-&gt;generation` used to hold the v2
generation data, but it was overwritten to contain the v1 generation
number via `load_commit_graph_info()`.

If the `offset` computation above overflows the v2 generation number
max, then `write_graph_chunk_generation_data()` will update its count of
large offsets and write the marker accordingly:

    if (offset &gt; GENERATION_NUMBER_V2_OFFSET_MAX) {
        offset = CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW | num_generation_data_overflows;
        num_generation_data_overflows++;
    }

and reads will look for the GDO2 chunk containing the overflowing v2
generation number, *after* the commit-graph code decided that no such
chunk was necessary.

The main problem is that the slab containing `struct commit_graph_data`
has a dual purpose. It is used to hold data that we are about to write
to disk while generating a commit-graph, as well as hold data that was
read from an existing commit-graph.

When the two mix, namely when the result of reading the commit-graph has
a side-effect that mixes poorly with an in-progress commit-graph write,
we end up with corrupt data.

A complete fix might be to introduce a new slab that is used exclusively
for writing, and gate access between the two slabs based on context
provided by the caller (e.g., whether this computation is part of a
"read" or "write" operation).

But a more minimal fix addresses the only known path which overwrites
the slab data, which is `compute_bloom_filters()` -&gt;
`get_or_compute_bloom_filter()` -&gt; `load_commit_graph_info()` -&gt;
`fill_commit_graph_info()` by avoiding the last call which clobbers the
data altogether.

This path only needs to learn the graph position of a given commit so
that it can be used in `load_bloom_filter_from_graph()`. By replacing
the last steps of the above with one that records the graph position
into a temporary variable which is then used to load the existing Bloom
data, we eliminate the clobbering, removing the corruption.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ah/plugleaks'</title>
<updated>2021-05-07T03:47:41Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2021-05-07T03:47:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=936e58851af0324d408c9e1a70cddd288d892a45'/>
<id>urn:sha1:936e58851af0324d408c9e1a70cddd288d892a45</id>
<content type='text'>
Plug various leans reported by LSAN.

* ah/plugleaks:
  builtin/rm: avoid leaking pathspec and seen
  builtin/rebase: release git_format_patch_opt too
  builtin/for-each-ref: free filter and UNLEAK sorting.
  mailinfo: also free strbuf lists when clearing mailinfo
  builtin/checkout: clear pending objects after diffing
  builtin/check-ignore: clear_pathspec before returning
  builtin/bugreport: don't leak prefixed filename
  branch: FREE_AND_NULL instead of NULL'ing real_ref
  bloom: clear each bloom_key after use
  ls-files: free max_prefix when done
  wt-status: fix multiple small leaks
  revision: free remainder of old commit list in limit_list
</content>
</entry>
<entry>
<title>bloom: clear each bloom_key after use</title>
<updated>2021-04-28T00:25:44Z</updated>
<author>
<name>Andrzej Hunt</name>
<email>ajrhunt@google.com</email>
</author>
<published>2021-04-25T14:16:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b180c681bb911714e2c9735038effe0251d6586e'/>
<id>urn:sha1:b180c681bb911714e2c9735038effe0251d6586e</id>
<content type='text'>
fill_bloom_key() allocates memory into bloom_key, we need to clean that
up once the key is no longer needed.

This leak was found while running t0002-t0099. Although this leak is
happening in code being called from a test-helper, the same code is also
used in various locations around git, and can therefore happen during
normal usage too. Gabor's analysis shows that peak-memory usage during
'git commit-graph write' is reduced on the order of 10% for a selection
of larger repos (along with an even larger reduction if we override
modified path bloom filter limits):
https://lore.kernel.org/git/20210411072651.GF2947267@szeder.dev/

LSAN output:

Direct leak of 308 byte(s) in 11 object(s) allocated from:
    #0 0x49a5e2 in calloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3
    #1 0x6f4032 in xcalloc wrapper.c:140:8
    #2 0x4f2905 in fill_bloom_key bloom.c:137:28
    #3 0x4f34c1 in get_or_compute_bloom_filter bloom.c:284:4
    #4 0x4cb484 in get_bloom_filter_for_commit t/helper/test-bloom.c:43:11
    #5 0x4cb072 in cmd__bloom t/helper/test-bloom.c:97:3
    #6 0x4ca7ef in cmd_main t/helper/test-tool.c:121:11
    #7 0x4caace in main common-main.c:52:11
    #8 0x7f798af95349 in __libc_start_main (/lib64/libc.so.6+0x24349)

SUMMARY: AddressSanitizer: 308 byte(s) leaked in 11 allocation(s).

Signed-off-by: Andrzej Hunt &lt;ajrhunt@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>use CALLOC_ARRAY</title>
<updated>2021-03-14T00:00:09Z</updated>
<author>
<name>René Scharfe</name>
<email>l.s.r@web.de</email>
</author>
<published>2021-03-13T16:17:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=ca56dadb4b65ccaeab809d80db80a312dc00941a'/>
<id>urn:sha1:ca56dadb4b65ccaeab809d80db80a312dc00941a</id>
<content type='text'>
Add and apply a semantic patch for converting code that open-codes
CALLOC_ARRAY to use it instead.  It shortens the code and infers the
element size automatically.

Signed-off-by: René Scharfe &lt;l.s.r@web.de&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Use new HASHMAP_INIT macro to simplify hashmap initialization</title>
<updated>2020-11-11T20:55:27Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2020-11-11T20:02:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b19315d8ab46ae50410dba9228a4c08ae5c73e38'/>
<id>urn:sha1:b19315d8ab46ae50410dba9228a4c08ae5c73e38</id>
<content type='text'>
Now that hashamp has lazy initialization and a HASHMAP_INIT macro,
hashmaps allocated on the stack can be initialized without a call to
hashmap_init() and in some cases makes the code a bit shorter.  Convert
some callsites over to take advantage of this.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>hashmap: provide deallocation function names</title>
<updated>2020-11-02T20:15:50Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2020-11-02T18:55:05Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=6da1a258142ac2422c8c57c54b92eaed3c86226e'/>
<id>urn:sha1:6da1a258142ac2422c8c57c54b92eaed3c86226e</id>
<content type='text'>
hashmap_free(), hashmap_free_entries(), and hashmap_free_() have existed
for a while, but aren't necessarily the clearest names, especially with
hashmap_partial_clear() being added to the mix and lazy-initialization
now being supported.  Peff suggested we adopt the following names[1]:

  - hashmap_clear() - remove all entries and de-allocate any
    hashmap-specific data, but be ready for reuse

  - hashmap_clear_and_free() - ditto, but free the entries themselves

  - hashmap_partial_clear() - remove all entries but don't deallocate
    table

  - hashmap_partial_clear_and_free() - ditto, but free the entries

This patch provides the new names and converts all existing callers over
to the new naming scheme.

[1] https://lore.kernel.org/git/20201030125059.GA3277724@coredump.intra.peff.net/

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>builtin/commit-graph.c: introduce '--max-new-filters=&lt;n&gt;'</title>
<updated>2020-09-18T17:35:39Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2020-09-18T13:27:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=809e0327f579267ea78a1b2f727d3b63c1f5d044'/>
<id>urn:sha1:809e0327f579267ea78a1b2f727d3b63c1f5d044</id>
<content type='text'>
Introduce a command-line flag to specify the maximum number of new Bloom
filters that a 'git commit-graph write' is willing to compute from
scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=&lt;n&gt;' to allow computing at most '&lt;n&gt;' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Helped-by: Junio C Hamano &lt;gitster@pobox.com&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>bloom: encode out-of-bounds filters as non-empty</title>
<updated>2020-09-18T04:55:50Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2020-09-18T02:59:44Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=59f0d5073fd9f0497b9ff9a48fb3a7a8d82d1f9d'/>
<id>urn:sha1:59f0d5073fd9f0497b9ff9a48fb3a7a8d82d1f9d</id>
<content type='text'>
When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.

This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.

In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".

It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.

To that end, change how we store Bloom filters in the "computed but not
representable" category:

  - Bloom filters with no entries are stored as a single byte with all
    bits low (i.e., all queries to that Bloom filter will return
    "definitely not")

  - Bloom filters with too many entries are stored as a single byte with
    all bits set high (i.e., all queries to that Bloom filter will
    return "maybe").

These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:

  - Filters that were previously empty will be recomputed and stored
    according to the new rules, and

  - old clients reading filters generated by new clients will interpret
    the filters correctly and be none the wiser to how they were
    generated.

Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.

Note that this does increase the size of on-disk commit-graphs, but far
less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or empty commit.

In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.

See, for example, these relative proportions of such boundary commits
(collected by SZEDER Gábor):

                  |     Percentage of     |    commit-graph   |           |
                  |   commits modifying   |     file size     |           |
                  ├────────┬──────────────┼───────────────────┤    pct.   |
                  | 0 path | &gt;= 512 paths | before  |  after  |   change  |
 ┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤
 | android-base   | 13.20% |        0.13% | 37.468M | 37.534M | +0.1741 % |
 | cmssw          |  0.15% |        0.23% | 17.118M | 17.119M | +0.0091 % |
 | cpython        |  3.07% |        0.01% |  7.967M |  7.971M | +0.0423 % |
 | elasticsearch  |  0.70% |        1.00% |  8.833M |  8.835M | +0.0128 % |
 | gcc            |  0.00% |        0.08% | 16.073M | 16.074M | +0.0030 % |
 | gecko-dev      |  0.14% |        0.64% | 59.868M | 59.874M | +0.0105 % |
 | git            |  0.11% |        0.02% |  3.895M |  3.895M | +0.0020 % |
 | glibc          |  0.02% |        0.10% |  3.555M |  3.555M | +0.0021 % |
 | go             |  0.00% |        0.07% |  3.186M |  3.186M | +0.0018 % |
 | homebrew-cask  |  0.40% |        0.02% |  7.035M |  7.035M | +0.0065 % |
 | homebrew-core  |  0.01% |        0.01% | 11.611M | 11.611M | +0.0002 % |
 | jdk            |  0.26% |        5.64% |  5.537M |  5.540M | +0.0590 % |
 | linux          |  0.01% |        0.51% | 63.735M | 63.740M | +0.0073 % |
 | llvm-project   |  0.12% |        0.03% | 25.515M | 25.516M | +0.0050 % |
 | rails          |  0.10% |        0.10% |  6.252M |  6.252M | +0.0027 % |
 | rust           |  0.07% |        0.17% |  9.364M |  9.364M | +0.0033 % |
 | tensorflow     |  0.09% |        1.02% |  7.009M |  7.010M | +0.0158 % |
 | webkit         |  0.05% |        0.31% | 17.405M | 17.406M | +0.0047 % |

(where the above increase is determined by computing a non-split
commit-graph before and after this patch).

Given that these projects are all "large" by commit count, the storage
cost by writing these filters explicitly is negligible. In the most
extreme example, android-base (which has 494,848 commits at the time of
writing) would have its commit-graph increase by a modest 68.4 KB.

Finally, a test to exercise filters which contain too many changed path
entries will be introduced in a subsequent patch.

Suggested-by: SZEDER Gábor &lt;szeder.dev@gmail.com&gt;
Suggested-by: Jakub Narębski &lt;jnareb@gmail.com&gt;
Helped-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Helped-by: SZEDER Gábor &lt;szeder.dev@gmail.com&gt;
Helped-by: Junio C Hamano &lt;gitster@pobox.com&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>bloom/diff: properly short-circuit on max_changes</title>
<updated>2020-09-17T16:31:25Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2020-09-16T18:07:52Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b16a8277644e4b1d21c08d97de757105039dc7ae'/>
<id>urn:sha1:b16a8277644e4b1d21c08d97de757105039dc7ae</id>
<content type='text'>
Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor &lt;szeder.dev@gmail.com&gt;
Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>bloom: use provided 'struct bloom_filter_settings'</title>
<updated>2020-09-17T16:31:25Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2020-09-16T18:07:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9a7a9ed10d56d6c22a0f16d7baf3f9895c47d693'/>
<id>urn:sha1:9a7a9ed10d56d6c22a0f16d7baf3f9895c47d693</id>
<content type='text'>
When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
from scratch, it looks to the default 'struct bloom_filter_settings' in
order to determine the maximum number of changed paths, number of bits
per entry, and so on.

All of these values have so far been constant, and so there was no need
to pass in a pointer from the caller (eg., the one that is stored in the
'struct write_commit_graph_context').

Start passing in a 'struct bloom_filter_settings *' instead of using the
default values to respect graph-specific settings (eg., in the case of
setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

In order to have an initialized value for these settings, move its
initialization to earlier in the commit-graph write.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
