git/t/t5324-split-commit-graph.sh, branch v2.36.2

commit-graph: start parsing generation v2 (again)

2022-03-01T20:15:06Z

The 'read_generation_data' member of 'struct commit_graph' was introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire chain does, 2021-01-16). The intention was to avoid using corrected commit dates if not all layers of a commit-graph had that data stored. The logic in validate_mixed_generation_chain() at that point incorrectly initialized read_generation_data to 1 if and only if the tip commit-graph contained the Corrected Commit Date chunk. This was "fixed" in 448a39e65 (commit-graph: validate layers for generation data, 2021-02-02) to validate that read_generation_data was either non-zero for all layers, or it would set read_generation_data to zero for all layers. The problem here is that read_generation_data is not initialized to be non-zero anywhere! This change initializes read_generation_data immediately after the chunk is parsed, so each layer will have its value present as soon as possible. The read_generation_data member is used in fill_commit_graph_info() to determine if we should use the corrected commit date or the topological levels stored in the Commit Data chunk. Due to this bug, all previous versions of Git were defaulting to topological levels in all cases! This can be measured with some performance tests. Using the Linux kernel as a testbed, I generated a complete commit-graph containing corrected commit dates and tested the 'new' version against the previous, 'old' version. First, rev-list with --topo-order demonstrates a 26% improvement using corrected commit dates: hyperfine \ -n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \ -n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 57.1 ms ± 3.1 ms Range (min … max): 52.9 ms … 62.0 ms 55 runs Benchmark 2: new Time (mean ± σ): 45.5 ms ± 3.3 ms Range (min … max): 39.9 ms … 51.7 ms 59 runs Summary 'new' ran 1.26 ± 0.11 times faster than 'old' These performance improvements are due to the algorithmic improvements given by walking fewer commits due to the higher cutoffs from corrected commit dates. However, this comes at a cost. The additional I/O cost of parsing the corrected commit dates is visible in case of merge-base commands that do not reduce the overall number of walked commits. hyperfine \ -n "old" "$OLD_GIT merge-base v4.8 v4.9" \ -n "new" "$NEW_GIT merge-base v4.8 v4.9" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 110.4 ms ± 6.4 ms Range (min … max): 96.0 ms … 118.3 ms 25 runs Benchmark 2: new Time (mean ± σ): 150.7 ms ± 1.1 ms Range (min … max): 149.3 ms … 153.4 ms 19 runs Summary 'old' ran 1.36 ± 0.08 times faster than 'new' Performance issues like this are what motivated 702110aac (commit-graph: use config to specify generation type, 2021-02-25). In the future, we could fix this performance problem by inserting the corrected commit date offsets into the Commit Date chunk instead of having that data in an extra chunk. Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano

test-read-graph: include extra post-parse info

2022-03-01T20:09:55Z

It can be helpful to verify that the 'struct commit_graph' that results from parsing a commit-graph is correctly structured. The existence of different chunks is not enough to verify that all of the optional features are correctly enabled. Update 'test-tool read-graph' to output an "options:" line that includes information for different parts of the struct commit_graph. In particular, this change demonstrates that the read_generation_data option is never being enabled, which will be fixed in a later change. Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano

commit-graph tests: fix another graph_git_two_modes() helper

2021-10-15T16:21:30Z

In 135a7123755 (commit-graph: add --split option to builtin, 2019-06-18) this function was copy/pasted to the split commit-graph tests, as in the preceding commit we need to fix this to use &&-chaining, so it won't be hiding errors. Unlike its sister function in "t5318-commit-graph.sh", which we got lucky with, this one was hiding a real test failure. A tests added in c523035cbd8 (commit-graph: allow cross-alternate chains, 2019-06-18) has never worked as intended. Unlike most other graph_git_behavior uses in this file it clones the repository into a sub-directory, so we'll need to refer to "commits/6" as "origin/commits/6". It's not easy to simply move the "graph_git_behavior" to the test above it, since it itself spawns a "test_expect_success". Let's instead add support to "graph_git_behavior()" and "graph_git_two_modes()" to pass a "-C" argument to git. We also need to add a "test -d fork" here, because otherwise we'll fail on e.g.: GIT_SKIP_TESTS=t5324.13 ./t5324-split-commit-graph.sh Signed-off-by: Ævar Arnfjörð Bjarmason Signed-off-by: Junio C Hamano

commit-graph: use config to specify generation type

2021-02-25T23:10:41Z

We have two established generation number versions: 1: topological levels 2: corrected commit dates The corrected commit dates are enabled by default, but they also write extra data in the GDAT and GDOV chunks. Services that host Git data might want to have more control over when this feature rolls out than just updating the Git binaries. Add a new "commitGraph.generationVersion" config option that specifies the intended generation number version. If this value is less than 2, then the GDAT chunk is never written _or read_ from an existing file. This can replace our use of the GIT_TEST_COMMIT_GRAPH_NO_GDAT environment variable in the test suite. Remove it. Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano

commit-graph: use generation v2 only if entire chain does

2021-01-19T00:21:18Z

Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee Signed-off-by: Abhishek Kumar Reviewed-by: Taylor Blau Reviewed-by: Derrick Stolee Signed-off-by: Junio C Hamano

commit-graph: implement generation data chunk

2021-01-19T00:21:18Z

As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/ Signed-off-by: Abhishek Kumar Reviewed-by: Taylor Blau Reviewed-by: Derrick Stolee Signed-off-by: Junio C Hamano

Merge branch 'ds/commit-graph-merging-fix'

2020-11-02T21:17:39Z

When "git commit-graph" detects the same commit recorded more than once while it is merging the layers, it used to die. The code now ignores all but one of them and continues. * ds/commit-graph-merging-fix: commit-graph: don't write commit-graph when disabled commit-graph: ignore duplicates when merging layers

commit-graph: don't write commit-graph when disabled

2020-10-09T21:16:32Z

The core.commitGraph config setting can be set to 'false' to prevent parsing commits from the commit-graph file(s). This causes an issue when trying to write with "--split" which needs to distinguish between commits that are in the existing commit-graph layers and commits that are not. The existing mechanism uses parse_commit() and follows by checking if there is a 'graph_pos' that shows the commit was parsed from the commit-graph file. When core.commitGraph=false, we do not parse the commits from the commit-graph and 'graph_pos' indicates that no commits are in the existing file. The --split logic moves forward creating a new layer on top that holds all reachable commits, then possibly merges down into those layers, resulting in duplicate commits. The previous change makes that merging process more robust to such a situation in case it happens in the written commit-graph data. The easy answer here is to avoid writing a commit-graph if reading the commit-graph is disabled. Since the resulting commit-graph will would not be read by subsequent Git processes. This is more natural than forcing core.commitGraph to be true for the 'write' process. Reported-by: Thomas Braun Helped-by: Jeff King Helped-by: Taylor Blau Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano

commit-graph: ignore duplicates when merging layers

2020-10-09T21:16:23Z

Thomas reported [1] that a "git fetch" command was failing with an error saying "unexpected duplicate commit id". The root cause is that they had fetch.writeCommitGraph enabled which generates commit-graph chains, and this instance was merging two layers that both contained the same commit ID. [1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/ The initial assumption is that Git would not write a commit ID into a commit-graph layer if it already exists in a lower commit-graph layer. Somehow, this specific case did get into that situation, leading to this error. While unexpected, this isn't actually invalid (as long as the two layers agree on the metadata for the commit). When we parse a commit that does not have a graph_pos in the commit_graph_data_slab, we use binary search in the commit-graph layers to find the commit and set graph_pos. That position is never used again in this case. However, when we parse a commit from the commit-graph file, we load its parents from the commit-graph and assign graph_pos at that point. If those parents were already parsed from the commit-graph, then nothing needs to be done. Otherwise, this graph_pos is a valid position in the commit-graph so we can parse the parents, when necessary. Thus, this die() is too aggressive. The easiest thing to do would be to ignore the duplicates. If we only ignore the duplicates, then we will produce a commit-graph that has identical commit IDs listed in adjacent positions. This excess data will never be removed from the commit-graph, which could cascade into significantly bloated file sizes. Thankfully, we can collapse the list to erase the duplicate commit pointers. This allows us to get the end result we want without extra memory costs and minimal CPU time. The root cause is due to disabling core.commitGraph, which prevents parsing commits from the lower layers during a 'git commit-graph write --split' command. Since we use the 'graph_pos' value to determine whether a commit is in a lower layer, we never discover that those commits are already in the commit-graph chain and add them to the top layer. This layer is then merged down, creating duplicates. The test added in t5324-split-commit-graph.sh fails without this change. However, we still have not completely removed the need for this duplicate check. That will come in a follow-up change. Reported-by: Thomas Braun Helped-by: Taylor Blau Co-authored-by: Jeff King Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano

Merge branch 'tb/bloom-improvements'

2020-09-29T21:01:20Z

"git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'