<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/t/t5324-split-commit-graph.sh, branch v2.36.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.36.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.36.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2022-03-01T20:15:06Z</updated>
<entry>
<title>commit-graph: start parsing generation v2 (again)</title>
<updated>2022-03-01T20:15:06Z</updated>
<author>
<name>Derrick Stolee</name>
<email>derrickstolee@github.com</email>
</author>
<published>2022-03-01T19:48:31Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=3b0199d4c3c9cc2ec39413c52c15cfd16e4d0980'/>
<id>urn:sha1:3b0199d4c3c9cc2ec39413c52c15cfd16e4d0980</id>
<content type='text'>
The 'read_generation_data' member of 'struct commit_graph' was
introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
chain does, 2021-01-16). The intention was to avoid using corrected
commit dates if not all layers of a commit-graph had that data stored.
The logic in validate_mixed_generation_chain() at that point incorrectly
initialized read_generation_data to 1 if and only if the tip
commit-graph contained the Corrected Commit Date chunk.

This was "fixed" in 448a39e65 (commit-graph: validate layers for
generation data, 2021-02-02) to validate that read_generation_data was
either non-zero for all layers, or it would set read_generation_data to
zero for all layers.

The problem here is that read_generation_data is not initialized to be
non-zero anywhere!

This change initializes read_generation_data immediately after the chunk
is parsed, so each layer will have its value present as soon as
possible.

The read_generation_data member is used in fill_commit_graph_info() to
determine if we should use the corrected commit date or the topological
levels stored in the Commit Data chunk. Due to this bug, all previous
versions of Git were defaulting to topological levels in all cases!

This can be measured with some performance tests. Using the Linux kernel
as a testbed, I generated a complete commit-graph containing corrected
commit dates and tested the 'new' version against the previous, 'old'
version.

First, rev-list with --topo-order demonstrates a 26% improvement using
corrected commit dates:

hyperfine \
	-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
	-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
	--warmup=10

Benchmark 1: old
  Time (mean ± σ):      57.1 ms ±   3.1 ms
  Range (min … max):    52.9 ms …  62.0 ms    55 runs

Benchmark 2: new
  Time (mean ± σ):      45.5 ms ±   3.3 ms
  Range (min … max):    39.9 ms …  51.7 ms    59 runs

Summary
  'new' ran
    1.26 ± 0.11 times faster than 'old'

These performance improvements are due to the algorithmic improvements
given by walking fewer commits due to the higher cutoffs from corrected
commit dates.

However, this comes at a cost. The additional I/O cost of parsing the
corrected commit dates is visible in case of merge-base commands that do
not reduce the overall number of walked commits.

hyperfine \
        -n "old" "$OLD_GIT merge-base v4.8 v4.9" \
        -n "new" "$NEW_GIT merge-base v4.8 v4.9" \
        --warmup=10

Benchmark 1: old
  Time (mean ± σ):     110.4 ms ±   6.4 ms
  Range (min … max):    96.0 ms … 118.3 ms    25 runs

Benchmark 2: new
  Time (mean ± σ):     150.7 ms ±   1.1 ms
  Range (min … max):   149.3 ms … 153.4 ms    19 runs

Summary
  'old' ran
    1.36 ± 0.08 times faster than 'new'

Performance issues like this are what motivated 702110aac (commit-graph:
use config to specify generation type, 2021-02-25).

In the future, we could fix this performance problem by inserting the
corrected commit date offsets into the Commit Date chunk instead of
having that data in an extra chunk.

Signed-off-by: Derrick Stolee &lt;derrickstolee@github.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>test-read-graph: include extra post-parse info</title>
<updated>2022-03-01T20:09:55Z</updated>
<author>
<name>Derrick Stolee</name>
<email>derrickstolee@github.com</email>
</author>
<published>2022-03-01T19:48:28Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=c78c7a959cacf20eb55a694ed9e20921afb306cf'/>
<id>urn:sha1:c78c7a959cacf20eb55a694ed9e20921afb306cf</id>
<content type='text'>
It can be helpful to verify that the 'struct commit_graph' that results
from parsing a commit-graph is correctly structured. The existence of
different chunks is not enough to verify that all of the optional
features are correctly enabled.

Update 'test-tool read-graph' to output an "options:" line that includes
information for different parts of the struct commit_graph.

In particular, this change demonstrates that the read_generation_data
option is never being enabled, which will be fixed in a later change.

Signed-off-by: Derrick Stolee &lt;derrickstolee@github.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph tests: fix another graph_git_two_modes() helper</title>
<updated>2021-10-15T16:21:30Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2021-10-14T23:37:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a046aa38ca9e0b62dc847a00a1bd8efaa682ef31'/>
<id>urn:sha1:a046aa38ca9e0b62dc847a00a1bd8efaa682ef31</id>
<content type='text'>
In 135a7123755 (commit-graph: add --split option to builtin,
2019-06-18) this function was copy/pasted to the split commit-graph
tests, as in the preceding commit we need to fix this to use
&amp;&amp;-chaining, so it won't be hiding errors.

Unlike its sister function in "t5318-commit-graph.sh", which we got
lucky with, this one was hiding a real test failure. A tests added in
c523035cbd8 (commit-graph: allow cross-alternate chains, 2019-06-18)
has never worked as intended. Unlike most other graph_git_behavior
uses in this file it clones the repository into a sub-directory, so
we'll need to refer to "commits/6" as "origin/commits/6".

It's not easy to simply move the "graph_git_behavior" to the test
above it, since it itself spawns a "test_expect_success". Let's
instead add support to "graph_git_behavior()" and
"graph_git_two_modes()" to pass a "-C" argument to git.

We also need to add a "test -d fork" here, because otherwise we'll
fail on e.g.:

    GIT_SKIP_TESTS=t5324.13 ./t5324-split-commit-graph.sh

Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: use config to specify generation type</title>
<updated>2021-02-25T23:10:41Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2021-02-25T18:19:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=702110aac63556b4572d9c7b65c9123ec8038ebf'/>
<id>urn:sha1:702110aac63556b4572d9c7b65c9123ec8038ebf</id>
<content type='text'>
We have two established generation number versions:

 1: topological levels
 2: corrected commit dates

The corrected commit dates are enabled by default, but they also write
extra data in the GDAT and GDOV chunks. Services that host Git data
might want to have more control over when this feature rolls out than
just updating the Git binaries.

Add a new "commitGraph.generationVersion" config option that specifies
the intended generation number version. If this value is less than 2,
then the GDAT chunk is never written _or read_ from an existing file.

This can replace our use of the GIT_TEST_COMMIT_GRAPH_NO_GDAT
environment variable in the test suite. Remove it.

Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: use generation v2 only if entire chain does</title>
<updated>2021-01-19T00:21:18Z</updated>
<author>
<name>Abhishek Kumar</name>
<email>abhishekkumar8222@gmail.com</email>
</author>
<published>2021-01-16T18:11:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=1fdc383c5c87b6f2bcacddd07d5d45dd04c2d146'/>
<id>urn:sha1:1fdc383c5c87b6f2bcacddd07d5d45dd04c2d146</id>
<content type='text'>
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Abhishek Kumar &lt;abhishekkumar8222@gmail.com&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: implement generation data chunk</title>
<updated>2021-01-19T00:21:18Z</updated>
<author>
<name>Abhishek Kumar</name>
<email>abhishekkumar8222@gmail.com</email>
</author>
<published>2021-01-16T18:11:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=e8b63005c48696a26f976f5f9b0ccaf1983e439d'/>
<id>urn:sha1:e8b63005c48696a26f976f5f9b0ccaf1983e439d</id>
<content type='text'>
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar &lt;abhishekkumar8222@gmail.com&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ds/commit-graph-merging-fix'</title>
<updated>2020-11-02T21:17:39Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-11-02T21:17:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=307a53dd9914f65c9b6399221574e24234e4b49f'/>
<id>urn:sha1:307a53dd9914f65c9b6399221574e24234e4b49f</id>
<content type='text'>
When "git commit-graph" detects the same commit recorded more than
once while it is merging the layers, it used to die.  The code now
ignores all but one of them and continues.

* ds/commit-graph-merging-fix:
  commit-graph: don't write commit-graph when disabled
  commit-graph: ignore duplicates when merging layers
</content>
</entry>
<entry>
<title>commit-graph: don't write commit-graph when disabled</title>
<updated>2020-10-09T21:16:32Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2020-10-09T20:53:52Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=85102ac71b98466eaa2b9b5a568c3a1de736202d'/>
<id>urn:sha1:85102ac71b98466eaa2b9b5a568c3a1de736202d</id>
<content type='text'>
The core.commitGraph config setting can be set to 'false' to prevent
parsing commits from the commit-graph file(s). This causes an issue when
trying to write with "--split" which needs to distinguish between
commits that are in the existing commit-graph layers and commits that
are not. The existing mechanism uses parse_commit() and follows by
checking if there is a 'graph_pos' that shows the commit was parsed from
the commit-graph file.

When core.commitGraph=false, we do not parse the commits from the
commit-graph and 'graph_pos' indicates that no commits are in the
existing file. The --split logic moves forward creating a new layer on
top that holds all reachable commits, then possibly merges down into
those layers, resulting in duplicate commits. The previous change makes
that merging process more robust to such a situation in case it happens
in the written commit-graph data.

The easy answer here is to avoid writing a commit-graph if reading the
commit-graph is disabled. Since the resulting commit-graph will would not
be read by subsequent Git processes. This is more natural than forcing
core.commitGraph to be true for the 'write' process.

Reported-by: Thomas Braun &lt;thomas.braun@virtuell-zuhause.de&gt;
Helped-by: Jeff King &lt;peff@peff.net&gt;
Helped-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: ignore duplicates when merging layers</title>
<updated>2020-10-09T21:16:23Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2020-10-09T20:53:51Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=150f11574bcb53a6880a03593ff46d6b25fa03a1'/>
<id>urn:sha1:150f11574bcb53a6880a03593ff46d6b25fa03a1</id>
<content type='text'>
Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.

If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.

The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.

The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.

Reported-by: Thomas Braun &lt;thomas.braun@virtuell-zuhause.de&gt;
Helped-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Co-authored-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'tb/bloom-improvements'</title>
<updated>2020-09-29T21:01:20Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-09-29T21:01:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=288ed98bf768f4df9b569d51a52c233a1402c0f5'/>
<id>urn:sha1:288ed98bf768f4df9b569d51a52c233a1402c0f5</id>
<content type='text'>
"git commit-graph write" learned to limit the number of bloom
filters that are computed from scratch with the --max-new-filters
option.

* tb/bloom-improvements:
  commit-graph: introduce 'commitGraph.maxNewFilters'
  builtin/commit-graph.c: introduce '--max-new-filters=&lt;n&gt;'
  commit-graph: rename 'split_commit_graph_opts'
  bloom: encode out-of-bounds filters as non-empty
  bloom/diff: properly short-circuit on max_changes
  bloom: use provided 'struct bloom_filter_settings'
  bloom: split 'get_bloom_filter()' in two
  commit-graph.c: store maximum changed paths
  commit-graph: respect 'commitGraph.readChangedPaths'
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: pass a 'struct repository *' in more places
  t4216: use an '&amp;&amp;'-chain
  commit-graph: introduce 'get_bloom_filter_settings()'
</content>
</entry>
</feed>
