<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/commit-graph.h, branch v2.40.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.40.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.40.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2022-08-03T20:36:08Z</updated>
<entry>
<title>Merge branch 'tb/commit-graph-genv2-upgrade-fix'</title>
<updated>2022-08-03T20:36:08Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2022-08-03T20:36:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=37e4bdd5ee5d6a7e09feaf5857299aac8fd56aeb'/>
<id>urn:sha1:37e4bdd5ee5d6a7e09feaf5857299aac8fd56aeb</id>
<content type='text'>
There was a bug in the codepath to upgrade generation information
in commit-graph from v1 to v2 format, which has been corrected.

* tb/commit-graph-genv2-upgrade-fix:
  commit-graph: fix corrupt upgrade from generation v1 to v2
  commit-graph: introduce `repo_find_commit_pos_in_graph()`
  t5318: demonstrate commit-graph generation v2 corruption
</content>
</entry>
<entry>
<title>commit-graph: introduce `repo_find_commit_pos_in_graph()`</title>
<updated>2022-07-15T23:51:39Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2022-07-12T23:10:31Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=7805360b7a3be02057385bc9d17aa493120b9538'/>
<id>urn:sha1:7805360b7a3be02057385bc9d17aa493120b9538</id>
<content type='text'>
Low-level callers in systems that are adjacent to the commit-graph (like
the changed-path Bloom filter code) could benefit from being able to
call a function like `parse_commit_in_graph()` without modifying the
corresponding commit slab data.

This is useful in contexts where that slab data is being used to prepare
for an upcoming commit-graph write, where Git must be careful to avoid
clobbering any of that data during a read operation.

Introduce a low-level variant of `parse_commit_in_graph()` which returns
the graph position of a given commit only, without modifying any of the
slab data.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: pass repo_settings instead of repository</title>
<updated>2022-07-14T22:42:17Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2022-07-14T21:43:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a92d8523cef66d46e24cd5ef2f01ef97dc4ab239'/>
<id>urn:sha1:a92d8523cef66d46e24cd5ef2f01ef97dc4ab239</id>
<content type='text'>
The parse_commit_graph() function takes a 'struct repository *' pointer,
but it only ever accesses config settings (either directly or through
the .settings field of the repo struct). Move all relevant config
settings into the repo_settings struct, and update parse_commit_graph()
and its existing callers so that it takes 'struct repo_settings *'
instead.

Callers of parse_commit_graph() will now need to call
prepare_repo_settings() themselves, or initialize a 'struct
repo_settings' directly.

Prior to ab14d0676c (commit-graph: pass a 'struct repository *' in more
places, 2020-09-09), parsing a commit-graph was a pure function
depending only on the contents of the commit-graph itself. Commit
ab14d0676c introduced a dependency on a `struct repository` pointer, and
later commits such as b66d84756f (commit-graph: respect
'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config
settings, which were accessed through the `settings` field of the
repository pointer. This field was initialized via a call to
`prepare_repo_settings()`.

Additionally, this fixes an issue in fuzz-commit-graph: In 44c7e62
(2021-12-06, repo-settings:prepare_repo_settings only in git repos),
prepare_repo_settings was changed to issue a BUG() if it is called by a
process whose CWD is not a Git repository.

The combination of commits mentioned above broke fuzz-commit-graph,
which attempts to parse arbitrary fuzzing-engine-provided bytes as a
commit graph file. Prior to this change, parse_commit_graph() called
prepare_repo_settings(), but since we run the fuzz tests without a valid
repository, we are hitting the BUG() from 44c7e62 for every test case.

Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Josh Steadmon &lt;steadmon@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: fix memory leak in misused string_list API</title>
<updated>2022-03-04T21:24:18Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2022-03-04T18:32:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=4a0479086a9bea5d31c4588b07bd45ae92a12b71'/>
<id>urn:sha1:4a0479086a9bea5d31c4588b07bd45ae92a12b71</id>
<content type='text'>
When this code was migrated to the string_list API in
d88b14b3fd6 (commit-graph: use string-list API for input, 2018-06-27)
it was made to use used both STRING_LIST_INIT_NODUP and a
strbuf_detach() pattern.

Those should not be used together if string_list_clear() is expected
to free the memory, instead we need to either use STRING_LIST_INIT_DUP
with a string_list_append_nodup(), or a STRING_LIST_INIT_NODUP and
manually fiddle with the "strdup_strings" member before calling
string_list_clear(). Let's do the former.

Since "strdup_strings = 1" is set now other code might be broken by
relying on "pack_indexes" not to duplicate it strings, but that
doesn't happen. When we pass this down to write_commit_graph() that
code uses the "struct string_list" without modifying it. Let's add a
"const" to the variable to have the compiler enforce that assumption.

Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>revision: avoid hitting packfiles when commits are in commit-graph</title>
<updated>2021-08-09T16:51:12Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2021-08-09T08:12:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=f559d6d45e7e58ae1f922213948723de77ea77bd'/>
<id>urn:sha1:f559d6d45e7e58ae1f922213948723de77ea77bd</id>
<content type='text'>
When queueing references in git-rev-list(1), we try to optimize parsing
of commits via the commit-graph. To do so, we first look up the object's
type, and if it is a commit we call `repo_parse_commit()` instead of
`parse_object()`. This is quite inefficient though given that we're
always uncompressing the object header in order to determine the type.
Instead, we can opportunistically search the commit-graph for the object
ID: in case it's found, we know it's a commit and can directly fill in
the commit object without having to uncompress the object header.

Expose a new function `lookup_commit_in_graph()`, which tries to find a
commit in the commit-graph by ID, and convert `get_reference()` to use
this function. This provides a big performance win in cases where we
load references in a repository with lots of references pointing to
commits. The following has been executed in a real-world repository with
about 2.2 million refs:

    Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.458 s ±  0.044 s    [User: 4.115 s, System: 0.342 s]
      Range (min … max):    4.409 s …  4.534 s    10 runs

    Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      3.089 s ±  0.015 s    [User: 2.768 s, System: 0.321 s]
      Range (min … max):    3.061 s …  3.105 s    10 runs

    Summary
      'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
        1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: use config to specify generation type</title>
<updated>2021-02-25T23:10:41Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2021-02-25T18:19:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=702110aac63556b4572d9c7b65c9123ec8038ebf'/>
<id>urn:sha1:702110aac63556b4572d9c7b65c9123ec8038ebf</id>
<content type='text'>
We have two established generation number versions:

 1: topological levels
 2: corrected commit dates

The corrected commit dates are enabled by default, but they also write
extra data in the GDAT and GDOV chunks. Services that host Git data
might want to have more control over when this feature rolls out than
just updating the Git binaries.

Add a new "commitGraph.generationVersion" config option that specifies
the intended generation number version. If this value is less than 2,
then the GDAT chunk is never written _or read_ from an existing file.

This can replace our use of the GIT_TEST_COMMIT_GRAPH_NO_GDAT
environment variable in the test suite. Remove it.

Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-reach: use corrected commit dates in paint_down_to_common()</title>
<updated>2021-01-19T00:21:18Z</updated>
<author>
<name>Abhishek Kumar</name>
<email>abhishekkumar8222@gmail.com</email>
</author>
<published>2021-01-16T18:11:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=8d00d7c3df8b7947c9154873116b5153c1a84dbf'/>
<id>urn:sha1:8d00d7c3df8b7947c9154873116b5153c1a84dbf</id>
<content type='text'>
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.

For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar &lt;abhishekkumar8222@gmail.com&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: use generation v2 only if entire chain does</title>
<updated>2021-01-19T00:21:18Z</updated>
<author>
<name>Abhishek Kumar</name>
<email>abhishekkumar8222@gmail.com</email>
</author>
<published>2021-01-16T18:11:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=1fdc383c5c87b6f2bcacddd07d5d45dd04c2d146'/>
<id>urn:sha1:1fdc383c5c87b6f2bcacddd07d5d45dd04c2d146</id>
<content type='text'>
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Abhishek Kumar &lt;abhishekkumar8222@gmail.com&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: implement generation data chunk</title>
<updated>2021-01-19T00:21:18Z</updated>
<author>
<name>Abhishek Kumar</name>
<email>abhishekkumar8222@gmail.com</email>
</author>
<published>2021-01-16T18:11:15Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=e8b63005c48696a26f976f5f9b0ccaf1983e439d'/>
<id>urn:sha1:e8b63005c48696a26f976f5f9b0ccaf1983e439d</id>
<content type='text'>
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar &lt;abhishekkumar8222@gmail.com&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: return 64-bit generation number</title>
<updated>2021-01-19T00:21:18Z</updated>
<author>
<name>Abhishek Kumar</name>
<email>abhishekkumar8222@gmail.com</email>
</author>
<published>2021-01-16T18:11:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=d7f92784c65b143c5ec9f435484146b6098c2f85'/>
<id>urn:sha1:d7f92784c65b143c5ec9f435484146b6098c2f85</id>
<content type='text'>
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.

We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.

With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.

Signed-off-by: Abhishek Kumar &lt;abhishekkumar8222@gmail.com&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
