<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/t/t5318-commit-graph.sh, branch v2.45.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.45.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.45.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2023-12-18T22:10:11Z</updated>
<entry>
<title>Merge branch 'ps/commit-graph-less-paranoid'</title>
<updated>2023-12-18T22:10:11Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2023-12-18T22:10:11Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=66685e85556ad1890d896bc30ba3a7a99bf4dd78'/>
<id>urn:sha1:66685e85556ad1890d896bc30ba3a7a99bf4dd78</id>
<content type='text'>
Earlier we stopped relying on commit-graph that (still) records
information about commits that are lost from the object store,
which has negative performance implications.  The default has been
flipped to disable this pessimization.

* ps/commit-graph-less-paranoid:
  commit-graph: disable GIT_COMMIT_GRAPH_PARANOIA by default
</content>
</entry>
<entry>
<title>Merge branch 'jk/chunk-bounds-more'</title>
<updated>2023-12-10T00:37:48Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2023-12-10T00:37:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=34401b7ddbd79c8511e864048bc52d896eac2f22'/>
<id>urn:sha1:34401b7ddbd79c8511e864048bc52d896eac2f22</id>
<content type='text'>
Code clean-up for jk/chunk-bounds topic.

* jk/chunk-bounds-more:
  commit-graph: mark chunk error messages for translation
  commit-graph: drop verify_commit_graph_lite()
  commit-graph: check order while reading fanout chunk
  commit-graph: use fanout value for graph size
  commit-graph: abort as soon as we see a bogus chunk
  commit-graph: clarify missing-chunk error messages
  commit-graph: drop redundant call to "lite" verification
  midx: check consistency of fanout table
  commit-graph: handle overflow in chunk_size checks
</content>
</entry>
<entry>
<title>commit-graph: disable GIT_COMMIT_GRAPH_PARANOIA by default</title>
<updated>2023-11-26T01:10:00Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2023-11-24T11:08:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b1df3b3867e351913887121063cbd69de24e83fc'/>
<id>urn:sha1:b1df3b3867e351913887121063cbd69de24e83fc</id>
<content type='text'>
In 7a5d604443 (commit: detect commits that exist in commit-graph but not
in the ODB, 2023-10-31), we have introduced a new object existence check
into `repo_parse_commit_internal()` so that we do not parse commits via
the commit-graph that don't have a corresponding object in the object
database. This new check of course comes with a performance penalty,
which the commit put at around 30% for `git rev-list --topo-order`. But
there are in fact scenarios where the performance regression is even
higher. The following benchmark against linux.git with a fully-build
commit-graph:

  Benchmark 1: git.v2.42.1 rev-list --count HEAD
    Time (mean ± σ):     658.0 ms ±   5.2 ms    [User: 613.5 ms, System: 44.4 ms]
    Range (min … max):   650.2 ms … 666.0 ms    10 runs

  Benchmark 2: git.v2.43.0-rc1 rev-list --count HEAD
    Time (mean ± σ):      1.333 s ±  0.019 s    [User: 1.263 s, System: 0.069 s]
    Range (min … max):    1.302 s …  1.361 s    10 runs

  Summary
    git.v2.42.1 rev-list --count HEAD ran
      2.03 ± 0.03 times faster than git.v2.43.0-rc1 rev-list --count HEAD

While it's a noble goal to ensure that results are the same regardless
of whether or not we have a potentially stale commit-graph, taking twice
as much time is a tough sell. Furthermore, we can generally assume that
the commit-graph will be updated by git-gc(1) or git-maintenance(1) as
required so that the case where the commit-graph is stale should not at
all be common.

With that in mind, default-disable GIT_COMMIT_GRAPH_PARANOIA and restore
the behaviour and thus performance previous to the mentioned commit. In
order to not be inconsistent, also disable this behaviour by default in
`lookup_commit_in_graph()`, where the object existence check has been
introduced right at its inception via f559d6d45e (revision: avoid
hitting packfiles when commits are in commit-graph, 2021-08-09).

This results in another speedup in commands that end up calling this
function, even though it's less pronounced compared to the above
benchmark. The following has been executed in linux.git with ~1.2
million references:

  Benchmark 1: GIT_COMMIT_GRAPH_PARANOIA=true git rev-list --all --no-walk=unsorted
    Time (mean ± σ):      2.947 s ±  0.003 s    [User: 2.412 s, System: 0.534 s]
    Range (min … max):    2.943 s …  2.949 s    3 runs

  Benchmark 2: GIT_COMMIT_GRAPH_PARANOIA=false git rev-list --all --no-walk=unsorted
    Time (mean ± σ):      2.724 s ±  0.030 s    [User: 2.207 s, System: 0.514 s]
    Range (min … max):    2.704 s …  2.759 s    3 runs

  Summary
    GIT_COMMIT_GRAPH_PARANOIA=false git rev-list --all --no-walk=unsorted ran
      1.08 ± 0.01 times faster than GIT_COMMIT_GRAPH_PARANOIA=true git rev-list --all --no-walk=unsorted

So whereas 7a5d604443 initially introduced the logic to start doing an
object existence check in `repo_parse_commit_internal()` by default, the
updated logic will now instead cause `lookup_commit_in_graph()` to stop
doing the check by default. This behaviour continues to be tweakable by
the user via the GIT_COMMIT_GRAPH_PARANOIA environment variable.

Note that this requires us to amend some tests to manually turn on the
paranoid checks again. This is because we cause repository corruption by
manually deleting objects which are part of the commit graph already.
These circumstances shouldn't usually happen in repositories.

Reported-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: check order while reading fanout chunk</title>
<updated>2023-11-09T10:07:53Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-11-09T07:25:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=06fb135f8eddc64071a719fe309c771883c07775'/>
<id>urn:sha1:06fb135f8eddc64071a719fe309c771883c07775</id>
<content type='text'>
We read the fanout chunk, storing a pointer to it, but only confirm that
the entries are monotonic in a final "lite" verification step. Let's
move that into the actual OIDF chunk callback, so that we can report
problems immediately (for all the reasons given in the previous
"commit-graph: abort as soon as we see a bogus chunk" commit).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: use fanout value for graph size</title>
<updated>2023-11-09T10:07:53Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-11-09T07:24:35Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=d3b6f6c63137b72df5055b71721825e786bcbd6e'/>
<id>urn:sha1:d3b6f6c63137b72df5055b71721825e786bcbd6e</id>
<content type='text'>
Commit-graph, midx, and pack idx files all have both a lookup table of
oids and an oid fanout table. In midx and pack idx files, we take the
final entry of the fanout table as the source of truth for the number of
entries, and then verify that the size of the lookup table matches that.
But for commit-graph files, we do the opposite: we use the size of the
lookup table as the source of truth, and then check the final fanout
entry against it.

As noted in 4169d89645 (commit-graph: check consistency of fanout
table, 2023-10-09), either is correct. But there are a few reasons to
prefer the fanout table as the source of truth:

  1. The fanout entries are 32-bits on disk, and that defines the
     maximum number of entries we can store. But since the size of the
     lookup table is only bounded by the filesystem, it can be much
     larger. And hence computing it as the commit-graph does means that
     we may truncate the result when storing it in a uint32_t.

  2. We read the fanout first, then the lookup table. If we're verifying
     the chunks as we read them, then we'd want to take the fanout as
     truth (we have nothing yet to check it against) and then we can
     check that the lookup table matches what we already know.

  3. It is pointlessly inconsistent with the midx and pack idx code.
     Since the three have to do similar size and bounds checks, it is
     easier to reason about all three if they use the same approach.

So this patch moves the assignment of g-&gt;num_commits to the fanout
parser, and then we can check the size of the lookup chunk as soon as we
try to load it.

There's already a test covering this situation, which munges the final
fanout entry to 2^32-1. In the current code we complain that it does not
agree with the table size. But now that we treat the munged value as the
source of truth, we'll complain that the lookup table is the wrong size
(again, either is correct). So we'll have to update the message we
expect (and likewise for an earlier test which does similar munging).

There's a similar test for this situation on the midx side, but rather
than making a very-large fanout value, it just truncates the lookup
table. We could do that here, too, but the very-large fanout value
actually shows an interesting corner case. On a 32-bit system,
multiplying to find the expected table size would cause an integer
overflow. Using st_mult() would detect that, but cause us to die()
rather than falling back to the non-graph code path. Checking the size
using division (as we do with existing chunk-size checks) avoids the
overflow entirely, and the test demonstrates this when run on a 32-bit
system.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: clarify missing-chunk error messages</title>
<updated>2023-11-09T10:07:53Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2023-11-09T07:14:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=93d29247298e9ae3fbc6dd8e022a6260b568191a'/>
<id>urn:sha1:93d29247298e9ae3fbc6dd8e022a6260b568191a</id>
<content type='text'>
When a required commit-graph chunk cannot be loaded, we leave its entry
in the struct NULL, and then later complain that it is missing. But
that's just one reason we might not have loaded it, as we also do some
data quality checks.

Let's switch these messages to say "missing or corrupted", which is
exactly what the midx code says for the same cases. Likewise, we'll use
the same phrasing and capitalization as those for consistency. And while
we're here, we can mark them for translation (just like the midx ones).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'jc/test-i18ngrep'</title>
<updated>2023-11-08T02:04:02Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2023-11-08T02:04:02Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a8e2394704d0543f4e1f1ac6ea532d098316d97e'/>
<id>urn:sha1:a8e2394704d0543f4e1f1ac6ea532d098316d97e</id>
<content type='text'>
Another step to deprecate test_i18ngrep.

* jc/test-i18ngrep:
  tests: teach callers of test_i18ngrep to use test_grep
  test framework: further deprecate test_i18ngrep
</content>
</entry>
<entry>
<title>Merge branch 'ps/do-not-trust-commit-graph-blindly-for-existence'</title>
<updated>2023-11-08T02:03:59Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2023-11-08T02:03:59Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=42b87f7ee60323f2e45a91233db80d44d3e33ad2'/>
<id>urn:sha1:42b87f7ee60323f2e45a91233db80d44d3e33ad2</id>
<content type='text'>
The codepath to traverse the commit-graph learned to notice that a
commit is missing (e.g., corrupt repository lost an object), even
though it knows something about the commit (like its parents) from
what is in commit-graph.

* ps/do-not-trust-commit-graph-blindly-for-existence:
  commit: detect commits that exist in commit-graph but not in the ODB
  commit-graph: introduce envvar to disable commit existence checks
</content>
</entry>
<entry>
<title>tests: teach callers of test_i18ngrep to use test_grep</title>
<updated>2023-11-02T08:13:44Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2023-10-31T05:23:30Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=6789275d3780bcb950e6be8557aeedf160d4ad6d'/>
<id>urn:sha1:6789275d3780bcb950e6be8557aeedf160d4ad6d</id>
<content type='text'>
They are equivalents and the former still exists, so as long as the
only change this commit makes are to rewrite test_i18ngrep to
test_grep, there won't be any new bug, even if there still are
callers of test_i18ngrep remaining in the tree, or when merged to
other topics that add new uses of test_i18ngrep.

This patch was produced more or less with

    git grep -l -e 'test_i18ngrep ' 't/t[0-9][0-9][0-9][0-9]-*.sh' |
    xargs perl -p -i -e 's/test_i18ngrep /test_grep /'

and a good way to sanity check the result yourself is to run the
above in a checkout of c4603c1c (test framework: further deprecate
test_i18ngrep, 2023-10-31) and compare the resulting working tree
contents with the result of applying this patch to the same commit.
You'll see that test_i18ngrep in a few t/lib-*.sh files corrected,
in addition to the manual reproduction.

Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit: detect commits that exist in commit-graph but not in the ODB</title>
<updated>2023-11-01T03:04:06Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2023-10-31T07:16:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=7a5d604443ffc7afcd3788818f8fe00fc68c054d'/>
<id>urn:sha1:7a5d604443ffc7afcd3788818f8fe00fc68c054d</id>
<content type='text'>
Commit graphs can become stale and contain references to commits that do
not exist in the object database anymore. Theoretically, this can lead
to a scenario where we are able to successfully look up any such commit
via the commit graph even though such a lookup would fail if done via
the object database directly.

As the commit graph is mostly intended as a sort of cache to speed up
parsing of commits we do not want to have diverging behaviour in a
repository with and a repository without commit graphs, no matter
whether they are stale or not. As commits are otherwise immutable, the
only thing that we really need to care about is thus the presence or
absence of a commit.

To address potentially stale commit data that may exist in the graph,
our `lookup_commit_in_graph()` function will check for the commit's
existence in both the commit graph, but also in the object database. So
even if we were able to look up the commit's data in the graph, we would
still pretend as if the commit didn't exist if it is missing in the
object database.

We don't have the same safety net in `parse_commit_in_graph_one()`
though. This function is mostly used internally in "commit-graph.c"
itself to validate the commit graph, and this usage is fine. We do
expose its functionality via `parse_commit_in_graph()` though, which
gets called by `repo_parse_commit_internal()`, and that function is in
turn used in many places in our codebase.

For all I can see this function is never used to directly turn an object
ID into a commit object without additional safety checks before or after
this lookup. What it is being used for though is to walk history via the
parent chain of commits. So when commits in the parent chain of a graph
walk are missing it is possible that we wouldn't notice if that missing
commit was part of the commit graph. Thus, a query like `git rev-parse
HEAD~2` can succeed even if the intermittent commit is missing.

It's unclear whether there are additional ways in which such stale
commit graphs can lead to problems. In any case, it feels like this is a
bigger bug waiting to happen when we gain additional direct or indirect
callers of `repo_parse_commit_internal()`. So let's fix the inconsistent
behaviour by checking for object existence via the object database, as
well.

This check of course comes with a performance penalty. The following
benchmarks have been executed in a clone of linux.git with stable tags
added:

    Benchmark 1: git -c core.commitGraph=true rev-list --topo-order --all (git = master)
      Time (mean ± σ):      2.913 s ±  0.018 s    [User: 2.363 s, System: 0.548 s]
      Range (min … max):    2.894 s …  2.950 s    10 runs

    Benchmark 2: git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
      Time (mean ± σ):      3.834 s ±  0.052 s    [User: 3.276 s, System: 0.556 s]
      Range (min … max):    3.780 s …  3.961 s    10 runs

    Benchmark 3: git -c core.commitGraph=false rev-list --topo-order --all (git = master)
      Time (mean ± σ):     13.841 s ±  0.084 s    [User: 13.152 s, System: 0.687 s]
      Range (min … max):   13.714 s … 13.995 s    10 runs

    Benchmark 4: git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
      Time (mean ± σ):     13.762 s ±  0.116 s    [User: 13.094 s, System: 0.667 s]
      Range (min … max):   13.645 s … 14.038 s    10 runs

    Summary
      git -c core.commitGraph=true rev-list --topo-order --all (git = master) ran
        1.32 ± 0.02 times faster than git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
        4.72 ± 0.05 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
        4.75 ± 0.04 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = master)

We look at a ~30% regression in general, but in general we're still a
whole lot faster than without the commit graph. To counteract this, the
new check can be turned off with the `GIT_COMMIT_GRAPH_PARANOIA` envvar.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
