<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/connected.h, branch jch</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=jch</id>
<link rel='self' href='https://git.shady.money/git/atom?h=jch'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2022-11-17T21:22:52Z</updated>
<entry>
<title>receive-pack: only use visible refs for connectivity check</title>
<updated>2022-11-17T21:22:52Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2022-11-17T05:47:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=bcec6780b2ec77ea5f846d5448771f97110041e1'/>
<id>urn:sha1:bcec6780b2ec77ea5f846d5448771f97110041e1</id>
<content type='text'>
When serving a push, git-receive-pack(1) needs to verify that the
packfile sent by the client contains all objects that are required by
the updated references. This connectivity check works by marking all
preexisting references as uninteresting and using the new reference tips
as starting point for a graph walk.

Marking all preexisting references as uninteresting can be a problem
when it comes to performance. Git forges tend to do internal bookkeeping
to keep alive sets of objects for internal use or make them easy to find
via certain references. These references are typically hidden away from
the user so that they are neither advertised nor writeable. At GitLab,
we have one particular repository that contains a total of 7 million
references, of which 6.8 million are indeed internal references. With
the current connectivity check we are forced to load all these
references in order to mark them as uninteresting, and this alone takes
around 15 seconds to compute.

We can optimize this by only taking into account the set of visible refs
when marking objects as uninteresting. This means that we may now walk
more objects until we hit any object that is marked as uninteresting.
But it is rather unlikely that clients send objects that make large
parts of objects reachable that have previously only ever been hidden,
whereas the common case is to push incremental changes that build on top
of the visible object graph.

This provides a huge boost to performance in the mentioned repository,
where the vast majority of its refs hidden. Pushing a new commit into
this repo with `transfer.hideRefs` set up to hide 6.8 million of 7 refs
as it is configured in Gitaly leads to a 4.5-fold speedup:

    Benchmark 1: main
      Time (mean ± σ):     30.977 s ±  0.157 s    [User: 30.226 s, System: 1.083 s]
      Range (min … max):   30.796 s … 31.071 s    3 runs

    Benchmark 2: pks-connectivity-check-hide-refs
      Time (mean ± σ):      6.799 s ±  0.063 s    [User: 6.803 s, System: 0.354 s]
      Range (min … max):    6.729 s …  6.850 s    3 runs

    Summary
      'pks-connectivity-check-hide-refs' ran
        4.56 ± 0.05 times faster than 'main'

As we mostly go through the same codepaths even in the case where there
are no hidden refs at all compared to the code before there is no change
in performance when no refs are hidden:

    Benchmark 1: main
      Time (mean ± σ):     48.188 s ±  0.432 s    [User: 49.326 s, System: 5.009 s]
      Range (min … max):   47.706 s … 48.539 s    3 runs

    Benchmark 2: pks-connectivity-check-hide-refs
      Time (mean ± σ):     48.027 s ±  0.500 s    [User: 48.934 s, System: 5.025 s]
      Range (min … max):   47.504 s … 48.500 s    3 runs

    Summary
      'pks-connectivity-check-hide-refs' ran
        1.00 ± 0.01 times faster than 'main'

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
</content>
</entry>
<entry>
<title>connected: refactor iterator to return next object ID directly</title>
<updated>2021-09-01T19:43:56Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2021-09-01T13:09:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9fec7b213045135655354e864d15894175428d5a'/>
<id>urn:sha1:9fec7b213045135655354e864d15894175428d5a</id>
<content type='text'>
The object ID iterator used by the connectivity checks returns the next
object ID via an out-parameter and then uses a return code to indicate
whether an item was found. This is a bit roundabout: instead of a
separate error code, we can just return the next object ID directly and
use `NULL` pointers as indicator that the iterator got no items left.
Furthermore, this avoids a copy of the object ID.

Refactor the iterator and all its implementations to return object IDs
directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs:

    Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch
      Time (mean ± σ):     30.110 s ±  0.148 s    [User: 27.161 s, System: 5.075 s]
      Range (min … max):   29.934 s … 30.406 s    10 runs

    Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch
      Time (mean ± σ):     29.899 s ±  0.109 s    [User: 26.916 s, System: 5.104 s]
      Range (min … max):   29.696 s … 29.996 s    10 runs

    Summary
      '328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran
        1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch'

While this 1% speedup could be labelled as statistically insignificant,
the speedup is consistent on my machine. Furthermore, this is an end to
end test, so it is expected that the improvement in the connectivity
check itself is more significant.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>connected: always use partial clone optimization</title>
<updated>2020-03-29T17:37:44Z</updated>
<author>
<name>Jonathan Tan</name>
<email>jonathantanmy@google.com</email>
</author>
<published>2020-03-20T22:00:45Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=2b98478c6fc32b04434981c783b0472e0137cfad'/>
<id>urn:sha1:2b98478c6fc32b04434981c783b0472e0137cfad</id>
<content type='text'>
With 50033772d5 ("connected: verify promisor-ness of partial clone",
2020-01-30), the fast path (checking promisor packs) in
check_connected() now passes a subset of the slow path (rev-list) - if
all objects to be checked are found in promisor packs, both the fast
path and the slow path will pass; otherwise, the fast path will
definitely not pass. This means that we can always attempt the fast path
whenever we need to do the slow path.

The fast path is currently guarded by a flag; therefore, remove that
flag. Also, make the fast path fallback to the slow path - if the fast
path fails, the failing OID and all remaining OIDs will be passed to
rev-list.

The main user-visible benefit is the performance of fetch from a partial
clone - specifically, the speedup of the connectivity check done before
the fetch. In particular, a no-op fetch into a partial clone on my
computer was sped up from 7 seconds to 0.01 seconds. This is a
complement to the work in 2df1aa239c ("fetch: forgo full
connectivity check if --filter", 2020-01-30), which is the child of the
aforementioned 50033772d5. In that commit, the connectivity check
*after* the fetch was sped up.

The addition of the fast path might cause performance reductions in
these cases:

 - If a partial clone or a fetch into a partial clone fails, Git will
   fruitlessly run rev-list (it is expected that everything fetched
   would go into promisor packs, so if that didn't happen, it is most
   likely that rev-list will fail too).

 - Any connectivity checks done by receive-pack, in the (in my opinion,
   unlikely) event that a partial clone serves receive-pack.

I think that these cases are rare enough, and the performance reduction
in this case minor enough (additional object DB access), that the
benefit of avoiding a flag outweighs these.

Signed-off-by: Jonathan Tan &lt;jonathantanmy@google.com&gt;
Reviewed-by: Josh Steadmon &lt;steadmon@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>connected: verify promisor-ness of partial clone</title>
<updated>2020-01-30T18:55:31Z</updated>
<author>
<name>Jonathan Tan</name>
<email>jonathantanmy@google.com</email>
</author>
<published>2020-01-12T04:15:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=50033772d50ef3c4023d63561d20bc61db96500e'/>
<id>urn:sha1:50033772d50ef3c4023d63561d20bc61db96500e</id>
<content type='text'>
Commit dfa33a298d ("clone: do faster object check for partial clones",
2019-04-21) optimized the connectivity check done when cloning with
--filter to check only the existence of objects directly pointed to by
refs. But this is not sufficient: they also need to be promisor objects.
Make this check more robust by instead checking that these objects are
promisor objects, that is, they appear in a promisor pack.

Signed-off-by: Jonathan Tan &lt;jonathantanmy@google.com&gt;
Reviewed-by: Jonathan Nieder &lt;jrnieder@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>clone: do faster object check for partial clones</title>
<updated>2019-04-21T05:08:53Z</updated>
<author>
<name>Josh Steadmon</name>
<email>steadmon@google.com</email>
</author>
<published>2019-04-19T21:00:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=dfa33a298de2ab724d4812633bb009a90d1df790'/>
<id>urn:sha1:dfa33a298de2ab724d4812633bb009a90d1df790</id>
<content type='text'>
For partial clones, doing a full connectivity check is wasteful; we skip
promisor objects (which, for a partial clone, is all known objects), and
enumerating them all to exclude them from the connectivity check can
take a significant amount of time on large repos.

At most, we want to make sure that we get the objects referred to by any
wanted refs. For partial clones, just check that these objects were
transferred.

Signed-off-by: Josh Steadmon &lt;steadmon@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>connected: document connectivity in partial clones</title>
<updated>2018-09-21T20:20:47Z</updated>
<author>
<name>Jonathan Tan</name>
<email>jonathantanmy@google.com</email>
</author>
<published>2018-09-21T18:22:37Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=4937291b45800372c2c34855a6a36d5d2a77af2c'/>
<id>urn:sha1:4937291b45800372c2c34855a6a36d5d2a77af2c</id>
<content type='text'>
In acb0c57260 ("fetch: support filters", 2017-12-08), check_connected()
was extended to allow objects to either be promised to be available (if
the repository is a partial clone) or to be present; previously, this
function required the latter. However, this change was not reflected in
the documentation of that function. Update the documentation
accordingly.

Signed-off-by: Jonathan Tan &lt;jonathantanmy@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Add missing includes and forward declarations</title>
<updated>2018-08-15T18:52:09Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2018-08-15T17:54:05Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=ef3ca95475ce467ae883cc8175ed40e6f7d27800'/>
<id>urn:sha1:ef3ca95475ce467ae883cc8175ed40e6f7d27800</id>
<content type='text'>
I looped over the toplevel header files, creating a temporary two-line C
program for each consisting of
  #include "git-compat-util.h"
  #include $HEADER
This patch is the result of manually fixing errors in compiling those
tiny programs.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>fetch-pack: write shallow, then check connectivity</title>
<updated>2018-07-03T21:57:44Z</updated>
<author>
<name>Jonathan Tan</name>
<email>jonathantanmy@google.com</email>
</author>
<published>2018-07-02T22:08:43Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=cf1e7c07705eb21c30d0ee414810e7bc8fdf7d82'/>
<id>urn:sha1:cf1e7c07705eb21c30d0ee414810e7bc8fdf7d82</id>
<content type='text'>
When fetching, connectivity is checked after the shallow file is
updated. There are 2 issues with this: (1) the connectivity check is
only performed up to ancestors of existing refs (which is not thorough
enough if we were deepening an existing ref in the first place), and (2)
there is no rollback of the shallow file if the connectivity check
fails.

To solve (1), update the connectivity check to check the ancestry chain
completely in the case of a deepening fetch by refraining from passing
"--not --all" when invoking rev-list in connected.c.

To solve (2), have fetch_pack() perform its own connectivity check
before updating the shallow file. To support existing use cases in which
"git fetch-pack" is used to download objects without much regard as to
the connectivity of the resulting objects with respect to the existing
repository, the connectivity check is only done if necessary (that is,
the fetch is not a clone, and the fetch involves shallow/deepen
functionality). "git fetch" still performs its own connectivity check,
preserving correctness but sometimes performing redundant work. This
redundancy is mitigated by the fact that fetch_pack() reports if it has
performed a connectivity check itself, and if the transport supports
connect or stateless-connect, it will bubble up that report so that "git
fetch" knows not to perform the connectivity check in such a case.

This was noticed when a user tried to deepen an existing repository by
fetching with --no-shallow from a server that did not send all necessary
objects - the connectivity check as run by "git fetch" succeeded, but a
subsequent "git fsck" failed.

Signed-off-by: Jonathan Tan &lt;jonathantanmy@google.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Convert check_connected to use struct object_id</title>
<updated>2017-10-16T02:05:50Z</updated>
<author>
<name>brian m. carlson</name>
<email>sandals@crustytoothpaste.net</email>
</author>
<published>2017-10-15T22:06:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=6ccac9eed56280f035d84605b4451ae1721a3100'/>
<id>urn:sha1:6ccac9eed56280f035d84605b4451ae1721a3100</id>
<content type='text'>
Convert check_connected and the callbacks it takes to use struct
object_id.

Signed-off-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>check_connected: accept an env argument</title>
<updated>2016-10-10T20:54:02Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2016-10-03T20:49:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=526f108a271b331af9ae92796215e560e5ec4677'/>
<id>urn:sha1:526f108a271b331af9ae92796215e560e5ec4677</id>
<content type='text'>
This lets callers influence the environment seen by
rev-list, which will be useful when we start providing
quarantined objects.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
