<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/connected.c, branch v2.40.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.40.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.40.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2022-12-14T06:55:46Z</updated>
<entry>
<title>Merge branch 'ab/various-leak-fixes'</title>
<updated>2022-12-14T06:55:46Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2022-12-14T06:55:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9ea1378d046d764642718f1b070689072bde4601'/>
<id>urn:sha1:9ea1378d046d764642718f1b070689072bde4601</id>
<content type='text'>
Various leak fixes.

* ab/various-leak-fixes:
  built-ins: use free() not UNLEAK() if trivial, rm dead code
  revert: fix parse_options_concat() leak
  cherry-pick: free "struct replay_opts" members
  rebase: don't leak on "--abort"
  connected.c: free the "struct packed_git"
  sequencer.c: fix "opts-&gt;strategy" leak in read_strategy_opts()
  ls-files: fix a --with-tree memory leak
  revision API: call graph_clear() in release_revisions()
  unpack-file: fix ancient leak in create_temp_file()
  built-ins &amp; libs &amp; helpers: add/move destructors, fix leaks
  dir.c: free "ident" and "exclude_per_dir" in "struct untracked_cache"
  read-cache.c: clear and free "sparse_checkout_patterns"
  commit: discard partial cache before (re-)reading it
  {reset,merge}: call discard_index() before returning
  tests: mark tests as passing with SANITIZE=leak
</content>
</entry>
<entry>
<title>connected.c: free the "struct packed_git"</title>
<updated>2022-11-21T03:32:48Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2022-11-08T18:17:47Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=dd4143e7bf406a5907ef86c50c77c4667f62375d'/>
<id>urn:sha1:dd4143e7bf406a5907ef86c50c77c4667f62375d</id>
<content type='text'>
The "new_pack" we allocate in check_connected() wasn't being
free'd. Let's do that before we return from the function. This has
leaked ever since "new_pack" was added to this function in
c6807a40dcd (clone: open a shortcut for connectivity check,
2013-05-26).

Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
</content>
</entry>
<entry>
<title>receive-pack: only use visible refs for connectivity check</title>
<updated>2022-11-17T21:22:52Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2022-11-17T05:47:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=bcec6780b2ec77ea5f846d5448771f97110041e1'/>
<id>urn:sha1:bcec6780b2ec77ea5f846d5448771f97110041e1</id>
<content type='text'>
When serving a push, git-receive-pack(1) needs to verify that the
packfile sent by the client contains all objects that are required by
the updated references. This connectivity check works by marking all
preexisting references as uninteresting and using the new reference tips
as starting point for a graph walk.

Marking all preexisting references as uninteresting can be a problem
when it comes to performance. Git forges tend to do internal bookkeeping
to keep alive sets of objects for internal use or make them easy to find
via certain references. These references are typically hidden away from
the user so that they are neither advertised nor writeable. At GitLab,
we have one particular repository that contains a total of 7 million
references, of which 6.8 million are indeed internal references. With
the current connectivity check we are forced to load all these
references in order to mark them as uninteresting, and this alone takes
around 15 seconds to compute.

We can optimize this by only taking into account the set of visible refs
when marking objects as uninteresting. This means that we may now walk
more objects until we hit any object that is marked as uninteresting.
But it is rather unlikely that clients send objects that make large
parts of objects reachable that have previously only ever been hidden,
whereas the common case is to push incremental changes that build on top
of the visible object graph.

This provides a huge boost to performance in the mentioned repository,
where the vast majority of its refs hidden. Pushing a new commit into
this repo with `transfer.hideRefs` set up to hide 6.8 million of 7 refs
as it is configured in Gitaly leads to a 4.5-fold speedup:

    Benchmark 1: main
      Time (mean ± σ):     30.977 s ±  0.157 s    [User: 30.226 s, System: 1.083 s]
      Range (min … max):   30.796 s … 31.071 s    3 runs

    Benchmark 2: pks-connectivity-check-hide-refs
      Time (mean ± σ):      6.799 s ±  0.063 s    [User: 6.803 s, System: 0.354 s]
      Range (min … max):    6.729 s …  6.850 s    3 runs

    Summary
      'pks-connectivity-check-hide-refs' ran
        4.56 ± 0.05 times faster than 'main'

As we mostly go through the same codepaths even in the case where there
are no hidden refs at all compared to the code before there is no change
in performance when no refs are hidden:

    Benchmark 1: main
      Time (mean ± σ):     48.188 s ±  0.432 s    [User: 49.326 s, System: 5.009 s]
      Range (min … max):   47.706 s … 48.539 s    3 runs

    Benchmark 2: pks-connectivity-check-hide-refs
      Time (mean ± σ):     48.027 s ±  0.500 s    [User: 48.934 s, System: 5.025 s]
      Range (min … max):   47.504 s … 48.500 s    3 runs

    Summary
      'pks-connectivity-check-hide-refs' ran
        1.00 ± 0.01 times faster than 'main'

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
</content>
</entry>
<entry>
<title>run-command API: rename "env_array" to "env"</title>
<updated>2022-06-02T21:31:16Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2022-06-02T09:09:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=29fda24dd11e90583f3ea9ff2f90ee9acacd7792'/>
<id>urn:sha1:29fda24dd11e90583f3ea9ff2f90ee9acacd7792</id>
<content type='text'>
Start following-up on the rename mentioned in c7c4bdeccf3 (run-command
API: remove "env" member, always use "env_array", 2021-11-25) of
"env_array" to "env".

The "env_array" name was picked in 19a583dc39e (run-command: add
env_array, an optional argv_array for env, 2014-10-19) because "env"
was taken. Let's not forever keep the oddity of "*_array" for this
"struct strvec", but not for its "args" sibling.

This commit is almost entirely made with a coccinelle rule[1]. The
only manual change here is in run-command.h to rename the struct
member itself and to change "env_array" to "env" in the
CHILD_PROCESS_INIT initializer.

The rest of this is all a result of applying [1]:

 * make contrib/coccinelle/run_command.cocci.patch
 * patch -p1 &lt;contrib/coccinelle/run_command.cocci.patch
 * git add -u

1. cat contrib/coccinelle/run_command.pending.cocci
   @@
   struct child_process E;
   @@
   - E.env_array
   + E.env

   @@
   struct child_process *E;
   @@
   - E-&gt;env_array
   + E-&gt;env

I've avoided changing any comments and derived variable names here,
that will all be done in the next commit.

Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>run-command API: remove "env" member, always use "env_array"</title>
<updated>2021-11-26T06:15:08Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2021-11-25T22:52:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=c7c4bdeccf3e737e6e674cd9f0828922e629ab06'/>
<id>urn:sha1:c7c4bdeccf3e737e6e674cd9f0828922e629ab06</id>
<content type='text'>
Remove the "env" member from "struct child_process" in favor of always
using the "env_array". As with the preceding removal of "argv" in
favor of "args" this gets rid of current and future oddities around
memory management at the API boundary (see the amended API docs).

For some of the conversions we can replace patterns like:

    child.env = env-&gt;v;

With:

    strvec_pushv(&amp;child.env_array, env-&gt;v);

But for others we need to guard the strvec_pushv() with a NULL check,
since we're not passing in the "v" member of a "struct strvec",
e.g. in the case of tmp_objdir_env()'s return value.

Ideally we'd rename the "env_array" member to simply "env" as a
follow-up, since it and "args" are now inconsistent in not having an
"_array" suffix, and seemingly without any good reason, unless we look
at the history of how they came to be.

But as we've currently got 122 in-tree hits for a "git grep env_array"
let's leave that for now (and possibly forever). Doing that rename
would be too disruptive.

Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ps/connectivity-optim'</title>
<updated>2021-11-12T23:29:24Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2021-11-12T23:29:24Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=8996d68ac7052bf96a38a2b36546c4239280279e'/>
<id>urn:sha1:8996d68ac7052bf96a38a2b36546c4239280279e</id>
<content type='text'>
Regression fix.

* ps/connectivity-optim:
  Revert "connected: do not sort input revisions"
</content>
</entry>
<entry>
<title>Revert "connected: do not sort input revisions"</title>
<updated>2021-11-11T20:34:41Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2021-11-11T20:34:41Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a7df4f52affe49f6b06246b9398e99e1937f9efc'/>
<id>urn:sha1:a7df4f52affe49f6b06246b9398e99e1937f9efc</id>
<content type='text'>
This reverts commit f45022dc2fd692fd024f2eb41a86a66f19013d43,
as this is like breakage in the traversal more likely.  In a
history with 10 single strand of pearls,

   1--&gt;2--&gt;3--...-&gt;7--&gt;8--&gt;9--&gt;10

asking "rev-list --unsorted-input 1 10 --not 9 8 7 6 5 4" fails to
paint the bottom 1 uninteresting as the traversal stops, without
completing the propagation of uninteresting bit starting at 4 down
through 3 and 2 to 1.
</content>
</entry>
<entry>
<title>Merge branch 'ps/fetch-optim'</title>
<updated>2021-09-20T22:20:39Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2021-09-20T22:20:39Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=deec8aa2d05aba0a2ba495f9d41ab4e2805c8b9e'/>
<id>urn:sha1:deec8aa2d05aba0a2ba495f9d41ab4e2805c8b9e</id>
<content type='text'>
Optimize code that handles large number of refs in the "git fetch"
code path.

* ps/fetch-optim:
  fetch: avoid second connectivity check if we already have all objects
  fetch: merge fetching and consuming refs
  fetch: refactor fetch refs to be more extendable
  fetch-pack: optimize loading of refs via commit graph
  connected: refactor iterator to return next object ID directly
  fetch: avoid unpacking headers in object existence check
  fetch: speed up lookup of want refs via commit-graph
</content>
</entry>
<entry>
<title>connected: refactor iterator to return next object ID directly</title>
<updated>2021-09-01T19:43:56Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2021-09-01T13:09:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9fec7b213045135655354e864d15894175428d5a'/>
<id>urn:sha1:9fec7b213045135655354e864d15894175428d5a</id>
<content type='text'>
The object ID iterator used by the connectivity checks returns the next
object ID via an out-parameter and then uses a return code to indicate
whether an item was found. This is a bit roundabout: instead of a
separate error code, we can just return the next object ID directly and
use `NULL` pointers as indicator that the iterator got no items left.
Furthermore, this avoids a copy of the object ID.

Refactor the iterator and all its implementations to return object IDs
directly. This brings a tiny performance improvement when doing a mirror-fetch of a repository with about 2.3M refs:

    Benchmark #1: 328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch
      Time (mean ± σ):     30.110 s ±  0.148 s    [User: 27.161 s, System: 5.075 s]
      Range (min … max):   29.934 s … 30.406 s    10 runs

    Benchmark #2: 328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch
      Time (mean ± σ):     29.899 s ±  0.109 s    [User: 26.916 s, System: 5.104 s]
      Range (min … max):   29.696 s … 29.996 s    10 runs

    Summary
      '328dc58b49919c43897240f2eabfa30be2ce32a4: git-fetch' ran
        1.01 ± 0.01 times faster than '328dc58b49919c43897240f2eabfa30be2ce32a4~: git-fetch'

While this 1% speedup could be labelled as statistically insignificant,
the speedup is consistent on my machine. Furthermore, this is an end to
end test, so it is expected that the improvement in the connectivity
check itself is more significant.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>connected: do not sort input revisions</title>
<updated>2021-08-09T16:51:12Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2021-08-09T08:11:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=f45022dc2fd692fd024f2eb41a86a66f19013d43'/>
<id>urn:sha1:f45022dc2fd692fd024f2eb41a86a66f19013d43</id>
<content type='text'>
In order to compute whether objects reachable from a set of tips are all
connected, we do a revision walk with these tips as positive references
and `--not --all`. `--not --all` will cause the revision walk to load
all preexisting references as uninteresting, which can be very expensive
in repositories with many references.

Benchmarking the git-rev-list(1) command highlights that by far the most
expensive single phase is initial sorting of the input revisions: after
all references have been loaded, we first sort commits by author date.
In a real-world repository with about 2.2 million references, it makes
up about 40% of the total runtime of git-rev-list(1).

Ultimately, the connectivity check shouldn't really bother about the
order of input revisions at all. We only care whether we can actually
walk all objects until we hit the cut-off point. So sorting the input is
a complete waste of time.

Introduce a new "--unsorted-input" flag to git-rev-list(1) which will
cause it to not sort the commits and adjust the connectivity check to
always pass the flag. This results in the following speedups, executed
in a clone of gitlab-org/gitlab [1]:

    Benchmark #1: git rev-list  --objects --quiet --not --all --not $(cat newrev)
      Time (mean ± σ):      7.639 s ±  0.065 s    [User: 7.304 s, System: 0.335 s]
      Range (min … max):    7.543 s …  7.742 s    10 runs

    Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.995 s ±  0.044 s    [User: 4.657 s, System: 0.337 s]
      Range (min … max):    4.909 s …  5.048 s    10 runs

    Summary
      'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran
        1.53 ± 0.02 times faster than 'git rev-list  --objects --quiet --not --all --not $newrev'

[1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs
     are visible to clients.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
