<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/upload-pack.c, branch v2.45.4</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.45.4</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.45.4'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2024-03-07T23:59:42Z</updated>
<entry>
<title>Merge branch 'jk/upload-pack-v2-capability-cleanup'</title>
<updated>2024-03-07T23:59:42Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-03-07T23:59:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a82fa7bce88e9795f18d01ed11df927e4de70ccf'/>
<id>urn:sha1:a82fa7bce88e9795f18d01ed11df927e4de70ccf</id>
<content type='text'>
The upload-pack program, when talking over v2, accepted the
packfile-uris protocol extension from the client, even if it did
not advertise the capability, which has been corrected.

* jk/upload-pack-v2-capability-cleanup:
  upload-pack: only accept packfile-uris if we advertised it
  upload-pack: use existing config mechanism for advertisement
  upload-pack: centralize setup of sideband-all config
  upload-pack: use repository struct to get config
</content>
</entry>
<entry>
<title>Merge branch 'jk/upload-pack-bounded-resources'</title>
<updated>2024-03-07T23:59:42Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-03-07T23:59:42Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=56d608456000121f141a40c4501d41d712381674'/>
<id>urn:sha1:56d608456000121f141a40c4501d41d712381674</id>
<content type='text'>
Various parts of upload-pack has been updated to bound the resource
consumption relative to the size of the repository to protect from
abusive clients.

* jk/upload-pack-bounded-resources:
  upload-pack: free tree buffers after parsing
  upload-pack: use PARSE_OBJECT_SKIP_HASH_CHECK in more places
  upload-pack: always turn off save_commit_buffer
  upload-pack: disallow object-info capability by default
  upload-pack: accept only a single packfile-uri line
  upload-pack: use a strmap for want-ref lines
  upload-pack: use oidset for deepen_not list
  upload-pack: switch deepen-not list to an oid_array
  upload-pack: drop separate v2 "haves" array
</content>
</entry>
<entry>
<title>upload-pack: only accept packfile-uris if we advertised it</title>
<updated>2024-02-29T16:10:42Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-02-28T22:50:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a922bfa3b5a6b2ac5e98f0e3405d66c1847aa7e8'/>
<id>urn:sha1:a922bfa3b5a6b2ac5e98f0e3405d66c1847aa7e8</id>
<content type='text'>
Clients are only supposed to request particular capabilities or features
if the server advertised them. For the "packfile-uris" feature, we only
advertise it if uploadpack.blobpacfileuri is set, but we always accept a
request from the client regardless.

In practice this doesn't really hurt anything, as we'd pass the client's
protocol list on to pack-objects, which ends up ignoring it. But we
should try to follow the protocol spec, and tightening this up may catch
buggy or misbehaving clients more easily.

Thanks to recent refactoring, we can hoist the config check from
upload_pack_advertise() into upload_pack_config(). Note the subtle
handling of a value-less bool (which does not count for triggering an
advertisement).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>upload-pack: use existing config mechanism for advertisement</title>
<updated>2024-02-28T23:30:41Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-02-28T22:48:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9a7b22959ad078df1f50d15e86a169aaebfb8c4c'/>
<id>urn:sha1:9a7b22959ad078df1f50d15e86a169aaebfb8c4c</id>
<content type='text'>
When serving a v2 capabilities request, we call upload_pack_advertise()
to tell us the set of features we can advertise to the client. That
involves looking at various config options, all of which need to be kept
in sync with the rules we use in upload_pack_config to set flags like
allow_filter, allow_sideband_all, and so on. If these two pieces of code
get out of sync then we may refuse to respect a capability we
advertised, or vice versa accept one that we should not.

Instead, let's call the same config helper that we'll use for processing
the actual client request, and then just pick the values out of the
resulting struct. This is only a little bit shorter than the current
code, but we don't repeat any policy logic (e.g., we don't have to worry
about the magic sideband-all environment variable here anymore).

And this reveals a gap in the existing code: there is no struct flag for
the packfile-uris capability (we accept it even if it is not advertised,
which we should not). We'll leave the advertisement code for now and
deal with it in the next patch.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>upload-pack: centralize setup of sideband-all config</title>
<updated>2024-02-28T23:30:41Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-02-28T22:47:18Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=37aa89b068810d4ed7f262adcfecca1e07cf553c'/>
<id>urn:sha1:37aa89b068810d4ed7f262adcfecca1e07cf553c</id>
<content type='text'>
We read uploadpack.allowsidebandall to set a matching flag in our
upload_pack_data struct. But for our tests, we also respect
GIT_TEST_SIDEBAND_ALL from the environment, and anybody looking at the
flag in the struct needs to remember to check both. There's only one
such piece of code now, but we're about to add another.

So let's have the config step actually fold the environment value into
the struct, letting the rest of the code use the flag in the obvious
way.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>upload-pack: use repository struct to get config</title>
<updated>2024-02-28T22:48:35Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-02-28T22:46:47Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=922fdefb84c4341c0bb10f3ca090725f9fcc8bcb'/>
<id>urn:sha1:922fdefb84c4341c0bb10f3ca090725f9fcc8bcb</id>
<content type='text'>
Our upload_pack_v2() function gets a repository struct, but we ignore it
totally.  In practice this doesn't cause any problems, as it will never
differ from the_repository. But in the spirit of taking a small step
towards getting rid of the_repository, let's at least starting using it
to grab config. There are probably other spots that could benefit, but
it's a start.

Note that we don't need to pass the repo for protected_config(); the
whole point there is that we are not looking at repo config, so there is
no repo-specific version of the function.

For the v0 version of the protocol, we're not passed a repository
struct, so we'll continue to use the_repository there.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>upload-pack: free tree buffers after parsing</title>
<updated>2024-02-28T22:42:01Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-02-28T22:39:07Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=6cd05e768b7e54ca48b16fb0214df4c70aecd46c'/>
<id>urn:sha1:6cd05e768b7e54ca48b16fb0214df4c70aecd46c</id>
<content type='text'>
When a client sends us a "want" or "have" line, we call parse_object()
to get an object struct. If the object is a tree, then the parsed state
means that tree-&gt;buffer points to the uncompressed contents of the tree.
But we don't really care about it. We only really need to parse commits
and tags; for trees and blobs, the important output is just a "struct
object" with the correct type.

But much worse, we do not ever free that tree buffer. It's not leaked in
the traditional sense, in that we still have a pointer to it from the
global object hash. But if the client requests many trees, we'll hold
all of their contents in memory at the same time.

Nobody really noticed because it's rare for clients to directly request
a tree. It might happen for a lightweight tag pointing straight at a
tree, or it might happen for a "tree:depth" partial clone filling in
missing trees.

But it's also possible for a malicious client to request a lot of trees,
causing upload-pack's memory to balloon. For example, without this
patch, requesting every tree in git.git like:

  pktline() {
    local msg="$*"
    printf "%04x%s\n" $((1+4+${#msg})) "$msg"
  }

  want_trees() {
    pktline command=fetch
    printf 0001
    git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)' |
      while read oid type; do
        test "$type" = "tree" || continue
        pktline want $oid
      done
      pktline done
      printf 0000
  }

  want_trees | GIT_PROTOCOL=version=2 valgrind --tool=massif ./git upload-pack . &gt;/dev/null

shows a peak heap usage of ~3.7GB. Which is just about the sum of the
sizes of all of the uncompressed trees. For linux.git, it's closer to
17GB.

So the obvious thing to do is to call free_tree_buffer() after we
realize that we've parsed a tree. We know that upload-pack won't need it
later. But let's push the logic into parse_object_with_flags(), telling
it to discard the tree buffer immediately. There are two reasons for
this. One, all of the relevant call-sites already call the with_options
variant to pass the SKIP_HASH flag. So it actually ends up as less code
than manually free-ing in each spot. And two, it enables an extra
optimization that I'll discuss below.

I've touched all of the sites that currently use SKIP_HASH in
upload-pack. That drops the peak heap of the upload-pack invocation
above from 3.7GB to ~24MB.

I've also modified the caller in get_reference(); a partial clone
benefits from its use in pack-objects for the reasons given in
0bc2557951 (upload-pack: skip parse-object re-hashing of "want" objects,
2022-09-06), where we were measuring blob requests. But note that the
results of get_reference() are used for traversing, as well; so we
really would _eventually_ use the tree contents. That makes this at
first glance a space/time tradeoff: we won't hold all of the trees in
memory at once, but we'll have to reload them each when it comes time to
traverse.

And here's where our extra optimization comes in. If the caller is not
going to immediately look at the tree contents, and it doesn't care
about checking the hash, then parse_object() can simply skip loading the
tree entirely, just like we do for blobs! And now it's not a space/time
tradeoff in get_reference() anymore. It's just a lazy-load: we're
delaying reading the tree contents until it's time to actually traverse
them one by one.

And of course for upload-pack, this optimization means we never load the
trees at all, saving lots of CPU time. Timing the "every tree from
git.git" request above shows upload-pack dropping from 32 seconds of CPU
to 19 (the remainder is mostly due to pack-objects actually sending the
pack; timing just the upload-pack portion shows we go from 13s to
~0.28s).

These are all highly gamed numbers, of course. For real-world
partial-clone requests we're saving only a small bit of time in
practice. But it does help harden upload-pack against malicious
denial-of-service attacks.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>upload-pack: use PARSE_OBJECT_SKIP_HASH_CHECK in more places</title>
<updated>2024-02-28T22:42:01Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-02-28T22:39:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a6ca601cdf82dda85db86c12973f2ecd746cb4bd'/>
<id>urn:sha1:a6ca601cdf82dda85db86c12973f2ecd746cb4bd</id>
<content type='text'>
In commit 0bc2557951 (upload-pack: skip parse-object re-hashing of
"want" objects, 2022-09-06), we optimized the parse_object() calls for
v2 "want" lines from the client so that they avoided parsing blobs, and
so that they used the commit-graph rather than parsing commit objects
from scratch.

We should extend that to two other spots:

  1. We parse "have" objects in the got_oid() function. These won't
     generally be non-commits (unlike "want" lines from a partial
     clone). But we still benefit from the use of the commit-graph.

  2. For v0, the "want" lines are parsed in receive_needs(). These are
     also less likely to be non-commits because by default they have to
     be ref tips. There are config options you might set to allow
     non-tip objects, but you'd mostly do so to support partial clones,
     and clients recent enough to support partial clone will generally
     speak v2 anyway.

So I don't expect this change to improve performance much for day-to-day
operations. But both are possible denial-of-service vectors, where an
attacker can waste our time by sending over a large number of objects to
parse (of course we may waste even more time serving a pack to them, but
we try as much as possible to optimize that in pack-objects; we should
do what we can here in upload-pack, too).

With this patch, running p5600 with GIT_TEST_PROTOCOL_VERSION=0 shows
similar results to what we saw in 0bc2557951 (which ran with the v2
protocol by default). Here are the numbers for linux.git:

  Test                          HEAD^                 HEAD
  -----------------------------------------------------------------------------
  5600.3: checkout of result    50.91(87.95+2.93)     41.75(79.00+3.18) -18.0%

Or for a more extreme (and malicious) case, we can claim to "have" every
blob in git.git over the v0 protocol:

  $ {
      echo "0032want $(git rev-parse HEAD)"
      printf 0000
      git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)' |
      perl -alne 'print "0032have $F[0]" if $F[1] eq "blob"'
    } &gt;input

  $ time ./git.old upload-pack . &lt;input &gt;/dev/null
  real	0m52.951s
  user	0m51.633s
  sys	0m1.304s

  $ time ./git.new upload-pack . &lt;input &gt;/dev/null
  real	0m0.261s
  user	0m0.156s
  sys	0m0.105s

(Note that these don't actually compute a pack because of the hacky
protocol usage, so those numbers are representing the raw blob-parsing
effort done by upload-pack).

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>upload-pack: always turn off save_commit_buffer</title>
<updated>2024-02-28T22:42:01Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-02-28T22:39:00Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=5f64279443520922949ea89babe9bd3712c11fec'/>
<id>urn:sha1:5f64279443520922949ea89babe9bd3712c11fec</id>
<content type='text'>
When the client sends us "want $oid" lines, we call parse_object($oid)
to get an object struct. It's important to parse the commits because we
need to traverse them in the negotiation phase. But of course we don't
need to hold on to the commit messages for each one.

We've turned off the save_commit_buffer flag in get_common_commits() for
a long time, since f0243f26f6 (git-upload-pack: More efficient usage of
the has_sha1 array, 2005-10-28). That helps with the commits we see
while actually traversing. But:

  1. That function is only used by the v0 protocol. I think the v2
     protocol's code path leaves the flag on (and thus pays the extra
     memory penalty), though I didn't measure it specifically.

  2. If the client sends us a bunch of "want" lines, that happens before
     the negotiation phase. So we'll hold on to all of those commit
     messages. Generally the number of "want" lines scales with the
     refs, not with the number of objects in the repo. But a malicious
     client could send a lot in order to waste memory.

As an example of (2), if I generate a request to fetch all commits in
git.git like this:

  pktline() {
    local msg="$*"
    printf "%04x%s\n" $((1+4+${#msg})) "$msg"
  }

  want_commits() {
    pktline command=fetch
    printf 0001
    git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)' |
      while read oid type; do
        test "$type" = "commit" || continue
        pktline want $oid
      done
      pktline done
      printf 0000
  }

  want_commits | GIT_PROTOCOL=version=2 valgrind --tool=massif git-upload-pack . &gt;/dev/null

before this patch upload-pack peaks at ~125MB, and after at ~35MB. The
difference is not coincidentally about the same as the sum of all commit
object sizes as computed by:

  git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectsize)' |
  perl -alne '$v += $F[1] if $F[0] eq "commit"; END { print $v }'

In a larger repository like linux.git, that number is ~1GB.

In a repository with a full commit-graph file this will have no impact
(and the commit graph would save us from parsing at all, so is a much
better solution!). But it's easy to do, might help a little in
real-world cases (where even if you have a commit graph it might not be
fully up to date), and helps a lot for a worst-case malicious request.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>upload-pack: accept only a single packfile-uri line</title>
<updated>2024-02-28T22:42:01Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-02-28T22:38:46Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=179776f9e6c4bd2f423c78fcf8b1e4cb4afc4c45'/>
<id>urn:sha1:179776f9e6c4bd2f423c78fcf8b1e4cb4afc4c45</id>
<content type='text'>
When we see a packfile-uri line from the client, we use
string_list_split() to split it on commas and store the result in a
string_list.  A single packfile-uri line is therefore limited to storing
~64kb, the size of a pkt-line.

But we'll happily accept multiple such lines, and each line appends to
the string list, growing without bound.

In theory this could be useful, making:

  0017packfile-uris http
  0018packfile-uris https

equivalent to:

  001dpackfile-uris http,https

But the protocol documentation doesn't indicate that this should work
(and indeed, refers to this in the singular as "the following argument
can be included in the client's request"). And the client-side
implementation in fetch-pack has always sent a single line (JGit appears
to understand the line on the server side but has no client-side
implementation, and libgit2 understands neither).

If we were worried about compatibility, we could instead just put a
limit on the maximum number of values we'd accept. The current client
implementation limits itself to only two values: "http" and "https", so
something like "256" would be more than enough. But accepting only a
single line seems more in line with the protocol documentation, and
matches other parts of the protocol (e.g., we will not accept a second
"filter" line).

We'll also make this more explicit in the protocol documentation; as
above, I think this was always the intent, but there's no harm in making
it clear.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
