<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/dir.c, branch v2.46.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.46.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.46.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2024-07-13T23:23:36Z</updated>
<entry>
<title>win32: override `fspathcmp()` with a directory separator-aware version</title>
<updated>2024-07-13T23:23:36Z</updated>
<author>
<name>Johannes Schindelin</name>
<email>johannes.schindelin@gmx.de</email>
</author>
<published>2024-07-13T21:08:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=193eda7507d0ccb2fe6fd42b403c56ecc205e546'/>
<id>urn:sha1:193eda7507d0ccb2fe6fd42b403c56ecc205e546</id>
<content type='text'>
On Windows, the backslash is the directory separator, even if the
forward slash can be used, too, at least since Windows NT.

This means that the paths `a/b` and `a\b` are equivalent, and
`fspathcmp()` needs to be made aware of that fact.

Note that we have to override both `fspathcmp()` and `fspathncmp()`, and
the former cannot be a mere pre-processor constant that transforms calls
to `fspathcmp(a, b)` into `fspathncmp(a, b, (size_t)-1)` because the
function `report_collided_checkout()` in `unpack-trees.c` wants to
assign `list.cmp = fspathcmp`.

Also note that `fspatheq()` does _not_ need to be overridden because it
calls `fspathcmp()` internally.

Signed-off-by: Johannes Schindelin &lt;johannes.schindelin@gmx.de&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>global: introduce `USE_THE_REPOSITORY_VARIABLE` macro</title>
<updated>2024-06-14T17:26:33Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-06-14T06:50:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=e7da9385708accf518a80a1e17969020fb361048'/>
<id>urn:sha1:e7da9385708accf518a80a1e17969020fb361048</id>
<content type='text'>
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.

It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.

Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit

For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).

Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>hash: require hash algorithm in `oidread()` and `oidclr()`</title>
<updated>2024-06-14T17:26:32Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2024-06-14T06:49:54Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9da95bda74cf10e1475384a71fd20914c3b99784'/>
<id>urn:sha1:9da95bda74cf10e1475384a71fd20914c3b99784</id>
<content type='text'>
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash
function that shall be used. Require callers to pass in the hash
algorithm to get rid of this implicit dependency.

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'jk/sparse-leakfix'</title>
<updated>2024-06-12T20:37:17Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-06-12T20:37:17Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=51ea70c18ad4646fdcccdce72a1a457f7ada6297'/>
<id>urn:sha1:51ea70c18ad4646fdcccdce72a1a457f7ada6297</id>
<content type='text'>
Many memory leaks in the sparse-checkout code paths have been
plugged.

* jk/sparse-leakfix:
  sparse-checkout: free duplicate hashmap entries
  sparse-checkout: free string list after displaying
  sparse-checkout: free pattern list in sparse_checkout_list()
  sparse-checkout: free sparse_filename after use
  sparse-checkout: refactor temporary sparse_checkout_patterns
  sparse-checkout: always free "line" strbuf after reading input
  sparse-checkout: reuse --stdin buffer when reading patterns
  dir.c: always copy input to add_pattern()
  dir.c: free removed sparse-pattern hashmap entries
  sparse-checkout: clear patterns when init() sees existing sparse file
  dir.c: free strings in sparse cone pattern hashmaps
  sparse-checkout: pass string literals directly to add_pattern()
  sparse-checkout: free string list in write_cone_to_file()
</content>
</entry>
<entry>
<title>Merge branch 'jk/cap-exclude-file-size'</title>
<updated>2024-06-12T20:37:17Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2024-06-12T20:37:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=c2f79440acc0d24e5a8a61648cf92055cf4ca5c8'/>
<id>urn:sha1:c2f79440acc0d24e5a8a61648cf92055cf4ca5c8</id>
<content type='text'>
An overly large ".gitignore" files are now rejected silently.

* jk/cap-exclude-file-size:
  dir.c: reduce max pattern file size to 100MB
  dir.c: skip .gitignore, etc larger than INT_MAX
</content>
</entry>
<entry>
<title>dir.c: always copy input to add_pattern()</title>
<updated>2024-06-05T16:51:42Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-06-04T10:13:22Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=eed1fbe73b46e5c7f87d724f97fba6d8a4391fc9'/>
<id>urn:sha1:eed1fbe73b46e5c7f87d724f97fba6d8a4391fc9</id>
<content type='text'>
The add_pattern() function has a subtle and undocumented gotcha: the
pattern string you pass in must remain valid as long as the pattern_list
is in use (and nor do we take ownership of it). This is easy to get
wrong, causing either subtle bugs (because you free or reuse the string
buffer) or leaks (because you copy the string, but don't track ownership
separately).

All of this "pattern" code was originally the "exclude" mechanism. So
this _usually_ works OK because you add entries in one of two ways:

  1. From the command-line (e.g., "--exclude"), in which case we're
     pointing to an argv entry which remains valid for the lifetime of
     the program.

  2. From a file (e.g., ".gitignore"), in which case we read the whole
     file into a buffer, attach it to the pattern_list's "filebuf"
     entry, then parse the buffer in-place (adding NULs). The strings
     point into the filebuf, which is cleaned up when the whole
     pattern_list goes away.

But other code, like sparse-checkout, reads individual lines from stdin
and passes them one by one to add_pattern(), leaking each. We could fix
this by refactoring it to take in the whole buffer at once, like (2)
above, and stuff it in "filebuf". But given how subtle the interface is,
let's just fix it to always copy the string.

That seems at first like we'd be wasting extra memory, but we can
mitigate that:

  a. The path_pattern struct already uses a FLEXPTR, since we sometimes
     make a copy (when we see "foo/", we strip off the trailing slash,
     requiring a modifiable copy of the string).

     Since we'll now always embed the string inside the struct, we can
     switch to the regular FLEX_ARRAY pattern, saving us 8 bytes of
     pointer. So patterns with a trailing slash and ones under 8 bytes
     actually get smaller.

  b. Now that we don't need the original string to hang around, we can
     get rid of the "filebuf" mechanism entirely, and just free the file
     contents after parsing. Since files are the sources we'd expect to
     have the largest pattern sets, we should mostly break even on
     stuffing the same data into the individual structs.

This patch just adjusts the add_pattern() interface; it doesn't fix any
leaky callers yet.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>dir.c: reduce max pattern file size to 100MB</title>
<updated>2024-06-05T16:23:42Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-06-05T08:03:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=e7c3d1ddba0b3eb9d52780588636833055830c6f'/>
<id>urn:sha1:e7c3d1ddba0b3eb9d52780588636833055830c6f</id>
<content type='text'>
In a2bc523e1e (dir.c: skip .gitignore, etc larger than INT_MAX,
2024-05-31) we put capped the size of some files whose parsing code and
data structures used ints. Setting the limit to INT_MAX was a natural
spot, since we know the parsing code would misbehave above that.

But it also leaves the possibility of overflow errors when we multiply
that limit to allocate memory. For instance, a file consisting only of
"a\na\n..." could have INT_MAX/2 entries. Allocating an array of
pointers for each would need INT_MAX*4 bytes on a 64-bit system, enough
to overflow a 32-bit int.

So let's give ourselves a bit more safety margin by giving a much
smaller limit. The size 100MB is somewhat arbitrary, but is based on the
similar value for attribute files added by 3c50032ff5 (attr: ignore
overly large gitattributes files, 2022-12-01).

There's no particular reason these have to be the same, but the idea is
that they are in the ballpark of "so huge that nobody would care, but
small enough to avoid malicious overflow". So lacking a better guess, it
makes sense to use the same value. The implementation here doesn't share
the same constant, but we could change that later (or even give it a
runtime config knob, though nobody has complained yet about the
attribute limit).

And likewise, let's add a few tests that exercise the limits, based on
the attr ones. In this case, though, we never read .gitignore from the
index; the blob code is exercised only for sparse filters. So we'll
trigger it that way.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>dir.c: free removed sparse-pattern hashmap entries</title>
<updated>2024-06-04T17:38:23Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-06-04T10:13:20Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=4c844c2f499a573e5801bd3cb83b8d42d3f1566a'/>
<id>urn:sha1:4c844c2f499a573e5801bd3cb83b8d42d3f1566a</id>
<content type='text'>
In add_pattern_to_hashsets(), we remove entries from the
recursive_hashmap when adding similar ones to the parent_hashmap. I
won't pretend to understand all of what's going on here, but there's an
obvious leak: whatever we removed from recursive_hashmap is not
referenced anywhere else, and is never free()d.

We can easily fix this by asking the hashmap to return a pointer to the
old entry. This makes t7002 now completely leak-free.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>dir.c: free strings in sparse cone pattern hashmaps</title>
<updated>2024-06-04T17:38:23Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-06-04T10:13:14Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=4318d3ab6589af78cef344afbab100d639219468'/>
<id>urn:sha1:4318d3ab6589af78cef344afbab100d639219468</id>
<content type='text'>
The pattern_list structs used for cone-mode sparse lookups use a few
extra hashmaps. These store pattern_entry structs, each of which has its
own heap-allocated pattern string. When we clean up the hashmaps, we
free the individual pattern_entry structs, but forget to clean up the
embedded strings, causing memory leaks.

We can fix this by iterating over the hashmaps to free the extra
strings. This reduces the numbers of leaks in t7002 from 22 to 9.

One alternative here would be to make the string a FLEX_ARRAY member of
the pattern_entry. Then there's no extra free() required, and as a bonus
it would be a little more efficient. However, some of the refactoring
gets awkward, as we are often assigning strings allocated by helper
functions. So let's just fix the leak for now, and we can explore bigger
refactoring separately.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>dir.c: skip .gitignore, etc larger than INT_MAX</title>
<updated>2024-05-31T22:30:32Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2024-05-31T12:00:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a2bc523e1ea2ef9b59eb0c26331b6e7d9dc5a812'/>
<id>urn:sha1:a2bc523e1ea2ef9b59eb0c26331b6e7d9dc5a812</id>
<content type='text'>
We use add_patterns() to read .gitignore, .git/info/exclude, etc, as
well as other pattern-like files like sparse-checkout. The parser for
these uses an "int" as an index, meaning that files over 2GB will
generally cause signed integer overflow and out-of-bounds access.

This is unlikely to happen in any real files, but we do read .gitignore
files from the tree. A malicious tree could cause an out-of-bounds read
and segfault (we also write NULs over newlines, so in theory it could be
an out-of-bounds write, too, but as we go char-by-char, the first thing
that happens is trying to read a negative 2GB offset).

We could fix the most obvious issue by replacing one "int" with a
"size_t". But there are tons of "int" sprinkled throughout this code for
things like pattern lengths, number of patterns, and so on. Since nobody
would actually want a 2GB .gitignore file, an easy defensive measure is
to just refuse to parse them.

The "int" in question is in add_patterns_from_buffer(), so we could
catch it there. But by putting the checks in its two callers, we can
produce more useful error messages.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
