git/dir.c, branch v2.46.2

win32: override `fspathcmp()` with a directory separator-aware version

2024-07-13T23:23:36Z

On Windows, the backslash is the directory separator, even if the forward slash can be used, too, at least since Windows NT. This means that the paths `a/b` and `a\b` are equivalent, and `fspathcmp()` needs to be made aware of that fact. Note that we have to override both `fspathcmp()` and `fspathncmp()`, and the former cannot be a mere pre-processor constant that transforms calls to `fspathcmp(a, b)` into `fspathncmp(a, b, (size_t)-1)` because the function `report_collided_checkout()` in `unpack-trees.c` wants to assign `list.cmp = fspathcmp`. Also note that `fspatheq()` does _not_ need to be overridden because it calls `fspathcmp()` internally. Signed-off-by: Johannes Schindelin Signed-off-by: Junio C Hamano

global: introduce `USE_THE_REPOSITORY_VARIABLE` macro

2024-06-14T17:26:33Z

Use of the `the_repository` variable is deprecated nowadays, and we slowly but steadily convert the codebase to not use it anymore. Instead, callers should be passing down the repository to work on via parameters. It is hard though to prove that a given code unit does not use this variable anymore. The most trivial case, merely demonstrating that there is no direct use of `the_repository`, is already a bit of a pain during code reviews as the reviewer needs to manually verify claims made by the patch author. The bigger problem though is that we have many interfaces that implicitly rely on `the_repository`. Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code units to opt into usage of `the_repository`. The intent of this macro is to demonstrate that a certain code unit does not use this variable anymore, and to keep it from new dependencies on it in future changes, be it explicit or implicit For now, the macro only guards `the_repository` itself as well as `the_hash_algo`. There are many more known interfaces where we have an implicit dependency on `the_repository`, but those are not guarded at the current point in time. Over time though, we should start to add guards as required (or even better, just remove them). Define the macro as required in our code units. As expected, most of our code still relies on the global variable. Nearly all of our builtins rely on the variable as there is no way yet to pass `the_repository` to their entry point. For now, declare the macro in "biultin.h" to keep the required changes at least a little bit more contained. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

hash: require hash algorithm in `oidread()` and `oidclr()`

2024-06-14T17:26:32Z

Both `oidread()` and `oidclr()` use `the_repository` to derive the hash function that shall be used. Require callers to pass in the hash algorithm to get rid of this implicit dependency. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

Merge branch 'jk/sparse-leakfix'

2024-06-12T20:37:17Z

Many memory leaks in the sparse-checkout code paths have been plugged. * jk/sparse-leakfix: sparse-checkout: free duplicate hashmap entries sparse-checkout: free string list after displaying sparse-checkout: free pattern list in sparse_checkout_list() sparse-checkout: free sparse_filename after use sparse-checkout: refactor temporary sparse_checkout_patterns sparse-checkout: always free "line" strbuf after reading input sparse-checkout: reuse --stdin buffer when reading patterns dir.c: always copy input to add_pattern() dir.c: free removed sparse-pattern hashmap entries sparse-checkout: clear patterns when init() sees existing sparse file dir.c: free strings in sparse cone pattern hashmaps sparse-checkout: pass string literals directly to add_pattern() sparse-checkout: free string list in write_cone_to_file()

Merge branch 'jk/cap-exclude-file-size'

2024-06-12T20:37:17Z

An overly large ".gitignore" files are now rejected silently. * jk/cap-exclude-file-size: dir.c: reduce max pattern file size to 100MB dir.c: skip .gitignore, etc larger than INT_MAX

dir.c: always copy input to add_pattern()

2024-06-05T16:51:42Z

The add_pattern() function has a subtle and undocumented gotcha: the pattern string you pass in must remain valid as long as the pattern_list is in use (and nor do we take ownership of it). This is easy to get wrong, causing either subtle bugs (because you free or reuse the string buffer) or leaks (because you copy the string, but don't track ownership separately). All of this "pattern" code was originally the "exclude" mechanism. So this _usually_ works OK because you add entries in one of two ways: 1. From the command-line (e.g., "--exclude"), in which case we're pointing to an argv entry which remains valid for the lifetime of the program. 2. From a file (e.g., ".gitignore"), in which case we read the whole file into a buffer, attach it to the pattern_list's "filebuf" entry, then parse the buffer in-place (adding NULs). The strings point into the filebuf, which is cleaned up when the whole pattern_list goes away. But other code, like sparse-checkout, reads individual lines from stdin and passes them one by one to add_pattern(), leaking each. We could fix this by refactoring it to take in the whole buffer at once, like (2) above, and stuff it in "filebuf". But given how subtle the interface is, let's just fix it to always copy the string. That seems at first like we'd be wasting extra memory, but we can mitigate that: a. The path_pattern struct already uses a FLEXPTR, since we sometimes make a copy (when we see "foo/", we strip off the trailing slash, requiring a modifiable copy of the string). Since we'll now always embed the string inside the struct, we can switch to the regular FLEX_ARRAY pattern, saving us 8 bytes of pointer. So patterns with a trailing slash and ones under 8 bytes actually get smaller. b. Now that we don't need the original string to hang around, we can get rid of the "filebuf" mechanism entirely, and just free the file contents after parsing. Since files are the sources we'd expect to have the largest pattern sets, we should mostly break even on stuffing the same data into the individual structs. This patch just adjusts the add_pattern() interface; it doesn't fix any leaky callers yet. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

dir.c: reduce max pattern file size to 100MB

2024-06-05T16:23:42Z

In a2bc523e1e (dir.c: skip .gitignore, etc larger than INT_MAX, 2024-05-31) we put capped the size of some files whose parsing code and data structures used ints. Setting the limit to INT_MAX was a natural spot, since we know the parsing code would misbehave above that. But it also leaves the possibility of overflow errors when we multiply that limit to allocate memory. For instance, a file consisting only of "a\na\n..." could have INT_MAX/2 entries. Allocating an array of pointers for each would need INT_MAX*4 bytes on a 64-bit system, enough to overflow a 32-bit int. So let's give ourselves a bit more safety margin by giving a much smaller limit. The size 100MB is somewhat arbitrary, but is based on the similar value for attribute files added by 3c50032ff5 (attr: ignore overly large gitattributes files, 2022-12-01). There's no particular reason these have to be the same, but the idea is that they are in the ballpark of "so huge that nobody would care, but small enough to avoid malicious overflow". So lacking a better guess, it makes sense to use the same value. The implementation here doesn't share the same constant, but we could change that later (or even give it a runtime config knob, though nobody has complained yet about the attribute limit). And likewise, let's add a few tests that exercise the limits, based on the attr ones. In this case, though, we never read .gitignore from the index; the blob code is exercised only for sparse filters. So we'll trigger it that way. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

dir.c: free removed sparse-pattern hashmap entries

2024-06-04T17:38:23Z

In add_pattern_to_hashsets(), we remove entries from the recursive_hashmap when adding similar ones to the parent_hashmap. I won't pretend to understand all of what's going on here, but there's an obvious leak: whatever we removed from recursive_hashmap is not referenced anywhere else, and is never free()d. We can easily fix this by asking the hashmap to return a pointer to the old entry. This makes t7002 now completely leak-free. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

dir.c: free strings in sparse cone pattern hashmaps

2024-06-04T17:38:23Z

The pattern_list structs used for cone-mode sparse lookups use a few extra hashmaps. These store pattern_entry structs, each of which has its own heap-allocated pattern string. When we clean up the hashmaps, we free the individual pattern_entry structs, but forget to clean up the embedded strings, causing memory leaks. We can fix this by iterating over the hashmaps to free the extra strings. This reduces the numbers of leaks in t7002 from 22 to 9. One alternative here would be to make the string a FLEX_ARRAY member of the pattern_entry. Then there's no extra free() required, and as a bonus it would be a little more efficient. However, some of the refactoring gets awkward, as we are often assigning strings allocated by helper functions. So let's just fix the leak for now, and we can explore bigger refactoring separately. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

dir.c: skip .gitignore, etc larger than INT_MAX

2024-05-31T22:30:32Z

We use add_patterns() to read .gitignore, .git/info/exclude, etc, as well as other pattern-like files like sparse-checkout. The parser for these uses an "int" as an index, meaning that files over 2GB will generally cause signed integer overflow and out-of-bounds access. This is unlikely to happen in any real files, but we do read .gitignore files from the tree. A malicious tree could cause an out-of-bounds read and segfault (we also write NULs over newlines, so in theory it could be an out-of-bounds write, too, but as we go char-by-char, the first thing that happens is trying to read a negative 2GB offset). We could fix the most obvious issue by replacing one "int" with a "size_t". But there are tons of "int" sprinkled throughout this code for things like pattern lengths, number of patterns, and so on. Since nobody would actually want a 2GB .gitignore file, an easy defensive measure is to just refuse to parse them. The "int" in question is in add_patterns_from_buffer(), so we could catch it there. But by putting the checks in its two callers, we can produce more useful error messages. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano