aboutsummaryrefslogtreecommitdiffstats
path: root/line-log.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-09-18Merge branch 'sg/line-log-boundary-fixes'Junio C Hamano1-3/+12
A corner case bug in "git log -L..." has been corrected. * sg/line-log-boundary-fixes: line-log: show all line ranges touched by the same diff range line-log: fix assertion error
2025-08-25line-log: simplify condition checking for merge commitsSZEDER Gábor1-3/+3
In process_ranges_arbitrary_commit() the condition deciding whether the given commit is not a merge, i.e. that it doesn't have more than one parent, is head-scratchingly backwards, flip it. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-25line-log: initialize diff queue in process_ranges_ordinary_commit()SZEDER Gábor1-1/+1
process_ranges_ordinary_commit() uses a local diff queue variable, which it leaves uninitialized before passing its address to queue_diffs(). This is not an issue, because at the end of that function the contents of an other diff queue is moved into it by simply overwriting whatever is in there, i.e. without reading any uninitialized memory. Still, seeing the uninitialized diff queue being passed around scared me more than once, so out of caution let's make sure that it's initialized. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-25line-log: get rid of the parents array in process_ranges_merge_commit()SZEDER Gábor1-12/+12
We can easily iterate through the parents of a merge commit without turning the list of parents into a dynamically allocated array of parents, so let's do so. This way we can avoid a memory allocation for each processed merge commit, though its effect on runtime seems to be unmeasurable. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-25line-log: avoid unnecessary tree diffs when processing merge commitsSZEDER Gábor1-15/+5
In process_ranges_merge_commit(), the line-level log first creates an array of diff queues by iterating over all parents of a merge commit and computing a tree diff for each. Then in a second loop it iterates over those diff queues, and if it finds that none of the interesting paths were modified in one of them, then it will return early. This means that when none of the interesting paths were modified between a merge and its first parent, then the tree diff between the merge and its second (Nth...) parent was computed in vain. Unify these two loops, so when it iterates over all parents of a merge commit, then it first computes the tree diff between the merge and that particular parent and then processes the resulting diff queue right away. This way we can spare some tree diff computing, thereby speeding up line-level log in repositories with mergy history: # git.git, 25.8% of commits are merges: Benchmark 1: ./git_v2.51.0 -C ~/src/git log -L:'lookup_commit(':commit.c v2.51.0 Time (mean ± σ): 1.001 s ± 0.009 s [User: 0.906 s, System: 0.095 s] Range (min … max): 0.991 s … 1.023 s 10 runs Benchmark 2: ./git -C ~/src/git log -L:'lookup_commit(':commit.c v2.51.0 Time (mean ± σ): 445.5 ms ± 3.4 ms [User: 358.8 ms, System: 84.3 ms] Range (min … max): 440.1 ms … 450.3 ms 10 runs Summary './git -C ~/src/git log -L:'lookup_commit(':commit.c v2.51.0' ran 2.25 ± 0.03 times faster than './git_v2.51.0 -C ~/src/git log -L:'lookup_commit(':commit.c v2.51.0' # linux.git, 7.5% of commits are merges: Benchmark 1: ./git_v2.51.0 -C ~/src/linux.git log -L:build_restore_work_registers:arch/mips/mm/tlbex.c v6.16 Time (mean ± σ): 3.246 s ± 0.007 s [User: 2.835 s, System: 0.409 s] Range (min … max): 3.232 s … 3.255 s 10 runs Benchmark 2: ./git -C ~/src/linux.git log -L:build_restore_work_registers:arch/mips/mm/tlbex.c v6.16 Time (mean ± σ): 2.467 s ± 0.014 s [User: 2.113 s, System: 0.353 s] Range (min … max): 2.455 s … 2.505 s 10 runs Summary './git -C ~/src/linux.git log -L:build_restore_work_registers:arch/mips/mm/tlbex.c v6.16' ran 1.32 ± 0.01 times faster than './git_v2.51.0 -C ~/src/linux.git log -L:build_restore_work_registers:arch/mips/mm/tlbex.c v6.16' And since now each iteration computes a tree diff and processes its result, there is no reason to store the diff queues for each merge parent anymore, so replace that diff queue array with a loop-local diff queue variable. With this change the static free_diffqueues() helper function in 'line-log.c' has no more callers left, remove it. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-20line-log: show all line ranges touched by the same diff rangeSZEDER Gábor1-0/+9
When line-level log is invoked with more than one disjoint line range in the same file, and one of the commits happens to change that file such that one diff range modifies more than one line range, then changes to all modified line ranges should be shown, but only the changes in the first modified line range are: $ git log --oneline -p 80ca903 (HEAD -> master) Initial diff --git a/file b/file new file mode 100644 index 0000000..00935f1 --- /dev/null +++ b/file @@ -0,0 +1,10 @@ +Line 1 +Line 2 +Line 3 +Line 4 +Line 5 +Line 6 +Line 7 +Line 8 +Line 9 +Line 10 $ git log --oneline -L1,2:file -L4,5:file -L7,8:file 80ca903 (HEAD -> master) Initial diff --git a/file b/file --- /dev/null +++ b/file @@ -0,0 +1,2 @@ +Line 1 +Line 2 The line-log-specific diff printer is already clever enough to handle the case when one line range covers multiple diff ranges, but the possibility of one diff range touching multiple disjoint line ranges was apparently overlooked. Add the necessary condition to dump_diff_hacky_one() to handle this case as well, and show all modified line ranges: $ git log --oneline -L1,2:file -L4,5:file -L7,8:file 0f9a5b4 (HEAD -> master) Initial diff --git a/file b/file --- /dev/null +++ b/file @@ -0,0 +1,2 @@ +Line 1 +Line 2 @@ -0,0 +4,2 @@ +Line 4 +Line 5 @@ -0,0 +7,2 @@ +Line 7 +Line 8 This bug was already present in the initial line-log implementation added in 2da1d1f6f (Implement line-history search (git log -L), 2013-03-28). Interestingly, that commit already contained a canned test case covering a similar scenario: "-L '/long f/',/^}/:a.c -L /main/,/^}/:a.c simple" This test case looks for two line ranges in the same file, and both trace back disjointly to the test repository's inital commit, therefore changes to both line ranges should have been shown for the initial commit, but only changes for the first line range are shown. So this test case should have failed from the very beginning, but it never did, because, unfortunately, the canned expected result is incorrect, as it doesn't include changes for the second line range. A similar test with a similarly incorrect canned expected result was added later in 209618860c (log -L: fix overlapping input ranges, 2013-04-05). Correct these two canned expected results to contain the changes for the second line range for the initial commit as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-20line-log: fix assertion errorSZEDER Gábor1-3/+3
When line-level log is invoked with more than one disjoint line range in the same file, and one of the commits happens to change that file such that: - the last line of a line range R(n) immediately preceeds the first line modified or added by a hunk H, and - subtracting the number of lines added by hunk H from the start and end of the subsequent line range R(n+1) would result in a range overlapping with line range R(n), then git aborts with an assertion error, because those overlapping line ranges violate the invariants: $ git log --oneline -p 73e4e2f (HEAD -> master) Add lines 6 7 8 9 10 diff --git a/file b/file index 572d5d9..00935f1 100644 --- a/file +++ b/file @@ -3,3 +3,8 @@ Line 2 Line 3 Line 4 Line 5 +Line 6 +Line 7 +Line 8 +Line 9 +Line 10 66e3561 Add lines 1 2 3 4 5 diff --git a/file b/file new file mode 100644 index 0000000..572d5d9 --- /dev/null +++ b/file @@ -0,0 +1,5 @@ +Line 1 +Line 2 +Line 3 +Line 4 +Line 5 $ git log --oneline -L3,5:file -L7,8:file git: line-log.c:73: range_set_append: Assertion `rs->nr == 0 || rs->ranges[rs->nr-1].end <= a' failed. Aborted (core dumped) The line-log machinery encodes line and diff ranges internally as [start, end) pairs, i.e. include 'start' but exclude 'end', and line numbering starts at 0 (as opposed to the -LX,Y option, where it starts at 1, IOW the parameter -L3,5 is represented internally as { start = 2, end = 5 }). The reason for this assertion error and some related issues is that there are a couple of places where 'end' is mistakenly considered to be part of the range: - When a commit modifies an interesting path, the line-log machinery first checks which diff range (i.e. hunk) modify any line ranges. This is done in diff_ranges_filter_touched(), where the outer loop iterates over the diff ranges, and in each iteration the inner loop advances the line ranges supposedly until the current line range ends at or after the current diff range starts, and then the current diff and line ranges are checked for overlap. For HEAD in the above example the first line range [2, 5) ends just before the diff range [5, 10) starts, so the inner loop should advance, and then the second line range [6, 8) and the diff range should be checked for overlap. Unfortunately, the condition of the inner loop mistakenly considers 'end' as part of the line range, and, seeing the diff range starting at 5 and the line range ending at 5, it doesn't skip the first range. Consequently, the diff range and the first line range are checked for overlap, and after that the outer loop runs out of diff ranges, and then the processing goes on in the false belief that this commit didn't touch any of the interesting line ranges. The line-log machinery later shifts the line ranges to account for any added/removed lines in the diff ranges preceeding each line range. This leaves the first line range intact, but attempts to shift the second line range [6, 8) by 5 lines towards the beginning of the file, resulting in [1, 3), triggering the assertion error, because the two overlapping line ranges violate the invariants. Fix that loop condition in diff_ranges_filter_touched() to not treat 'end' as part of the line range. - With the above fix the assertion error is gone... but, alas, we now get stuck in an endless loop! This happens in range_set_difference(), where a couple of nested loops iterate over the line and diff ranges, and a condition is supposed to break the middle loop when the current line range ends before the current diff range, so processing could continue with the next line range. For HEAD in the above example the first line range [2, 5) ends just before the diff range [5, 10) starts, so this condition should trigger and break the middle loop. Unfortunately, just like in the case of the assertion error, this conditions mistakenly considers 'end' as part of the line range, and, seeing the line range ending at 5 and the diff range starting at 5, it doesn't break the loop, which will then go on and on. Fix this condition in range_set_difference() to not treat 'end' as part of the line range. - With the above fix the endless loop is gone... but, alas, the output is now wrong, as it shows both line ranges for HEAD, even though the first line range is not modified by that commit: $ git log --oneline -L3,5:file -L7,8:file 73e4e2f (HEAD -> master) Add lines 6 7 8 9 10 diff --git a/file b/file --- a/file +++ b/file @@ -3,3 +3,3 @@ Line 3 Line 4 Line 5 @@ -6,0 +7,2 @@ +Line 7 +Line 8 66e3561 Add lines 1 2 3 4 5 diff --git a/file b/file --- /dev/null +++ b/file @@ -0,0 +3,3 @@ +Line 3 +Line 4 +Line 5 In dump_diff_hacky_one() a couple of nested loops are responsible for finding and printing the modified line ranges: the big outer loop iterates over all line ranges, and the first inner loop skips over the diff ranges that end before the start of the current line range. This is followed by a condition checking whether the current diff range starts after the end of the current line range, which, when fulfilled, continues and advances the outer loop to the next line range. For HEAD in the above example the first line range [2, 5) ends just before the diff range [5, 10), so this condition should trigger, and the outer loop should advance to the second line range. Unfortunately, just like in the previous cases, this condition mistakenly considers 'end' as part of the line range, and, seeing the first line range ending at 5 and the diff range starting at 5, it doesn't continue to advance the outher loop, but goes on to show the (unmodified) first line range. Fix this condition to not treat 'end' as part of the line range, just like in the previous cases. After all this the command in the above example finally finishes and produces the right output: $ git log --oneline -L3,5:file -L7,8:file 73e4e2f (HEAD -> master) Add lines 6 7 8 9 10 diff --git a/file b/file --- a/file +++ b/file @@ -6,0 +7,2 @@ +Line 7 +Line 8 66e3561 Add lines 1 2 3 4 5 diff --git a/file b/file --- /dev/null +++ b/file @@ -0,0 +3,3 @@ +Line 3 +Line 4 +Line 5 Add a canned test similar to the above example, with the line ranges adjusted to the test repository's history. Reported-by: Evgeni Chasnovski <evgeni.chasnovski@gmail.com> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-14bloom: rename function operates on bloom_keyLidong Yan1-2/+3
git code style requires that functions operating on a struct S should be named in the form S_verb. However, the functions operating on struct bloom_key do not follow this convention. Therefore, fill_bloom_key() and clear_bloom_key() are renamed to bloom_key_fill() and bloom_key_clear(), respectively. Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-06global: mark code units that generate warnings with `-Wsign-compare`Patrick Steinhardt1-0/+2
Mark code units that generate warnings with `-Wsign-compare`. This allows for a structured approach to get rid of all such warnings over time in a way that can be easily measured. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-21line-log: fix leak when rewriting commit parentsPatrick Steinhardt1-0/+1
In `process_ranges_merge_commit()` we try to figure out which of the parents can be blamed for the given line changes. When we figure out that none of the files in the line-log have changed we assign the complete blame to that commit and rewrite the parents of the current commit to only use that single parent. This is done via `commit_list_append()`, which is misleadingly _not_ appending to the list of parents. Instead, we overwrite the parents with the blamed parent. This makes us lose track of the old pointers, creating a memory leak. Fix this issue by freeing the parents before we overwrite them. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-10Merge branch 'jk/output-prefix-cleanup'Junio C Hamano1-12/+2
Code clean-up. * jk/output-prefix-cleanup: diff: store graph prefix buf in git_graph struct diff: return line_prefix directly when possible diff: return const char from output_prefix callback diff: drop line_prefix_length field line-log: use diff_line_prefix() instead of custom helper
2024-10-10Merge branch 'ps/leakfixes-part-8'Junio C Hamano1-27/+42
More leakfixes. * ps/leakfixes-part-8: (23 commits) builtin/send-pack: fix leaking list of push options remote: fix leaking push reports t/helper: fix leaks in proc-receive helper pack-write: fix return parameter of `write_rev_file_order()` revision: fix leaking saved parents revision: fix memory leaks when rewriting parents midx-write: fix leaking buffer pack-bitmap-write: fix leaking OID array pseudo-merge: fix leaking strmap keys pseudo-merge: fix various memory leaks line-log: fix several memory leaks diff: improve lifecycle management of diff queues builtin/revert: fix leaking `gpg_sign` and `strategy` config t/helper: fix leaking repository in partial-clone helper builtin/clone: fix leaking repo state when cloning with bundle URIs builtin/pack-redundant: fix various memory leaks builtin/stash: fix leaking `pathspec_from_file` submodule: fix leaking submodule entry list wt-status: fix leaking buffer with sparse directories shell: fix leaking strings ...
2024-10-03line-log: use diff_line_prefix() instead of custom helperJeff King1-12/+2
Our local output_prefix() is exactly the same as the public diff_line_prefix() function. Let's just use that one, saving us a little bit of code. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-10-03line-log: protect inner strbuf from freeDerrick Stolee1-6/+4
The output_prefix() method in line-log.c may call a function pointer via the diff_options struct. This function pointer returns a strbuf struct and then its buffer is passed back. However, that implies that the consumer is responsible to free the string. This is especially true because the default behavior is to duplicate the empty string. The existing functions used in the output_prefix pointer include: 1. idiff_prefix_cb() in diff-lib.c. This returns the data pointer, so the value exists across multiple calls. 2. diff_output_prefix_callback() in graph.c. This uses a static strbuf struct, so it reuses buffers across calls. These should not be freed. 3. output_prefix_cb() in range-diff.c. This is similar to the diff-lib.c case. In each case, we should not be freeing this buffer. We can convert the output_prefix() function to return a const char pointer and stop freeing the result. This choice is essentially the opposite of what was done in 394affd46d (line-log: always allocate the output prefix, 2024-06-07). This was discovered via 'valgrind' while investigating a public report of a bug in 'git log --graph -L' [1]. [1] https://github.com/git-for-windows/git/issues/5185 This issue would have been caught by the new test, when Git is compiled with ASan to catch these double frees. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-30line-log: fix several memory leaksPatrick Steinhardt1-19/+35
As described in "line-log.c" itself, the code is "leaking like a sieve". These leaks are all of rather trivial nature, so this commit plugs them without going too much into details for each of those leaks. The leaks are hit by t4211, but plugging them alone does not make the full test suite pass. The remaining leaks are unrelated to the line-log subsystem. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-09-30diff: improve lifecycle management of diff queuesPatrick Steinhardt1-8/+7
The lifecycle management of diff queues is somewhat confusing: - For most of the part this can be attributed to `DIFF_QUEUE_CLEAR()`, which does not release any memory but rather initializes the queue, only. This is in contrast to our common naming schema, where "clearing" means that we release underlying memory and then re-initialize the data structure such that it is ready to use. - A second offender is `diff_free_queue()`, which does not free the queue structure itself. It is rather a release-style function. Refactor the code to make things less confusing. `DIFF_QUEUE_CLEAR()` is replaced by `DIFF_QUEUE_INIT` and `diff_queue_init()`, while `diff_free_queue()` is replaced by `diff_queue_release()`. While on it, adapt callsites where we call `DIFF_QUEUE_CLEAR()` with the intent to release underlying memory to instead call `diff_queue_clear()` to fix memory leaks. This memory leak is exposed by t4211, but plugging it alone does not make the whole test suite pass. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-07line-log: always allocate the output prefixPatrick Steinhardt1-7/+11
The returned string by `output_prefix()` is sometimes a string constant and sometimes an allocated string. This has been fine until now because we always leak the allocated strings, and thus we never tried to free the string constant. Fix the code to always return an allocated string and free the returned value at all callsites. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-07line-log: stop assigning string constant to file parent bufferPatrick Steinhardt1-1/+3
Stop assigning a string constant to the file parent buffer and instead assign an allocated string. While the code is fine in practice, it will break once we compile with `-Wwrite-strings`. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26treewide: remove unnecessary includes in source filesElijah Newren1-1/+0
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26line-log.h: remove unnecessary includeElijah Newren1-0/+1
The unnecessary include in the header transitively pulled in some other headers actually needed by source files, though. Have those source files explicitly include the headers they need. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26treewide: remove unnecessary includes in source filesElijah Newren1-2/+0
Each of these were checked with gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE} to ensure that removing the direct inclusion of the header actually resulted in that header no longer being included at all (i.e. that no other header pulled it in transitively). ...except for a few cases where we verified that although the header was brought in transitively, nothing from it was directly used in that source file. These cases were: * builtin/credential-cache.c * builtin/pull.c * builtin/send-pack.c Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-05revision: clear decoration structs during release_revisions()Jeff King1-0/+10
The point of release_revisions() is to free memory associated with the rev_info struct, but we have several "struct decoration" members that are left untouched. Since the previous commit introduced a function to do that, we can just call it. We do have to provide some specialized callbacks to map the void pointers onto real ones (the alternative would be casting the existing function pointers; this generally works because "void *" is usually interchangeable with a struct pointer, but it is technically forbidden by the standard). Since the line-log code does not expose the type it stores in the decoration (nor of course the function to free it), I put this behind a generic line_log_free() entry point. It's possible we may need to add more line-log specific bits anyway (running t4211 shows a number of other leaks in the line-log code). While this doubtless cleans up many leaks triggered by the test suite, the only script which becomes leak-free is t4217, as it does very little beyond a simple traversal (its existing leak was from the use of --children, which is now fixed). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05git-compat-util: move alloc macros to git-compat-util.hCalvin Wan1-1/+0
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for dynamic array allocation. Moving these macros to git-compat-util.h with the other alloc macros focuses alloc.[ch] to allocation for Git objects and additionally allows us to remove inclusions to alloc.h from files that solely used the above macros. Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21diff.h: remove unnecessary include of oidset.hElijah Newren1-0/+1
This also made it clear that several .c files depended upon various things that oidset included, but had omitted the direct #include for those headers. Add those now. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-24diff.h: reduce unnecessary includesElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: remove cache.h inclusion due to setup.h changesElijah Newren1-1/+0
By moving several declarations to setup.h, the previous patch made it possible to remove the include of cache.h in several source files. Do so. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21setup.h: move declarations for setup.c functions from cache.hElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-17Merge branch 'jk/unused-post-2.39-part2'Junio C Hamano1-1/+2
More work towards -Wunused. * jk/unused-post-2.39-part2: (21 commits) help: mark unused parameter in git_unknown_cmd_config() run_processes_parallel: mark unused callback parameters userformat_want_item(): mark unused parameter for_each_commit_graft(): mark unused callback parameter rewrite_parents(): mark unused callback parameter fetch-pack: mark unused parameter in callback function notes: mark unused callback parameters prio-queue: mark unused parameters in comparison functions for_each_object: mark unused callback parameters list-objects: mark unused callback parameters mark unused parameters in signal handlers run-command: mark error routine parameters as unused mark "pointless" data pointers in callbacks ref-filter: mark unused callback parameters http-backend: mark unused parameters in virtual functions http-backend: mark argc/argv unused object-name: mark unused parameters in disambiguate callbacks serve: mark unused parameters in virtual functions serve: use repository pointer to get config ls-refs: drop config caching ...
2023-02-24rewrite_parents(): mark unused callback parameterJeff King1-1/+2
The rewrite_parents() function takes a callback, but not every callback needs the "rev" parameter. Mark the unused one so -Wunused-parameter will be happy. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23cache.h: remove dependence on hex.h; make other files include it explicitlyElijah Newren1-0/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23alloc.h: move ALLOC_GROW() functions from cache.hElijah Newren1-0/+1
This allows us to replace includes of cache.h with includes of the much smaller alloc.h in many places. It does mean that we also need to add includes of alloc.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-11-02line-log: free the diff queues' arrays when processing merge commitsSZEDER Gábor1-4/+2
When processing merge commits, the line-level log first creates an array of diff queues, each comparing the merge commit with one of its parents, to check whether any of the files in the given line ranges were modified. Alas, when freeing these queues it only frees the filepairs in the queues, but not the queues' internal arrays holding pointers to those filepairs. Use the diff_free_queue() helper function introduced in the previous commit to free the diff queues' internal arrays as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com>
2022-11-02line-log: free diff queue when processing non-merge commitsSZEDER Gábor1-0/+1
When processing a non-merge commit, the line-level log first asks the tree-diff machinery whether any of the files in the given line ranges were modified between the current commit and its parent, and if some of them were, then it loads the contents of those files from both commits to see whether their line ranges were modified and/or need to be adjusted. Alas, it doesn't free() the diff queue holding the results of that query and the contents of those files once its done. This can add up to a substantial amount of leaked memory, especially when the file in question is big and is frequently modified: a user reported "Out of memory, malloc failed" errors with a 2MB text file that was modified ~2800 times [1] (I estimate the leak would use up almost 11GB memory in that case). Free that diff queue to plug this memory leak. However, instead of simply open-coding the necessary three lines, add them as a helper function to the diff API, because it will be useful elsewhere as well. [1] https://public-inbox.org/git/CAFOPqVXz2XwzX8vGU7wLuqb2ZuwTuOFAzBLRM_QPk+NJa=eC-g@mail.gmail.com/ Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com>
2021-03-13use CALLOC_ARRAYRené Scharfe1-1/+1
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-12line-log: handle deref_tag() returning NULLRené Scharfe1-1/+1
Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-29Merge branch 'tb/bloom-improvements'Junio C Hamano1-1/+1
"git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=<n>' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'
2020-09-17bloom: split 'get_bloom_filter()' in twoTaylor Blau1-1/+1
'get_bloom_filter' takes a flag to control whether it will compute a Bloom filter if the requested one is missing. In the next patch, we'll add yet another parameter to this method, which would force all but one caller to specify an extra 'NULL' parameter at the end. Instead of doing this, split 'get_bloom_filter' into two functions: 'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only looks up a Bloom filter (and does not compute one if it's missing, thus dropping the 'compute_if_not_present' flag). The latter does compute missing Bloom filters, with an additional parameter to store whether or not it needed to do so. This simplifies many call-sites, since the majority of existing callers to 'get_bloom_filter' do not want missing Bloom filters to be computed (so they can drop the parameter entirely and use the simpler version of the function). While we're at it, instrument the new 'get_or_compute_bloom_filter()' with counters in the 'write_commit_graph_context' struct which store the number of filters that we did and didn't compute, as well as filters that were truncated. It would be nice to drop the 'compute_if_not_present' flag entirely, since all remaining callers of 'get_or_compute_bloom_filter' pass it as '1', but this will change in a future patch and hence cannot be removed. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28strvec: convert more callers away from argv_array nameJeff King1-3/+3
We eventually want to drop the argv_array name and just use strvec consistently. There's no particular reason we have to do it all at once, or care about interactions between converted and unconverted bits. Because of our preprocessor compat layer, the names are interchangeable to the compiler (so even a definition and declaration using different names is OK). This patch converts remaining files from the first half of the alphabet, to keep the diff to a manageable size. The conversion was done purely mechanically with: git ls-files '*.c' '*.h' | xargs perl -i -pe ' s/ARGV_ARRAY/STRVEC/g; s/argv_array/strvec/g; ' and then selectively staging files with "git add '[abcdefghjkl]*'". We'll deal with any indentation/style fallouts separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28strvec: rename files from argv-array to strvecJeff King1-1/+1
This requires updating #include lines across the code-base, but that's all fairly mechanical, and was done with: git ls-files '*.c' '*.h' | xargs perl -i -pe 's/argv-array.h/strvec.h/' Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08Merge branch 'ds/line-log-on-bloom'Junio C Hamano1-3/+40
"git log -L..." now takes advantage of the "which paths are touched by this commit?" info stored in the commit-graph system. * ds/line-log-on-bloom: line-log: integrate with changed-path Bloom filters line-log: try to use generation number-based topo-ordering line-log: more responsive, incremental 'git log -L' t4211-line-log: add tests for parent oids line-log: remove unused fields from 'struct line_log_data'
2020-05-11line-log: integrate with changed-path Bloom filtersDerrick Stolee1-1/+38
The previous changes to the line-log machinery focused on making the first result appear faster. This was achieved by no longer walking the entire commit history before returning the early results. There is still another way to improve the performance: walk most commits much faster. Let's use the changed-path Bloom filters to reduce time spent computing diffs. Since the line-log computation requires opening blobs and checking the content-diff, there is still a lot of necessary computation that cannot be replaced with changed-path Bloom filters. The part that we can reduce is most effective when checking the history of a file that is deep in several directories and those directories are modified frequently. In this case, the computation to check if a commit is TREESAME to its first parent takes a large fraction of the time. That is ripe for improvement with changed-path Bloom filters. We must ensure that prepare_to_use_bloom_filters() is called in revision.c so that the bloom_filter_settings are loaded into the struct rev_info from the commit-graph. Of course, some cases are still forbidden, but in the line-log case the pathspec is provided in a different way than normal. Since multiple paths and segments could be requested, we compute the struct bloom_key data dynamically during the commit walk. This could likely be improved, but adds code complexity that is not valuable at this time. There are two cases to care about: merge commits and "ordinary" commits. Merge commits have multiple parents, but if we are TREESAME to our first parent in every range, then pass the blame for all ranges to the first parent. Ordinary commits have the same condition, but each is done slightly differently in the process_ranges_[merge|ordinary]_commit() methods. By checking if the changed-path Bloom filter can guarantee TREESAME, we can avoid that tree-diff cost. If the filter says "probably changed", then we need to run the tree-diff and then the blob-diff if there was a real edit. The Linux kernel repository is a good testing ground for the performance improvements claimed here. There are two different cases to test. The first is the "entire history" case, where we output the entire history to /dev/null to see how long it would take to compute the full line-log history. The second is the "first result" case, where we find how long it takes to show the first value, which is an indicator of how quickly a user would see responses when waiting at a terminal. To test, I selected the paths that were changed most frequently in the top 10,000 commits using this command (stolen from StackOverflow [1]): git log --pretty=format: --name-only -n 10000 | sort | \ uniq -c | sort -rg | head -10 which results in 121 MAINTAINERS 63 fs/namei.c 60 arch/x86/kvm/cpuid.c 59 fs/io_uring.c 58 arch/x86/kvm/vmx/vmx.c 51 arch/x86/kvm/x86.c 45 arch/x86/kvm/svm.c 42 fs/btrfs/disk-io.c 42 Documentation/scsi/index.rst (along with a bogus first result). It appears that the path arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the following results for the real command time: | | Entire History | First Result | | Path | Before | After | Before | After | |------------------------------|--------|--------|--------|--------| | MAINTAINERS | 4.26 s | 3.87 s | 0.41 s | 0.39 s | | fs/namei.c | 1.99 s | 0.99 s | 0.42 s | 0.21 s | | arch/x86/kvm/cpuid.c | 5.28 s | 1.12 s | 0.16 s | 0.09 s | | fs/io_uring.c | 4.34 s | 0.99 s | 0.94 s | 0.27 s | | arch/x86/kvm/vmx/vmx.c | 5.01 s | 1.34 s | 0.21 s | 0.12 s | | arch/x86/kvm/x86.c | 2.24 s | 1.18 s | 0.21 s | 0.14 s | | fs/btrfs/disk-io.c | 1.82 s | 1.01 s | 0.06 s | 0.05 s | | Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s | It is worth noting that the least speedup comes for the MAINTAINERS file which is * edited frequently, * low in the directory heirarchy, and * quite a large file. All of those points lead to spending more time doing the blob diff and less time doing the tree diff. Still, we see some improvement in that case and significant improvement in other cases. A 2-4x speedup is likely the more typical case as opposed to the small 5% change for that file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-11line-log: more responsive, incremental 'git log -L'SZEDER Gábor1-2/+2
The current line-level log implementation performs a preprocessing step in prepare_revision_walk(), during which the line_log_filter() function filters and rewrites history to keep only commits modifying the given line range. This preprocessing affects both responsiveness and correctness: - Git doesn't produce any output during this preprocessing step. Checking whether a commit modified the given line range is somewhat expensive, so depending on the size of the given revision range this preprocessing can result in a significant delay before the first commit is shown. - Limiting the number of displayed commits (e.g. 'git log -3 -L...') doesn't limit the amount of work during preprocessing, because that limit is applied during history traversal. Alas, by that point this expensive preprocessing step has already churned through the whole revision range to find all commits modifying the revision range, even though only a few of them need to be shown. - It rewrites parents, with no way to turn it off. Without the user explicitly requesting parent rewriting any parent object ID shown should be that of the immediate parent, just like in case of a pathspec-limited history traversal without parent rewriting. However, after that preprocessing step rewrote history, the subsequent "regular" history traversal (i.e. get_revision() in a loop) only sees commits modifying the given line range. Consequently, it can only show the object ID of the last ancestor that modified the given line range (which might happen to be the immediate parent, but many-many times it isn't). This patch addresses both the correctness and, at least for the common case, the responsiveness issues by integrating line-level log filtering into the regular revision walking machinery: - Make process_ranges_arbitrary_commit(), the static function in 'line-log.c' deciding whether a commit modifies the given line range, public by removing the static keyword and adding the 'line_log_' prefix, so it can be called from other parts of the revision walking machinery. - If the user didn't explicitly ask for parent rewriting (which, I believe, is the most common case): - Call this now-public function during regular history traversal, namely from get_commit_action() to ignore any commits not modifying the given line range. Note that while this check is relatively expensive, it must be performed before other, much cheaper conditions, because the tracked line range must be adjusted even when the commit will end up being ignored by other conditions. - Skip the line_log_filter() call, i.e. the expensive preprocessing step, in prepare_revision_walk(), because, thanks to the above points, the revision walking machinery is now able to filter out commits not modifying the given line range while traversing history. This way the regular history traversal sees the unmodified history, and is therefore able to print the object ids of the immediate parents of the listed commits. The eliminated preprocessing step can greatly reduce the delay before the first commit is shown, see the numbers below. - However, if the user did explicitly ask for parent rewriting via '--parents' or a similar option, then stick with the current implementation for now, i.e. perform that expensive filtering and history rewriting in the preprocessing step just like we did before, leaving the initial delay as long as it was. I tried to integrate line-level log filtering with parent rewriting into the regular history traversal, but, unfortunately, several subtleties resisted... :) Maybe someday we'll figure out how to do that, but until then at least the simple and common (i.e. without parent rewriting) 'git log -L:func:file' commands can benefit from the reduced delay. This change makes the failing 'parent oids without parent rewriting' test in 't4211-line-log.sh' succeed. The reduced delay is most noticable when there's a commit modifying the line range near the tip of a large-ish revision range: # no parent rewriting requested, no commit-graph present $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0 Before: real 0m9.570s user 0m9.494s sys 0m0.076s After: real 0m0.718s user 0m0.674s sys 0m0.044s A significant part of the remaining delay is spent reading and parsing commit objects in limit_list(). With the help of the commit-graph we can eliminate most of that reading and parsing overhead, so here are the timing results of the same command as above, but this time using the commit-graph: Before: real 0m8.874s user 0m8.816s sys 0m0.057s After: real 0m0.107s user 0m0.091s sys 0m0.013s The next patch will further reduce the remaining delay. To be clear: this patch doesn't actually optimize the line-level log, but merely moves most of the work from the preprocessing step to the history traversal, so the commits modifying the line range can be shown as soon as they are processed, and the traversal can be terminated as soon as the given number of commits are shown. Consequently, listing the full history of a line range, potentially all the way to the root commit, will take the same time as before (but at least the user might start reading the output earlier). Furthermore, if the most recent commit modifying the line range is far away from the starting revision, then that initial delay will still be significant. Additional testing by Derrick Stolee: In the Linux kernel repository, the MAINTAINERS file was changed ~3,500 times across the ~915,000 commits. In addition to that edit frequency, the file itself is quite large (~18,700 lines). This means that a significant portion of the computation is taken up by computing the patch-diff of the file. This patch improves the real time it takes to output the first result quite a bit: Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null Before: 3.88 s After: 0.71 s If we drop the "-n 1" in the command, then there is no change in end-to-end process time. This is because the command still needs to walk the entire commit history, which negates the point of this patch. This is expected. As a note for future reference, the ~4.3 seconds in the old code spends ~2.6 seconds computing the patch-diffs, and the rest of the time is spent walking commits and computing diffs for which paths changed at each commit. The changed-path Bloom filters could improve the end-to-end computation time (i.e. no "-n 1" in the command). Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-07diff: make diff_populate_filespec_options structJonathan Tan1-3/+3
The behavior of diff_populate_filespec() currently can be customized through a bitflag, but a subsequent patch requires it to support a non-boolean option. Replace the bitflag with an options struct. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-18Merge branch 'sg/line-log-tree-diff-optim'Junio C Hamano1-19/+52
Optimize unnecessary full-tree diff away from "git log -L" machinery. * sg/line-log-tree-diff-optim: line-log: avoid unnecessary full tree diffs line-log: extract pathspec parsing from line ranges into a helper function
2019-08-21line-log: avoid unnecessary full tree diffsSZEDER Gábor1-7/+36
With rename detection enabled the line-level log is able to trace the evolution of line ranges across whole-file renames [1]. Alas, to achieve that it uses the diff machinery very inefficiently, making the operation very slow [2]. And since rename detection is enabled by default, the line-level log is very slow by default. When the line-level log processes a commit with rename detection enabled, it currently does the following (see queue_diffs()): 1. Computes a full tree diff between the commit and (one of) its parent(s), i.e. invokes diff_tree_oid() with an empty 'diffopt->pathspec'. 2. Checks whether any paths in the line ranges were modified. 3. Checks whether any modified paths in the line ranges are missing in the parent commit's tree. 4. If there is such a missing path, then calls diffcore_std() to figure out whether the path was indeed renamed based on the previously computed full tree diff. 5. Continues doing stuff that are unrelated to the slowness. So basically the line-level log computes a full tree diff for each commit-parent pair in step (1) to be used for rename detection in step (4) in the off chance that an interesting path is missing from the parent. Avoid these expensive and mostly unnecessary full tree diffs by limiting the diffs to paths in the line ranges. This is much cheaper, and makes step (2) unnecessary. If it turns out that an interesting path is missing from the parent, then fall back and compute a full tree diff, so the rename detection will still work. Care must be taken when to update the pathspec used to limit the diff in case of renames. A path might be renamed on one branch and modified on several parallel running branches, and while processing commits on these branches the line-level log might have to alternate between looking at a path's new and old name. However, at any one time there is only a single 'diffopt->pathspec'. So add a step (0) to the above to ensure that the paths in the pathspec match the paths in the line ranges associated with the currently processed commit, and re-parse the pathspec from the paths in the line ranges if they differ. The new test cases include a specially crafted piece of history with two merged branches and two files, where each branch modifies both files, renames on of them, and then modifies both again. Then two separate 'git log -L' invocations check the line-level log of each of those two files, which ensures that at least one of those invocations have to do that back-and-forth between the file's old and new name (no matter which branch is traversed first). 't/t4211-line-log.sh' already contains two tests involving renames, they don't don't trigger this back-and-forth. Avoiding these unnecessary full tree diffs can have huge impact on performance, especially in big repositories with big trees and mergy history. Tracing the evolution of a function through the whole history: # git.git $ time git --no-pager log -L:read_alternate_refs:sha1-file.c v2.23.0 Before: real 0m8.874s user 0m8.816s sys 0m0.057s After: real 0m2.516s user 0m2.456s sys 0m0.060s # linux.git $ time ~/src/git/git --no-pager log \ -L:build_restore_work_registers:arch/mips/mm/tlbex.c v5.2 Before: real 3m50.033s user 3m48.041s sys 0m0.300s After: real 0m2.599s user 0m2.466s sys 0m0.157s That's just over 88x speedup. [1] Line-level log's rename following is quite similar to 'git log --follow path', with the notable differences that it does handle multiple paths at once as well, and that it doesn't show the commit performing the rename if it's an exact rename. [2] This slowness might not have been apparent initially, because back when the line-level log feature was introduced rename detection was not yet enabled by default; 12da1d1f6f (Implement line-history search (git log -L), 2013-03-28) and 5404c116aa (diff: activate diff.renames by default, 2016-02-25). Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-21line-log: extract pathspec parsing from line ranges into a helper functionSZEDER Gábor1-14/+18
A helper function to parse the paths involved in the line ranges and to turn them into a pathspec will be useful in the next patch. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-27tree-walk.c: remove the_repo from get_tree_entry()Nguyễn Thái Ngọc Duy1-3/+4
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-09Merge branch 'en/merge-directory-renames'Junio C Hamano1-1/+1
"git merge-recursive" backend recently learned a new heuristics to infer file movement based on how other files in the same directory moved. As this is inherently less robust heuristics than the one based on the content similarity of the file itself (rather than based on what its neighbours are doing), it sometimes gives an outcome unexpected by the end users. This has been toned down to leave the renamed paths in higher/conflicted stages in the index so that the user can examine and confirm the result. * en/merge-directory-renames: merge-recursive: switch directory rename detection default merge-recursive: give callers of handle_content_merge() access to contents merge-recursive: track information associated with directory renames t6043: fix copied test description to match its purpose merge-recursive: switch from (oid,mode) pairs to a diff_filespec merge-recursive: cleanup handle_rename_* function signatures merge-recursive: track branch where rename occurred in rename struct merge-recursive: remove ren[12]_other fields from rename_conflict_info merge-recursive: shrink rename_conflict_info merge-recursive: move some struct declarations together merge-recursive: use 'ci' for rename_conflict_info variable name merge-recursive: rename locals 'o' and 'a' to 'obuf' and 'abuf' merge-recursive: rename diff_filespec 'one' to 'o' merge-recursive: rename merge_options argument from 'o' to 'opt' Use 'unsigned short' for mode, like diff_filespec does
2019-04-08Use 'unsigned short' for mode, like diff_filespec doesElijah Newren1-1/+1
struct diff_filespec defines mode to be an 'unsigned short'. Several other places in the API which we'd like to interact with using a diff_filespec used a plain unsigned (or unsigned int). This caused problems when taking addresses, so switch to unsigned short. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-08line-log: suppress diff output with "-s"Jeff King1-2/+4
When "-L" is in use, we ignore any diff output format that the user provides to us, and just always print a patch (with extra context lines covering the whole area of interest). It's not entirely clear what we should do with all formats (e.g., should "--stat" show just the diffstat of the touched lines, or the stat for the whole file?). But "-s" is pretty clear: the user probably wants to see just the commits that touched those lines, without any diff at all. Let's at least make that work. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>