git/tree-diff.c, branch v2.32.2

hash: provide per-algorithm null OIDs

2021-04-27T07:31:39Z

Up until recently, object IDs did not have an algorithm member, only a hash. Consequently, it was possible to share one null (all-zeros) object ID among all hash algorithms. Now that we're going to be handling objects from multiple hash algorithms, it's important to make sure that all object IDs have a correct algorithm field. Introduce a per-algorithm null OID, and add it to struct hash_algo. Introduce a wrapper function as well, and use it everywhere we used to use the null_oid constant. Signed-off-by: brian m. carlson Signed-off-by: Junio C Hamano

bloom/diff: properly short-circuit on max_changes

2020-09-17T16:31:25Z

Commit e3696980 (diff: halt tree-diff early after max_changes, 2020-03-30) intended to create a mechanism to short-circuit a diff calculation after a certain number of paths were modified. By incrementing a "num_changes" counter throughout the recursive ll_diff_tree_paths(), this was supposed to match the number of changes that would be written into the changed-path Bloom filters. Unfortunately, this was not implemented correctly and instead misses simple cases like file modifications. This then does not stop very large changed-path filters from being written (unless they add or remove many files). To start, change the implementation in ll_diff_tree_paths() to instead use the global diff_queue_diff struct's 'nr' member as the count. This is a way to simplify the logic instead of making more mistakes in the complicated diff code. This has a drawback: the diff_queue_diff struct only lists the paths corresponding to blob changes, not their leading directories. Thus, get_or_compute_bloom_filter() needs an additional check to see if the hashmap with the leading directories becomes too large. One reason why this was not caught by test cases was that the test in t4216-log-bloom.sh that was supposed to check this "too many changes" condition only checked this on the initial commit of a repository. The old logic counted these values correctly. Update this test in a few ways: 1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit, allowing smaller commits to engage with this logic. 2. Create several interesting cases of edits, adds, removes, and mode changes (in the second commit). By testing both sides of the inequality with the *_MAX_CHANGED_PATHS variable, we can see that the count is exactly correct, so none of these changes are missed or over-counted. 3. Use the trace2 data value filter_found_large to verify that these commits are on the correct side of the limit. Another way to verify the behavior is correct is through performance tests. By testing on my local copies of the Git repository and the Linux kernel repository, I could measure the effect of these short-circuits when computing a fresh commit-graph file with changed-path Bloom filters using the command GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \ git commit-graph write --reachable --changed-paths and reporting the wall time and resulting commit-graph size. For Git, the results are | | N=1 | N=10 | N=512 | |--------|----------------|----------------|----------------| | HEAD~1 | 10.90s 9.18MB | 11.11s 9.34MB | 11.31s 9.35MB | | HEAD | 9.21s 8.62MB | 11.11s 9.29MB | 11.29s 9.34MB | For Linux, the results are | | N=1 | N=20 | N=512 | |--------|----------------|---------------|---------------| | HEAD~1 | 61.28s 64.3MB | 76.9s 72.6MB | 77.6s 72.6MB | | HEAD | 49.44s 56.3MB | 68.7s 65.9MB | 69.2s 65.9MB | Naturally, the improvement becomes much less as the limit grows, as fewer commits satisfy the short-circuit. Reported-by: SZEDER Gábor Signed-off-by: Derrick Stolee Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano

diff.h: drop diff_tree_oid() & friends' return value

2020-06-08T19:28:49Z

ll_diff_tree_oid() has only ever returned 0 [1], so it's return value is basically useless. It's only caller diff_tree_oid() has only ever returned the return value of ll_diff_tree_oid() as-is [2], so its return value is just as useless. Most of diff_tree_oid()'s callers simply ignore its return value, except: - diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and returns with its return value, but all of diff_root_tree_oid()'s callers ignore its return value. - rev_compare_tree() and rev_same_tree_as_empty() do look at the return value in a condition, but, since the return value is always 0, the former's < 0 condition is never fulfilled, while the latter's >= 0 condition is always fulfilled. So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid() and diff_root_tree_oid(), and drop those conditions from rev_compare_tree() and rev_same_tree_as_empty() as well. [1] ll_diff_tree_oid() and its ancestors have been returning only 0 ever since it was introduced as diff_tree() in 9174026cfe (Add "diff-tree" program to show which files have changed between two trees., 2005-04-09). [2] diff_tree_oid() traces back to diff-tree.c:main() in 9174026cfe as well. Signed-off-by: SZEDER Gábor Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano

diff: halt tree-diff early after max_changes

2020-03-30T16:59:53Z

When computing the changed-paths bloom filters for the commit-graph, we limit the size of the filter by restricting the number of paths in the diff. Instead of computing a large diff and then ignoring the result, it is better to halt the diff computation early. Create a new "max_changes" option in struct diff_options. If non-zero, then halt the diff computation after discovering strictly more changed paths. This includes paths corresponding to trees that change. Use this max_changes option in the bloom filter calculations. This reduces the time taken to compute the filters for the Linux kernel repo from 2m50s to 2m35s. On a large internal repository with ~500 commits that perform tree-wide changes, the time reduced from 6m15s to 3m48s. Signed-off-by: Derrick Stolee Signed-off-by: Garima Singh Signed-off-by: Junio C Hamano

tree-walk.c: remove the_repo from fill_tree_descriptor()

2019-06-27T19:45:17Z

While at there, clean up the_repo usage in builtin/merge-tree.c a tiny bit. Signed-off-by: Nguyễn Thái Ngọc Duy Signed-off-by: Junio C Hamano

Use 'unsigned short' for mode, like diff_filespec does

2019-04-08T07:02:07Z

struct diff_filespec defines mode to be an 'unsigned short'. Several other places in the API which we'd like to interact with using a diff_filespec used a plain unsigned (or unsigned int). This caused problems when taking addresses, so switch to unsigned short. Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano

Merge branch 'bc/tree-walk-oid'

2019-01-29T20:47:56Z

The code to walk tree objects has been taught that we may be working with object names that are not computed with SHA-1. * bc/tree-walk-oid: cache: make oidcpy always copy GIT_MAX_RAWSZ bytes tree-walk: store object_id in a separate member match-trees: use hashcpy to splice trees match-trees: compute buffer offset correctly when splicing tree-walk: copy object ID before use

tree-walk: store object_id in a separate member

2019-01-15T17:57:41Z

When parsing a tree, we read the object ID directly out of the tree buffer. This is normally fine, but such an object ID cannot be used with oidcpy, which copies GIT_MAX_RAWSZ bytes, because if we are using SHA-1, there may not be that many bytes to copy. Instead, store the object ID in a separate struct member. Since we can no longer efficiently compute the path length, store that information as well in struct name_entry. Ensure we only copy the object ID into the new buffer if the path length is nonzero, as some callers will pass us an empty path with no object ID following it, and we will not want to read past the end of the buffer. Signed-off-by: brian m. carlson Signed-off-by: Junio C Hamano

tree-walk.c: make tree_entry_interesting() take an index

2018-11-19T01:50:33Z

In order to support :(attr) when matching pathspec on a tree, tree_entry_interesting() needs to take an index (because git_check_attr() needs it). This is the preparation step for it. This also makes it clearer what index we fall back to when looking up attributes during an unpack-trees operation: the source index. This also fixes revs->pruning.repo initialization that should have been done in 2abf350385 (revision.c: remove implicit dependency on the_index - 2018-09-21). Without it, skip_uninteresting() will dereference a NULL pointer through this call chain get_revision(revs) get_revision_internal get_revision_1 try_to_simplify_commit rev_compare_tree diff_tree_oid(..., &revs->pruning) ll_diff_tree_oid diff_tree_paths ll_diff_tree skip_uninteresting Signed-off-by: Nguyễn Thái Ngọc Duy Signed-off-by: Junio C Hamano

Merge branch 'nd/the-index'

2018-10-19T04:34:02Z

Various codepaths in the core-ish part learn to work on an arbitrary in-core index structure, not necessarily the default instance "the_index". * nd/the-index: (23 commits) revision.c: reduce implicit dependency the_repository revision.c: remove implicit dependency on the_index ws.c: remove implicit dependency on the_index tree-diff.c: remove implicit dependency on the_index submodule.c: remove implicit dependency on the_index line-range.c: remove implicit dependency on the_index userdiff.c: remove implicit dependency on the_index rerere.c: remove implicit dependency on the_index sha1-file.c: remove implicit dependency on the_index patch-ids.c: remove implicit dependency on the_index merge.c: remove implicit dependency on the_index merge-blobs.c: remove implicit dependency on the_index ll-merge.c: remove implicit dependency on the_index diff-lib.c: remove implicit dependency on the_index read-cache.c: remove implicit dependency on the_index diff.c: remove implicit dependency on the_index grep.c: remove implicit dependency on the_index diff.c: remove the_index dependency in textconv() functions blame.c: rename "repo" argument to "r" combine-diff.c: remove implicit dependency on the_index ...