<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/tree-diff.c, branch v2.32.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.32.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.32.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2021-04-27T07:31:39Z</updated>
<entry>
<title>hash: provide per-algorithm null OIDs</title>
<updated>2021-04-27T07:31:39Z</updated>
<author>
<name>brian m. carlson</name>
<email>sandals@crustytoothpaste.net</email>
</author>
<published>2021-04-26T01:02:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=14228447c9ce664a4e9c31ba10344ec5e4ea4ba5'/>
<id>urn:sha1:14228447c9ce664a4e9c31ba10344ec5e4ea4ba5</id>
<content type='text'>
Up until recently, object IDs did not have an algorithm member, only a
hash.  Consequently, it was possible to share one null (all-zeros)
object ID among all hash algorithms.  Now that we're going to be
handling objects from multiple hash algorithms, it's important to make
sure that all object IDs have a correct algorithm field.

Introduce a per-algorithm null OID, and add it to struct hash_algo.
Introduce a wrapper function as well, and use it everywhere we used to
use the null_oid constant.

Signed-off-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>bloom/diff: properly short-circuit on max_changes</title>
<updated>2020-09-17T16:31:25Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2020-09-16T18:07:52Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=b16a8277644e4b1d21c08d97de757105039dc7ae'/>
<id>urn:sha1:b16a8277644e4b1d21c08d97de757105039dc7ae</id>
<content type='text'>
Commit e3696980 (diff: halt tree-diff early after max_changes,
2020-03-30) intended to create a mechanism to short-circuit a diff
calculation after a certain number of paths were modified. By
incrementing a "num_changes" counter throughout the recursive
ll_diff_tree_paths(), this was supposed to match the number of changes
that would be written into the changed-path Bloom filters.
Unfortunately, this was not implemented correctly and instead misses
simple cases like file modifications. This then does not stop very
large changed-path filters from being written (unless they add or remove
many files).

To start, change the implementation in ll_diff_tree_paths() to instead
use the global diff_queue_diff struct's 'nr' member as the count. This
is a way to simplify the logic instead of making more mistakes in the
complicated diff code.

This has a drawback: the diff_queue_diff struct only lists the paths
corresponding to blob changes, not their leading directories. Thus,
get_or_compute_bloom_filter() needs an additional check to see if the
hashmap with the leading directories becomes too large.

One reason why this was not caught by test cases was that the test in
t4216-log-bloom.sh that was supposed to check this "too many changes"
condition only checked this on the initial commit of a repository. The
old logic counted these values correctly. Update this test in a few
ways:

1. Use GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS to reduce the limit,
   allowing smaller commits to engage with this logic.

2. Create several interesting cases of edits, adds, removes, and mode
   changes (in the second commit). By testing both sides of the
   inequality with the *_MAX_CHANGED_PATHS variable, we can see that
   the count is exactly correct, so none of these changes are missed
   or over-counted.

3. Use the trace2 data value filter_found_large to verify that these
   commits are on the correct side of the limit.

Another way to verify the behavior is correct is through performance
tests. By testing on my local copies of the Git repository and the Linux
kernel repository, I could measure the effect of these short-circuits
when computing a fresh commit-graph file with changed-path Bloom filters
using the command

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=N time \
    git commit-graph write --reachable --changed-paths

and reporting the wall time and resulting commit-graph size.

For Git, the results are

|        |      N=1       |       N=10     |      N=512     |
|--------|----------------|----------------|----------------|
| HEAD~1 | 10.90s  9.18MB | 11.11s  9.34MB | 11.31s  9.35MB |
| HEAD   |  9.21s  8.62MB | 11.11s  9.29MB | 11.29s  9.34MB |

For Linux, the results are

|        |       N=1      |     N=20      |     N=512     |
|--------|----------------|---------------|---------------|
| HEAD~1 | 61.28s  64.3MB | 76.9s  72.6MB | 77.6s  72.6MB |
| HEAD   | 49.44s  56.3MB | 68.7s  65.9MB | 69.2s  65.9MB |

Naturally, the improvement becomes much less as the limit grows, as
fewer commits satisfy the short-circuit.

Reported-by: SZEDER Gábor &lt;szeder.dev@gmail.com&gt;
Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diff.h: drop diff_tree_oid() &amp; friends' return value</title>
<updated>2020-06-08T19:28:49Z</updated>
<author>
<name>SZEDER Gábor</name>
<email>szeder.dev@gmail.com</email>
</author>
<published>2020-06-05T13:00:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=0ee3cb888dc4da900b2e701ef8106205d8963ffb'/>
<id>urn:sha1:0ee3cb888dc4da900b2e701ef8106205d8963ffb</id>
<content type='text'>
ll_diff_tree_oid() has only ever returned 0 [1], so it's return value
is basically useless.  It's only caller diff_tree_oid() has only ever
returned the return value of ll_diff_tree_oid() as-is [2], so its
return value is just as useless.  Most of diff_tree_oid()'s callers
simply ignore its return value, except:

  - diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and
    returns with its return value, but all of diff_root_tree_oid()'s
    callers ignore its return value.

  - rev_compare_tree() and rev_same_tree_as_empty() do look at the
    return value in a condition, but, since the return value is always
    0, the former's &lt; 0 condition is never fulfilled, while the
    latter's &gt;= 0 condition is always fulfilled.

So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid()
and diff_root_tree_oid(), and drop those conditions from
rev_compare_tree() and rev_same_tree_as_empty() as well.

[1] ll_diff_tree_oid() and its ancestors have been returning only 0
    ever since it was introduced as diff_tree() in 9174026cfe (Add
    "diff-tree" program to show which files have changed between two
    trees., 2005-04-09).
[2] diff_tree_oid() traces back to diff-tree.c:main() in 9174026cfe as
    well.

Signed-off-by: SZEDER Gábor &lt;szeder.dev@gmail.com&gt;
Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diff: halt tree-diff early after max_changes</title>
<updated>2020-03-30T16:59:53Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2020-03-30T00:31:27Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=e3696980163bdbd3bc56e5ffc69e8770015f366f'/>
<id>urn:sha1:e3696980163bdbd3bc56e5ffc69e8770015f366f</id>
<content type='text'>
When computing the changed-paths bloom filters for the commit-graph,
we limit the size of the filter by restricting the number of paths
in the diff. Instead of computing a large diff and then ignoring the
result, it is better to halt the diff computation early.

Create a new "max_changes" option in struct diff_options. If non-zero,
then halt the diff computation after discovering strictly more changed
paths. This includes paths corresponding to trees that change.

Use this max_changes option in the bloom filter calculations. This
reduces the time taken to compute the filters for the Linux kernel
repo from 2m50s to 2m35s. On a large internal repository with ~500
commits that perform tree-wide changes, the time reduced from
6m15s to 3m48s.

Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Garima Singh &lt;garima.singh@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>tree-walk.c: remove the_repo from fill_tree_descriptor()</title>
<updated>2019-06-27T19:45:17Z</updated>
<author>
<name>Nguyễn Thái Ngọc Duy</name>
<email>pclouds@gmail.com</email>
</author>
<published>2019-06-27T09:28:48Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=5e575807337d15dbecf88767e709df08304e3b94'/>
<id>urn:sha1:5e575807337d15dbecf88767e709df08304e3b94</id>
<content type='text'>
While at there, clean up the_repo usage in builtin/merge-tree.c a tiny
bit.

Signed-off-by: Nguyễn Thái Ngọc Duy &lt;pclouds@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Use 'unsigned short' for mode, like diff_filespec does</title>
<updated>2019-04-08T07:02:07Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2019-04-05T15:00:12Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=5ec1e72823735b5682389589b6bee774ae70fa49'/>
<id>urn:sha1:5ec1e72823735b5682389589b6bee774ae70fa49</id>
<content type='text'>
struct diff_filespec defines mode to be an 'unsigned short'.  Several
other places in the API which we'd like to interact with using a
diff_filespec used a plain unsigned (or unsigned int).  This caused
problems when taking addresses, so switch to unsigned short.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'bc/tree-walk-oid'</title>
<updated>2019-01-29T20:47:56Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2019-01-29T20:47:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=371820d5f1bb3c3e691ad21cee652c02c36ea758'/>
<id>urn:sha1:371820d5f1bb3c3e691ad21cee652c02c36ea758</id>
<content type='text'>
The code to walk tree objects has been taught that we may be
working with object names that are not computed with SHA-1.

* bc/tree-walk-oid:
  cache: make oidcpy always copy GIT_MAX_RAWSZ bytes
  tree-walk: store object_id in a separate member
  match-trees: use hashcpy to splice trees
  match-trees: compute buffer offset correctly when splicing
  tree-walk: copy object ID before use
</content>
</entry>
<entry>
<title>tree-walk: store object_id in a separate member</title>
<updated>2019-01-15T17:57:41Z</updated>
<author>
<name>brian m. carlson</name>
<email>sandals@crustytoothpaste.net</email>
</author>
<published>2019-01-15T00:39:44Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=ea82b2a0857e3e0449bdce4e3987dee6adbc51ae'/>
<id>urn:sha1:ea82b2a0857e3e0449bdce4e3987dee6adbc51ae</id>
<content type='text'>
When parsing a tree, we read the object ID directly out of the tree
buffer. This is normally fine, but such an object ID cannot be used with
oidcpy, which copies GIT_MAX_RAWSZ bytes, because if we are using SHA-1,
there may not be that many bytes to copy.

Instead, store the object ID in a separate struct member. Since we can
no longer efficiently compute the path length, store that information as
well in struct name_entry. Ensure we only copy the object ID into the
new buffer if the path length is nonzero, as some callers will pass us
an empty path with no object ID following it, and we will not want to
read past the end of the buffer.

Signed-off-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>tree-walk.c: make tree_entry_interesting() take an index</title>
<updated>2018-11-19T01:50:33Z</updated>
<author>
<name>Nguyễn Thái Ngọc Duy</name>
<email>pclouds@gmail.com</email>
</author>
<published>2018-11-18T16:47:57Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=67022e02145b3be774febf38e9dc228ed11e6f25'/>
<id>urn:sha1:67022e02145b3be774febf38e9dc228ed11e6f25</id>
<content type='text'>
In order to support :(attr) when matching pathspec on a tree,
tree_entry_interesting() needs to take an index (because
git_check_attr() needs it). This is the preparation step for it. This
also makes it clearer what index we fall back to when looking up
attributes during an unpack-trees operation: the source index.

This also fixes revs-&gt;pruning.repo initialization that should have
been done in 2abf350385 (revision.c: remove implicit dependency on
the_index - 2018-09-21). Without it, skip_uninteresting() will
dereference a NULL pointer through this call chain

  get_revision(revs)
  get_revision_internal
  get_revision_1
  try_to_simplify_commit
  rev_compare_tree
  diff_tree_oid(..., &amp;revs-&gt;pruning)
  ll_diff_tree_oid
  diff_tree_paths
  ll_diff_tree
  skip_uninteresting

Signed-off-by: Nguyễn Thái Ngọc Duy &lt;pclouds@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'nd/the-index'</title>
<updated>2018-10-19T04:34:02Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2018-10-19T04:34:02Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=11877b9ebec87175a9cc55676311ef2d98585cca'/>
<id>urn:sha1:11877b9ebec87175a9cc55676311ef2d98585cca</id>
<content type='text'>
Various codepaths in the core-ish part learn to work on an
arbitrary in-core index structure, not necessarily the default
instance "the_index".

* nd/the-index: (23 commits)
  revision.c: reduce implicit dependency the_repository
  revision.c: remove implicit dependency on the_index
  ws.c: remove implicit dependency on the_index
  tree-diff.c: remove implicit dependency on the_index
  submodule.c: remove implicit dependency on the_index
  line-range.c: remove implicit dependency on the_index
  userdiff.c: remove implicit dependency on the_index
  rerere.c: remove implicit dependency on the_index
  sha1-file.c: remove implicit dependency on the_index
  patch-ids.c: remove implicit dependency on the_index
  merge.c: remove implicit dependency on the_index
  merge-blobs.c: remove implicit dependency on the_index
  ll-merge.c: remove implicit dependency on the_index
  diff-lib.c: remove implicit dependency on the_index
  read-cache.c: remove implicit dependency on the_index
  diff.c: remove implicit dependency on the_index
  grep.c: remove implicit dependency on the_index
  diff.c: remove the_index dependency in textconv() functions
  blame.c: rename "repo" argument to "r"
  combine-diff.c: remove implicit dependency on the_index
  ...
</content>
</entry>
</feed>
