<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/diffcore-rename.c, branch v2.32.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.32.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.32.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2021-04-16T20:53:33Z</updated>
<entry>
<title>Merge branch 'en/ort-perf-batch-10'</title>
<updated>2021-04-16T20:53:33Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2021-04-16T20:53:33Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=e2e1a03f6b9c0869e64710c94fc43df82a06f0c8'/>
<id>urn:sha1:e2e1a03f6b9c0869e64710c94fc43df82a06f0c8</id>
<content type='text'>
Various rename detection optimization to help "ort" merge strategy
backend.

* en/ort-perf-batch-10:
  diffcore-rename: determine which relevant_sources are no longer relevant
  merge-ort: record the reason that we want a rename for a file
  diffcore-rename: add computation of number of unknown renames
  diffcore-rename: check if we have enough renames for directories early on
  diffcore-rename: only compute dir_rename_count for relevant directories
  merge-ort: record the reason that we want a rename for a directory
  merge-ort, diffcore-rename: tweak dirs_removed and relevant_source type
  diffcore-rename: take advantage of "majority rules" to skip more renames
</content>
</entry>
<entry>
<title>Merge branch 'en/ort-perf-batch-9'</title>
<updated>2021-04-08T20:23:26Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2021-04-08T20:23:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=1b31224e59750f515f7ceb7adab2a7609371327d'/>
<id>urn:sha1:1b31224e59750f515f7ceb7adab2a7609371327d</id>
<content type='text'>
The ort merge backend has been optimized by skipping irrelevant
renames.

* en/ort-perf-batch-9:
  diffcore-rename: avoid doing basename comparisons for irrelevant sources
  merge-ort: skip rename detection entirely if possible
  merge-ort: use relevant_sources to filter possible rename sources
  merge-ort: precompute whether directory rename detection is needed
  merge-ort: introduce wrappers for alternate tree traversal
  merge-ort: add data structures for an alternate tree traversal
  merge-ort: precompute subset of sources for which we need rename detection
  diffcore-rename: enable filtering possible rename sources
</content>
</entry>
<entry>
<title>Merge branch 'en/ort-perf-batch-8'</title>
<updated>2021-03-22T21:00:24Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2021-03-22T21:00:23Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=dd4048d1c76f067ad7b339dfb3a5f4765f5ae979'/>
<id>urn:sha1:dd4048d1c76f067ad7b339dfb3a5f4765f5ae979</id>
<content type='text'>
Rename detection rework continues.

* en/ort-perf-batch-8:
  diffcore-rename: compute dir_rename_guess from dir_rename_counts
  diffcore-rename: limit dir_rename_counts computation to relevant dirs
  diffcore-rename: compute dir_rename_counts in stages
  diffcore-rename: extend cleanup_dir_rename_info()
  diffcore-rename: move dir_rename_counts into dir_rename_info struct
  diffcore-rename: add function for clearing dir_rename_count
  Move computation of dir_rename_count from merge-ort to diffcore-rename
  diffcore-rename: add a mapping of destination names to their indices
  diffcore-rename: provide basic implementation of idx_possible_rename()
  diffcore-rename: use directory rename guided basename comparisons
</content>
</entry>
<entry>
<title>diffcore-rename: determine which relevant_sources are no longer relevant</title>
<updated>2021-03-18T21:32:56Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2021-03-13T22:22:08Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=9bd342137eb4dfe3852f657f8afcc637e68b1439'/>
<id>urn:sha1:9bd342137eb4dfe3852f657f8afcc637e68b1439</id>
<content type='text'>
As noted a few commits ago ("diffcore-rename: only compute
dir_rename_count for relevant directories"), when a source file rename
is used as part of directory rename detection, we need to increment
counts for each ancestor directory in dirs_removed with value
RELEVANT_FOR_SELF.  However, a few commits ago ("diffcore-rename: check
if we have enough renames for directories early on"), we may have
downgraded all relevant ancestor directories from RELEVANT_FOR_SELF to
RELEVANT_FOR_ANCESTOR.

For a given file, if no ancestor directory is found in dirs_removed with
a value of RELEVANT_FOR_SELF, then we can downgrade
relevant_source[PATH] from RELEVANT_LOCATION to RELEVANT_NO_MORE.  This
means we can skip detecting a rename for that particular path (and any
other paths in the same directory).

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.680 s ±  0.096 s     5.665 s ±  0.129 s
    mega-renames:     13.812 s ±  0.162 s    11.435 s ±  0.158 s
    just-one-mega:   506.0  ms ±  3.9  ms   494.2  ms ±  6.1  ms

While this improvement looks rather modest for these testcases (because
all the previous optimizations were sufficient to nearly remove all time
spent in rename detection already),  consider this alternative testcase
tweaked from the ones in commit 557ac0350d as follows

    &lt;Same initial setup as commit 557ac0350d, then...&gt;
    $ git switch -c add-empty-file v5.5
    $ &gt;drivers/gpu/drm/i915/new-empty-file
    $ git add drivers/gpu/drm/i915/new-empty-file
    $ git commit -m "new file"
    $ git switch 5.4-rename
    $ git cherry-pick --strategy=ort add-empty-file

For this testcase, we see the following improvement:

                            Before                  After
    pick-empty:        1.936 s ±  0.024 s     688.1 ms ±  4.2 ms

So roughly a factor of 3 speedup.  At $DAYJOB, there was a particular
repository and cherry-pick that inspired this optimization; for that
case I saw a speedup factor of 7 with this optimization.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diffcore-rename: add computation of number of unknown renames</title>
<updated>2021-03-18T21:32:56Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2021-03-13T22:22:06Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=bf238b71377e924066ca5155a787c167177f706c'/>
<id>urn:sha1:bf238b71377e924066ca5155a787c167177f706c</id>
<content type='text'>
The previous commit can only be effective if we have a computation of
the number of paths under a given directory which are still have pending
renames, and expected this number to be recorded in the dir_rename_count
map under the key UNKNOWN_DIR.  Add the code necessary to compute these
values.

Note that this change means dir_rename_count might have a directory
whose only entry (for UNKNOWN_DIR) was removed by the time merge-ort
goes to check it.  To account for this, merge-ort needs to check for the
case where the max count is 0.

With this change we are now computing the necessary value for each
directory in dirs_removed, but are not using that value anywhere.  The
next two commits will make use of the values stored in dirs_removed in
order to compute whether each relevant_source (that is needed only for
directory rename detection) has become unnecessary.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diffcore-rename: check if we have enough renames for directories early on</title>
<updated>2021-03-18T21:32:56Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2021-03-13T22:22:05Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=0491d39297132c48e36c2f73cb0dfe7e22b3c52f'/>
<id>urn:sha1:0491d39297132c48e36c2f73cb0dfe7e22b3c52f</id>
<content type='text'>
As noted in the past few commits, if we can determine that a directory
already has enough renames to determine how directory rename detection
will be decided for that directory, then we can mark that directory as
no longer needing any more renames detected for files underneath it.
For such directories, we change the value in the dirs_removed map from
RELEVANT_TO_SELF to RELEVANT_FOR_ANCESTOR.

A subsequent patch will use this information while iterating over the
remaining potential rename sources to mark ones that were only
location_relevant as unneeded if no containing directory is still marked
as RELEVANT_TO_SELF.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diffcore-rename: only compute dir_rename_count for relevant directories</title>
<updated>2021-03-18T21:32:55Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2021-03-13T22:22:04Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=e54385b97a69e09ec12be5f8eef234777c801257'/>
<id>urn:sha1:e54385b97a69e09ec12be5f8eef234777c801257</id>
<content type='text'>
When one side adds files to a directory that the other side renamed,
directory rename detection is used to either move the new paths to the
newer directory or warn the user about the fact that another path
location might be better.

If a parent of the given directory had new files added to it, any
renames in the current directory are also part of determining where the
parent directory is renamed to.  Thus, naively, we need to record each
rename N times for a path at depth N.  However, we can use the
additional information added to dirs_removed in the last commit to avoid
traversing all N parent directories in many cases.  Let's use an example
to explain how this works.  If we have a path named
   src/old_dir/a/b/file.c
and src/old_dir doesn't exist on one side of history, but the other
added a file named src/old_dir/newfile.c, then if one side renamed
   src/old_dir/a/b/file.c =&gt; source/new_dir/a/b/file.c
then this file would affect potential directory rename detection counts
for
   src/old_dir/a/b =&gt; source/new_dir/a/b
   src/old_dir/a   =&gt; source/new_dir/a
   src/old_dir     =&gt; source/new_dir
   src             =&gt; source
adding a weight of 1 to each in dir_rename_counts.  However, if src/
exists on both sides of history, then we don't need to track any entries
for it in dir_rename_counts.  That was implemented previously.  What we
are adding now, is that if no new files were added to src/old_dir/a or
src/old_dir/b, then we don't need to have counts in dir_rename_count
for those directories either.

In short, we only need to track counts in dir_rename_count for
directories whose dirs_removed value is RELEVANT_FOR_SELF.  And as soon
as we reach a directory that isn't in dirs_removed (signalled by
returning the default value of NOT_RELEVANT from strintmap_get()), we
can stop looking any further up the directory hierarchy.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>merge-ort: record the reason that we want a rename for a directory</title>
<updated>2021-03-18T21:32:55Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2021-03-13T22:22:03Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=fb52938eec1cf6ec3169152362fe774849f5ac9b'/>
<id>urn:sha1:fb52938eec1cf6ec3169152362fe774849f5ac9b</id>
<content type='text'>
When one side of history renames a directory, and the other side of
history added files to the old directory, directory rename detection is
used to warn about the location of the added files so the user can
move them to the old directory or keep them with the new one.

This sets up three different types of directories:
  * directories that had new files added to them
  * directories underneath a directory that had new files added to them
  * directories where no new files were added to it or any leading path

Save this information in dirs_removed; the next several commits will
make use of this information.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>merge-ort, diffcore-rename: tweak dirs_removed and relevant_source type</title>
<updated>2021-03-18T21:32:55Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2021-03-13T22:22:02Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a49b55d52e7dcfd628b6328c9098d734ebe7a97d'/>
<id>urn:sha1:a49b55d52e7dcfd628b6328c9098d734ebe7a97d</id>
<content type='text'>
As noted in the previous commit, we want to be able to take advantage of
the "majority rules" portion of directory rename detection to avoid
detecting more renames than necessary.  However, for diffcore-rename to
take advantage of that, it needs to know whether a rename source file
was needed for just directory rename detection reasons, or if it is
wanted for potential three-way content merging.  Modify relevant_sources
from a strset to a strintmap, so we can encode additional information.

We also modify dirs_removed from a strset to a strintmap at the same
time because trying to determine what files are needed for directory
rename detection will require us tracking a bit more information for
each directory.

This commit only changes the types of the two variables from strset to
strintmap; it does not actually store any special values yet and for now
only checks for presence of entries in the strintmap.  Thus, the code is
functionally identical to how it behaved before.  Future commits will
start associating values with each key for these two maps.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diffcore-rename: take advantage of "majority rules" to skip more renames</title>
<updated>2021-03-18T21:32:55Z</updated>
<author>
<name>Elijah Newren</name>
<email>newren@gmail.com</email>
</author>
<published>2021-03-13T22:22:01Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=ae1db7b31c54c95b1ecb2005f7a37e3645ec9eda'/>
<id>urn:sha1:ae1db7b31c54c95b1ecb2005f7a37e3645ec9eda</id>
<content type='text'>
In directory rename detection (when a directory is removed on one side
of history and the other side adds new files to that directory), we work
to find where the greatest number of files within that directory were
renamed to so that the new files can be moved with the majority of the
files.

Naively, we can just do this by detecting renames for *all* files within
the removed/renamed directory, looking at all the destination
directories where files within that directory were moved, and if there
is more than one such directory then taking the one with the greatest
number of files as the directory where the old directory was renamed to.

However, sometimes there are enough renames from exact rename detection
or basename-guided rename detection that we have enough information to
determine the majority winner already.  Add a function meant to compute
whether particular renames are still needed based on this majority rules
check.  The next several commits will then add the necessary
infrastructure to get the information we need to compute which
additional rename sources we can skip.

An important side note for future further optimization:

There is a possible improvement to this optimization that I have not yet
attempted and will not be included in this series of patches: we could
first check whether exact renames provide enough information for us to
determine directory renames, and avoid doing basename-guided rename
detection on some or all of the RELEVANT_LOCATION files within those
directories.  In effect, this variant would mean doing the
handle_early_known_dir_renames() both after exact rename detection and
again after basename-guided rename detection, though it would also mean
decrementing the number of "unknown" renames for each rename we found
from basename-guided rename detection.  Adding this additional check for
skippable renames right after exact rename detection might turn out to
be valuable, especially for partial clones where it might allow us to
download certain source files entirely.  However, this particular
optimization was actually the last one I did in original implementation
order, and by the time I implemented this idea, every testcase I had was
sufficiently fast that further optimization was unwarranted.  If future
testcases arise that tax rename detection more heavily (or perhaps
partial clones can benefit from avoiding loading more objects), it may
be worth implementing this more involved variant.

Signed-off-by: Elijah Newren &lt;newren@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
