<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/builtin/unpack-objects.c, branch v2.40.2</title>
<subtitle>Mirror of https://git.kernel.org/pub/scm/git/git.git/
</subtitle>
<id>https://git.shady.money/git/atom?h=v2.40.2</id>
<link rel='self' href='https://git.shady.money/git/atom?h=v2.40.2'/>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/'/>
<updated>2022-06-13T17:22:36Z</updated>
<entry>
<title>unpack-objects: use stream_loose_object() to unpack large objects</title>
<updated>2022-06-13T17:22:36Z</updated>
<author>
<name>Han Xin</name>
<email>hanxin.hx@alibaba-inc.com</email>
</author>
<published>2022-06-11T02:44:21Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=aaf81223f48f710a3b9a64cc84fac93deed806b6'/>
<id>urn:sha1:aaf81223f48f710a3b9a64cc84fac93deed806b6</id>
<content type='text'>
Make use of the stream_loose_object() function introduced in the
preceding commit to unpack large objects. Before this we'd need to
malloc() the size of the blob before unpacking it, which could cause
OOM with very large blobs.

We could use the new streaming interface to unpack all blobs, but
doing so would be much slower, as demonstrated e.g. with this
benchmark using git-hyperfine[0]:

	rm -rf /tmp/scalar.git &amp;&amp;
	git clone --bare https://github.com/Microsoft/scalar.git /tmp/scalar.git &amp;&amp;
	mv /tmp/scalar.git/objects/pack/*.pack /tmp/scalar.git/my.pack &amp;&amp;
	git hyperfine \
		-r 2 --warmup 1 \
		-L rev origin/master,HEAD -L v "10,512,1k,1m" \
		-s 'make' \
		-p 'git init --bare dest.git' \
		-c 'rm -rf dest.git' \
		'./git -C dest.git -c core.bigFileThreshold={v} unpack-objects &lt;/tmp/scalar.git/my.pack'

Here we'll perform worse with lower core.bigFileThreshold settings
with this change in terms of speed, but we're getting lower memory use
in return:

	Summary
	  './git -C dest.git -c core.bigFileThreshold=10 unpack-objects &lt;/tmp/scalar.git/my.pack' in 'origin/master' ran
	    1.01 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1k unpack-objects &lt;/tmp/scalar.git/my.pack' in 'origin/master'
	    1.01 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1m unpack-objects &lt;/tmp/scalar.git/my.pack' in 'origin/master'
	    1.01 ± 0.02 times faster than './git -C dest.git -c core.bigFileThreshold=1m unpack-objects &lt;/tmp/scalar.git/my.pack' in 'HEAD'
	    1.02 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=512 unpack-objects &lt;/tmp/scalar.git/my.pack' in 'origin/master'
	    1.09 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1k unpack-objects &lt;/tmp/scalar.git/my.pack' in 'HEAD'
	    1.10 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=512 unpack-objects &lt;/tmp/scalar.git/my.pack' in 'HEAD'
	    1.11 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=10 unpack-objects &lt;/tmp/scalar.git/my.pack' in 'HEAD'

A better benchmark to demonstrate the benefits of that this one, which
creates an artificial repo with a 1, 25, 50, 75 and 100MB blob:

	rm -rf /tmp/repo &amp;&amp;
	git init /tmp/repo &amp;&amp;
	(
		cd /tmp/repo &amp;&amp;
		for i in 1 25 50 75 100
		do
			dd if=/dev/urandom of=blob.$i count=$(($i*1024)) bs=1024
		done &amp;&amp;
		git add blob.* &amp;&amp;
		git commit -mblobs &amp;&amp;
		git gc &amp;&amp;
		PACK=$(echo .git/objects/pack/pack-*.pack) &amp;&amp;
		cp "$PACK" my.pack
	) &amp;&amp;
	git hyperfine \
		--show-output \
		-L rev origin/master,HEAD -L v "512,50m,100m" \
		-s 'make' \
		-p 'git init --bare dest.git' \
		-c 'rm -rf dest.git' \
		'/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold={v} unpack-objects &lt;/tmp/repo/my.pack 2&gt;&amp;1 | grep Maximum'

Using this test we'll always use &gt;100MB of memory on
origin/master (around ~105MB), but max out at e.g. ~55MB if we set
core.bigFileThreshold=50m.

The relevant "Maximum resident set size" lines were manually added
below the relevant benchmark:

  '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects &lt;/tmp/repo/my.pack 2&gt;&amp;1 | grep Maximum' in 'origin/master' ran
        Maximum resident set size (kbytes): 107080
    1.02 ± 0.78 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects &lt;/tmp/repo/my.pack 2&gt;&amp;1 | grep Maximum' in 'origin/master'
        Maximum resident set size (kbytes): 106968
    1.09 ± 0.79 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects &lt;/tmp/repo/my.pack 2&gt;&amp;1 | grep Maximum' in 'origin/master'
        Maximum resident set size (kbytes): 107032
    1.42 ± 1.07 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects &lt;/tmp/repo/my.pack 2&gt;&amp;1 | grep Maximum' in 'HEAD'
        Maximum resident set size (kbytes): 107072
    1.83 ± 1.02 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects &lt;/tmp/repo/my.pack 2&gt;&amp;1 | grep Maximum' in 'HEAD'
        Maximum resident set size (kbytes): 55704
    2.16 ± 1.19 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects &lt;/tmp/repo/my.pack 2&gt;&amp;1 | grep Maximum' in 'HEAD'
        Maximum resident set size (kbytes): 4564

This shows that if you have enough memory this new streaming method is
slower the lower you set the streaming threshold, but the benefit is
more bounded memory use.

An earlier version of this patch introduced a new
"core.bigFileStreamingThreshold" instead of re-using the existing
"core.bigFileThreshold" variable[1]. As noted in a detailed overview
of its users in [2] using it has several different meanings.

Still, we consider it good enough to simply re-use it. While it's
possible that someone might want to e.g. consider objects "small" for
the purposes of diffing but "big" for the purposes of writing them
such use-cases are probably too obscure to worry about. We can always
split up "core.bigFileThreshold" in the future if there's a need for
that.

0. https://github.com/avar/git-hyperfine/
1. https://lore.kernel.org/git/20211210103435.83656-1-chiyutianyi@gmail.com/
2. https://lore.kernel.org/git/20220120112114.47618-5-chiyutianyi@gmail.com/

Helped-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Helped-by: Derrick Stolee &lt;stolee@gmail.com&gt;
Helped-by: Jiang Xin &lt;zhiyou.jx@alibaba-inc.com&gt;
Signed-off-by: Han Xin &lt;chiyutianyi@gmail.com&gt;
Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>unpack-objects: low memory footprint for get_data() in dry_run mode</title>
<updated>2022-06-13T17:22:35Z</updated>
<author>
<name>Han Xin</name>
<email>hanxin.hx@alibaba-inc.com</email>
</author>
<published>2022-06-11T02:44:16Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=a1bf5ca29f7ebb41e24318147518fc7c738303e5'/>
<id>urn:sha1:a1bf5ca29f7ebb41e24318147518fc7c738303e5</id>
<content type='text'>
As the name implies, "get_data(size)" will allocate and return a given
amount of memory. Allocating memory for a large blob object may cause the
system to run out of memory. Before preparing to replace calling of
"get_data()" to unpack large blob objects in latter commits, refactor
"get_data()" to reduce memory footprint for dry_run mode.

Because in dry_run mode, "get_data()" is only used to check the
integrity of data, and the returned buffer is not used at all, we can
allocate a smaller buffer and use it as zstream output. Make the function
return NULL in the dry-run mode, as no callers use the returned buffer.

The "find [...]objects/?? -type f | wc -l" test idiom being used here
is adapted from the same "find" use added to another test in
d9545c7f465 (fast-import: implement unpack limit, 2016-04-25).

Suggested-by: Jiang Xin &lt;zhiyou.jx@alibaba-inc.com&gt;
Signed-off-by: Han Xin &lt;chiyutianyi@gmail.com&gt;
Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ns/batch-fsync'</title>
<updated>2022-06-03T21:30:34Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2022-06-03T21:30:34Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=83937e9592832408670da38bfe6e96c90ad63521'/>
<id>urn:sha1:83937e9592832408670da38bfe6e96c90ad63521</id>
<content type='text'>
Introduce a filesystem-dependent mechanism to optimize the way the
bits for many loose object files are ensured to hit the disk
platter.

* ns/batch-fsync:
  core.fsyncmethod: performance tests for batch mode
  t/perf: add iteration setup mechanism to perf-lib
  core.fsyncmethod: tests for batch mode
  test-lib-functions: add parsing helpers for ls-files and ls-tree
  core.fsync: use batch mode and sync loose objects by default on Windows
  unpack-objects: use the bulk-checkin infrastructure
  update-index: use the bulk-checkin infrastructure
  builtin/add: add ODB transaction around add_files_to_cache
  cache-tree: use ODB transaction around writing a tree
  core.fsyncmethod: batched disk flushes for loose-objects
  bulk-checkin: rebrand plug/unplug APIs as 'odb transactions'
  bulk-checkin: rename 'state' variable and separate 'plugged' boolean
</content>
</entry>
<entry>
<title>unpack-objects: use the bulk-checkin infrastructure</title>
<updated>2022-04-06T20:13:26Z</updated>
<author>
<name>Neeraj Singh</name>
<email>neerajsi@microsoft.com</email>
</author>
<published>2022-04-05T05:20:13Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=425d290ce564ada84a5322545dbbd1866f2a29c4'/>
<id>urn:sha1:425d290ce564ada84a5322545dbbd1866f2a29c4</id>
<content type='text'>
The unpack-objects functionality is used by fetch, push, and fast-import
to turn the transfered data into object database entries when there are
fewer objects than the 'unpacklimit' setting.

By enabling an odb-transaction when unpacking objects, we can take advantage
of batched fsyncs.

Here are some performance numbers to justify batch mode for
unpack-objects, collected on a WSL2 Ubuntu VM.

Fsync Mode | Time for 90 objects (ms)
-------------------------------------
       Off | 170
  On,fsync | 760
  On,batch | 230

Note that the default unpackLimit is 100 objects, so there's a 3x
benefit in the worst case. The non-batch mode fsync scales linearly
with the number of objects, so there are significant benefits even with
smaller numbers of objects.

Signed-off-by: Neeraj Singh &lt;neerajsi@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>object-file API: have hash_object_file() take "enum object_type"</title>
<updated>2022-02-26T01:16:32Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2022-02-04T23:48:32Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=44439c1c5827480f68b37c3cc38f257eaeb3ed2c'/>
<id>urn:sha1:44439c1c5827480f68b37c3cc38f257eaeb3ed2c</id>
<content type='text'>
Change the hash_object_file() function to take an "enum
object_type".

Since a preceding commit all of its callers are passing either
"{commit,tree,blob,tag}_type", or the result of a call to type_name(),
the parse_object() caller that would pass NULL is now using
stream_object_signature().

Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>object-file API: have write_object_file() take "enum object_type"</title>
<updated>2022-02-26T01:16:31Z</updated>
<author>
<name>Ævar Arnfjörð Bjarmason</name>
<email>avarab@gmail.com</email>
</author>
<published>2022-02-04T23:48:26Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=c80d226a046170b1c8dd82ef72a27373ddd5880e'/>
<id>urn:sha1:c80d226a046170b1c8dd82ef72a27373ddd5880e</id>
<content type='text'>
Change the write_object_file() function to take an "enum object_type"
instead of a "const char *type". Its callers either passed
{commit,tree,blob,tag}_type and can pass the corresponding OBJ_* type
instead, or were hardcoding strings like "blob".

This avoids the back &amp; forth fragility where the callers of
write_object_file() would have the enum type, and convert it
themselves via type_name(). We do have to now do that conversion
ourselves before calling write_object_file_prepare(), but those
codepaths will be similarly adjusted in subsequent commits.

Signed-off-by: Ævar Arnfjörð Bjarmason &lt;avarab@gmail.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>hash: provide per-algorithm null OIDs</title>
<updated>2021-04-27T07:31:39Z</updated>
<author>
<name>brian m. carlson</name>
<email>sandals@crustytoothpaste.net</email>
</author>
<published>2021-04-26T01:02:56Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=14228447c9ce664a4e9c31ba10344ec5e4ea4ba5'/>
<id>urn:sha1:14228447c9ce664a4e9c31ba10344ec5e4ea4ba5</id>
<content type='text'>
Up until recently, object IDs did not have an algorithm member, only a
hash.  Consequently, it was possible to share one null (all-zeros)
object ID among all hash algorithms.  Now that we're going to be
handling objects from multiple hash algorithms, it's important to make
sure that all object IDs have a correct algorithm field.

Introduce a per-algorithm null OID, and add it to struct hash_algo.
Introduce a wrapper function as well, and use it everywhere we used to
use the null_oid constant.

Signed-off-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Use the final_oid_fn to finalize hashing of object IDs</title>
<updated>2021-04-27T07:31:38Z</updated>
<author>
<name>brian m. carlson</name>
<email>sandals@crustytoothpaste.net</email>
</author>
<published>2021-04-26T01:02:53Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=5951bf467ea92458c3bea3051c8413041f3b27d5'/>
<id>urn:sha1:5951bf467ea92458c3bea3051c8413041f3b27d5</id>
<content type='text'>
When we're hashing a value which is going to be an object ID, we want to
zero-pad that value if necessary.  To do so, use the final_oid_fn
instead of the final_fn anytime we're going to create an object ID to
ensure we perform this operation.

Signed-off-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Always use oidread to read into struct object_id</title>
<updated>2021-04-27T07:31:38Z</updated>
<author>
<name>brian m. carlson</name>
<email>sandals@crustytoothpaste.net</email>
</author>
<published>2021-04-26T01:02:50Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=92e2cab96b8b8ea9a076dc279864226b3d0863e9'/>
<id>urn:sha1:92e2cab96b8b8ea9a076dc279864226b3d0863e9</id>
<content type='text'>
In the future, we'll want oidread to automatically set the hash
algorithm member for an object ID we read into it, so ensure we use
oidread instead of hashcpy everywhere we're copying a hash value into a
struct object_id.

Signed-off-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ab/fsck-api-cleanup'</title>
<updated>2021-04-07T23:54:09Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2021-04-07T23:54:09Z</published>
<link rel='alternate' type='text/html' href='https://git.shady.money/git/commit/?id=5644419d04a6bb25c36c965acf0da0e700f1bd3f'/>
<id>urn:sha1:5644419d04a6bb25c36c965acf0da0e700f1bd3f</id>
<content type='text'>
Fsck API clean-up.

* ab/fsck-api-cleanup:
  fetch-pack: use new fsck API to printing dangling submodules
  fetch-pack: use file-scope static struct for fsck_options
  fetch-pack: don't needlessly copy fsck_options
  fsck.c: move gitmodules_{found,done} into fsck_options
  fsck.c: add an fsck_set_msg_type() API that takes enums
  fsck.c: pass along the fsck_msg_id in the fsck_error callback
  fsck.[ch]: move FOREACH_FSCK_MSG_ID &amp; fsck_msg_id from *.c to *.h
  fsck.c: give "FOREACH_MSG_ID" a more specific name
  fsck.c: undefine temporary STR macro after use
  fsck.c: call parse_msg_type() early in fsck_set_msg_type()
  fsck.h: re-order and re-assign "enum fsck_msg_type"
  fsck.h: move FSCK_{FATAL,INFO,ERROR,WARN,IGNORE} into an enum
  fsck.c: refactor fsck_msg_type() to limit scope of "int msg_type"
  fsck.c: rename remaining fsck_msg_id "id" to "msg_id"
  fsck.c: remove (mostly) redundant append_msg_id() function
  fsck.c: rename variables in fsck_set_msg_type() for less confusion
  fsck.h: use "enum object_type" instead of "int"
  fsck.h: use designed initializers for FSCK_OPTIONS_{DEFAULT,STRICT}
  fsck.c: refactor and rename common config callback
</content>
</entry>
</feed>
