aboutsummaryrefslogtreecommitdiffstats
path: root/path-walk.c
diff options
context:
space:
mode:
authorDerrick Stolee <derrickstolee@github.com>2025-02-03 17:11:06 +0000
committerJunio C Hamano <gitster@pobox.com>2025-02-03 16:12:42 -0800
commitbff455576750bd013a3c87b15cc7086cb8c1eab0 (patch)
tree28c1c20f55e9da7b10e59b83d49e1ddcac775a08 /path-walk.c
parentbackfill: add --min-batch-size=<n> option (diff)
downloadgit-bff455576750bd013a3c87b15cc7086cb8c1eab0.tar.gz
git-bff455576750bd013a3c87b15cc7086cb8c1eab0.zip
backfill: add --sparse option
One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensive as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Non-cone mode can describe the included files using both positive and negative patterns, which changes the possible return values of path_matches_pattern_list(). Test both kinds of matches for increased coverage. To test this, we can create a blobless sparse clone, expand the sparse-checkout slightly, and then run 'git backfill --sparse' to see how much data is downloaded. The general steps are 1. git clone --filter=blob:none --sparse <url> 2. git sparse-checkout set <dir1> ... <dirN> 3. git backfill --sparse For the Git repository with the 'builtin' directory in the sparse-checkout, we get these results for various batch sizes: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 3 | 110 MB | | | 10K | 12 | 192 MB | 17.2s | | 15K | 9 | 192 MB | 15.5s | | 20K | 8 | 192 MB | 15.5s | | 25K | 7 | 192 MB | 14.7s | This case matters less because a full clone of the Git repository from GitHub is currently at 277 MB. Using a copy of the Linux repository with the 'kernel/' directory in the sparse-checkout, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|------| | (Initial clone) | 2 | 1,876 MB | | | 10K | 11 | 2,187 MB | 46s | | 25K | 7 | 2,188 MB | 43s | | 50K | 5 | 2,194 MB | 44s | | 100K | 4 | 2,194 MB | 48s | This case is more meaningful because a full clone of the Linux repository is currently over 6 GB, so this is a valuable way to download a fraction of the repository and no longer need network access for all reachable objects within the sparse-checkout. Choosing a batch size will depend on a lot of factors, including the user's network speed or reliability, the repository's file structure, and how many versions there are of the file within the sparse-checkout scope. There will not be a one-size-fits-all solution. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 'path-walk.c')
-rw-r--r--path-walk.c28
1 files changed, 23 insertions, 5 deletions
diff --git a/path-walk.c b/path-walk.c
index 9715a5550e..341bdd2ba4 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -12,6 +12,7 @@
#include "object.h"
#include "oid-array.h"
#include "prio-queue.h"
+#include "repository.h"
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
@@ -172,6 +173,23 @@ static int add_tree_entries(struct path_walk_context *ctx,
if (type == OBJ_TREE)
strbuf_addch(&path, '/');
+ if (ctx->info->pl) {
+ int dtype;
+ enum pattern_match_result match;
+ match = path_matches_pattern_list(path.buf, path.len,
+ path.buf + base_len, &dtype,
+ ctx->info->pl,
+ ctx->repo->index);
+
+ if (ctx->info->pl->use_cone_patterns &&
+ match == NOT_MATCHED)
+ continue;
+ else if (!ctx->info->pl->use_cone_patterns &&
+ type == OBJ_BLOB &&
+ match != MATCHED)
+ continue;
+ }
+
if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
CALLOC_ARRAY(list, 1);
list->type = type;
@@ -582,10 +600,10 @@ void path_walk_info_init(struct path_walk_info *info)
memcpy(info, &empty, sizeof(empty));
}
-void path_walk_info_clear(struct path_walk_info *info UNUSED)
+void path_walk_info_clear(struct path_walk_info *info)
{
- /*
- * This destructor is empty for now, as info->revs
- * is not owned by 'struct path_walk_info'.
- */
+ if (info->pl) {
+ clear_pattern_list(info->pl);
+ free(info->pl);
+ }
}