You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.
However, history investigations can be expensive as computing blob diffs
will trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.
Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.
This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.
Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout. Non-cone mode can describe the included files
using both positive and negative patterns, which changes the possible return
values of path_matches_pattern_list(). Test both kinds of matches for
increased coverage.
To test this, we can create a blobless sparse clone, expand the
sparse-checkout slightly, and then run 'git backfill --sparse' to see
how much data is downloaded. The general steps are
1. git clone --filter=blob:none --sparse <url>
2. git sparse-checkout set <dir1> ... <dirN>
3. git backfill --sparse
For the Git repository with the 'builtin' directory in the
sparse-checkout, we get these results for various batch sizes:
| Batch Size | Pack Count | Pack Size | Time |
|-----------------|------------|-----------|-------|
| (Initial clone) | 3 | 110 MB | |
| 10K | 12 | 192 MB | 17.2s |
| 15K | 9 | 192 MB | 15.5s |
| 20K | 8 | 192 MB | 15.5s |
| 25K | 7 | 192 MB | 14.7s |
This case matters less because a full clone of the Git repository from
GitHub is currently at 277 MB.
Using a copy of the Linux repository with the 'kernel/' directory in the
sparse-checkout, we get these results:
| Batch Size | Pack Count | Pack Size | Time |
|-----------------|------------|-----------|------|
| (Initial clone) | 2 | 1,876 MB | |
| 10K | 11 | 2,187 MB | 46s |
| 25K | 7 | 2,188 MB | 43s |
| 50K | 5 | 2,194 MB | 44s |
| 100K | 4 | 2,194 MB | 48s |
This case is more meaningful because a full clone of the Linux
repository is currently over 6 GB, so this is a valuable way to download
a fraction of the repository and no longer need network access for all
reachable objects within the sparse-checkout.
Choosing a batch size will depend on a lot of factors, including the
user's network speed or reliability, the repository's file structure,
and how many versions there are of the file within the sparse-checkout
scope. There will not be a one-size-fits-all solution.
Signed-off-by: Derrick Stolee <[email protected]>
0 commit comments