aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/technical/commit-graph.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/technical/commit-graph.txt')
-rw-r--r--Documentation/technical/commit-graph.txt401
1 files changed, 0 insertions, 401 deletions
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
deleted file mode 100644
index 2c26e95e51..0000000000
--- a/Documentation/technical/commit-graph.txt
+++ /dev/null
@@ -1,401 +0,0 @@
-Git Commit-Graph Design Notes
-=============================
-
-Git walks the commit graph for many reasons, including:
-
-1. Listing and filtering commit history.
-2. Computing merge bases.
-
-These operations can become slow as the commit count grows. The merge
-base calculation shows up in many user-facing commands, such as 'merge-base'
-or 'status' and can take minutes to compute depending on history shape.
-
-There are two main costs here:
-
-1. Decompressing and parsing commits.
-2. Walking the entire graph to satisfy topological order constraints.
-
-The commit-graph file is a supplemental data structure that accelerates
-commit graph walks. If a user downgrades or disables the 'core.commitGraph'
-config setting, then the existing object database is sufficient. The file is stored
-as "commit-graph" either in the .git/objects/info directory or in the info
-directory of an alternate.
-
-The commit-graph file stores the commit graph structure along with some
-extra metadata to speed up graph walks. By listing commit OIDs in
-lexicographic order, we can identify an integer position for each commit
-and refer to the parents of a commit using those integer positions. We
-use binary search to find initial commits and then use the integer
-positions for fast lookups during the walk.
-
-A consumer may load the following info for a commit from the graph:
-
-1. The commit OID.
-2. The list of parents, along with their integer position.
-3. The commit date.
-4. The root tree OID.
-5. The generation number (see definition below).
-
-Values 1-4 satisfy the requirements of parse_commit_gently().
-
-There are two definitions of generation number:
-1. Corrected committer dates (generation number v2)
-2. Topological levels (generation number v1)
-
-Define "corrected committer date" of a commit recursively as follows:
-
- * A commit with no parents (a root commit) has corrected committer date
- equal to its committer date.
-
- * A commit with at least one parent has corrected committer date equal to
- the maximum of its committer date and one more than the largest corrected
- committer date among its parents.
-
- * As a special case, a root commit with timestamp zero has corrected commit
- date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
- (that is, an uncomputed corrected commit date).
-
-Define the "topological level" of a commit recursively as follows:
-
- * A commit with no parents (a root commit) has topological level of one.
-
- * A commit with at least one parent has topological level one more than
- the largest topological level among its parents.
-
-Equivalently, the topological level of a commit A is one more than the
-length of a longest path from A to a root commit. The recursive definition
-is easier to use for computation and observing the following property:
-
- If A and B are commits with generation numbers N and M, respectively,
- and N <= M, then A cannot reach B. That is, we know without searching
- that B is not an ancestor of A because it is further from a root commit
- than A.
-
- Conversely, when checking if A is an ancestor of B, then we only need
- to walk commits until all commits on the walk boundary have generation
- number at most N. If we walk commits using a priority queue seeded by
- generation numbers, then we always expand the boundary commit with highest
- generation number and can easily detect the stopping condition.
-
-The property applies to both versions of generation number, that is both
-corrected committer dates and topological levels.
-
-This property can be used to significantly reduce the time it takes to
-walk commits and determine topological relationships. Without generation
-numbers, the general heuristic is the following:
-
- If A and B are commits with commit time X and Y, respectively, and
- X < Y, then A _probably_ cannot reach B.
-
-In absence of corrected commit dates (for example, old versions of Git or
-mixed generation graph chains),
-this heuristic is currently used whenever the computation is allowed to
-violate topological relationships due to clock skew (such as "git log"
-with default order), but is not used when the topological order is
-required (such as merge base calculations, "git log --graph").
-
-In practice, we expect some commits to be created recently and not stored
-in the commit-graph. We can treat these commits as having "infinite"
-generation number and walk until reaching commits with known generation
-number.
-
-We use the macro GENERATION_NUMBER_INFINITY to mark commits not
-in the commit-graph file. If a commit-graph file was written by a version
-of Git that did not compute generation numbers, then those commits will
-have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
-
-Since the commit-graph file is closed under reachability, we can guarantee
-the following weaker condition on all commits:
-
- If A and B are commits with generation numbers N and M, respectively,
- and N < M, then A cannot reach B.
-
-Note how the strict inequality differs from the inequality when we have
-fully-computed generation numbers. Using strict inequality may result in
-walking a few extra commits, but the simplicity in dealing with commits
-with generation number *_INFINITY or *_ZERO is valuable.
-
-We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
-topological levels (generation number v1) are computed to be at least
-this value. We limit at this value since it is the largest value that
-can be stored in the commit-graph file using the 30 bits available
-to topological levels. This presents another case where a commit can
-have generation number equal to that of a parent.
-
-Design Details
---------------
-
-- The commit-graph file is stored in a file named 'commit-graph' in the
- .git/objects/info directory. This could be stored in the info directory
- of an alternate.
-
-- The core.commitGraph config setting must be on to consume graph files.
-
-- The file format includes parameters for the object ID hash function,
- so a future change of hash algorithm does not require a change in format.
-
-- Commit grafts and replace objects can change the shape of the commit
- history. The latter can also be enabled/disabled on the fly using
- `--no-replace-objects`. This leads to difficulty storing both possible
- interpretations of a commit id, especially when computing generation
- numbers. The commit-graph will not be read or written when
- replace-objects or grafts are present.
-
-- Shallow clones create grafts of commits by dropping their parents. This
- leads the commit-graph to think those commits have generation number 1.
- If and when those commits are made unshallow, those generation numbers
- become invalid. Since shallow clones are intended to restrict the commit
- history to a very small set of commits, the commit-graph feature is less
- helpful for these clones, anyway. The commit-graph will not be read or
- written when shallow commits are present.
-
-Commit-Graphs Chains
---------------------
-
-Typically, repos grow with near-constant velocity (commits per day). Over time,
-the number of commits added by a fetch operation is much smaller than the
-number of commits in the full history. By creating a "chain" of commit-graphs,
-we enable fast writes of new commit data without rewriting the entire commit
-history -- at least, most of the time.
-
-## File Layout
-
-A commit-graph chain uses multiple files, and we use a fixed naming convention
-to organize these files. Each commit-graph file has a name
-`$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
-valued hash stored in the footer of that file (which is a hash of the file's
-contents before that hash). For a chain of commit-graph files, a plain-text
-file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
-hashes for the files in order from "lowest" to "highest".
-
-For example, if the `commit-graph-chain` file contains the lines
-
-```
- {hash0}
- {hash1}
- {hash2}
-```
-
-then the commit-graph chain looks like the following diagram:
-
- +-----------------------+
- | graph-{hash2}.graph |
- +-----------------------+
- |
- +-----------------------+
- | |
- | graph-{hash1}.graph |
- | |
- +-----------------------+
- |
- +-----------------------+
- | |
- | |
- | |
- | graph-{hash0}.graph |
- | |
- | |
- | |
- +-----------------------+
-
-Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
-commits in `graph-{hash1}.graph`, and X2 be the number of commits in
-`graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
-then we interpret this as being the commit in position (X0 + X1 + i), and that
-will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
-positions to refer to their parents, which may be in `graph-{hash1}.graph` or
-`graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
-its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
-X2).
-
-Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data
-specifying the hashes of all files in the lower layers. In the above example,
-`graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains
-`{hash0}` and `{hash1}`.
-
-## Merging commit-graph files
-
-If we only added a new commit-graph file on every write, we would run into a
-linear search problem through many commit-graph files. Instead, we use a merge
-strategy to decide when the stack should collapse some number of levels.
-
-The diagram below shows such a collapse. As a set of new commits are added, it
-is determined by the merge strategy that the files should collapse to
-`graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and
-the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}`
-file.
-
- +---------------------+
- | |
- | (new commits) |
- | |
- +---------------------+
- | |
- +-----------------------+ +---------------------+
- | graph-{hash2} |->| |
- +-----------------------+ +---------------------+
- | | |
- +-----------------------+ +---------------------+
- | | | |
- | graph-{hash1} |->| |
- | | | |
- +-----------------------+ +---------------------+
- | tmp_graphXXX
- +-----------------------+
- | |
- | |
- | |
- | graph-{hash0} |
- | |
- | |
- | |
- +-----------------------+
-
-During this process, the commits to write are combined, sorted and we write the
-contents to a temporary file, all while holding a `commit-graph-chain.lock`
-lock-file. When the file is flushed, we rename it to `graph-{hash3}`
-according to the computed `{hash3}`. Finally, we write the new chain data to
-`commit-graph-chain.lock`:
-
-```
- {hash3}
- {hash0}
-```
-
-We then close the lock-file.
-
-## Merge Strategy
-
-When writing a set of commits that do not exist in the commit-graph stack of
-height N, we default to creating a new file at level N + 1. We then decide to
-merge with the Nth level if one of two conditions hold:
-
- 1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in
- level N is less than X times the number of commits in level N + 1.
-
- 2. `--max-commits=<C>` is specified with non-zero C and the number of commits
- in level N + 1 is more than C commits.
-
-This decision cascades down the levels: when we merge a level we create a new
-set of commits that then compares to the next level.
-
-The first condition bounds the number of levels to be logarithmic in the total
-number of commits. The second condition bounds the total number of commits in
-a `graph-{hashN}` file and not in the `commit-graph` file, preventing
-significant performance issues when the stack merges and another process only
-partially reads the previous stack.
-
-The merge strategy values (2 for the size multiple, 64,000 for the maximum
-number of commits) could be extracted into config settings for full
-flexibility.
-
-## Handling Mixed Generation Number Chains
-
-With the introduction of generation number v2 and generation data chunk, the
-following scenario is possible:
-
-1. "New" Git writes a commit-graph with the corrected commit dates.
-2. "Old" Git writes a split commit-graph on top without corrected commit dates.
-
-A naive approach of using the newest available generation number from
-each layer would lead to violated expectations: the lower layer would
-use corrected commit dates which are much larger than the topological
-levels of the higher layer. For this reason, Git inspects the topmost
-layer to see if the layer is missing corrected commit dates. In such a case
-Git only uses topological level for generation numbers.
-
-When writing a new layer in split commit-graph, we write corrected commit
-dates if the topmost layer has corrected commit dates written. This
-guarantees that if a layer has corrected commit dates, all lower layers
-must have corrected commit dates as well.
-
-When merging layers, we do not consider whether the merged layers had corrected
-commit dates. Instead, the new layer will have corrected commit dates if the
-layer below the new layer has corrected commit dates.
-
-While writing or merging layers, if the new layer is the only layer, it will
-have corrected commit dates when written by compatible versions of Git. Thus,
-rewriting split commit-graph as a single file (`--split=replace`) creates a
-single layer with corrected commit dates.
-
-## Deleting graph-{hash} files
-
-After a new tip file is written, some `graph-{hash}` files may no longer
-be part of a chain. It is important to remove these files from disk, eventually.
-The main reason to delay removal is that another process could read the
-`commit-graph-chain` file before it is rewritten, but then look for the
-`graph-{hash}` files after they are deleted.
-
-To allow holding old split commit-graphs for a while after they are unreferenced,
-we update the modified times of the files when they become unreferenced. Then,
-we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}`
-files whose modified times are older than a given expiry window. This window
-defaults to zero, but can be changed using command-line arguments or a config
-setting.
-
-## Chains across multiple object directories
-
-In a repo with alternates, we look for the `commit-graph-chain` file starting
-in the local object directory and then in each alternate. The first file that
-exists defines our chain. As we look for the `graph-{hash}` files for
-each `{hash}` in the chain file, we follow the same pattern for the host
-directories.
-
-This allows commit-graphs to be split across multiple forks in a fork network.
-The typical case is a large "base" repo with many smaller forks.
-
-As the base repo advances, it will likely update and merge its commit-graph
-chain more frequently than the forks. If a fork updates their commit-graph after
-the base repo, then it should "reparent" the commit-graph chain onto the new
-chain in the base repo. When reading each `graph-{hash}` file, we track
-the object directory containing it. During a write of a new commit-graph file,
-we check for any changes in the source object directory and read the
-`commit-graph-chain` file for that source and create a new file based on those
-files. During this "reparent" operation, we necessarily need to collapse all
-levels in the fork, as all of the files are invalid against the new base file.
-
-It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph`
-files in this scenario. It falls to the user to define the proper settings for
-their custom environment:
-
- 1. When merging levels in the base repo, the unreferenced files may still be
- referenced by chains from fork repos.
-
- 2. The expiry time should be set to a length of time such that every fork has
- time to recompute their commit-graph chain to "reparent" onto the new base
- file(s).
-
- 3. If the commit-graph chain is updated in the base, the fork will not have
- access to the new chain until its chain is updated to reference those files.
- (This may change in the future [5].)
-
-Related Links
--------------
-[0] https://bugs.chromium.org/p/git/issues/detail?id=8
- Chromium work item for: Serialized Commit Graph
-
-[1] https://lore.kernel.org/git/20110713070517.GC18566@sigill.intra.peff.net/
- An abandoned patch that introduced generation numbers.
-
-[2] https://lore.kernel.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
- Discussion about generation numbers on commits and how they interact
- with fsck.
-
-[3] https://lore.kernel.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
- More discussion about generation numbers and not storing them inside
- commit objects. A valuable quote:
-
- "I think we should be moving more in the direction of keeping
- repo-local caches for optimizations. Reachability bitmaps have been
- a big performance win. I think we should be doing the same with our
- properties of commits. Not just generation numbers, but making it
- cheap to access the graph structure without zlib-inflating whole
- commit objects (i.e., packv4 or something like the "metapacks" I
- proposed a few years ago)."
-
-[4] https://lore.kernel.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
- A patch to remove the ahead-behind calculation from 'status'.
-
-[5] https://lore.kernel.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/
- A discussion of a "two-dimensional graph position" that can allow reading
- multiple commit-graph chains at the same time.