Let’s look at how Git stores information in its “object database”. You can read the full details in various places such as the Git Community Book or the Git Book, but here is a quick summary.
A Git repository is a key-value data store containing immutable objects. Every object has a hexadecimal ID and some content. Some objects can contain the IDs of other objects, forming edges in a directed acyclic graph. There are four types of objects:
Objects are identified by a hash of their content, forming a blockchain ahem, a Merkle tree. This is important because it means that objects aren’t identified by where they are, only by what’s inside. As long as the content is the same, one object can be reused in multiple places. E.g. if two different directories contain the same file, the file content will be stored only once, and the two tree objects will point to the same blob object with different filenames.
This automatic deduplication is usually a good thing. It’s the whole reason why Git doesn’t make a full copy of the whole project when you change a single file. But it also means that a repository can be much bigger than it seems to be.
The easy
branch contains about 100,000 files whose total size is over 450 GiB, not including filesystem overhead. If you try to switch to it with git checkout
, you will probably run out of disk space or patience.
But as long as you’re careful and use commands that don’t read the file content, you should be fine. There are many ways to get the file list. In particular, git ls-tree
can print it in exactly the right format.
git ls-tree -r --name-only origin/easy
git rev-list --objects origin/easy
git show --name-status origin/easy
If you try to switch to the hard
branch, git checkout
won’t even show you “0%” because it can’t compute how many files there are. The easy
branch had lots of duplicate files, but the graph was still mostly treelike. The hard
branch is a true directed acyclic graph where almost every node has an indegree and outdegree larger than 1.
When the total number of files is exponential, we clearly need another approach. If it’s not possible to enumerate every single file, we must stop relying on Git’s own commands and look at the whole graph ourselves. From a bird’s-eye view, we must:
We could do everything in a single program, but it can help to split it into a chain of programs and use the strengths of different programming languages.
First we list all objects:
git cat-file --batch-check --batch-all-objects
(in Git 2.6.0 or newer)git verify-pack -v .git/objects/pack/*.idx
Then we filter out everything that isn’t a tree
.
Git doesn’t have a built-in way to read a list of tree objects. Here are three possible strategies.
git ls-tree $id
” or “git cat-file -p $id
” for each one.git cat-file --batch
, and parse its output. The binary format can be recognized from a hexdump, but one tricky part is that object IDs of tree entries are stored as 20-byte binary strings instead of 40-byte hexadecimal strings.libgit2
.The first strategy is slower than the other two because of the overhead of spawning so many processes. But developer time (installing a library or understanding the binary format of tree objects) is usually more expensive. You can work on other problems while it’s running.
We tested the first strategy on different machines and operating systems. For some unexplained reason, the timing is very inconsistent. One machine can finish in 5-10 minutes and another needs half an hour. In that case, the other two strategies might be more worth it.
Now we finally have a normal algorithmic problem. We have a list of nodes, and for every node, an ordered list of its outgoing edges and their names.
How do we compute the output when the total length of the file list is exponential? We must take advantage of the final checksum and compute it directly in one go. But at least we don’t have to deal with sorting – Git already stores them in the correct sorted order, even taking care of the implied “/
” after directory names.
In the original problem, a tree’s contribution to the final output depends on its path from the root, and there can be multiple paths leading to the same tree. We can represent every tree as a function from a “path prefix” string to a “file list” string. In the example from the problem statement, they would be defined like this:
$$
\begin{align*}
f_1(p) & = p + \texttt{"r\textbackslash{}n"} + p + \texttt{"r.h\textbackslash{}n"} \\
f_2(p) & = p + \texttt{"-\textbackslash{}n"} \\
f_3(p) & = p + \texttt{"v.txt\textbackslash{}n"} + f_2(p + \texttt{"v/"}) + p + \texttt{"v0\textbackslash{}n"} \\
f_4(p) & = f_1(p + \texttt{"12/"}) + p + \texttt{"3\textbackslash{}n"} + p + \texttt{"ll\textbackslash{}n"} + f_3(p + \texttt{"lll/"}) + p + \texttt{"llll\textbackslash{}n"} \\
\textrm{result} & = f_4(\texttt{""}) = f_1(\texttt{"12/"}) + \texttt{"3\textbackslash{}n"} + \texttt{"ll\textbackslash{}n"} + f_3(\texttt{"lll/"}) + \texttt{"llll\textbackslash{}n"} = \ldots
\end{align*}
$$
This works, but we need something faster. Something that would allow us to only process each tree node once. Instead of functions on strings, we want polynomials.
Let’s describe every string constant x with three numbers lx, sx, cx: its length, its sum, and its checksum (i.e. Σi = 1ni ⋅ x[i]). Instead of describing each tree with a function ft(p) from string to string, we will have three functions lt(lp, sp, cp),st(lp, sp, cp),ct(lp, sp, cp) which can tell us the length, sum and checksum of the tree’s file list given the length, sum and checksum of the tree’s path prefix.
We can prove by induction that the functions will always be “simple” and possible to describe with a constant number of constants. For every tree t:
The length and sum are easy, but the checksum is more complex.
Concatenation of two constants. Given two strings x and y, the checksum of their concatenation x + y is cx + cy + lxsy.
Prefix extension. Given a function f(p) describing a tree given its prefix p, we want the function g(q)=f(q + "foo/"
) describing a named tree entry pointing to that tree given a parent prefix q.
$$
\begin{align*}
p & = q + e \\
l_p & = l_q + l_e \\
s_p & = s_q + s_e \\
c_p & = c_q + c_e + l_q s_e \\
c_g(l_q, s_q, c_q) & = c_f(l_p, s_p, c_p) \\
c_g(l_q, s_q, c_q) & = a_l l_p + a_s s_p + a_{ls} l_p s_p + a_c c_p + a_0 \\
c_g(l_q, s_q, c_q) & = a_l (l_q + l_e) + a_s (s_q + s_e) + a_{ls} (l_q + l_e) (s_q + s_e) + a_c (c_q + c_e + l_q s_e) + a_0 \\
c_g(l_q, s_q, c_q) & = a_l l_q + a_l l_e + a_s s_q + a_s s_e + a_{ls} l_q s_q + a_{ls} l_q s_e + a_{ls} l_e s_q + a_{ls} l_e s_e + a_c c_q + a_c c_e + a_c l_q s_e + a_0 \\
c_g(l_q, s_q, c_q) & = (a_l + a_{ls} s_e + a_c s_e) l_q + (a_s + a_{ls} l_e) s_q + (a_{ls}) l_q s_q + (a_c) c_q + (a_l l_e + a_s s_e + a_{ls} l_e s_e + a_c c_e + a_0)
\end{align*}
$$
Concatenation of n functions. Given a tree object t with entries 1, 2, …, n represented by functions l1, s1, c1, …, ln, sn, cn (which were created with prefix extension), the checksum ct(lp, sp, cp) is:
$$
\begin{align*}
c_t(l_p, s_p, c_p) & = c_1(l_p, s_p, c_p) \\
& + c_2(l_p, s_p, c_p) + l_1(l_p, s_p, c_p) \cdot s_2(l_p, s_p, c_p) \\
& + c_3(l_p, s_p, c_p) + (l_1(l_p, s_p, c_p) + l_2(l_p, s_p, c_p)) \cdot s_3(l_p, s_p, c_p) + \ldots
\end{align*}
$$
Now we have everything we need to recursively compute the coefficients in the functions lt, st, ct for every tree object, ending with the root tree r, and get the final checksum with cr(0, 0, 0).