The Internals of Git
I have been using Git for the past couple of years and remember how long it took me to get my head around the work-flow. Throughout the past couple of months, thanks to a couple of well timed findings, I have gained an interest into how Git works internally. In this post I hope to explain how Git uses well-designed, composed low-level commands to create the high-level actions we are familiar with using on a day-to-day basis.
At its core a Git repository is a key-value object store, where each SHA-1 key is generated based on multiple factors. This file-based store is used to represent a functional Directed Acyclic Graph, which through the use of commit, tree and blob objects, represents the projects history.
To better understand how history is represented within Git, lets first commit a plain-text file entitled ‘foo.txt’ with the contents ‘Hello, world’. The only catch being, instead of using high-level commands such as git-add and git-commit, we will go through the process of manually indexing and committing the file instead.
The first step is to store a blob object containing the contents of the ‘foo.txt’ file in the data-store - the hash of which is generated based on the contents and size of the file (or std-in) supplied. In the case of this example we will use std-in, and not only query Git for the generated hash, but also persist the blob into the object graph.
You will notice, if you are following along, that the resulting hash for this blob will be identical to the one above. That is one of Git’s ingenious strengths, identical files are only ever stored once, regardless of where they appear in the history graph.
The next step is to add the blob (via its hash key) to the index (staging area), specifying the file permissions and directory/name you wish to associate it with. We can then write this tree object out to the file-store, being returned its commuted hash.
The pending index is stored in a file called ‘index’ until it is then persisted into the store. Tree objects store a combination of blob objects with file attributes such as the ‘foo.txt’ filename, along with other tree objects for subsequent directories. We are now ready to associate this generated tree with a commit object.
This commit hash will differ from the one you generate as factors such as name/email and current timestamp play a role in its generation. With this commit now persisted in the store we can update the repositories ‘HEAD’ reference to point to the generated hash commit.
As you can see, it only takes three persisted object files to represent a one file commit history. Finally, lets generate a new file ‘baz.txt’ which is stored in a directory called ‘bar’.
In a similar manner to the first example, we persist both the file-contents blob and tree objects. The difference arrives when we wish to generate the second commit, we must now supply the parent commits hash (in our case the first commit) which will be used to calculate the graphs history when desired.
I hope this quick introduction to the internals of Git has helped you realise how simple, yet powerful the system has been modeled.