random-state.net / Getting Git, part 2 (January 27th 2008)

Getting Git, part 2 #
hacking, January 27th 2008

Read part 1 first.

At the core of Git is the object database. It is not an implementation detail, but a fundamental part of the whole. Don't ignore it, and don't be scared of it. You don't have to use it directly, but knowing the basics makes life a lot easier.

So:

The object database lives somewhere under .git/ in each and every repository clone — nevermind where exactly.

The object database is garbage collected: the only way to delete an object is to remove all roots that point to it, and let the GC reclaim it. Roots also live under .git/, but are distinct from the database.

Since content is immutable, there is no way to mutate anything in the database — you can only add new objects.

Now, there are four kinds of objects in the database. Today we will cover just two of them — the lower level, if you will. This is content at its most content-seeming:

Blobs are binary content. They don't contain any pointers. Blobs are used to store file contents. Not file names, etc — just contents. If you have files x/foo.txt and y/bar.txt, which both contain just the string "foobar", then in the object database there will be a blob that stores the string "foobar" — representing the content of both files. Remember: content is identity.

Trees are lists of entries. The entries represent other objects in the database: for each entry the tree stores the object type, the pointer/SHA1 for the object, the object name, and the mode (the executable bit, really.) A single tree object represents a single directory with its files and subdirectories; in normal circumstances a tree willl only contain blob and tree entries. Again, content is identity: if you have to two directories containing identically named files with identical contents and executable bits, both will be represented by the same tree object.

Consider directories and the file here: x/y/z.txt

We have:

Blob-object B_z for the contents of z.txt.
Tree-object T_y for y/, containing the name z.txt, and a pointer to B_z.
Tree-object T_x for x/, containing the name y, and a pointer to T_y.

Now, if we change the contents of z.txt, and commit the new content to the object database — what happens to the object graph as a whole?

Remember: Content is identity, and not just for blobs, but all objects. If you change the contents of a file, you need a new tree object containing a pointer to the new blob, etc. This is a really important bit, so make sure you understand this: any pointer into the object database is a unique identifier for the whole object graph reachable from that point.

So, you will have:

Blob-object B_z2 for the new contents of z.txt.
Tree-object T_y2 for y/, containing the name z.txt, and a pointer to B_z2.
Tree-object T_x2 for x/, containing the name y, and a pointer to T_y2.

The old versions are still there: content is immutable — as long as GC hasn't reclaimed them, we can get at them.

Now, as long as you remember that a tree object represents the whole state of the whole directory structure under it, including file contents, you can forget about blobs. Just think of trees, and you will be fine.

Review: ~~Why did the hacker cross the road?~~ Where does Git store content? How are files and directories stored? Can you mutate stored content; if so, how; if not, why not? Can you deleted stored content; if so, how; if not, why not? What does a tree object represent?

Next time: how history is content, and how it is stored.