Getting into SCM with git

Understading git

There are various tutorials that explain git;  most of them assume that
the reader is already familiar with other SCM tools, in fact many assume
that the reader is an expert on distributed project managment (kernel
hacker, etc).

This short tutorial shall attempt to explain the git system to people
who have no, or minimal experience on SCMs and version control tools.

Git as a tool for backups

Suppose you have a project with 10,000 files and you want to take
periodic backups.  This task is not as simple as it seems.

One solution is to create a tar.gz archive of the entire thing and
store it to the "safe location".  However this is dangerous.
Suppose that you create one such tarball `backup.tgz` on Monday
and save it. Then on Wensday you accidentially delete a 9,000
files and then the crontab creates a new tarball at midnight
and stores it in the "safe location" over the previous `backup.tgz`.
Files lost!

Experienced system administrators are aware of the above phenomenon
and for that reason each tarball gets a new name that also contains
the date when the tarball was created.  Thus, every day of the week
we have a new tarball created as `backup-01-07-2007.tgz`, etc.
This schema is safe but it has the disadvantage that it takes too
much space since each tarball includes the entire project.

One solution is to automatically delete very old tarballs; this
works but is still a little bit dangerous and you don't have
complete history of the project.

For such a task, git is an ideal tool.  It will guarantee very
efficient storage and complete project history, forever.

The setup

Suppose you have a folder with 4 files:

	A, B, C and D

From this folder we type:

	git add .

And we are ready. Our "safe location" is the subdirectory `.git` that
has been created.  Now, every time we need to take a backup we type:

	git commit -a -m 'Backup my files again'

and lets call this command `git backup`.

Git stores its objects in the directories below `.git/objects`.
When we type git-backup here is what happens:

	For each file (in A, B, C, D)
		1. compress the data of the file
		2. save the compressed data to a file
		   named after the SHA digest of the
		   uncompressed data under `.git/objects`, 
		   unless such a file exists already!

So the first time we issue `git-backup` four new "blob objects"
will be added to the git database (assuming that all the files
have different content!)

You can find these files by listing the directories in `.git/objects`.
The command:

	git-cat-file -p hash

will print the contents of a blob with the given hash (or just the
start of the hash if unambiguous).

At this point we learn that git stores *entire files* compressed
with zlib.  These are the blob objects.  Blob objects are immutable
and once such an object has been added under `.git` it can never
be removed!

Now if later we modify, say, file `A` and we issue git-backup,
the same procedure will be invoked.  But since the content of
the other three files hasn't changed, nothing will be added to
the database in their case.  Consequently only one new blob
object will be added (unless the new content of `A` is the same
as one of the other three files!).

Now suppose even later that we change `A` back to its original
content and we run `git-backup`.  What will happen next?  No
new blobs will be added to the object database.

With these schema, all we have to do is update the subfolder
`.git/` periodically to the "safe location".

Commit objects

Every time we run a `git-commit` command, a new "commit object"
will be created (and zero or more blobs, as we saw).  Each
commit is some plaintext that mentions:

	- the HASH of the parent commit
	- the name of the author and the time of the commit
	- the HASHES of the blobs of the current tree (actually the HASH a tree object)

This text is compressed with zlib and stored in a file named
after the SHA digest of the uncompressed text.

So in our initial git-backup above, we should see that 6 objects
have been created in the git database:
	- 4 blob objects with the contents of the files (compressed),
	    (suppose with hashes starting with: fa0, fb0, fc0, fd0)
	- 1 tree object (suppose hash t0)
	- 1 commit object (suppose hash c0)

All these can be viewed with `git-cat-file`.

Running `git-cat-file -p c0` on the hash of the commit will reveal
 the hash of the tree.

Running `git-cat-file -p t0` on the hash of the tree will reveal the
 hashes of the blobs and the corresponding file names, in this
 case [ "A" --> fa0, "B" --> fb0, etc].

So, knowing the hash id of the latest commit, we can view the
entire history of the project and extract the contents of any
file during any backup state.  And this hash can be found by
`cat`ing the file `.git/HEAD`.

Restoring backups

Suppose that because a programmer forgot to initialize a local
variable in a C program, this caused "Undefined Behavior" which
resulted in the contents of the files "A, B, C, D" been
filled with random data!  Oh the horror!

If the subfolder `.git` is intact, the files can be fully restored.

By typing:

	git log

we will see a list of the hashes of the last commits.

	git checkout c0


	- open commit object c0 and find that the tree of
	  this commit is object t0.
	- open the tree object t0 and find that:
		- file "A" has blob "fa0", unzip
		  the object fa0 and save its contents
		  in "A"
		- do the same for the other files/blobs.

We were lucky the monitor didn't explode.

Version Control

The backup schema we saw above, can be used for "version control".

Version control basically means that when we are working on a project
we want to be able to go back in time and see what we changed.  The
reason is that some times we change things in order to make the project
better, but later discover that our brain wasn't working when we
made those changes and we did something stupid.  Being able to go back
in time is very important in this case because it allows us to *fully
restore* a previous working state.  Thus we are able to undo mistakes
and easilly see what were our intentions when we were changing something,
even if not excessively commented.

Note that since objects are never removed from the git database,
"undoing" a change does not move the project back in time.  It just
creates a new state were the offending changeset is reverted with the
use of an anti-patch patch.

Having this project history, also allows us to to bisection to detect
which changes caused a regression, etc.

Tools that can do version control are CVS, SVN and others.

One caveat in version control that often programmers fall into is
trying too much to have a perfect versioning history.  In this
situation, the developer spends too much time trying to separate
commits in semantically well described changes and attach too
well written description messages.  And eventually, the energy
spent in maintaining the versioning is more than the energy
spent actually writting code!  Thus, the tool which was installed
in order to increase the productivity of the programmer becomes
an activity that wastes time, energy and money (the perfect is
the enemy of good!).
Especially for projects in early development it may be preferable
to make daily commits before going to sleep instead.

Going Distributed

Version control systems exist for more than 20 years and are indeed
a very useful tool for software developers.  However, the challenge
of the new millenium is the internet, open source software and
many different developers working in parallel simultaneously on the
same project, like ants.

The challenge in this case of distributed project maintenance is
very different, and although giving network access to a CVS server
may seem to be better than nothing, it is simply the wrong tool
for the wrong job.

The most important thing in distributed project development is
to achieve synchronization between the developers, and therefore
ensure that developers won't edit the same part of the project
at the same time, while at the same time allowing developers
to work without waiting for others to finish their job.

The plot

Suppose you are the developer of the project with the files A, B, C and D.
After the initial commit, you edit all four files and make a second
commit.  Thus so far the repository supposedly contains these objects:

	HEAD commit:	c1
	tree of c1:	t1
	files of t1:	fa1, fb1, fc1, fd1
	parent of c1:	c0
	tree of c0:	t0
	files of t0:	fa0, fb0, fc0, fd0

At this time, another developer, "developer Strobolovitch" whishes to
work on the project and make valuable contributions, and henceforth
the project becomes distributed.

First of all, developer Strobolovitch "clones" your repository.
That means that he downloads *all* the objects of your repository
in his local hard disk. 

Then the developer does a "checkout" of the head commit (c1) and
you are both looking at the same files.

Now you work in parallel:

	On your local repository you edit file `A` and then
	commit.  Thus the following objects exist in your
		HEAD commit:  c2
		parent of c2: c1
		tree of c2:   t2
		files of t2:  fa2, fb1, fc1, fd1

	On his workstation, developer Strobolovitch adds a
	new subsystem to file `B` and commits.  The followin:
	objects exists on his repository:
		HEAD commit:  c3
		parent of c3: c1
		tree of c3:   t3
		files of t3:  fa1, fb2, fc1, fd1

Now it is time to sync and create a new tree that contains all
the goodies that the two of you have implemented.

Because this is _your_ project and you are the git master,
developer Strobolovitch issues a "please sir pull" request.

Via the network you connect to the developer's workstation
and fetch objects from his head commit.  That is git will
download the objects:

	c3, t3, fb2

the other objects aren't downloaded because git sees that
you have them already.

At this time, on your local repository you have two "branches".
One which roots at commit (c2) and one which is (c3).

Now you have to issue the command:

	git merge c2 c3

"Merge" will create a new commit (with two parents) as follows:

	HEAD commit:    c4
	parents of c4:  c2, c3
	tree of c4:     t4
	files of t4:    fa2, fb2, fc1, fd1

The rest of the story is completed with the developer will now
pull from you.  And in order to do that he will only have to
fetch the objects:

		c4, c2, t4, t2, fa2

And you now both sync'd and have identical trees.  In other words
you are now both working on the next mini-version of the project.

Note that networking is required only on "sync".


In the previous scenario, merging did something seemingly magical.
It took the best from the two branches.  But how did it know which
blobs to pick?

This is called a "tree-way merge" and it is based on the axiom
that "all modifications are improvements".  In order to do 3way merge,
the branches *need* to have a common ancestor.  And this is why
commit objects are so important in git.

In the previous case the common ancestor was commit (c1).
In branch (c2) file `A` had been modified to (fa2).
On the other hand branch (c3) file `A` hadn't been modified
according to the common ancestor, and therefore the improvement
would be merged by using the one from branch (c2).

And merging works for changes on the same file as long as patches
don't conflict.

Naturally, one can easilly imagine a case where "you" and the
other "developer" modify something that will result in a conflict.
For example, both "you" and Strobolovitch change the first line
of file `A`.  And what happens then ???

This is defined as a "conflict".  The truth is that conflicts *are*
indeed very rare.

For a project like the kernel with nearly 16000 files every 100 patches
it is highly unlikely that there will be two patches that modify the
same parts of the same files, or even the same files.

And as long as developers "sync to the master" often, conflicts become
more rare.  [Actually the possibility for a conflict increases with
the number of patches that have been applied to master since our
last sync].

Also there is the issue of subsystem maintainers.  There each
sub-maintainer is responsible for a part of the project and makes
sure that changes to the specific part, first have to go through
the specific maintainer's repository.

Given the above we can say that automatic 3-way merge works in 99%
of the times.

There is still the possibility for a conflict.  In this case the
maintainer who tries to do the merge will be notified by git
that merge failed due to conflicts.  Two things can happen:

	- the maintainer fixes the conflicts by hand (git leaves
	  files in a zombie state where one can see both alternatives)

	- the maintainer tells to the developer that he cannot
	  apply his improvements because there are conflicts and
	  developer Strobolovitch'd better sync with the maintainer
	  and then apply his improvements by hand on the latest
	  snapshot, thank you.

In both cases, somebody has to resolve things manually.

Hierachrially distributed

For projects that are very big and have many developers, git is
used somewhat differently.  First of all the project space is
divided in subsystems and maintainers of subsystems.

For example suppose that subsystem `audio/` is maintained by
Maintainer John Bop.  Bop has cloned the master repository.
Developers who are interested in working on `audio/` join
the relevant mailing list and sync often to Bop's repository.

When a developer modifies something, he creates a patch and
sends it to the maintainer.  The maintainer applies the patch
and commits his repository;  everybody interested in the
subsystem syncs after a while (including the developer who
initiated the patch).

When a sufficient number of patches have been accumulateed, Bop
issues a "please pull sir" request and his branch is merged
in the official mainline system by the project's top-level
maintainer (or the maintainer of the containing subsystem).

Every once in a while, the developers of audio sync to the
mainline repository.  Because git is so efficient, they
will only have to fetch objects that have been added to
other subsystems.

The end

Because each commit object mentions the hash of its parent commit(s),
it is possible to do 3-way merge, and consequently let the tool take
care of applying patches automatically.

This solves the one thing we want in distributed project development
which is developers working in parallel, as well as efficient
synchronization without the requirement of network access in order
to use the tool.

In the end though, we saw that git is just a database of (immutable)
objects that can be used for backups.  Version control, partial
synchronization and consequently parallel distributed hierarchial
development of big open source projects over the internet, are
possible uses of this database system.

For the rest, please read the official git tutorials now that you
know more or less what's the deal.

Thank you for reading and don't forget me when you're rich,

	-- St.