Getting into SCM with git ========================= Understading git ---------------- There are various tutorials that explain git; most of them assume that the reader is already familiar with other SCM tools, in fact many assume that the reader is an expert on distributed project managment (kernel hacker, etc). This short tutorial shall attempt to explain the git system to people who have no, or minimal experience on SCMs and version control tools. Git as a tool for backups ------------------------- Suppose you have a project with 10,000 files and you want to take periodic backups. This task is not as simple as it seems. One solution is to create a tar.gz archive of the entire thing and store it to the "safe location". However this is dangerous. Suppose that you create one such tarball `backup.tgz` on Monday and save it. Then on Wensday you accidentially delete a 9,000 files and then the crontab creates a new tarball at midnight and stores it in the "safe location" over the previous `backup.tgz`. Files lost! Experienced system administrators are aware of the above phenomenon and for that reason each tarball gets a new name that also contains the date when the tarball was created. Thus, every day of the week we have a new tarball created as `backup-01-07-2007.tgz`, etc. This schema is safe but it has the disadvantage that it takes too much space since each tarball includes the entire project. One solution is to automatically delete very old tarballs; this works but is still a little bit dangerous and you don't have complete history of the project. For such a task, git is an ideal tool. It will guarantee very efficient storage and complete project history, forever. The setup --------- Suppose you have a folder with 4 files: A, B, C and D From this folder we type: git-init-db git add . And we are ready. Our "safe location" is the subdirectory `.git` that has been created. Now, every time we need to take a backup we type: git commit -a -m 'Backup my files again' and lets call this command `git backup`. Git stores its objects in the directories below `.git/objects`. When we type git-backup here is what happens: For each file (in A, B, C, D) 1. compress the data of the file 2. save the compressed data to a file named after the SHA digest of the uncompressed data under `.git/objects`, unless such a file exists already! So the first time we issue `git-backup` four new "blob objects" will be added to the git database (assuming that all the files have different content!) You can find these files by listing the directories in `.git/objects`. The command: git-cat-file -p hash will print the contents of a blob with the given hash (or just the start of the hash if unambiguous). At this point we learn that git stores *entire files* compressed with zlib. These are the blob objects. Blob objects are immutable and once such an object has been added under `.git` it can never be removed! Now if later we modify, say, file `A` and we issue git-backup, the same procedure will be invoked. But since the content of the other three files hasn't changed, nothing will be added to the database in their case. Consequently only one new blob object will be added (unless the new content of `A` is the same as one of the other three files!). Now suppose even later that we change `A` back to its original content and we run `git-backup`. What will happen next? No new blobs will be added to the object database. With these schema, all we have to do is update the subfolder `.git/` periodically to the "safe location". Commit objects -------------- Every time we run a `git-commit` command, a new "commit object" will be created (and zero or more blobs, as we saw). Each commit is some plaintext that mentions: - the HASH of the parent commit - the name of the author and the time of the commit - the HASHES of the blobs of the current tree (actually the HASH a tree object) This text is compressed with zlib and stored in a file named after the SHA digest of the uncompressed text. So in our initial git-backup above, we should see that 6 objects have been created in the git database: - 4 blob objects with the contents of the files (compressed), (suppose with hashes starting with: fa0, fb0, fc0, fd0) - 1 tree object (suppose hash t0) - 1 commit object (suppose hash c0) All these can be viewed with `git-cat-file`. Running `git-cat-file -p c0` on the hash of the commit will reveal the hash of the tree. Running `git-cat-file -p t0` on the hash of the tree will reveal the hashes of the blobs and the corresponding file names, in this case [ "A" --> fa0, "B" --> fb0, etc]. So, knowing the hash id of the latest commit, we can view the entire history of the project and extract the contents of any file during any backup state. And this hash can be found by `cat`ing the file `.git/HEAD`. Restoring backups ----------------- Suppose that because a programmer forgot to initialize a local variable in a C program, this caused "Undefined Behavior" which resulted in the contents of the files "A, B, C, D" been filled with random data! Oh the horror! If the subfolder `.git` is intact, the files can be fully restored. By typing: git log we will see a list of the hashes of the last commits. Typing git checkout c0 will: - open commit object c0 and find that the tree of this commit is object t0. - open the tree object t0 and find that: - file "A" has blob "fa0", unzip the object fa0 and save its contents in "A" - do the same for the other files/blobs. We were lucky the monitor didn't explode. Version Control =============== The backup schema we saw above, can be used for "version control". Version control basically means that when we are working on a project we want to be able to go back in time and see what we changed. The reason is that some times we change things in order to make the project better, but later discover that our brain wasn't working when we made those changes and we did something stupid. Being able to go back in time is very important in this case because it allows us to *fully restore* a previous working state. Thus we are able to undo mistakes and easilly see what were our intentions when we were changing something, even if not excessively commented. Note that since objects are never removed from the git database, "undoing" a change does not move the project back in time. It just creates a new state were the offending changeset is reverted with the use of an anti-patch patch. Having this project history, also allows us to to bisection to detect which changes caused a regression, etc. Tools that can do version control are CVS, SVN and others. One caveat in version control that often programmers fall into is trying too much to have a perfect versioning history. In this situation, the developer spends too much time trying to separate commits in semantically well described changes and attach too well written description messages. And eventually, the energy spent in maintaining the versioning is more than the energy spent actually writting code! Thus, the tool which was installed in order to increase the productivity of the programmer becomes an activity that wastes time, energy and money (the perfect is the enemy of good!). Especially for projects in early development it may be preferable to make daily commits before going to sleep instead. Going Distributed ================= Version control systems exist for more than 20 years and are indeed a very useful tool for software developers. However, the challenge of the new millenium is the internet, open source software and many different developers working in parallel simultaneously on the same project, like ants. The challenge in this case of distributed project maintenance is very different, and although giving network access to a CVS server may seem to be better than nothing, it is simply the wrong tool for the wrong job. The most important thing in distributed project development is to achieve synchronization between the developers, and therefore ensure that developers won't edit the same part of the project at the same time, while at the same time allowing developers to work without waiting for others to finish their job. The plot -------- Suppose you are the developer of the project with the files A, B, C and D. After the initial commit, you edit all four files and make a second commit. Thus so far the repository supposedly contains these objects: HEAD commit: c1 tree of c1: t1 files of t1: fa1, fb1, fc1, fd1 parent of c1: c0 tree of c0: t0 files of t0: fa0, fb0, fc0, fd0 At this time, another developer, "developer Strobolovitch" whishes to work on the project and make valuable contributions, and henceforth the project becomes distributed. First of all, developer Strobolovitch "clones" your repository. That means that he downloads *all* the objects of your repository in his local hard disk. Then the developer does a "checkout" of the head commit (c1) and you are both looking at the same files. Now you work in parallel: On your local repository you edit file `A` and then commit. Thus the following objects exist in your repository: HEAD commit: c2 parent of c2: c1 tree of c2: t2 files of t2: fa2, fb1, fc1, fd1 On his workstation, developer Strobolovitch adds a new subsystem to file `B` and commits. The followin: objects exists on his repository: HEAD commit: c3 parent of c3: c1 tree of c3: t3 files of t3: fa1, fb2, fc1, fd1 Now it is time to sync and create a new tree that contains all the goodies that the two of you have implemented. Because this is _your_ project and you are the git master, developer Strobolovitch issues a "please sir pull" request. Via the network you connect to the developer's workstation and fetch objects from his head commit. That is git will download the objects: c3, t3, fb2 the other objects aren't downloaded because git sees that you have them already. At this time, on your local repository you have two "branches". One which roots at commit (c2) and one which is (c3). Now you have to issue the command: git merge c2 c3 "Merge" will create a new commit (with two parents) as follows: HEAD commit: c4 parents of c4: c2, c3 tree of c4: t4 files of t4: fa2, fb2, fc1, fd1 The rest of the story is completed with the developer will now pull from you. And in order to do that he will only have to fetch the objects: c4, c2, t4, t2, fa2 And you now both sync'd and have identical trees. In other words you are now both working on the next mini-version of the project. Note that networking is required only on "sync". Merging ------- In the previous scenario, merging did something seemingly magical. It took the best from the two branches. But how did it know which blobs to pick? This is called a "tree-way merge" and it is based on the axiom that "all modifications are improvements". In order to do 3way merge, the branches *need* to have a common ancestor. And this is why commit objects are so important in git. In the previous case the common ancestor was commit (c1). In branch (c2) file `A` had been modified to (fa2). On the other hand branch (c3) file `A` hadn't been modified according to the common ancestor, and therefore the improvement would be merged by using the one from branch (c2). And merging works for changes on the same file as long as patches don't conflict. Naturally, one can easilly imagine a case where "you" and the other "developer" modify something that will result in a conflict. For example, both "you" and Strobolovitch change the first line of file `A`. And what happens then ??? This is defined as a "conflict". The truth is that conflicts *are* indeed very rare. For a project like the kernel with nearly 16000 files every 100 patches it is highly unlikely that there will be two patches that modify the same parts of the same files, or even the same files. And as long as developers "sync to the master" often, conflicts become more rare. [Actually the possibility for a conflict increases with the number of patches that have been applied to master since our last sync]. Also there is the issue of subsystem maintainers. There each sub-maintainer is responsible for a part of the project and makes sure that changes to the specific part, first have to go through the specific maintainer's repository. Given the above we can say that automatic 3-way merge works in 99% of the times. There is still the possibility for a conflict. In this case the maintainer who tries to do the merge will be notified by git that merge failed due to conflicts. Two things can happen: - the maintainer fixes the conflicts by hand (git leaves files in a zombie state where one can see both alternatives) - the maintainer tells to the developer that he cannot apply his improvements because there are conflicts and developer Strobolovitch'd better sync with the maintainer and then apply his improvements by hand on the latest snapshot, thank you. In both cases, somebody has to resolve things manually. Hierachrially distributed ------------------------- For projects that are very big and have many developers, git is used somewhat differently. First of all the project space is divided in subsystems and maintainers of subsystems. For example suppose that subsystem `audio/` is maintained by Maintainer John Bop. Bop has cloned the master repository. Developers who are interested in working on `audio/` join the relevant mailing list and sync often to Bop's repository. When a developer modifies something, he creates a patch and sends it to the maintainer. The maintainer applies the patch and commits his repository; everybody interested in the subsystem syncs after a while (including the developer who initiated the patch). When a sufficient number of patches have been accumulateed, Bop issues a "please pull sir" request and his branch is merged in the official mainline system by the project's top-level maintainer (or the maintainer of the containing subsystem). Every once in a while, the developers of audio sync to the mainline repository. Because git is so efficient, they will only have to fetch objects that have been added to other subsystems. The end ======= Because each commit object mentions the hash of its parent commit(s), it is possible to do 3-way merge, and consequently let the tool take care of applying patches automatically. This solves the one thing we want in distributed project development which is developers working in parallel, as well as efficient synchronization without the requirement of network access in order to use the tool. In the end though, we saw that git is just a database of (immutable) objects that can be used for backups. Version control, partial synchronization and consequently parallel distributed hierarchial development of big open source projects over the internet, are possible uses of this database system. For the rest, please read the official git tutorials now that you know more or less what's the deal. Thank you for reading and don't forget me when you're rich, -- St.