Git: Better than a lap-dog to a slip of a girl

Basilica cistern in Istanbul

Basilica cistern in Istanbul

Git is the source code management system created by Linus Torvalds a couple of years ago to manage the Linux kernel source code (yes, git is a silly name; where was Ari Lemmke this time?). Git is now being used, or at least considered, by several open source projects other than the Linux kernel. So I decided to try it myself on some personal projects. Here is a short experience report.

My perception of git a couple of months ago was that, while git provides the heavyweight decentralized development facilities Linus needs to manage the kernel source, it did not appear easy to get started with it for everyday development. The early git documentation contributed to this impression, by focusing on the conceptual model used by git, rather than the things you need to know as you take your first steps. Furthermore, git reuses terminology from other popular SCM systems with somewhat different meanings (the nature of a git repository is quite different to that of a CVS/SVN repository, git checkout does something quite different to cvs checkout, etc.).

But after a month of using git, I've found that it works great for small projects! And it's actually quite easy to get started with it these days.

The documentation problem has mostly been addressed. There is a good tutorial. Once you know your way around the basic commands, you can explore their full functionality by reading the comprehensive man pages. And there is an active wiki containing lots of additional information and links to further documentation. There is room for improvement, of course, but all in all, the state of the documentation seems superior to that of the other SCMs that have sprung up in recent years.

The other thing that has made it much easier to get started is git's availability as an optional package on many Linux distributions. I just did yum install git on Fedora 7, but the equivalent should works on Ubuntu, Debian, etc.

Setting up a git repository for a new project is extremely easy. You just go into the directory where you have you project, and do git init to create the empty repository, and then commit the files (git add . ; git commit). That's it. Turning a bunch of files into a git repository could hardly be easier.

The main hump in the git learning curse is understanding that a commit is a two-step process with git. First, you select the changes to be committed (git add). Conceptually, this copies the changed files into the staging area. Then you commit what's in the staging area into the current branch (git commit). This separation between the two steps makes it possible not only to commit a subset of the changed files, but to commit a subset of the changes made to a single file (git add has a mode where you select the desired change hunks from the diff). So the staging area is not simply a list of files from your working directory, but actually contains the file contents to be committed, which may be different from those in your working directory. This may all sound complicated and fiddly, but it provides a lot of flexibility (similar to that achievable by hand-editing diffs if you use diff and patch to manage changes). The simple case — committing a set of files as they stand in your working directory — is handled by the git commit -a command, which combines the add/commit steps.

There is also a graphical tool, git-gui, included with git, which provides an alternative to the command line tools for managing commits. git-gui makes it easy to see what's in your staging area, how it compares with HEAD and your working directory, move changes back and forth (much more easily than with the git add text UI), and perform the commits. Although it's not fancy, I find git-gui extremely convenient, and now I'm using it to manage almost all my commits. Even in simple cases, it's nice to be able to review what you are about to commit as you write the commit message. Git comes with another graphical tool, gitk, for viewing and searching the repository history.

One of the headline features of git is advanced automated merging. So far, I have only made trivial use of branching and merging: Creating branches to hold more adventurous lines of development, and merging them back if they work out well (nothing that CVS can't handle). All of this works in an obvious fashion, and is covered in the tutorial.

Another of git's advertised features is performance. Since operations on git repositories are local (except when you are pushing and pulling changes between repositories), it's naturally much faster than a remote centralized SCM. And my projects are very modest in size. But even with that taken into account, I was pleasantly surprised that everything happens with no perceptible delay at all (even with diff and patch on hard-linked source trees, I'm used to a slight delay). So with git it is painless to commit every few minutes; the main source of effort is writing the commit messages. If you are going to work on something slightly experimental for an hour or two, just create a branch for it, and commit into that as you go. The git repositories you use for development are always private (though perhaps linked to a published repository), so you don't have to worry about choosing a particularly descriptive or unique branch name.

Another consequence of git's performance: When you are using an unfamiliar feature that modifies the repository, and you are a little unsure if you fully understand the effect of the commands involved, it's very easy to simply clone the repository and then do a dry run on the clone.

So my experience of git has been very positive. If it gains a critical mass of open-source projects using it, and developers familiar with it, it could go a long way. There are two main issues that I'm aware of that could hold up wider acceptance:

(Migrating from another SCM to git, or getting git to coexist with another SCM, seems well covered.)


Using the Blogger Data APIs to Fix Markup Errors

The Erg Chebbi dunes in Morocco

The Erg Chebbi dunes in Morocco

A few days ago, I clicked the Validate HTML link down there on the left for the first time in years, and discovered some carelessness in the HTML of my blog posts. I write these posts as HTML in Emacs before copying it across into the Blogger posting form. The Emacs HTML mode features syntax coloring that makes it easy to see when characters need escaping as character entity references. Except, that is, when those characters appear in attribute values — attribute values appear in a uniform shade of pink. The result is that when writing several posts, I copied URIs containing ampersands into href attributes, and didn't notice that the ampersands need replacing with & entities. And this is the issue that the HTML Validator flagged. Browsers are able to handle these attributes without problems, but this is not something I could leave alone. And fixing the problematic posts is a perfect opportunity to play with the Blogger Data APIs!

I started with the Python client libraries provided by Google, and working from the developer's guide and sample code, I was quickly able to come up with a fairly clean 50-line Python program to do the job. Modifying that program to correct other simple markup errors should be straightforward.

Draft blog entries came in useful for testing this program. The draft entries are marked in the feed by the app:draft element from from the Atom Publishing Protocol. So it was easy to restrict the program it to update only draft entries, then create a draft entry containing my test cases, and review the changes made by the program to that draft entry through the Blogger UI. When I was ready to entrust it with my precious blog posts, I removed that draft check. This approach was quicker than creating a whole new test blog just to hold test entries.

I do have some small gripes with the Python GData client library. Google provides reasonable howto-style documentation describing how to manipulate Blogger data via the library. But when you move away from the examples in that documentation to do something a bit different, you need reference documentation to guide you. In particular, I was wondering whether my code needed to do anything to support the different entry content types from the Atom spec; the answer seems to be that Blogger currently supports only type='html', with the escaped HTML markup that entails. The Python code for the library does not provide ready answers, because it mostly consists of hundreds of lines of boilerplate to convert between the the GData protocol XML structures and the corresponding Python objects (an opportunity for some metaprogramming, I wonder?). So there is a gap for more systematic documentation describing the correspondence between the Python API and GData protocol. Such documentation doesn't need to be verbose: For example, the documentation for the Universal Feed Parser is very good at concisely defining how its Python objects correspond to the XML structure of RSS and Atom feeds.

A KartaMetro screenshot showing the Arbatskaya station complex

Back at the end of May, I went to Google Developer Day 2007 in London. I saw a lot of interesting stuff, which I didn't write about here because I was busy arranging the wedding, and well, I'm not in the habit of updating this blog at the best of times. So I owe a belated thanks to Google and their staff for providing a day of interesting talks and the surrounding hospitality for free.

There was something that caught my eye while I was there: Before each talk there was a looped video on the projection screen, showing off various applications that have been built on top of Google's APIs. And one of these applications involved Google Maps' streetmap of Moscow, overlaid with the metro stations and lines. The video cut between applications far too fast to see any details, or to note the URL of the application. But some weeks later I found it via It's at

If you know Moscow, it's fun to just browse the metro system. One great feature is the subterranean maps of the stations. The screenshot above shows the sprawling Arbatskaya/Bibliteka im. Lenina/Aleksandrovskiy Sad/Borovitskaya station complex, which turns out to be about 600 metres across (it feels a lot bigger when you are down there).

One problem KartaMetro has in common with many Google Maps-based sites is that it can feel quite slow even on fast machines, not because of the time spent talking to the server, but just due to the JavaScript grinding away (in particular, whatever Google Maps uses to draw lines in Firefox). The attractions of moving applications into the browser are obvious for users and developers, but the technology foundations are disappointing in many ways. It will be interesting to see whether the projects to overhaul the Firefox internals can demonstrate that these are implementation issues, and not fundamental flaws.

1 comment