Friday, April 12, 2013

Version Control

This is a nerdy post, relevant only for empirical researchers in social sciences. It may also be relevant for those whose job involves the creation of tons of computer files to finish one project, though.

Matthew Gentzkow and Jesse M. Shapiro of Chicago Booth advocate the use of version control in empirical research of social sciences (See chapter 3 of their writing entitled "Code and Data for the Social Sciences: A Practitioner’s Guide").

If you are new to the idea of version control, watch a series of videos from Software Carpentry.

The basic idea of Gentzkow and Shapiro is that social science empirical researchers should think of writing data analysis scripts as developing software to be released to the public. We need to allow other researchers to replicate our empirical findings. For this purpose, we should make public all the codes and datasets once you publish the paper. It's often the case, however, that by the time you publish the paper, your computer directories are cluttered with many files unnecessary to produce the final results. And cleaning them up often ends up the inability of replicating the final results that you have obtained for the paper to be published. Version control can avoid such a problem.

However, it seems to me that the main benefit of version control is something else: to track the evolution of your thoughts on each empirical research project.

We empirical researchers often face a situation like this:

"Well, I need to analyze this particular thing. I think I did it a few months ago. Which files did I write for this purpose? I cannot find them in my computer."

So you have to start from scratch. A massive waste of time.

The branching function of version control (a great illustration can be found on section 3.4 of Git Pro, written by Scott Chacon) helps us avoid this problem. Every time you try out a new way of analyzing the data, create a new branch (call it the test branch). Within the test branch, keep developing your code. If it turns out to be a bad idea, you can stop working on the test branch and go back to the "master" branch. This way, all the new files you wrote for the failed idea disappear from your working directory. All the clutters are cleaned away. However, these files are preserved behind the curtain. If you later find the failed idea to be actually a good one, you can recover all the files you created in the test branch. Then, you can merge all these files in the test branch with those in the master branch very easily.

There are several systems of version control out there. Git appears to be the best one for branching. (And this article confirms my impression.) However, Git itself is a Unix-based software. Its user interface is not particularly friendly unless you are  a computer programmer.

Among a wide range of visualization software for Git (see the partial list provided by the Git official website), I find Gitbox the most intuitive. It's like an iPhone. Without reading a manual, you can use it. It runs on Mac OS. For Windows, I don't know which one is the best.

The only problem with Gitbox is that it does not visualize branches. Perhaps it is a good idea to also use another graphical interface software for the purpose of visualizing branches only. But it seems to me that none of the available software is very good at visualizing branches.

There is one issue with Git per se. It's a "distributed" version control system. That is, you keep all the files in your local computer and, whenever appropriate, sync them with a remote server (a bit like Evernote). And all the previous versions of each file will be stored in your local computer. This is fine if you only write ascii files. It's not fine if you "version-control" binary files such as data and images. If you use Git, therefore, it's a good idea to version-control those scripts to run statistical software only. Data can be reproduced by running those "tracked" scripts each time.

As opposed to the distributed system, there is also a centralized version control system (such as Subversion), which keeps track of file histories on a remote server. (See this article for the comparison of centralized and distributed.) The drawback of the centralized version control is that branching takes time (because each time you create a new branch, you need to download every file from the remote server). If the main benefit of version control is branching, then the distributed system appears to be the way to go.

Another merit of using version control is to make collaboration easy. It's an effective tool to avoid different people edit different parts of the project, ending up lots of conflicts that cannot immediately be resolved. For collaborative use of version control, however, your coauthors also know about version control (which is totally new to anyone in social science) and agree on when to create a new branch and when to "commit" your works. (To commit means to record all the file changes you have made so that they can be tracked in the future.) Which doesn't seem to be easy.

I'm still learning about version control. One thing that I have to figure out is to use Dropbox for version control. Freshmob and Sam Doidge suggest how to do it.


1 comment:

Anonymous said...

Why do you need to use Dropbox, if the point of Git is to be able to work on your "branch" locally? I could imagine needing dropbox to share "raw" data files (before any processing in .do or .r files), but other than that, what are you thinking about using Dropbox for?