The Unix Philosophy and Data Analysis Projects

Jun 20, 2015

At the halfway point of my ADA project I've been doing some reflection on my progress. One of things I've been most pleased with is the computational aspect. It definitely could have been a nightmare; I've got a large amount of data to process and more coming in every day, tempermental third-party software to accomodate, and a lot of nonstandard analyses to code up.. It's a minor miracle that I haven't had any real difficulties. I'd attribute my success to emulating the Unix philosophy as expressed below

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. – Doug McIlroy

Let's break this down.

Write programs that do one thing and do it well

The pitfall to avoid here is the monolithic script file. I see this as a TA in my students' projects, they have their entire project in one file: reading in the data, cleaning it, running some test analyses, and then making a few plots. For small-scale assignments this works, but it quickly falls apart for even slightly more complicated projects. Here's a few of the problems:

It's a nightmare to collaborate: I'm thinking mostly of people for whom the state-of-the-art in collaboration is emailing the script file back and forth. You have to read¹ the entire file to find out what changed, and then copy and paste in your changes. If you use a version-control tool like git this isn't that bad. You might have to do manual merges more often, but that's not usually a problem. But your commit logs are terrible to walk through because you have to either trust the comments or actually look at the diffs.
It's hard to cache intermediate results: If you have a lot of preprocessing to do, you're going to want to avoid performing any redundant work. But the big script file will rerun that preprocessing every time. I've seen people write the results to a file and then comment out the relevent section of code, but that just seems like a sure recipe for headaches.
It's not language agnostic: This may or may not be a sticking point, but if you are familiar with several languages it's nice to be able to use the right tool for the job.
- Want to compare your method against several standard classifiers? It's dead simple to write the relevant code in scikit-learn.
- Need to decompress your raw data files? Just use your usual commandline tools
- Want to really impress your advisor? Code up a shiny app.
- Have somebody else's matlab code? Waste an afternoon trying to get matlab installed and running and eventually you can just "run it"².

So if the monolithic script is a bad idea, what should we do instead? The obvious response is to write a bunch of small scripts which do one thing. So we have one script to clean the data, one to fit a model, a couple to make diagnostic plots, a knitr file with code for the final figures and tables. This is immensely better because it solves all of the problems we stated above.

It's easy to collaborate: You just work on different script files and if you're so inclined you can just email them back and completely replace your local file. For the more sophisticated users, your commit history becomes much more informative.
It's easy to cache intermediate results: Indeed you must save intermediate results if you decouple the data producing script from the data consuming script.
It's easy to use different languages: Everything is implemented separately, so the language doesn't matter.

Write programs to work together

One objection to the above organizational scheme is that it's an ordeal to keep everything up-to-date. Suppose you changed the data cleaning script, did you then run it and cache the up-to-date version? If not your entire analysis is out-of-date. One thing you have to admit is that the monolithic script is easy to reproduce: just run it top to bottom.

My alternative is to use a build system. These were originally developed to automate the process of compiling source code; if you've ever built something from source you almost surely used make or one of its derivatives.

I use rake which is written in ruby and provides some features which make it more suitable for an analysis project than something like make. The basic idea is that you provide a dependency graph of your project. For each file you want to create you provide the files and the commands to create the file. So if I want to create "histogram.png" by running "code/plot-histogram.R" on the file "data/data-file.csv" I just write the following task in my Rakefile

file "figs/histogram.png" => ["code/plot-histogram.R", "data/data-file.csv"] do
  sh "Rscript code/plot-histogram.R data/data-file.csv figs/histogram.png"
end

Then I run "rake figs/histogram.png" and it'll try to build my plot through the following process.

First it'll look at the dependencies, "code/plot-histogram.R" and "data/data-files.csv". If either of them is missing rake will try to build the missing file by finding instructions elsewhere in the file. Thus if "data/data-files.csv" is an intermediate data file rake will run the appropriate commands to create it (and any dependencies it has).
Once both dependencies exist rake will check if the histogram exists.
- If the histogram is missing then the "Rscript …" command will be run.
- If the histogram exists but has a earlier timestamp³ than any one of its dependencies then the "Rscript …" command will be run.
- If the histogram exists but has a later timestamp than both of its dependencies then nothing will happen because the output wouldn't have changed.

This saves an amazing amount of time. A full build of my project would take a couple days, but I've only had to do that a few times (and that was in the early stages when I didn't have as much data so it only took a couple hours). Nowdays I only have a long build process when I get a new installment of data. But I always have the peace of mind that everything is up-to-date and completely reproducible.

Write programs to handle text streams, because that is a universal interface

If we're going to have programs working together we need to have some way of sharing data. I'm not going to advocate a UNIX piping scheme here. But if we use something like rake we can create a data processing pipeline which passes along intermediate data files.

The typical approach is to just write CSV files, however I often don't data which is naturally expressed in a tabular form (though it could be done). I also try to keep things human readable, just so I can sanity check things.

Thus I like passing around data in JSON or HDF5 formats. Every language I use has a straightforward means to load up either of these formats, so it's the most universal option.

One unappreciated benefit of this approach is that you can create fake data. Suppose your collaborator is working on some MCMC code which will eventually output the posterior draws in "posterior.csv". If you want to make some visualizations based upon those draws you don't have to put your work on hold indefinitely for the MCMC code to be written and run, you can just simulate some data and save it as "posterior.csv". Now you code up your visualizations and get everything looking exactly like you want it. Then when the MCMC code is done, "posterior.csv" will be built which will trigger the building of your plot.

I find this particularly helpful for my Chromebook which is woefully outmatched by my ADA project. But I can store a very small subset of the real data and quickly simulate any outputs so that I can work on the project without lugging around my real laptop.

Summary

If I had to concisely summary the benefit of this organizational choice it's that it allows me to focus exclusively on a well-defined task. If I thought about my entire project I'd quickly become overwhelmed. But I can easily write a script to make a certain plot. I can also easily write something to use some third-party code to solve one of my problems. Or I can implement my one algorithm. Soon these small, well-defined, easy tasks will add up to a finished product. Or at least I hope.

Footnotes:

Or diff it, but I suppose that having the technical acumen to think of using diff would imply having the technical acumen to avoid getting in this position in the first place.

For licensing reasons I can't forget to either be on the CMU network or connected to the VPN. Also if I abort in the middle the matlab process doesn't get killed and decides to eat up one of my processors until I manually terminate it. Add to the fact that the source code is unreadable (who uses a(3) to refer to indexing?) and I don't know why anyone uses matlab.

I disliked this early on because making a cosmetic change like adding comments to your data cleaning code can lead to a complete rebuild of your project. The way I've found to minimize this is see if the output has changed before doing my writes. For example in R I can structure things like this

fname <- "perm.csv"

perm <- sample(1:100)

## Saves everytime
write(perm, fname)

## Saves only if the data changed
changed <- TRUE
if(file.exists(fname)) {
    old <- scan(fname)

    if(length(old) == length(perm)) {
        if(all(old == perm)) {
            changed <- FALSE
        }
    }
}

if(changed) {
    write(perm, fname)
}

Thus in the worse case you only have to rerun a single step.

If I have some spare time I should wrap up these up in an R package. I haven't had any issues with HDF5, but text representations, especially of floating points, can get tricky.

Filed under: computing, data analysis

« March Madness Prediction Reservoir Sampling »