More on workflows in R

keithl7
Nov 26, 2015
3 min read

This blog offers some specific suggestions on improving your workflow in R by breaking your code up into smaller chunks so that you can work more efficiently on them.

The workflow blog dealt with general workflow issues as well as dynamic reports and reproducible research. In this blog, I presented a series of workflow schematics. The left hand side of the schematic has a “gather data” and “enter data” sections. I think it is widely accepted that a relational database, not a spread sheet, is the best place for most data to be entered and stored. There are exceptions to this I'm sure, but to my knowledge, for data entry forms, control, queries, query forms, and reports, it is hard to beat a relational database. But a database is only the left hand side of the workflow. What happens once you move to the right side of the schematic (manipulate data, analyze data, report)?

In my quest for a nice, seamless workflow, I present the following that I shamelessly ripped off from R-bloggers and StackOverflow. If the logic of my arguments do not convince you, I'll appeal to authority, i.e., this seems to be what folks like Hadley Wickham do (or some variation thereof).

First, the problem:

Admittedly, my *.R files are pretty small. They are only hundreds of lines. But that still means a lot of data cleaning, merging of data frames, data set up in RMark (see other post) before I even begin my analysis. Then, there is the output for the report and finally, the report itself in RMarkdown (as always, see Chris's excellent tutorial on this subject. The point is that a file can get real messy, real quick and that is without trying to combine the analysis with the report. This means, even for relatively small files, I still spend a lot of time scrolling around trying to find the code that I needed to work on or trying to relate two different parts of the file (say a function with the analysis). This creates frustration and confusion.

The solution:

Create 4 files for each project:

1. Load-clean.R - Load and clean your data, i.e., subset, remove missing values, merge data sets, etc. Basically, do everything to get the data ready for analysis.

2. Fun.R (or library.R): This file does nothing but store the code for your functions.

3. Do.R - In this file, do the actual analysis - just use function "source" to bring in data and functions.

4. Report.Rmd - Create the report: source in what you need from Do.R and write away. Import csv files to make tables and figures.

At face value, like dplyr, this approach may seem like a small gain. But, as the above sites suggest, this approach means you don't have to reload data, it allows you to get reacquainted with a project easily after time away, and forces a degree of organization. I would add that with several nice, neat files, it is much easier to read/edit. And for those of you who write more code than I do, I suggest that there is probably a direct relationship between the size of the file and difficulty in working on it.

As suggested in the above sites, there are certainly variations on this general theme and the files can be collapsed as required. For example, a load-clean-do.R file may work fine for a given project.

And as a segue to the next blog, there is an additional step - version control. Changes to a file add up. I used to keep every change in the file I was working on (changes to graphs etc) and the code became almost unreadable even with lots of annotation. Currently, I still using file names to keep track of my different versions (e.g. file-name-v3.R). This is OK but has some serious limitations. The computer jocks all seem to be using git or github - I'm trying to learn that. Chris says that it is easy. Stay tuned.

Comments