Project Management with RStudio and Github

This post is inspired by Professor Mallory’s presentation on Reproducible Research Practices for Economists. See the presentation version that I gave at a lab meeting.

Summary

  • Introduced project management practices for file management and version controls.

  • Principles for organized project management: keep raw data intact, use version controls, use progress management and save files in the relative path.

  • Introductions to R, RStudio, R markdown, Github

  • Steps to manage a project with RStudio and Github (added in Stata version)

Why do we need project management?

Graduate school is a project-based learning experience. The course works arms you with theoretical knowledge, but most of your practical skills are developed by doing research projects. A project is an entire process of defining research questions, collecting and cleaning data, running models, visualizing results and writing reports to present your work. The quality and quantity of projects that you accomplish is a good measure of your success in grad school.

However, project management is hard. To start with, the number of files you work with can be enormous. Instead of the loading a clean dataset in your statistics class and start trying out the models, a large portion of the time in doing projects is spent on putting your data in the right shape and form. To have a regression with 100 variables, the preparation stage alone can generate thousands of files. As a result, a mix of raw data, processed data, functions, scripts stored all in one place would look like this.

Messy project folder

The drawback of this “natural” style of managing project files is apparent. By storing the original and modified data all in one place, you risk possible contamination of the raw data. This is probably the biggest mistake you can make in managing projects considering the time and money spent collecting them. Another problem with messy project folders is the time wasted on searching for files, especially when you try to figure out Where to proceed after a vacation or switching back from another project. It would be even harder for your collaborator to understand where you are at and where he/she can start working on.

Even when you manage to produce some results and ready to present your work to your colleagues, it would just be the start of a different kind of nightmare. Tables, graphs, presentation slides, and paper drafts constitute an infinite combination of ways that you can present your results. Before long, your paper draft folder would begin to look like this.

Typical draft versions

I call this type of version control “Final_final_draft_V6.3” style, just to make fun of the meaningless version names. Even with dates added, the changes between versions are not clear. What you need is version notes, for example, “run model 0 with county fixed effects, generated figure 1 and table 2; Formatted draft in AJAE style”. I would recommend that you do so even when you are not using Github. It is just that the committing message in Github forces you to write something like the version note. The bigger problem is complete separation in your code and paper draft. Looking at a particular paper draft that you did two years ago, how difficult it is to find the exact code to recreate every single table and graph? Well, the time spent searching through your old code, or even rewriting them all over again, can be saved if you use version control tools. What’s more, if replicating your own work is hard, you can imagine the how your colleagues or students feel after you dump your project folder, saying “let me know if you have any questions.”

To summarize, a systematic way of organizing files, keeping records of changes is vital.

Principles for building an organized project

Before actually going into the nitty-gritty of using RStudio and Github, I want to talk about some general principles that apply to any programming language that you use.

Build good practices instead of taking the time to clean things up

You will NEVER find the time! There will always more urgent and vital tasks than the tedious organizing files, commenting code.

Progress management

It is important always to keep the big picture in mind. Even if you are doing tedious data cleaning work, you are one step closer towards something great.

To form the big picture and keep track on your progress, you should start a project by planning as the project lead. I discussed the importance of preparing in enhancing productivity and preventing procrastination in my other post .

To start, define the project objectives with measurable outcomes and specific deadlines. For example, a presentable paper draft in a year. To prepare for that, a conference presentation in six months, a paper proposal in one month, etc. After defining the big objectives, break them down into smaller tasks and even smaller jobs, then add them to your daily to-do lists. One great tool to keep track of progress and prioritize between projects is the Gantt Chart. Below is an excel template.

Gantt Chart template in Excel

Use “project” to manage your project.

Never set the working directory!

Whenever you open your project, the working directory is automatically set to where your project is. This means your code will not break when you work on a different computer.

Organized layout with relative path

Organzied project folder

This follows directly from the previous point about keeping the code intact when running on a different computer. You can always refer to your raw data as “data/raw/raw_data.csv,” without worrying about the exact location of the files. It also forces you to keep the records where they belong, when you save a processed data or generate a graph. Some minor points:

Separate inputs and outputs (drafts, graphs, tables)

Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.

+ ggsave("output/graphs/myplot.png")

Separate raw and processed data

raw data should never be touched.

- read.csv(data,"data/raw/market_price.csv")

- write.csv(data,"data/clean/market_coordinates.csv")

Separate cleaning and analysis scripts

So that you don’t have to go through the cleaning process again and start directly from your processed data.

Use version control tools

Unlike the “Final_final_draft_V6.3” style of version control, version control tools is a time capsule for entire projects. Click on the magical button on your version history, the whole project goes back to that state. It would be easy to find what code generated what result. You can reproduce your old result without a sweat.

Version history

R, RStudio, Rmarkdown, Github

  • R: Open source statistical computing software, flexible data structure, multiple purposes.

  • RStudio: an open source development environment for R with a nice GUI.

  • Markdown/ RMarkdown: statistical analysis and results contained in documents

    • Easiest way to formulate math formulas ( Bye ! Latex compiling errors )

    • Formats: PDF, Word, HTML, Beamer

  • Github is a cloud-based repository for version control

Benefits of using RStudio and Github integration

  • A Github repository = R project

  • Easy to work on multiple devices

  • Click to commit and push in RStudio

Getting Started

  • Sign up for Github

Sign up Github Education Pack with your school email to get unlimited private repositories for free (worth $7/month )!

  • Create a repository

Create repository

Git setup

How to manage Project with RStudio

  • Create a local project from your Github repository Clone from Github repo

  • Commit local changes to the online repository, then click push

Commit

Click on the files that you want to upload and ignore the data that you want to keep locally.

Git Push

What if I use Stata?

Well, Stata also has a project environment, and all the principles of project management still apply. However, the integration with Github and markdown is not ideal.

Stata project environment.

As a workaround, I would recommend that you just store your Stata code and files in the same project folder by starting an R project. You can do version control on your Stata code, and Stata generated tables the same way you manage all your other files. I usually rely on R to do data cleaning because its data structure is more flexible and can deal with different types of raw data (spatial shapefiles, JSON, etc.). You can do the rest of the analysis in Stata to run panel regressions, IV regressions, time series analysis if you have more faith in the coefficients estimated by Stata.

Github as a technical CV

As you manage your project through Github, you are potentially building a projected-based CV. Your Github page fully demonstrates your end-to-end project management ability. By taking a look at your Github profile, a recruiter immediately know what programming language you are good at, what projects you have worked on and how popular they are. They can get a sense of how good you are at explaining your work in writing and visualization, how readable your code is and what methods you are familiar with. ( More on this in my other post on writing more readble R code .)

In the academic setting, if you published a paper in a journal, your work is more convincing if your fellow researchers can replicate your work quickly.

Github Profile

Ultimate goal

“Hey, can you send me the code that you did in your paper?”

Your answer would be “sure, go check out my Github page,” instead of “let me spend some time to clean things up.”

Related