Reproducible Quantitative Methods
Instructor Guide, Lesson 5
Topics and Resources
-
Introduction to R
Start by working with your students to import data into R Studio. Show the students how to do this by live-coding. For reproducibility, it’s best that you’re downloading the data directly from the web, so that anyone with a web connection can run your analyses using your script, without having to worry about pointing the code to a file they’ve downloaded somewhere on their computer. For this reason, a good way to prepare for this lesson is by uploading the course data to figshare or some other web based data repository. Often, students find simply getting the data IN to R is the hardest part, so once everyone has this working, make a big deal of it!
ProTip
A helpful hint from those that came before
Liven it up Live coding is a teaching technique that is exactly what it sounds like- you code directly in front of your students. This is a very good technique to use with novice coders because it humanizes the process - you don’t ‘just’ do anything, you have to show all the steps. It slows you down, it shows you making mistakes, it gives students an opportunity to suggest next steps. We recommend using this approach for the majority of your lessons involving coding.
-
Data Cleaning in R
Once you have your data into R, the first thing you're going to want to do is make sure it's clean and behaving as you'd expect it to. For the issues with your data that you identified last week, script ways to correct them using R. What you do here is highly dependent on the issues with your data, but the idea here is to show the students that you can correct most issues you’d correct with OpenRefine in a scripted, reproducible way- directly correcting errors from the original data source. Lots of opportunities to use gsub to correct typos, etc. Here’s a resource on the variety of things you can do this way. Here’s some inspiration. Scripted data cleaning: it CAN be done!!
ProTip
A helpful hint from those that came before
Motivation in the script Students may be hesitant to take this approach with ‘one-off’ data, and that’s understandable, and that’s why it’s up to you to show them key places where this is important- for example, if you’re downloading weather station data directly for an analysis- these files are typically continuously updated, and the same things will need to be corrected every time this data is downloaded.
-
Version Control
Finally, when you're setting up R Studio, you're going to want to make sure that all students are on board with version control. An easy way to do version control with Github is through the R Studio interface. Here are some tips for setting that up.
Exercises
- Github and R set-up
Follow the tutorial linked above to set up github in R Studio on all students’ computers. Have students all attempt to create a project using version control through R. You *will* hit bugs.
ProTip
A helpful hint from those that came before
It's not a bug, it's a feature. Believe it on not, bugs help students learn about problem solving in computational environments. Support them through the debugging, but let them lead as much as possible. Common problems include: firewall issues, O/S differences and git config problems, git files installing in surprising places.
Discussion
Reproducibility and replication
What makes a scientific experiment repeatable? Repoducible? Replicable? Explore these resources with your students:
• Replication frustration: what stops experiments being reliably repeated? -NB: Although this article presents some really interesting results of a cool project , the article itself uses the terms reproducibility and replication interchangeably. WE DO NOT AGREE WITH THIS! See below article.
• Reproducibility vs Replication Note: The definitions here talk about people going out and replicating and reproducing where we are more interested in the ability to replicate and reproduce.
Video
Scripts for Reproducible Research in R (R Tutorial 1.9) (6:21)
Questions
Why do we care so much about reproducibility?
How does scripting analysis improve reproducibility?