Reproducible Quantitative Methods

Instructor Guide, Lesson 4

Cleaning up messy data / Identifying 'grey' data sources

yeah bar

Topics and Resources

  1. Introduction to data cleaning

    Open Refine is a great, graphical interface for data cleaning. Data Carpentry has a great tutorial which walks you through using OpenRefine on a messy dataset. We recommend working through this yourself in advance of the class, and consider demo-ing it on your data set when you present in class. Invite the students to explore OpenRefine on their own.

      • Hadley Wickham's Tidy data sets out the principles of "tidy" datasets and offers instruction for how to clean them in R.

      • The Quartz guide to bad data is a great resource for understanding the many, many ways data can go wrong, and offers readers suggestions for how to fix it


    A helpful hint from those that came before

    Yes, you may. The Quartz Guide is great because it clearly delineates where students will need to go back to the data creator or consult an expert. Sometimes, a student needs ‘permission’ to ask for help and this guide gives clear scenarios where they should.

  2. Grey data liberation

    Data is all around us, and we, as humans, are generating it constantly as we go about our daily business. Sometimes you need to think about it a little harder before you realize it's actually data- for example, there's a lot of information being produced simply by people uploading photos to the internet. However, there's also a lot of classic research data (and literature) that never sees the light of day- the producers of the data, for whatever reason, have not published on it using traditional academic channels. But you can find out a lot of cool things if you're willing to dig a bit for your data. One of the goals of this course is, in addition to teaching all the students the skills they need to make their own work reproducible, is to provide a mechanism for liberating grey data.


  1. Personas
  2. Do a “Persona” exercise designed to help students understand goals and motivations of potential data donors. The persona is an imaginary data producer, and our goal is to get that person to share their data. Based on real-world observations and understandings of actual potential or current donors, sketch out this persona, and identify his or her motivations. The persona is used in business and software development to help designers understand and empathize with their users.


Grey data liberation

Refer back to Simon Leather’s post on unused data that needs love.


Grey literature- ask your students- how does this apply to data?


Why is it desirable to see grey data published?

Why might a data producer not publish on the data they've produced?

What sort of questions can we ask of grey data?

How do we convince producers of grey data to work with us?

Previous Lesson | Home | Next Lesson