data and donuts: data cleaning with openrefine
TRANSCRIPT
Data Cleaning using
OpenRefine
C. Tobin Magle, PhDNov. 9, 2016
10:00-11:00 a.m.Morgan Library Computer
Classroom 175
*inspired by content from Data Carpentry
HypothesisRaw data
Experimental design
Tidy Data
ResultsArticle
Data Management Plans
Cleaning
Analysis
The research cycle
Tidy Data
1. Columns as variables
• Don’t combine multiple pieces of info in one column
2. Rows as observations
• One measured value
Demo: Clean Survey data
• Find a partner
• Download the data: http://tinyurl.com/zlfoat6
• Open up the data in a spreadsheet program
• Look at 2013 and 2014 tabs: create a new tab and reformat the data into one tidy spreadsheet.
• What columns do we need?
Open Refine
• Doesn’t modify original
• Tracks changes you made
• Easily reversible
• Complex clustering algorithms
Survey data
• Rows: observations of individual animals
• Columns: Variables that describe the animals
• Species, sex, date, location, etc
• Messy Data• Misspellings• White space• Combined variables
Create a project• Download the file: http://tinyurl.com/qjjqlby
Preview
Removing Whitespace
• Click the blue triangle to the left of the column header
• Edit cells
• Common transforms
• Remove leading and trailing whitespace
1
234
Faceting
• ScientificName column
• Click down arrow: select text facets
• Look at possible values of the column on the left
• Edit the facets
1
2 3
4
Clustering
Select Clustering Algorithm
Merge and re-cluster
Repeat Merge and Re-cluster
Until there are no more clusters…
Split
• Edit Column > Split
• Put space as separator
• Result: new columns
1
2 3
4
5
Undo/Redo
• All your steps are saved!
• Click where it says Undo / Redo • Left frame
• Click on the step to revert to
• Result: data change.
Saving Scripts
• Export the steps for reuse
• In the Undo / Redo section, click Extract
• Select the steps you want to keep
• Save code as .txt file using a text editor
Applying Scripts
• Run the same steps on a similar document
• Click apply
• Paste in codePaste
Saving and exporting a project
• Autosave feature
• Click 'Export' button (top right)
• Select 'Export project'
• Result: a compressed file that contains
• Data• Cleaning steps
Importing a project
• Found in the menu where you crease/open projects• Loads data and history
Exporting data
• Go to 'Export' in the top right.
• Click on the file type you want to export the data in.
• 'Tab-separated values'• 'Comma-separated values'
Subsetting data 2 ways: Facet• Facet the species column• Click on a facet
Subsetting 2 ways: Text filter• Example: Find all records collected in Hawaii• Unstructured data: many facets contain “Hawaii”• Text filter = “Hawaii”
Reshaping data
Wide format = not tidy
Tall format = Tidy
• Both rows and columns are variables• Column headers are values, not
variable names
Reshaping Lou’s data
Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.
• Variables for weight: Date, mouse number, infection status, value
• Variables for cytokine levels: Date, mouse number, infection status, value
Weight data format
• Mouse # across the top• Days as rows• Month in file name• Infection status in secondary column header
Tidying the mouse weight data• Download data http://tinyurl.com/hvna4mg
• Import April_weight.xls
• Transpose cells in across columns into rows
• Split the mouse column on “ “
• Edit/delete columns
• Export script, use on other spreadsheets
Import: ignore first line
Transpose cells across columns to rows
1
3
2
4
5
Split the mouse_number column
1
2 – sep by space
3 – 3 new columns
Delete/Change Column Names
Facet Treatment Column
Edit Facets
Tidy Data!!Variables: • Days• Mouse_number• Treatment• Month in file name
Rows: weight from one mouse on one day
Can add month column and merge with other files in R! (next time?)
Need help?
• Email: [email protected]
• Data Management Services website: http://lib.colostate.edu/services/data-management
• Data Carpentry: http://www.datacarpentry.org/• OpenRefine Lesson: http://www.datacarpentry.org/OpenRefine-ecology-lesson/