data and donuts: data cleaning with openrefine

35
Data Cleaning using OpenRefine C. Tobin Magle, PhD Nov. 9, 2016 10:00-11:00 a.m. Morgan Library Computer Classroom 175 *inspired by content from Data Carpentry

Upload: c-tobin-magle

Post on 09-Jan-2017

105 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Data and Donuts: Data cleaning with OpenRefine

Data Cleaning using

OpenRefine

C. Tobin Magle, PhDNov. 9, 2016

10:00-11:00 a.m.Morgan Library Computer

Classroom 175

*inspired by content from Data Carpentry

Page 2: Data and Donuts: Data cleaning with OpenRefine

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Analysis

The research cycle

Page 3: Data and Donuts: Data cleaning with OpenRefine

Tidy Data

1. Columns as variables

• Don’t combine multiple pieces of info in one column

2. Rows as observations

• One measured value

Page 4: Data and Donuts: Data cleaning with OpenRefine

Demo: Clean Survey data

• Find a partner

• Download the data: http://tinyurl.com/zlfoat6

• Open up the data in a spreadsheet program

• Look at 2013 and 2014 tabs: create a new tab and reformat the data into one tidy spreadsheet.

• What columns do we need?

Page 5: Data and Donuts: Data cleaning with OpenRefine

Open Refine

• Doesn’t modify original

• Tracks changes you made

• Easily reversible

• Complex clustering algorithms

Page 6: Data and Donuts: Data cleaning with OpenRefine

Survey data

• Rows: observations of individual animals

• Columns: Variables that describe the animals

• Species, sex, date, location, etc

• Messy Data• Misspellings• White space• Combined variables

Page 7: Data and Donuts: Data cleaning with OpenRefine

Create a project• Download the file: http://tinyurl.com/qjjqlby

Page 8: Data and Donuts: Data cleaning with OpenRefine

Preview

Page 9: Data and Donuts: Data cleaning with OpenRefine

Removing Whitespace

• Click the blue triangle to the left of the column header

• Edit cells

• Common transforms

• Remove leading and trailing whitespace

1

234

Page 10: Data and Donuts: Data cleaning with OpenRefine

Faceting

• ScientificName column

• Click down arrow: select text facets

• Look at possible values of the column on the left

• Edit the facets

1

2 3

4

Page 11: Data and Donuts: Data cleaning with OpenRefine

Clustering

Page 12: Data and Donuts: Data cleaning with OpenRefine

Select Clustering Algorithm

Page 13: Data and Donuts: Data cleaning with OpenRefine

Merge and re-cluster

Page 14: Data and Donuts: Data cleaning with OpenRefine

Repeat Merge and Re-cluster

Until there are no more clusters…

Page 15: Data and Donuts: Data cleaning with OpenRefine

Split

• Edit Column > Split

• Put space as separator

• Result: new columns

1

2 3

4

5

Page 16: Data and Donuts: Data cleaning with OpenRefine

Undo/Redo

• All your steps are saved!

• Click where it says Undo / Redo • Left frame

• Click on the step to revert to

• Result: data change.

Page 17: Data and Donuts: Data cleaning with OpenRefine

Saving Scripts

• Export the steps for reuse

• In the Undo / Redo section, click Extract

• Select the steps you want to keep

• Save code as .txt file using a text editor

Page 18: Data and Donuts: Data cleaning with OpenRefine

Applying Scripts

• Run the same steps on a similar document

• Click apply

• Paste in codePaste

Page 19: Data and Donuts: Data cleaning with OpenRefine

Saving and exporting a project

• Autosave feature

• Click 'Export' button (top right)

• Select 'Export project'

• Result: a compressed file that contains

• Data• Cleaning steps

Page 20: Data and Donuts: Data cleaning with OpenRefine

Importing a project

• Found in the menu where you crease/open projects• Loads data and history

Page 21: Data and Donuts: Data cleaning with OpenRefine

Exporting data

• Go to 'Export' in the top right.

• Click on the file type you want to export the data in.

• 'Tab-separated values'• 'Comma-separated values'

Page 22: Data and Donuts: Data cleaning with OpenRefine

Subsetting data 2 ways: Facet• Facet the species column• Click on a facet

Page 23: Data and Donuts: Data cleaning with OpenRefine

Subsetting 2 ways: Text filter• Example: Find all records collected in Hawaii• Unstructured data: many facets contain “Hawaii”• Text filter = “Hawaii”

Page 24: Data and Donuts: Data cleaning with OpenRefine

Reshaping data

Wide format = not tidy

Tall format = Tidy

• Both rows and columns are variables• Column headers are values, not

variable names

Page 25: Data and Donuts: Data cleaning with OpenRefine

Reshaping Lou’s data

Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.

• Variables for weight: Date, mouse number, infection status, value

• Variables for cytokine levels: Date, mouse number, infection status, value

Page 26: Data and Donuts: Data cleaning with OpenRefine

Weight data format

• Mouse # across the top• Days as rows• Month in file name• Infection status in secondary column header

Page 27: Data and Donuts: Data cleaning with OpenRefine

Tidying the mouse weight data• Download data http://tinyurl.com/hvna4mg

• Import April_weight.xls

• Transpose cells in across columns into rows

• Split the mouse column on “ “

• Edit/delete columns

• Export script, use on other spreadsheets

Page 28: Data and Donuts: Data cleaning with OpenRefine

Import: ignore first line

Page 29: Data and Donuts: Data cleaning with OpenRefine

Transpose cells across columns to rows

1

3

2

4

5

Page 30: Data and Donuts: Data cleaning with OpenRefine

Split the mouse_number column

1

2 – sep by space

3 – 3 new columns

Page 31: Data and Donuts: Data cleaning with OpenRefine

Delete/Change Column Names

Page 32: Data and Donuts: Data cleaning with OpenRefine

Facet Treatment Column

Page 33: Data and Donuts: Data cleaning with OpenRefine

Edit Facets

Page 34: Data and Donuts: Data cleaning with OpenRefine

Tidy Data!!Variables: • Days• Mouse_number• Treatment• Month in file name

Rows: weight from one mouse on one day

Can add month column and merge with other files in R! (next time?)

Page 35: Data and Donuts: Data cleaning with OpenRefine

Need help?

• Email: [email protected]

• Data Management Services website: http://lib.colostate.edu/services/data-management

• Data Carpentry: http://www.datacarpentry.org/• OpenRefine Lesson: http://www.datacarpentry.org/OpenRefine-ecology-lesson/