data and donuts: data cleaning with openrefine

Post on 09-Jan-2017

105 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Cleaning using

OpenRefine

C. Tobin Magle, PhDNov. 9, 2016

10:00-11:00 a.m.Morgan Library Computer

Classroom 175

*inspired by content from Data Carpentry

HypothesisRaw data

Experimental design

Tidy Data

ResultsArticle

Data Management Plans

Cleaning

Analysis

The research cycle

Tidy Data

1. Columns as variables

• Don’t combine multiple pieces of info in one column

2. Rows as observations

• One measured value

Demo: Clean Survey data

• Find a partner

• Download the data: http://tinyurl.com/zlfoat6

• Open up the data in a spreadsheet program

• Look at 2013 and 2014 tabs: create a new tab and reformat the data into one tidy spreadsheet.

• What columns do we need?

Open Refine

• Doesn’t modify original

• Tracks changes you made

• Easily reversible

• Complex clustering algorithms

Survey data

• Rows: observations of individual animals

• Columns: Variables that describe the animals

• Species, sex, date, location, etc

• Messy Data• Misspellings• White space• Combined variables

Create a project• Download the file: http://tinyurl.com/qjjqlby

Preview

Removing Whitespace

• Click the blue triangle to the left of the column header

• Edit cells

• Common transforms

• Remove leading and trailing whitespace

1

234

Faceting

• ScientificName column

• Click down arrow: select text facets

• Look at possible values of the column on the left

• Edit the facets

1

2 3

4

Clustering

Select Clustering Algorithm

Merge and re-cluster

Repeat Merge and Re-cluster

Until there are no more clusters…

Split

• Edit Column > Split

• Put space as separator

• Result: new columns

1

2 3

4

5

Undo/Redo

• All your steps are saved!

• Click where it says Undo / Redo • Left frame

• Click on the step to revert to

• Result: data change.

Saving Scripts

• Export the steps for reuse

• In the Undo / Redo section, click Extract

• Select the steps you want to keep

• Save code as .txt file using a text editor

Applying Scripts

• Run the same steps on a similar document

• Click apply

• Paste in codePaste

Saving and exporting a project

• Autosave feature

• Click 'Export' button (top right)

• Select 'Export project'

• Result: a compressed file that contains

• Data• Cleaning steps

Importing a project

• Found in the menu where you crease/open projects• Loads data and history

Exporting data

• Go to 'Export' in the top right.

• Click on the file type you want to export the data in.

• 'Tab-separated values'• 'Comma-separated values'

Subsetting data 2 ways: Facet• Facet the species column• Click on a facet

Subsetting 2 ways: Text filter• Example: Find all records collected in Hawaii• Unstructured data: many facets contain “Hawaii”• Text filter = “Hawaii”

Reshaping data

Wide format = not tidy

Tall format = Tidy

• Both rows and columns are variables• Column headers are values, not

variable names

Reshaping Lou’s data

Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.

• Variables for weight: Date, mouse number, infection status, value

• Variables for cytokine levels: Date, mouse number, infection status, value

Weight data format

• Mouse # across the top• Days as rows• Month in file name• Infection status in secondary column header

Tidying the mouse weight data• Download data http://tinyurl.com/hvna4mg

• Import April_weight.xls

• Transpose cells in across columns into rows

• Split the mouse column on “ “

• Edit/delete columns

• Export script, use on other spreadsheets

Import: ignore first line

Transpose cells across columns to rows

1

3

2

4

5

Split the mouse_number column

1

2 – sep by space

3 – 3 new columns

Delete/Change Column Names

Facet Treatment Column

Edit Facets

Tidy Data!!Variables: • Days• Mouse_number• Treatment• Month in file name

Rows: weight from one mouse on one day

Can add month column and merge with other files in R! (next time?)

Need help?

• Email: tobin.magle@colostate.edu

• Data Management Services website: http://lib.colostate.edu/services/data-management

• Data Carpentry: http://www.datacarpentry.org/• OpenRefine Lesson: http://www.datacarpentry.org/OpenRefine-ecology-lesson/

top related