a quick tour of openrefine

60
Wrangling Data with OpenRefine Tony Hirst Computing and Communications The Open University @psychemedia

Upload: tony-hirst

Post on 14-Jul-2015

1.114 views

Category:

Technology


1 download

TRANSCRIPT

Wrangling Data with OpenRefine

Tony Hirst

Computing and Communications

The Open University

@psychemedia

“It’s … a great joy to learn a technique, because as soon as you learn it, you start thinking in it. When I learn a new technique my imaginative possibilities have expanded.”

Grayson Perry,Playing to the Gallery

Real data is often dirty and messy

LoadingCheckingExploringCleaning

ReshapingAnnotating

Saving

This is a hands-onworkshop, so fire up your laptops…

Loading – import source

Loading - filetype

Loading - encoding

Exploring – text facets

Exploring – text facets

Exploring – text facets

Saving - customisation

Cleaning – text facets

Checking – no blanks

Cleaning – tidying columns

Cleaning – tidying numbers

Cleaning – tidying numbers

value.replace(‘£’, ‘’).replace(‘,’ , ‘’)

Cleaning – tidying numbers

Cleaning – tidying numbers

Exploring – number facets

Exploring – filtering number ranges

Exploring – sorting columns

Cleaning – making dates

value.toDate( ‘d/M/y’ )

Cleaning – making dates

http://bit.ly/javadateformat Cleaning – making dates

Exploring – filtering date ranges

Cleaning - whitespace

Cleaning - whitespace

Cleaning - whitespace

Exploring – filter and facet

Cleaning – “ish-match”

Cleaning – cluster / make alike

Cleaning – good practice

We need a small datasetfor the next example…

Annotating – reconciliation

https://opencorporates.com/reconcile

Annotating – reconciliation

Annotating –reconciliation

value.replace( / LTD\.?/, ‘ LIMITED’)

Cleaning – normalisation

Annotating – reconciled data

cell.recon.match.id

Annotating – reconciled data

cell.recon.match.name

Annotating – reconciled data

Annotating – reconciled data

https://api.opencorporates.com/companies/gb/00102498

Annotating – URL based data

'https://api.opencorporates.com'+value+'?sparse=true'

Annotating – URL based data

Annotating – URL based data

JSON['results’]

JSON['results']['company’]

JSON['results']['company']['registered_address_in_full']

value.parseJson['results']['company']['registered_address_in_full']

Annotating – parsed JSON data

split(value, ‘,’)

Annotating – parsed JSON data

split(value, ‘,’)[-1]

split(value, ‘,’)[-1].strip()

Annotating – parsed JSON data

Saving – annotated data

LoadingCheckingExploringCleaning

ReshapingAnnotating

Saving

Reuse – exporting your action list

Tutorials and walkthroughs

http://schoolofdata.org/handbook/recipes/cleaning-data-with-refine/

http://blog.ouseful.info/category/syndication/openrefine

Any questions: @psychemedia