let your data shine... with openrefine

21
Let your data shine… with OpenRefine Open Belgium 2016 OpenRefine workshop Brosens - Desmet

Upload: open-knowledge-belgium

Post on 12-Apr-2017

278 views

Category:

Presentations & Public Speaking


5 download

TRANSCRIPT

Page 1: Let your data shine... with OpenRefine

Let your data shine… with OpenRefine

Open Belgium 2016OpenRefine workshopBrosens - Desmet

Page 2: Let your data shine... with OpenRefine

What people say: tweets

@bartox: "Damn! Wish I had this 5 years ago! RT @swiertz nice tools ! Format & clean your data with Google Refine http://goo.gl/UniR6 #cleanup #tools" view tweet

@Musebrarian: "YIPEEEE! Google Refine works with OAI-PMH XML out of the box. This is going to make my life much easier." view tweet

@kb: "It’s kind of ridiculous how exciting I find this: https://code.google.com/p/google-refine/" view tweet

@litcritter: "I rarely feel the desire to kiss a corporation on the mouth, but Google Refine is making me come close http://goo.gl/8pvKB #datageek" view tweet

Page 3: Let your data shine... with OpenRefine

@LearonDalby: "I'm sold on #Google #Refine used it most of the day with "messy" data and managed to clean nearly all of it." view tweet

@roolio: "Today google #refine saved my afternoon. Every #data #hacker should try it" view tweet

@Salesient: "Google refine is awesome. Never before have I been home this early." view tweet

@Mayin: "Not only will it clean your data, Google Refine will slice, dice and put bows on your hairdo!http://bit.ly/cPGn1E Rocks data exploration." view tweet

@marklabedz: "Google Refine: Making interns unneccesary since 2010." view tweet

@naterkane: "i'm completely in love with Google Refine. fo' reals." view tweet

@LearonDalby: "Using #Google #Refine makes me happy. Even for the easy stuff." view tweet

@loranstefani: "Google Refine: love at first click" view tweet

@tracystan: "Google Refine is gonna change my life" view tweet

What people say: tweets

Page 4: Let your data shine... with OpenRefine

"Google Refine isn’t going to solve the problem of poor data availability, but for those who manage to gain access to existing records, it can be a powerful tool for transparency." Rebekah Heacock, co-director of the Technology for Transparency Network and a Project Coordinator at Harvard’s Berkman Center for Internet and Society - Sunlight Foundation, Tools for transparency: Google Refine.

"Google Refine is an immensely powerful tool for dealing with "messy" data, and it sports a myriad of advanced features for massaging and analyzing complex data sets" Dmitri Popov (Linux Magazine) - Use Google Refine to Massage Your Data

"For anyone who’s ever had to sort through messy data to try to turn up a meaningful treatment, and who hasn’t, this tool is a godsend." Michael Lines, SLAW - Google Refine 2.0

"Google Refine 2.0 will serve an excellent back-end for data visualization services. It has been well received by the Chicago Tribune and open-government data communities. Along with Google Squared, Refine 2.0 can create a powerful research tool." Chinmoy Kanjilal, Techie Buzz - Google Refine 2.0: Power Tools for Working With Data

What people say: blogs

Page 5: Let your data shine... with OpenRefine

● Formerly known as Google Refine, now OpenRefine● Site: http://openrefine.org● Github: https://github.com/OpenRefine● Used for

○ Data cleaning (detect and correct anomalies)○ Transform data (change format, change datatype)○ “Pimp” & “link” data (harvest & connect data from online databases)

● More powerful than a worksheet● More visual than scripting

A free, open source, powerful tool for working with messy data

Page 6: Let your data shine... with OpenRefine

● Supported by a large community (lots of tutorials and plugins)● Works quite well up to 100.000 rows of data● Supports several file formats● The original file is unaffected● OpenRefine runs in a modern browser, but does not require an internet

connection (except when you connect to services)

A free, open source, powerful tool for working with messy data

Page 7: Let your data shine... with OpenRefine

Other tools OpenRefine

Worksheet focus on cells focus on rows and columns

focus on import data & calculations

focus on exploring and transforming existing data

Scripting data → script → output all steps are visualized

focus on transformation of data

Databases focus on queries looks like a worksheet

you should know the data data is always visible, facets shows you choices

OpenRefine vs other tools

Page 8: Let your data shine... with OpenRefine

Distribution Description Authors

LODRefine LODRefine is actually OpenRefine with integrated extensions that make transition from

tabular data to Linked Data a bit easier. Integrated extensions are: RDF extension, DBpedia

extension, Crowdsourcing extension, Stats extension

Sparkica

OpenDataRise Tool to cleanse and semantify datasets from CKAN repositories. Based on OpenRefine. Open Data in

Trentino

p3-batchrefine BatchRefine adds batch processing capabilities to OpenRefine and support multiple back

end including spark

SpazioDati

SparkonRefine RefineOnSpark is a driver program to run OpenRefine jobs on the Spark cluster SpazioDati

Reconciliation-and-Matching-

Framework

A framework to allow the matching of string entities using customised sets of transformations

and matchers, plus a tool to produce the necessary configurations and another to expose

them as OpenRefine reconciliation services.

RBGKew

Tools working with OpenRefine

Page 9: Let your data shine... with OpenRefine

● Download Google Refine on: http://openrefine.org/download.html● Launch Google Refine● Create a project● Choose the file you want to clean (Example Dataset: Onderwijsaanbod in Vlaanderen

(http://opendata.vlaanderen.be/dataset/onderwijsaanbod)

Hands on: install OpenRefine

Page 10: Let your data shine... with OpenRefine

● Check the preview and define parsing○ Set character encoding (UTF8)○ Choose delimiter (/t ; , …) ○ Parse data as (csv)○ Parse first line as column header, ignore first … line(s)....

Hands on: importing data

Page 11: Let your data shine... with OpenRefine

● Accessing information organized according to a faceted classification system○ Creating an overview of the data○ Allows targeted editing of your data○ Allows specific filtering○ Facet choices as tab separated values (like pivot tables in Excel)

Hands on: faceting

Page 12: Let your data shine... with OpenRefine

● Clustering allows to automatically group and edit different but similar values

Hands on: clustering

Page 13: Let your data shine... with OpenRefine

● Common transforms: ○ to number○ trim leading and trailing whitespace○ to title case; to date; to number

● Split & Join multi valued cells

Hands on: edit cells

Page 14: Let your data shine... with OpenRefine

● Split columns (by separator or field length)● Add columns (by fetching urls or based on column) (use GREL)● Move columns● Remove columns● Rename columns

Hands on: edit columns

Page 15: Let your data shine... with OpenRefine

● GREL (google refine expression language) ○ add columns based on other column

■ basic string modification■ find and replace■ string parsing and splitting ■ calling web services

○ Result are always visible in the Preview

Hands on: scripting using GREL

Page 16: Let your data shine... with OpenRefine

● Add columns by fetching url■ find and replace■ string parsing & splitting ■ add column based on column”straat” (value+”%20”+cells[‘huisnummer’].value)

■ Call google API (or openstreetmap or….) ("https://maps.googleapis.

com/maps/api/geocode/json?address="+value+ cells["huisnummer"].value&key=AIzaSyDY2Z6wehbIqIPrHIb9ljC62pwRqEHOous")

■ Parse JSON (value.parseJson()["results"][0]["geometry"]["location"]["lng"])

Hands on: georeferencing

Page 17: Let your data shine... with OpenRefine

● Grouping concepts with an external service, eg taxonomic reconciliation○ Example from the natural environment (biodiversity data)

■ add a reconciliation service (reconcile, start reconciling)■ Let’s use Encyclopedia of Life■ Select Matches (Facet, Quick actions…)

Hands on: reconciling

Page 18: Let your data shine... with OpenRefine

● Grouping concepts with an external service, eg taxonomic reconciliation○ Example from the natural environment (biodiversity data)

■ add ID EOL ID column (GREL) cell.recon.match.id■ create url based on EOL ID■ http://eol.org/pages/3465521

Hands on: reconciling

Page 19: Let your data shine... with OpenRefine

● Merge data from the two projects by creating a new column from values from an existing column within one project that are used to index into a similar column in the other project○ cell.cross("datasetname.csv","scientificName").cells["order"].value[0]

Hands on: cross referencing

Page 20: Let your data shine... with OpenRefine

● Extract and save parts of your operation history as JSON that you can apply to this or other projects in the future.

Hands on: Extract operation history

Page 21: Let your data shine... with OpenRefine

● https://github.com/OpenRefine/OpenRefine/wiki● https://github.com/OpenRefine/OpenRefine/wiki/Recipes● http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial● ...

Hands on: further reading