data liberation - tony hirst

76
DATA LIBERATION Opening Up Data by Hook or by Crook - Data Scraping, Linkage and the Value of a Good Identifier Tony Hirst Department of Communication and Systems The Open University

Upload: incisiveevents

Post on 22-Jun-2015

244 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Data Liberation - Tony Hirst

DATA LIBERATION

Opening Up Data by Hook or by Crook - Data Scraping, Linkage and the Value of a Good Identifier

Tony HirstDepartment of Communication and

SystemsThe Open University

Page 2: Data Liberation - Tony Hirst

data NOT information

Craftby Vicky Hugheston

Page 3: Data Liberation - Tony Hirst

[Disruptive Innovation?]

Page 4: Data Liberation - Tony Hirst
Page 5: Data Liberation - Tony Hirst

“First” generation:data catalogues

Page 6: Data Liberation - Tony Hirst

Breathing life into data…

Page 7: Data Liberation - Tony Hirst

=importData(“CSV_URL”)

Google Sheets

Page 8: Data Liberation - Tony Hirst

the spreadsheet becomes

A DATABASE

Page 9: Data Liberation - Tony Hirst

Google Charts

Visualisation API

Page 10: Data Liberation - Tony Hirst

Google Charts

Visualisation API

Page 11: Data Liberation - Tony Hirst

Google Charts

Visualisation API

Page 12: Data Liberation - Tony Hirst

“Second” generation:data management

systems

Page 13: Data Liberation - Tony Hirst

DMS – Data Management System

Page 14: Data Liberation - Tony Hirst

BUT

Page 15: Data Liberation - Tony Hirst

There’s lots more data that’s locked up in web pages…

Page 16: Data Liberation - Tony Hirst

Scraping…

Page 17: Data Liberation - Tony Hirst
Page 18: Data Liberation - Tony Hirst

“grabbing web content in a machine readable

format and then processing it for your

own purposes”

Page 19: Data Liberation - Tony Hirst
Page 20: Data Liberation - Tony Hirst

DIY API

Page 21: Data Liberation - Tony Hirst
Page 22: Data Liberation - Tony Hirst

Original HTML web

page

Accessible web page

Extract Information

-> data

Page 23: Data Liberation - Tony Hirst

Recreating the database that was used

to populate a (templated) page

Page 24: Data Liberation - Tony Hirst
Page 25: Data Liberation - Tony Hirst
Page 26: Data Liberation - Tony Hirst
Page 27: Data Liberation - Tony Hirst
Page 28: Data Liberation - Tony Hirst
Page 29: Data Liberation - Tony Hirst

Implied semantics

Page 30: Data Liberation - Tony Hirst

…quick’n’dirty=importHTML(“pageURL”,“table”,N)

Page 31: Data Liberation - Tony Hirst
Page 32: Data Liberation - Tony Hirst
Page 33: Data Liberation - Tony Hirst
Page 34: Data Liberation - Tony Hirst
Page 35: Data Liberation - Tony Hirst
Page 36: Data Liberation - Tony Hirst
Page 37: Data Liberation - Tony Hirst

PDF scraping

Page 38: Data Liberation - Tony Hirst
Page 39: Data Liberation - Tony Hirst

Scrapers

Views

Scraper SQLite database

SQLite database Scraper

Page 40: Data Liberation - Tony Hirst
Page 41: Data Liberation - Tony Hirst
Page 42: Data Liberation - Tony Hirst
Page 43: Data Liberation - Tony Hirst

Sometimes the data is spread

across different files…

Page 44: Data Liberation - Tony Hirst
Page 45: Data Liberation - Tony Hirst

Row based aggregation

Page 46: Data Liberation - Tony Hirst

Sometimes the data is spread

across different websites…

Page 47: Data Liberation - Tony Hirst

…Normalisation…

Page 48: Data Liberation - Tony Hirst
Page 49: Data Liberation - Tony Hirst

Data Enrichment

Page 50: Data Liberation - Tony Hirst

Column Additions/An

notations

Page 51: Data Liberation - Tony Hirst
Page 52: Data Liberation - Tony Hirst

Sometimes the data is split

across different files…

Page 53: Data Liberation - Tony Hirst

Column based merge

Page 54: Data Liberation - Tony Hirst
Page 55: Data Liberation - Tony Hirst

-> Data cleansing

Page 56: Data Liberation - Tony Hirst

Clustering…

Page 57: Data Liberation - Tony Hirst

OpenRefinehttp://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/

/via Martin Hawksey/@mhawksey

Page 58: Data Liberation - Tony Hirst

OpenRefine

Page 59: Data Liberation - Tony Hirst

OpenRefine

Page 60: Data Liberation - Tony Hirst

“Finessing” a common identifer

Page 61: Data Liberation - Tony Hirst

Common identifiers (common KEYS) make

it MUCH easier to JOIN datasets by column

Page 62: Data Liberation - Tony Hirst

Book Title -> ISBN

Page 63: Data Liberation - Tony Hirst

I am “psychemedia” on Twitter, delicious, slideshare, flickr, etc

etc

Page 64: Data Liberation - Tony Hirst
Page 65: Data Liberation - Tony Hirst

Reconciliation…

Page 66: Data Liberation - Tony Hirst

OpenRefine

Page 67: Data Liberation - Tony Hirst

OpenRefine

Page 68: Data Liberation - Tony Hirst

OpenRefine

Page 69: Data Liberation - Tony Hirst

OpenRefine

Page 70: Data Liberation - Tony Hirst
Page 71: Data Liberation - Tony Hirst

Linked Data™

Page 72: Data Liberation - Tony Hirst
Page 73: Data Liberation - Tony Hirst

So who speaks SPARQL?

Diners - Journal Canteenby avlxyz

Page 74: Data Liberation - Tony Hirst

You DON’T have to….

Page 75: Data Liberation - Tony Hirst

Just think about how one piece of data might be related to another

through a common means of addressing them…

Page 76: Data Liberation - Tony Hirst

http://ouseful.info

@psychemedia