data liberation - tony hirst

Post on 22-Jun-2015

244 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DATA LIBERATION

Opening Up Data by Hook or by Crook - Data Scraping, Linkage and the Value of a Good Identifier

Tony HirstDepartment of Communication and

SystemsThe Open University

data NOT information

Craftby Vicky Hugheston

[Disruptive Innovation?]

“First” generation:data catalogues

Breathing life into data…

=importData(“CSV_URL”)

Google Sheets

the spreadsheet becomes

A DATABASE

Google Charts

Visualisation API

Google Charts

Visualisation API

Google Charts

Visualisation API

“Second” generation:data management

systems

DMS – Data Management System

BUT

There’s lots more data that’s locked up in web pages…

Scraping…

“grabbing web content in a machine readable

format and then processing it for your

own purposes”

DIY API

Original HTML web

page

Accessible web page

Extract Information

-> data

Recreating the database that was used

to populate a (templated) page

Implied semantics

…quick’n’dirty=importHTML(“pageURL”,“table”,N)

PDF scraping

Scrapers

Views

Scraper SQLite database

SQLite database Scraper

Sometimes the data is spread

across different files…

Row based aggregation

Sometimes the data is spread

across different websites…

…Normalisation…

Data Enrichment

Column Additions/An

notations

Sometimes the data is split

across different files…

Column based merge

-> Data cleansing

Clustering…

OpenRefinehttp://mashe.hawksey.info/2012/11/mining-and-openrefineing-jiscmail-a-look-at-oer-discuss/

/via Martin Hawksey/@mhawksey

OpenRefine

OpenRefine

“Finessing” a common identifer

Common identifiers (common KEYS) make

it MUCH easier to JOIN datasets by column

Book Title -> ISBN

I am “psychemedia” on Twitter, delicious, slideshare, flickr, etc

etc

Reconciliation…

OpenRefine

OpenRefine

OpenRefine

OpenRefine

Linked Data™

So who speaks SPARQL?

Diners - Journal Canteenby avlxyz

You DON’T have to….

Just think about how one piece of data might be related to another

through a common means of addressing them…

http://ouseful.info

@psychemedia

top related