introduction to open data and other hypes

83
Introduction to OPEN DATA and other hypes J. Minguillón EIMT / UOC

Upload: julia-minguillon

Post on 16-Apr-2017

391 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Introduction to OPEN DATA and other hypes

Introduction toOPEN DATA

and other hypes

J. MinguillónEIMT / UOC

Page 2: Introduction to OPEN DATA and other hypes

what is Open Data?

Page 3: Introduction to OPEN DATA and other hypes

what is Open?

Page 4: Introduction to OPEN DATA and other hypes

what is Data?

Page 5: Introduction to OPEN DATA and other hypes

plural of "datum" (thing given)

data is / data are

idea: the measure / amount / ... of something

Page 6: Introduction to OPEN DATA and other hypes

42

Page 7: Introduction to OPEN DATA and other hypes

42 what?

https://en.wikipedia.org/wiki/42_(disambiguation)

Page 8: Introduction to OPEN DATA and other hypes

forty-two

quaranta-dos

amane nambili

Page 9: Introduction to OPEN DATA and other hypes

representation

Page 10: Introduction to OPEN DATA and other hypes

integer?

base / radix?

units?

Page 11: Introduction to OPEN DATA and other hypes

D-I-K-W pyramid

Page 12: Introduction to OPEN DATA and other hypes

D: 42

I: Patient's body temperature (t) is 42 degrees

K: Fever with t > 42 can cause severe brain damage

W: never let t reach 42 degrees!

Page 13: Introduction to OPEN DATA and other hypes

t = 42 degrees?

Celsius: fever

Fahrenheit: cold body

Kelvin: cold body floating in outer space

Page 14: Introduction to OPEN DATA and other hypes

data is not just numbers

Page 15: Introduction to OPEN DATA and other hypes

tables, documents

wikipedia: pages / articles

flickr: images

twitter: tweets

Page 16: Introduction to OPEN DATA and other hypes

internal structure

x

possible values

Page 17: Introduction to OPEN DATA and other hypes

basic types

structured

semi-structured

Page 18: Introduction to OPEN DATA and other hypes

basic types

integer, real, complex

vectors (RGB, ...)

characters, strings

Page 19: Introduction to OPEN DATA and other hypes

structured data

flat: 1D, 2D, 3D, ...

hierarchical: tweets

relations: graphs

Page 20: Introduction to OPEN DATA and other hypes

semi-structured data

documents

HTML pages

Page 21: Introduction to OPEN DATA and other hypes

what is Open?

Page 22: Introduction to OPEN DATA and other hypes

openness as freedom

Page 23: Introduction to OPEN DATA and other hypes

5 Rs model

ReuseReviseRemix

RedistributeRetain

Page 24: Introduction to OPEN DATA and other hypes

open vs free

https://theodi.org/blog/when-data-is-free-but-not-open

Page 25: Introduction to OPEN DATA and other hypes

open is a combination of

no technological barriers

no legal barriers

Page 26: Introduction to OPEN DATA and other hypes

technological barriers

Page 27: Introduction to OPEN DATA and other hypes

technological barriers

data must be accessibledownloadablemanipulable

Page 28: Introduction to OPEN DATA and other hypes

the 5 star model

* no manipulable: pdf, tiff ** proprietary: doc, ppt, xls

*** open formats: txt, csv, json**** accessible (link): xml, rdf

***** provide context: xml, rdf

http://5stardata.info/en/

Page 29: Introduction to OPEN DATA and other hypes

open data needs at least 3 star

open formats

open software

Page 30: Introduction to OPEN DATA and other hypes

linked data

Page 31: Introduction to OPEN DATA and other hypes

linked data

use URIs to name thingsuse HTTP to provide access

describe data using metadatalink to related data sources

readable by machines

Page 32: Introduction to OPEN DATA and other hypes

why linked data?

automatic web data extractiondata exchange / enrichmentconstruction of knowledge

semantic searches

Page 33: Introduction to OPEN DATA and other hypes

example: wikidata

municipalities surrounding Barcelona?

https://en.wikipedia.org/wiki/Barcelona

https://www.wikidata.org/wiki/Q1492

Page 34: Introduction to OPEN DATA and other hypes

"static" access

data is downloaded as a filefiles are "pictures of the past"

not defined by final users typical of data repositories

human oriented

http://dadesobertes.gencat.cat/en/cercador/detall-cataleg/?id=5

Page 35: Introduction to OPEN DATA and other hypes

"dynamic" access

data is downloaded as a streamstreams are "pictures of the present"

parametrized by final users (API)typical of online services

machine oriented

Page 36: Introduction to OPEN DATA and other hypes

legal barriers

Page 37: Introduction to OPEN DATA and other hypes

legal barriers

reachable through Internet does not mean open

licensesterms and conditions

EULAs

Page 38: Introduction to OPEN DATA and other hypes

licenses for open data

for datasets / databasesfacts cannot be restricted...

...but collections can!

http://opendatacommons.org/licenses/

Page 39: Introduction to OPEN DATA and other hypes

terms and conditions

for web datalegal language

http://www.coca-colacompany.com/our-company/the-coca-cola-company-terms-of-use

Page 40: Introduction to OPEN DATA and other hypes

EULA

End-User License Agreementfor apps and online services

legal languageabsurd!

https://www.eff.org/wp/dangerous-terms-users-guide-eulas

Page 41: Introduction to OPEN DATA and other hypes

ethic issues

privacysecurity

transparency

Page 43: Introduction to OPEN DATA and other hypes

why open data?

Page 44: Introduction to OPEN DATA and other hypes

why not?

Page 45: Introduction to OPEN DATA and other hypes

data belongs to their producersin most cases, users!

it promotes participationit discovers additional value

"data is the new oil" (C. Humby)

"data is the new soil" (D. McCandless)

Page 46: Introduction to OPEN DATA and other hypes

data life-cycle

Page 47: Introduction to OPEN DATA and other hypes

data is ...

generatedstored / published

gathered / capturedpreprocessed

analyzedvisualized

Page 48: Introduction to OPEN DATA and other hypes

data generation

by humans / sensors / servicesanytime / anywherepersistent / volatilestored / published

Page 49: Introduction to OPEN DATA and other hypes

data gathering

from repositoriesAPIs

social networksdatabases / logsweb scrapping

humans (captcha)

Page 50: Introduction to OPEN DATA and other hypes

data preprocessing

filtering / selectionjoin (enrichment)feature extraction

conversionsummarize / aggregate

Page 51: Introduction to OPEN DATA and other hypes

data analysis

statistical descriptorsinference

unsupervised (clustering)supervised (classification)

variable relevance...

Page 52: Introduction to OPEN DATA and other hypes

data visualization

visual analysissummarization

reportingdashboards

maps / graphsinteractivity

Page 53: Introduction to OPEN DATA and other hypes

big data

Page 54: Introduction to OPEN DATA and other hypes

big data

3 Vs

volumevarietyvelocity

Page 55: Introduction to OPEN DATA and other hypes

volume isthe number of elements

sample / population size

Page 56: Introduction to OPEN DATA and other hypes

variety isthe number of different forms

dimensionality

Page 57: Introduction to OPEN DATA and other hypes

velocity ishow fast data is produced or

changes

longitudinal

Page 58: Introduction to OPEN DATA and other hypes

other Vs

veracityvalue

variabilityvisibility

...

Page 59: Introduction to OPEN DATA and other hypes

example: Wal-Mart

(2015) 37 million peopleshop at Wal-Mart every dayfrom a list of 140,000 items

who buys what when?why?

Page 60: Introduction to OPEN DATA and other hypes

other big huge data players

amazonVISAtelcos

facebook, twitter, ...google

Page 61: Introduction to OPEN DATA and other hypes

big data also

uses multiple sourcesdeals with population, not samplesmakes traditional methods obsolete

requires supercomputing / cloud

Page 62: Introduction to OPEN DATA and other hypes

example

include context datacustomer loyalty cards

product interestingness (RFID)CCTV camerassocial networks

...

Page 63: Introduction to OPEN DATA and other hypes

tools(examples)

Page 64: Introduction to OPEN DATA and other hypes

"engineering" approach

solve this problem now with the available tools

no tool solves all problemsproblems change, tools too

tools related to data life-cycle

Page 66: Introduction to OPEN DATA and other hypes

example: URL manipulation

IDESCAT

names of newborn childrenparameters: year, sex, place

other: position, sort

Page 67: Introduction to OPEN DATA and other hypes

example: URL manipulation

use scrapy for data gathering

define desired fieldscreate list of URLs

identify XPATH (inspect)

Page 68: Introduction to OPEN DATA and other hypes

data preprocessing

Mr. Data ConverterJSON online editor

OpenRefinebash+awk, perl, python

Page 71: Introduction to OPEN DATA and other hypes

example

visualizing co-authorship at UOC

Page 72: Introduction to OPEN DATA and other hypes

data gathered from SCOPUS

unify author names, build graph

no analysis

visualize graph

Page 73: Introduction to OPEN DATA and other hypes

what knowledge can we extract from the visualization?

most profilic authors/departmentsinterdisciplinarity, connectorsinternal publication policies

"lone rangers"

Page 74: Introduction to OPEN DATA and other hypes

what open data can we use to enrich the visualization?

from authors/departmentsfrom papers/journals

...

Page 75: Introduction to OPEN DATA and other hypes

open datainitiatives

Page 76: Introduction to OPEN DATA and other hypes

agenda oberta

civio

15mpedia

wheredoesmymoneygo?

...

Page 77: Introduction to OPEN DATA and other hypes

data sources

social networksopen data repositories

scraped web data...

Page 78: Introduction to OPEN DATA and other hypes

examples

league of legends & twittersmileys, weather & twitter

air tickets price fluctuationsbarcelona & flickr

barcelona & bicingUS air traffic patterns

Page 79: Introduction to OPEN DATA and other hypes

project

Page 80: Introduction to OPEN DATA and other hypes

requirements

teams of 3-4 peoplefree topic using open data

proof-of-conceptfinal report

Page 81: Introduction to OPEN DATA and other hypes

report

summary and goalsdata life-cycle description

tools and data usedresults

legal and ethical issueslimitations and future work

bibliography and references

Page 82: Introduction to OPEN DATA and other hypes

calendar

today: team, topic, abstractnext session: work in class

online mentoringdeadline: 23/01/2017

Page 83: Introduction to OPEN DATA and other hypes

contact

jminguillona[at]uoc[dot]edu

@jminguillona

webpage

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.