introduction to open data and other hypes

Post on 16-Apr-2017

391 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction toOPEN DATA

and other hypes

J. MinguillónEIMT / UOC

what is Open Data?

what is Open?

what is Data?

plural of "datum" (thing given)

data is / data are

idea: the measure / amount / ... of something

42

42 what?

https://en.wikipedia.org/wiki/42_(disambiguation)

forty-two

quaranta-dos

amane nambili

representation

integer?

base / radix?

units?

D-I-K-W pyramid

D: 42

I: Patient's body temperature (t) is 42 degrees

K: Fever with t > 42 can cause severe brain damage

W: never let t reach 42 degrees!

t = 42 degrees?

Celsius: fever

Fahrenheit: cold body

Kelvin: cold body floating in outer space

data is not just numbers

tables, documents

wikipedia: pages / articles

flickr: images

twitter: tweets

internal structure

x

possible values

basic types

structured

semi-structured

basic types

integer, real, complex

vectors (RGB, ...)

characters, strings

structured data

flat: 1D, 2D, 3D, ...

hierarchical: tweets

relations: graphs

semi-structured data

documents

HTML pages

what is Open?

openness as freedom

5 Rs model

ReuseReviseRemix

RedistributeRetain

open vs free

https://theodi.org/blog/when-data-is-free-but-not-open

open is a combination of

no technological barriers

no legal barriers

technological barriers

technological barriers

data must be accessibledownloadablemanipulable

the 5 star model

* no manipulable: pdf, tiff ** proprietary: doc, ppt, xls

*** open formats: txt, csv, json**** accessible (link): xml, rdf

***** provide context: xml, rdf

http://5stardata.info/en/

open data needs at least 3 star

open formats

open software

linked data

linked data

use URIs to name thingsuse HTTP to provide access

describe data using metadatalink to related data sources

readable by machines

why linked data?

automatic web data extractiondata exchange / enrichmentconstruction of knowledge

semantic searches

example: wikidata

municipalities surrounding Barcelona?

https://en.wikipedia.org/wiki/Barcelona

https://www.wikidata.org/wiki/Q1492

"static" access

data is downloaded as a filefiles are "pictures of the past"

not defined by final users typical of data repositories

human oriented

http://dadesobertes.gencat.cat/en/cercador/detall-cataleg/?id=5

"dynamic" access

data is downloaded as a streamstreams are "pictures of the present"

parametrized by final users (API)typical of online services

machine oriented

legal barriers

legal barriers

reachable through Internet does not mean open

licensesterms and conditions

EULAs

licenses for open data

for datasets / databasesfacts cannot be restricted...

...but collections can!

http://opendatacommons.org/licenses/

terms and conditions

for web datalegal language

http://www.coca-colacompany.com/our-company/the-coca-cola-company-terms-of-use

EULA

End-User License Agreementfor apps and online services

legal languageabsurd!

https://www.eff.org/wp/dangerous-terms-users-guide-eulas

ethic issues

privacysecurity

transparency

why open data?

why not?

data belongs to their producersin most cases, users!

it promotes participationit discovers additional value

"data is the new oil" (C. Humby)

"data is the new soil" (D. McCandless)

data life-cycle

data is ...

generatedstored / published

gathered / capturedpreprocessed

analyzedvisualized

data generation

by humans / sensors / servicesanytime / anywherepersistent / volatilestored / published

data gathering

from repositoriesAPIs

social networksdatabases / logsweb scrapping

humans (captcha)

data preprocessing

filtering / selectionjoin (enrichment)feature extraction

conversionsummarize / aggregate

data analysis

statistical descriptorsinference

unsupervised (clustering)supervised (classification)

variable relevance...

data visualization

visual analysissummarization

reportingdashboards

maps / graphsinteractivity

big data

big data

3 Vs

volumevarietyvelocity

volume isthe number of elements

sample / population size

variety isthe number of different forms

dimensionality

velocity ishow fast data is produced or

changes

longitudinal

other Vs

veracityvalue

variabilityvisibility

...

example: Wal-Mart

(2015) 37 million peopleshop at Wal-Mart every dayfrom a list of 140,000 items

who buys what when?why?

other big huge data players

amazonVISAtelcos

facebook, twitter, ...google

big data also

uses multiple sourcesdeals with population, not samplesmakes traditional methods obsolete

requires supercomputing / cloud

example

include context datacustomer loyalty cards

product interestingness (RFID)CCTV camerassocial networks

...

tools(examples)

"engineering" approach

solve this problem now with the available tools

no tool solves all problemsproblems change, tools too

tools related to data life-cycle

example: URL manipulation

IDESCAT

names of newborn childrenparameters: year, sex, place

other: position, sort

example: URL manipulation

use scrapy for data gathering

define desired fieldscreate list of URLs

identify XPATH (inspect)

data preprocessing

Mr. Data ConverterJSON online editor

OpenRefinebash+awk, perl, python

example

visualizing co-authorship at UOC

data gathered from SCOPUS

unify author names, build graph

no analysis

visualize graph

what knowledge can we extract from the visualization?

most profilic authors/departmentsinterdisciplinarity, connectorsinternal publication policies

"lone rangers"

what open data can we use to enrich the visualization?

from authors/departmentsfrom papers/journals

...

open datainitiatives

agenda oberta

civio

15mpedia

wheredoesmymoneygo?

...

data sources

social networksopen data repositories

scraped web data...

examples

league of legends & twittersmileys, weather & twitter

air tickets price fluctuationsbarcelona & flickr

barcelona & bicingUS air traffic patterns

project

requirements

teams of 3-4 peoplefree topic using open data

proof-of-conceptfinal report

report

summary and goalsdata life-cycle description

tools and data usedresults

legal and ethical issueslimitations and future work

bibliography and references

calendar

today: team, topic, abstractnext session: work in class

online mentoringdeadline: 23/01/2017

contact

jminguillona[at]uoc[dot]edu

@jminguillona

webpage

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

top related