standardizing for open data

39
(1) Standardizing for Open Data Ivan Herman, W3C Open Data Week Marseille, France, June 26 2013 Slides at: http://www.w3.org/2013/Talks/0626-Marseille-IH/

Upload: ivan-herman

Post on 19-Jan-2015

294 views

Category:

Technology


1 download

DESCRIPTION

Plans of W3C in the area of standard activities on Data on the Web

TRANSCRIPT

Page 1: Standardizing for Open Data

(1)

Standardizing for Open Data Ivan  Herman,  W3C  Open  Data  Week  

Marseille,  France,  June  26  2013  Slides at: http://www.w3.org/2013/Talks/0626-Marseille-IH/

Page 2: Standardizing for Open Data

(2)

Data  is  everywhere  on  the  Web!  

l  Public,  private,  behind  enterprise  firewalls  

l  Ranges  from  informal  to  highly  curated  

l  Ranges  from  machine  readable  to  human  readable  l  HTML  tables,  twitter  feeds,  local  vocabularies,  

spreadsheets,  …  

l  Expressed  in  diverse  models    l  tree,  graph,  table,  …  

l  Serialized  in  many  ways    l  XML,  CSV,  RDF,  PDF,  HTML  Tables,  microdata,…  

Page 3: Standardizing for Open Data

(3)

Page 4: Standardizing for Open Data

(4)

Page 5: Standardizing for Open Data

(5)

Page 6: Standardizing for Open Data

(6)

Page 7: Standardizing for Open Data

(7)

Page 8: Standardizing for Open Data

(8)

W3C’s  standardization  focus  was,  traditionally,  on  Web  scale  

integration  of  data  

l Some  basic  principles:  l  use  of  URIs  everywhere  (to  uniquely  identify  things)  l  relate  resources  among  one  another  (to  connect  

things  on  the  Web)  l  discover  new  relationships  through  inferences  

l This  is  what  the  Semantic  Web  technologies  are  all  about  

 

 

Page 9: Standardizing for Open Data

(9)

We  have  a  number  of  standards  

RDF  1.1  

SPARQL  1.1  

URI  JSON-­‐LD   Turtle   RDFa   RDF/XML  

RDF:  data  model,  links,  basic  assertions;  different  serializations    

SPARQL:  querying  data  

A  fairly  stable  set  of  technologies  by  now!  

Page 10: Standardizing for Open Data

(10)

We  have  a  number  of  standards  

RDB2RDF   RDF  1.1  

RDFS  1.1  SPARQL  1.1  

OWL  2  

URI  JSON-­‐LD   Turtle   RDFa   RDF/XML  

RDF:  data  model,  links,  basic  assertions;  different  serializations    

SPARQL:  querying  data  

RDFS:    simple  vocabularies  

OWL:  complex  vocabularies,  ontologies  

RDB2RDF:  databases  to  RDF  

A  fairly  stable  set  of  technologies  by  now!  

Page 11: Standardizing for Open Data

(11)

We  have  Linked  Data  principles  

Page 12: Standardizing for Open Data

(12)

Integration  is  done  in  different  ways  

l Very  roughly:  l  data  is  accessed  directly  as  RDF  and  turned  into  

something  useful  l  relies  on  data  being  “preprocessed”  and  published  as  RDF  

l  data  is  collected  from  different  sources,  integrated  internally  l  using,  say,  a  triple  store  

Page 13: Standardizing for Open Data

(13)

Page 14: Standardizing for Open Data
Page 15: Standardizing for Open Data

(15)

However…  

l There  is  a  price  to  pay:  a  relatively  heavy  ecosystem  l many  developers  shy  away  from  using  RDF  and  

related  tools  

l Not  all  applications  need  this!  l  data  may  be  used  directly,  no  need  for  integration  

concerns  l  the  emphasis  may  be  on  easy  production  and  

manipulation  of  data  with  simple  tools  

Page 16: Standardizing for Open Data

(16)

Typical  situation  on  the  Web  

l Data  published  in  CSV,  JSON,  XML  

l An  application  uses  only  1-­‐2  datasets,  integration  done  by  direct  programming  is  straightforward  l  e.g.,  in  a  Web  Application  

l Data  is  often  very  large,  direct  manipulation  is  more  efficient  

Page 17: Standardizing for Open Data

(17)

Non-­‐RDF  Data  

l  In  some  setting  that  data  can  be  converted  into  RDF  

l But,  in  many  cases,  it  is  not  done  l  e.g.,  CSV  data  is  way  too  big  l  RDF  tooling  may  not  be  adequate  for  the  task  at  

hand  l  integration  is  not  a  major  issue  

Page 18: Standardizing for Open Data

(18)

Page 19: Standardizing for Open Data

(19)

What  that  application  does…    

l Gets  the  data  published  by  NHS  

l Processes  the  data  (e.g.,  through  Hadoop)  

l  Integrates  the  result  of  the  analysis  with  geographical  data  

Ie:  the  raw  data  is  used  without  integration  

Page 20: Standardizing for Open Data

(20)

The  reality  of  data  on  the  Web…  

l  It  is  still  a  fairly  messy  space  out  there  L  l many  different  formats  are  used  l  data  is  difficult  to  find  l  published  data  are  messy,  erroneous,    l  tools  are  complex,  unfinished…    

Page 21: Standardizing for Open Data

(21)

How  do  developers  perceive  this?  

‘When  transportation  agencies  consider  data  integration,  one  pervasive  notion  is  that  the  analysis  of  existing  information  needs  and  infrastructure,  much  less  the  organization  of  data  into  viable  channels  for  integration,  requires  a  monumental  initial  commitment  of  resources  and  staff.  Resource-­‐scarce  agencies  identify  this  perceived  major  upfront  overhaul  as  "unachievable"  and  "disruptive.”’      -­‐-­‐  Data  Integration  Primer:  Challenges  to  Data  Integration,  US  

Dept.  of  Transportation    

Page 22: Standardizing for Open Data

(22)

One  may  look  at  the  problem  through  different  goggles  

l Two  alternatives  come  to  the  fore:  1.   provide  tools,  environments,  etc.,  to  help  

outsiders  to  publish  Linked  Data  (in  RDF)  easily  

l  a  typical  example  is  the  Datalift  project  

2.   forget  about  RDF,  Linked  Data,  etc,  and  concentrate  on  the  raw  data  instead  

Page 23: Standardizing for Open Data
Page 24: Standardizing for Open Data

(24)

But  religions  and  cultures  can  coexist…  J  

Page 25: Standardizing for Open Data

(25)

Open  Data  on  the  Web  Workshop  

l Had  a  successful  workshop  in  London,  in  April:  l  around  100  participants  l  coming  from  different  horizons:  publishers  and  users  

of    Linked  Data,  CSV,  PDF,  …    

Page 26: Standardizing for Open Data

(26)

We  also  talked  to  our  “stakeholders”  

l Member  organizations  and  companies  

l Open  Data  Institute,  Open  Knowledge  Foundation,  Schema.org  

l …  

Page 27: Standardizing for Open Data

(27)

Some  takeaway  

l The  Semantic  Web  community  needs  stability  of  the  technology  l  do  not  add  yet  another  technology  block  J  l  existing  technologies  should  be  maintained  

Page 28: Standardizing for Open Data

(28)

Some  takeaway  

l Look  at  the  more  general  space,  too  l  importance  of  metadata  l  deal  with  non-­‐RDF  data  formats  l  best  practices  are  necessary  to  raise  the  quality  of  

published  data  

Page 29: Standardizing for Open Data

(29)

We  need  to  meet  app  developers  where  they  are!  

Page 30: Standardizing for Open Data

(30)

Metadata  is  of  a  major  importance  

l Metadata  describes  the  characteristics  of  the  dataset  l  structure,  datatypes  used  l  access  rights,  licenses  l  provenance,  authorship  l  etc.  

l Vocabularies  are  also  key  for  Linked  Data  

Page 31: Standardizing for Open Data

(31)

Vocabulary  Management  Action  

l Standard  vocabularies  are  necessary  to  describe  data  l  there  are  already  some  initiatives:  W3C’s  data  cube,  

data  catalog,  PROV,  schema.org,  DCMI,  …    

l At  the  moment,  it  is  a  fairly  chaotic  world…  l many,  possibly  overlapping  vocabularies  l  difficult  to  locate  the  one  that  is  needed  l  vocabularies  may  not  be  properly  managed,  

maintained,  versioned,  provided  persistence…  

Page 32: Standardizing for Open Data

(32)

W3C’s  plan:    

l Provide  a  space  whereby  l  communities  can  develop  l  host  vocabularies  at  W3C  if  requested  l  annotate  vocabularies  with  a  proper  set  of  metadata  

terms  l  establish  a  vocabulary  directory  

l The  exact  structure  is  still  being  discussed:  http://www.w3.org/2013/04/vocabs/  

Page 33: Standardizing for Open Data
Page 34: Standardizing for Open Data

(34)

CSV  on  the  Web  

l Planned  work  areas:  l metadata  vocabulary  to  describe  CSV  data  

l  structure,  reference  to  access  rights,  annotations,  etc.  

l methods  to  find  the  metadata  l  part  of  an  HTTP  header,  special  rows  and  columns,  

packaging  formats…  

l mapping  content  to  RDF,  JSON,  XML  

l Possibly  at  a  later  phase:    l  API  standards  to  access  CSV  data  

Page 35: Standardizing for Open Data
Page 36: Standardizing for Open Data

(36)

Open  Data  Best  Practices  

l Document  best  practices  for  data  publishers  l management  of  persistence,  versioning,  URI  design  l  use  of  core  vocabularies  (provenance,  access  control,  

ownership,  annotations,…)  l  business  models  

l Specialized  Metadata  vocabularies  l  quality  description  (quality  of  the  data,  update  

frequencies,  correction  policies,  etc.)  l  description  of  data  access  API-­‐s  l …  

Page 37: Standardizing for Open Data

(37)

Summary  

l Data  on  the  Web  has  many  different  facets  

l We  have  concentrated  on  the  integration  aspects  in  the  past  years  

l We  have  to  take  a  more  general  view,  look  at  other  types  of  data  published  on  the  Web  

 

 

Page 38: Standardizing for Open Data

(38)

In  future…  

l We  should  look  at  other  formats,  not  only  CSV  l MARC,  GIS,  ABIF,…  

l Better  outreach  to  data  publishing  communities  and  organizations  l WF,  RDA,  ODI,  OKFN,  …  

Page 39: Standardizing for Open Data

Enjoy  the  e

vent!