presentation adequate project: workshop on quality assessment and improvements in open data...

39
www.adequate.at Workshop on Quality Assessment and Improvements on Open Data (Portals) opendata.ch conference, 14.6.2016, 12.45 - 14:00pm CEST Lausanne, Casino de Montbenon, Allée Ernest-Ansermet 3 Slides published CC-BY AT 3.0 Jürgen Umbrich Vienna University of Economics and Business [email protected] Johann Höchtl Donau-Universität Krems [email protected] Martin Kaltenböck Semantic Web Company [email protected]

Upload: martin-kaltenboeck

Post on 14-Jan-2017

122 views

Category:

Technology


1 download

TRANSCRIPT

www.adequate.at

Workshop on Quality Assessment and Improvements on Open Data (Portals)

opendata.ch conference, 14.6.2016, 12.45 - 14:00pm CESTLausanne, Casino de Montbenon, Allée Ernest-Ansermet 3

Slides published CC-BY AT 3.0

Jürgen Umbrich Vienna University of Economics and Business [email protected]

Johann Höchtl Donau-Universität Krems [email protected]

Martin Kaltenböck Semantic Web Company [email protected]

www.adequate.at

Agenda

2

Time Session Remarks

20’ incl q&a Welcome & Introduction● WS Objectives, Agenda & WS Team● Participants● The ADEQUATe project: basics, objectives, status & outlook

Martin Kaltenböck (SWC)

20’ incl q&a Results of Requirements Elicitation, DQ Metrics and Interaction items● What do the users want?● What are the most “important” ones? What are metrics specifically targeting

openness?● Why data portal quality interaction items with end users and what do we

plan to do in ADEQUATe?

Johann Höchtl (DUK)

20’ incl q&a Best Practise & the ADEQUATe OD Framework● Data & CSV on the web working group recommendations (W3C)● AD Framework: architecture & components

Jürgen Umbrich (WU)

15’ open discussion Interactive & open discussion on DQ issues:● Requirements for DQ in Open Data● What is in place or planned for DQ

Moderated by the WS Team

www.adequate.at

FFG Projecthttp://www.adequate.at

3

Das Projekt „ADEQUATe“ wird im Rahmen des FTI - Programms „IKT der Zukunft“ durch das Bundesministerium für Verkehr, Innovation und Technologie gefördert und von der Österreichischen Forschungsförderungsgesellschaft abgewickelt [Projektnummer: 849982].

www.adequate.at

What is ?

ADEQUATe Open Data: Analytics & Data Enrichment to improve

the QUAliTy of Open Data builds on two observations:

An increasing amount of Open Data becomes available as an important resource for emerging businesses and further on the

integration of such open, freely re-usable data sources into organisations’ data warehouse and data management systems is

seen as a key success factor for competitive advantages in a data-driven economy.

The project now identifies crucial issues which have to be tackled to fully exploit the value of open data and the efficient

integration with other data sources:

● the overall quality issues with metadata and the data itself

● the lack of interoperability between data sources

The project's approach is to address this points already in an early stage – when the open data is freshly provided by either

governmental organisations or others.

4

www.adequate.at

What is ?

5

www.adequate.at

What is ?

✓ 3 Partners:1. Semantic Web Company2. Danube University Krems3. University of Economics Vienna

✓ 30 months project duration, Oct. 2015 - March 2018✓ 2 Use Case Partners: data.gv.at & opendataportal.at✓ Objective: Improvement of Data Quality through:

○ Quality Assessment and Monitoring

○ Automatic Algorithms

○ Making use of Linked Data principles

○ Improvements of the data by the user (community)

6

www.adequate.at

Project Structure & Schedule

7

ADEQUATe: GOALS

WP1 - Requirements & SpecificationWP2 - Quality Improvement & Monitoring FrameworkWP3 - Algorithms & Tools for Quality ImprovementsWP4 - Data LinkageWP5 - Community driven Quality ImprovementsWP6 - Use Case IntegrationWP7 - Project Management & Dissemination

www.adequate.at

Outlook & Timing of Results

8

M30 (03/2018)Evaluation, Refinements, Improvements

M21 (06/2017)Quality improvements Use case connection

M15 (12/2016)Quality monitoring framework Data linkage

M10 (07/2016)

Architecture Blueprint

M9 (06/2017)Quality metrics Requirements

ADEQUATe: GOALS

www.adequate.at

Concrete Outputs & Outlook

✓ End of June 2016: 3 Deliverables○ State of the Art○ Requirements Elicitation○ Quality Metrics

✓ End of July 2016: 1 Deliverable○ Architecture Blueprint○ All components specified

✓ End of 2016: ADEQUATe Framework - 1st release○ Assessment & Monitoring Framework○ Data Quality Algorithms & Tools○ Linked Data Mechanisms○ 1st set of user driven Mechanisms

✓ Early 2017: Dock onto ODP & data.gv.at9

www.adequate.at

Requirements Elicitation,DQ metrics and Interaction items

10

www.adequate.at

Results of Requirements Elicitation

11

www.adequate.at

Contents and Formats

○ I would really prefer to have the data themselves consistent. [...] metadata does not match; standards regarding the representation of their content

○ It would be really great if we could shift somehow to UTF-8○ meta data for CSV files were incomplete [...] header for CSV was missing ○ no static identifiers for objects in data sets. This in turn leads to problems if you want

to track changes related to these objects over time

Results of Requirements Elicitation

12

www.adequate.at

Communication

○ central communication point for exchanging experiences and issues○ Meta data should be written in English language

Reliability

○ Servers are restarted every day [...] hosted data becomes unavailable

Results of Requirements Elicitation

13

www.adequate.at

DQ metrics (1)

Completeness

● Metadata Completeness: How many (manadatory) metadata keys have values?● Table completeness: How many (CSV) cells have non-null values

Timeliness

● Tau of Data: How “outdated” are datasets based on the promised update frequency

14

www.adequate.at

DQ metrics (2)

Machine readability

● Regularity of CSV-files (CSV-Lint), RDF, ...● Structural consistency - variations in structure of CSV files

Openness

● Open formats - no well-defined definition of what constitutes an open ● Open Licenses - Seems opendefinitions.org has them all covered

Persistence

15

www.adequate.at

DQ Metrics - Persistence?

16

www.adequate.at

ADEQUATe: 11 Dimensions & 46 Metrics

17

www.adequate.at

Contributors to DQ improvement

Publishers Community

18

Algorithms & Linked Data

www.adequate.at

Contributors to DQ Improvements (1/2)

● Providers○ Correctness and Completeness of Data and Metadata○ SLAs governing availability○ Readiness for feedback, discussion and interaction

● Algorithms○ Automated improvements

■ Availability checks and reporting■ Missing information, outliers■ Check of format (valid UTF8?), size■ Data format conversions: CSV → CSV on the web specification

○ Semi-automated Improvements and Enhancements■ Identification of related data sets■ Mapping of (data) attributes, ...

● Interaction with the Data Community19

www.adequate.at

Interaction: Data Community

20

● Control the results of automated enhancements○ Interlinking○ format conversions○ encodings

● Correct mistakes and report mistakes● Data enrichment and transformations

www.adequate.at

Interaction: Data Community

21https://open.wien.gv.at/site/riesenbaum-in-wien-entdeckt/#more-87184

www.adequate.at

Interaction: Forking: Identify - Improve - Share

22

1 47 11

2 48 15

1 47 11

2 48 151

1 47 11

2 47 15

2

www.adequate.at

Interaction: Forking: Identify - Improve - Share

23

www.adequate.at

Making results tangible

24https://github.com/antontarasenko/gpq/blob/master/notebooks/contracts_intro.ipynb

Government Procurement Queries projectUS Government contracts 2000 - 2016 (USAspending.gov)

www.adequate.at

The ADEQUATe OD Framework&

publishing CSVs for humans and machines

25

www.adequate.at

The ADEQUATe Framework

26

● The ADEQUATe framework offers:○ quality assessment and monitoring○ a set of data quality improvement algorithms○ a set of algorithms to create, maintain a knowledge graph and “link” data into this graph

■ Think about shared identifiers for addresses, companies, departments, parties, ...○ community involvement ( e.g., data editors, feedback loops, forking & merging)

● Main objectives:○ all developed components will be Open Source ( see the ADEQUATe Github Repo)○ components should be used as standalone components

■ Use only what you need

www.adequate.at

The ADEQUATe Framework

27

● Core Components1. Data monitoring2. Knowledge Vault3. Quality Assessment4. Quality Improvement5. Data Linkage6. Community Improvement7. UI, API & User authentication

Users

(Met

a)D

ata

Mon

itor

KnowledgeVault

QualityAssessment

Orchestration / API

QualityImprovement Linkage Community

Improvement

Authentication / Load Balancing /

UI Public API catalog

data.gv.at

ODP

Clients

RESTful APIComponentData

www.adequate.at

W3C CSV on the Web & ADEQUATe

One core feature in ADEQUATe will be to use the CSV on the Web metadata standard, which allows to:

➢ describe CSV files○ used dialect & encoding

○ table & column descriptions ( with language tags)

○ data types and value ranges for columns

➢ add semantics to it○ primary & foreign key, URIs, entity types, ...

➢ validate CSV files against a predefined schema➢ specify the transformation

○ CSV -> JSON or RDF

28

www.adequate.at

W3C CSV on the Web: Metadata standard

29

www.adequate.at

W3C CSV on the Web: Metadata standard

30

www.adequate.at

W3C CSV on the Web: Metadata standard

31

www.adequate.at

W3C CSV on the Web: Example (JSON-LD) 1/3 {

"@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],

"url": "http://data.mumok.at/exhibition.csv",

"dc:title": "Exhibitions for objects from the mumok collection",

"dcat:keyword": ["art", "museum", "exhibition"],

"dc:publisher": {

"schema:name": "mumok - museum moderner kunst stiftung ludwig wien",

"schema:url": {"@id": "http://www.mumok.at"}

},

"dc:license": {"@id": "https://creativecommons.org/licenses/by/3.0/at/legalcode"},

"dc:modified": {"@value": "2015-07-04", "@type": "xsd:date"},

….

32

www.adequate.at

W3C CSV on the Web: Example (JSON-LD) 2/3 "dialect": {

"encoding": "utf-8", "lineTerminators": ["\r\n", "\n"],

"quoteChar": "\"", "doubleQuote": true,

"skipRows": 0, "commentPrefix": "#",

"header": true, "headerRowCount": 1,

"delimiter": ",",

"skipColumns": 0,

"skipBlankRows": false,

"skipInitialSpace": false,

"trim": false

},

33

www.adequate.at

W3C CSV on the Web: Example (JSON-LD) 3/3 "tableSchema": {

"columns": [{

"name": "exhibition_id",

"titles": "Exhibition Identifier",

"dc:description": "A unique identifier for the exhibition.",

"datatype": "integer",

"required": true

}, {

"name": "city",

"titles": "City",

"dc:description": "The city in which the exhibition took place (no language defined, mostly in German).",

"datatype": "string"

}

34

www.adequate.at

W3C CSV on the Web: Discovery● Registered content type: application/csvm+json● 3 discovery mechanisms

○ File extension■ http://data.mumok.at/exhibition.csv -> http://data.mumok.at/exhibition.csv-metadata.json

○ Well-known location

■ /.well-known/csvm

○ LINK HTTP Header

35

» curl -I http://data.mumok.at/exhibition.csvHTTP/1.1 200 OKDate: Thu, 26 Nov 2015 22:18:47 GMTServer: Apache/2.2.22 (Debian)….Content-Length: 112723Content-Type: text/csv; charset=utf-8; header=presentLink: </exhibition.csv-metadata.json>;rel=describedBy;type=application/csvm+json

www.adequate.at

CSV on the Web Summary

● Don’t publish CSV on the Web for humans, publish also for machines○ e.g., EXCEL exports

● RFC 4180● Encoding

○ Use UTF-8, don’t mix encodings

● File extension: .csv● Content-type: text/csv

Optional, but big improvement!

● Ideally, publish CSV MetaData along your CSV file● Avoid acronyms or encodings (e.g., sex=1,2,3)

36

www.adequate.at

CSV on the Web Summary

37

● CSV URLs● CSVs link to other CSVs● CSVs link to other resources● RDF and JSON conversion

REFERENCES

● CSV on the Web Working Group

● CSV on the Web Community Group

● CSV on the Web Github Repository

● Tabular Data on the Web - A Introduction to CSV on the Web (Slides)

● Implementing CSV on the Web ( Gregg Kellogg)

www.adequate.at

Contact

39

Jürgen Umbrich Vienna University of Economics and Business

Juergen.umbrich @ wu.ac.at Short CV:https://www.wu.ac.at/en/infobiz/team/umbrich/

Johann Höchtl Donau-Universität Krems

Johann.hoechtl @ donau-uni.ac.at Short CV: https://at.linkedin.com/in/johannhoechtl

http://adequate.at/ http://vienna.theodi.org

Martin Kaltenböck Semantic Web Company

[email protected] Short CV: https://www.linkedin.com/in/martinkaltenboeck