experimental transformation of abs data into data cube vocabulary (dcv) format : why, how and what...

36
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) format Why? How? What was learned?

Upload: alistair-hamilton

Post on 28-Jul-2015

485 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Experimental transformation of ABS data into Data Cube Vocabulary (DCV) format

Why?How?

What was learned?

Page 2: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Outline

I. Context– Transforming national & international statistical

systems– Semantic Web / Linked Data meets Official Statistics – SemStats 2013– Parameters for the R&D project

II. Investigation of existing toolsIII. Summary of the transformation processIV. Lessons learnedV. Discussion

Page 3: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

2009 (Australia)

• The case for an international statistical innovation program Transforming national and international statistics systems

• Future capabilities1. From static data products to “common information services”2. From publications to communication3. Support for transaction data flowing at a much higher volume4. Ability to rapidly incorporate new issues and views of data into

standards and classifications5. ‘Rapid-response’ capability6. Connecting processes and passing metadata and data easily

between them7. Analysing assemblies of data

Page 4: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

The Challenges

Increasing cost &

difficulty of acquiring

survey data

New sources & changing

expectations

Rapid changes in the

environment

Competition for skilled resourcesDiminishing

budgets

Riding the big data

wave

Page 5: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

HLG• High-Level Group for the Modernisation of Statistical Production and Services

• Comprises 10 heads of national and international statistical organisations– Gosse van der Veen (Netherlands) - Chairman– Brian Pink (Australia)– Eduardo Sojo Garza-Aldape (Mexico)– Enrico Giovannini (Italy)– Woo, Ki-Jong (Republic of Korea)– Irena Križman (Slovenia)– Katherine Wallman (United States)– Walter Radermacher (Eurostat)– Martine Durand (OECD)– Lidia Bratanova (UNECE)

The official statistics industry and its place in the wider information industry

From Strategy to implement the vision of the HLG (2012)

Page 6: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Grouping the challenges

1. Product Challenge - Modernising Statistical Services• Designing and delivering new and better statistical

outputs (products and services)

2. Process Challenge – Modernising Statistical Production

• Developing and implementing new and better production processes and methods which are capable of delivering statistical outputs withi. reduced cost, andii. greater flexibility.

Page 7: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

HLG Strategy• Standards-based, collaborative modernisation of official statistics.

• Create an environment (eg “common architecture”) that facilitates collaborative development, sharing and reuse of– statistical business processes– statistical methods– IT components– data repositories

• Explicit role for– common conceptual frameworks, eg

• GSIM (Generic Statistical Information Model)

– and common implementation standards, eg• SDMX (Statistical Data and Metadata eXchange), working with• DDI (Data Documentation Initiative)

Page 8: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

ABS main data service support SDMX

• ABS.Stat Beta– Dissemination from predefined aggregate data cubes

• eg Consumer Price Index

– Featured at GovHack 2013 – Based on OECD.Stat

• Now used by OECD, IMF, UNESCO, European Commission, ABS, Statistics New Zealand, Statistics Italy

• Further development through SIS Collaboration Community

• TableBuilder– Dissemination of on demand tabulations from microdata

• Includes Population Census

Page 9: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Harnessing the opportunities

• Global community around SDMX– intersects with SIS Collaboration Community

• Working on

– SDMX to JSON (JavaScript Object Notation)• Making life easier for third party developers

– No need to parse SDMX-ML

• Object model similar to Data Cube Vocabulary (DCV)• Expected to be released for review in September

– SDMX to Data Cube Vocabulary (DCV)• Much earlier stage within SIS Collaboration Community

Page 10: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Layering standards on standards

• RDF Data Cube Vocabulary (DCV) developed under W3C

– designed for publishing multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts

– based upon the approach used by the SDMX ISO standard for statistical data exchange

– very general and can be used for other data sets such as survey data, spreadsheets and OLAP data cubes

Page 11: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Use of DCV

• Usage within– data.gov.uk– Eurostat– Other institutions within the European Union via

the EU’s Open Data Portal• eg European Environment Agency

– Experimental use within data.gov.au

Page 12: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Linked Data view on Official Statistics

• Official Statistics and the Practice of Data Fidelity– Official statistics are the “crown jewels” of a nation’s public data– Provide empirical evidence for policy making and economic research– Statistical offices are among the most “data-savvy” organisations in

government– Handling of Statistical Data as Linked Data requires particular attention

to maintain its integrity and fidelity

• Linked SDMX Data– Challenges

• Automation of data transformation of data from high profile statistical organizations• Minimization of third-party interpretation of the source data and metadata and

lossless transformations

Page 13: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

(Unofficial) view from Official Statistics

• Semantic Statistics opportunities include :– external application of statistical classifications, and other statistical

concept schemes, as ontologies– simpler, more flexible and more powerful use of statistical data along side

other data– partnering more closely with other “data” communities

• Semantic Statistics issues and risks include– ensuring production process is sustainable– ensuring semantics are identified consistently across all statistical outputs

from a single agency – possible lack of rigour when defining and linking concepts to outputs from

other sources– the possibility of “fuzzy” semantics leading to incorrect data analyses

Page 14: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

SemStats 2013

• Interest in “Semantic Statistics” is growing rapidly within Statistical and Semantic Web communities

• There are existing semantic web developments building on both SDMX and DDI

• SemStats 2013 provides a rare opportunity to interact with world experts while they’re in Australia

• We are interested in what entrants might create and demonstrate in regard to SemStats 2013 Challenge

Page 15: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

SemStats 2013 Challenge

• Provides Australian and French Census data in Data Cube Vocabulary (DCV) format

– Data is Geography x Sex x Age x “Activity” status

– Entrants are asked to demonstrate value from innovative application of semantic web technologies to the data.

Page 16: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Aim when preparing Australian content

• use as an opportunity for practical learning

• start with SDMX-ML (not, eg, CSV) (if possible)– Plan A: SDMX-ML from TableBuilder

• use existing international tools for SDMX-ML to DCV transformations (if possible)

• do the work within the ABS (if possible)• Plan B was to ask INSEE (Statistics France) to help us with the

transformation

Page 17: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Investigation

• Datalift– Supports multiple input types– Generic transformation– Supports dissemination to the web

• Mimas– XSLT based – Complicated

• Guillaume report– From INSEE– Highly tailored to the input data

Page 18: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Datalift

• Free to use – source code also available• Java web application• Supports multiple input types

– Semantic graphs– Relational databases– Files (CSV, XML, etc)

• Supports entire cycle– INSEE plan to use in future

• SDMX -> DCV plug-in in development

Page 19: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Mimas

• Inflexible– XML input only– XML output only

• Cumbersome– Requires multiple intermediate conversions

• Inefficient for large volumes of data

Page 20: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Guillaume Report

• INSEE short term solution• Datalift was not mature enough• MIMAS identified as cumbersome and

inefficient• Opted to use Apache Jena for small Java

application

Page 21: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Technology Overview

• Census TableBuilder– Data extracted in SDMX and CSV

• Java– Apache Jena library– SDMX 2.0 XML beans

• Ontologies used– Simple Knowledge Organisation System– Data Cube Vocabulary

• Turtle RDF syntax– Easy to read for humans and machines

Page 22: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

SDMX Extraction Tool Overview

• Reads in SDMX structure file– Uses SDMX 2.0 beans to parse file

• Disassembles XML to main components– Code lists– Concepts– Key Families

• Build semantic model with Apache Jena• Write to file in Turtle syntax

Page 23: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Code Lists

• Representation of a classification– Can be hierarchical or flat

Page 24: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Code Schemes

Code scheme information

Code information

Codes

Code schemes

Generate SKOS concept scheme

Page 25: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

SKOS Concept Schemes

Unique identifier

Type

Parent category

Label

Classification/concept scheme

Code

Page 26: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Concepts & Components

• Links observations to their:– Classification– Concept

Page 27: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Concept SchemesConcept informationConcepts

Concept Schemes

Page 28: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

ComponentsComponent informationComponents

Key families

Create data structure definition

Page 29: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Data Structure Definition

Can only be values of this type

List of codes to use

Concept dimension is measuring

What the observation is measuring

Page 30: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

The Data - SDMX

• Series key – dimensions being measured• Attributes – extra metadata about observation• Obs – the value of the observation (i.e. people

counted)

Page 31: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

The Data - DCV

• More condensed – attributes attached to the dataset instead of the observation

DimensionsCoded values

Observation value

Dataset observation is from

Page 32: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Lessons Learned (1)

• Subject Matter Experts needed– What dimensions to use?– What attributes to use?– What concepts are we measuring?

• Current tools not yet mature• Full validation of data complex• Heavy resource usage for large data

– Unable to process SA2 level data on 32bit

Page 33: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Lessons Learned (2)

• Conversion straight forward– Standards very similar

• Promotes reuse– Power comes from linking data

• Linked nature makes you think about what you are doing– E.g. How close is INSEE activity to ABS labour force

status?

Page 34: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Semantic Considerations

• How much, how soon, do we aim to harness opportunities for carrying more usable semantics in Data Cube Vocabulary?– Expected an external ontology for sex – but most are for Gender

• How close is “close enough” for semantic assertions in Linked Open Data?• Aim for statistical harmonisation first (eg SDMX Cross Domain Concepts)

then explore links to broader ontologies?

• Even data producers are not sure if Age is a common concept across ABS & INSEE (Statistics France).

• Risk of overselling the technical format before semantic payload is sorted?

Page 35: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Laying the foundations • The project confirmed that, in order to deliver more useable semantics in our outputs,

on a sustainable basis, we need statistical data and metadata to be defined and managed on a consistent, standards aligned basis across the organisation, including– across all statistical subject matter domains (social, economic, environmental)– “end to end” (ie spanning design, collection, processing/integration, analysis and

dissemination)

• We also need production processes to be automated & sustainable.

• This is one example of why ABS needs to “modernise statistical production” to reflect the changed world in which we operate and to offer new services that address new needs and expectations of users.

• In the 13/14 Budget Papers funding of $2.1 million was provided to develop a second pass business case for a major statistical infrastructure and business process reengineering project.

Page 36: Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) format  : Why, How and What was learned

Discussion