development of guidelines for publishing statistical data
Post on 20-Dec-2021
4 Views
Preview:
TRANSCRIPT
Development of guidelines for publishing statistical dataas linked open dataMERGING STATISTICS AND GEOSPATIALINFORMATION IN MEMBER STATES - POLAND
Mirosław MigaczINSPIRE Conference 2016Barcelona, 26 IX 16
Agenda
• project aims,
• introduction to linked open data,
• project timeline,
• project tasks,
• intranet site.
• from ontology do sparql endpoint
Overall objective
Support decision-making processes involving provision of standardized, usable and open georeferenced statistical data.
What is linked open data?
• Internet – collection of documents published online – accessible at Web location identified by a URL,
• Documents mainly human-readable and cannot be understood by machines.
• Linked open data – data machine-readable formats and connecting described using Uniform Resource Identifiers (URIs), thus enabling people and machines to collect the data, and put it together to do all kinds of things with it (permitted by the licence).
source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)
Linked open data
• URI – for names
• RDF – to describe data
• SPARQL – to query for data
source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)
Uniform Resource Identifier (URI)to „make a long story short”:
object described by an internet address
A country, e.g. Belgium
http://publications.europa.eu/resource/authority/country/BEL
A dataset, e.g. Countries Named Authority List
http://publications.europa.eu/resource/authority/country/
In official statistics it can look like this:
http://teryt.stat.gov.pl/32/18/05/3 - gmina Węgorzyno
source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)
RDF i SPARQLResource Description Framework (RDF ) is a syntax for representing data and resources in the Web
RDF breaks every piece of information down in triples:
• Subject – a resource, which may be identified with a URI.
• Predicate – a URI-identified reused specification of the relationship.
• Object – a resource or literal to which the subject is related.
source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)
http://example.org/place/Brussels is the capital of “Belgia”LUB
http://example.org/place/Brussels is the capital of http://example.org/place/Belgium
subject predicate object
SPARQL is a standardised language for querying RDF data.
Five stars of linked open data
source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)
Make your stuff available on the Web (whatever format) under an open license.
Make it available as structured data (e.g., Excel instead of image scan of a table)
Use non-proprietary formats (e.g., CSV instead of Excel)
Use URIs to denote things, so that people can point at your stuff
Link your data to other data to provide context
Specific objectives
• identification of statistical units for which data can be published with harmonization of theirgeometries for respective years
• building standarized URIs for statistical units
• identification and analysis of potential data sources
• plan for transformation of existing data sourcesinto open formats
• creation of RDF metadata for data sources
• feasibility analysis for publishing linked open data
Stage I – until 4/10/2016
• identification of statistical unitsfor which data can be publishedwith harmonization of theirgeometries for respective years
• building standarized URIs for statistical units
• identification and analysis of potential data sources, analyzing for: „openness”, georeference, veryfing need for geocoding
5 GUS-PK
2GUS-DI
1 GUS-AZ
3US Poznań
2 US Olsztyn
1US Wrocław / OBDL J. Góra
Stage II – until 7/10/2017
• plan for transformation of existing data sources intoopen formats
• creation of RDF metadata for data sources
• feasibility analysis for publishing linked open data (building a SPARQL endpoint)
5 GUS-PK
1GUS-AZ
3 US Poznań
2US Olsztyn
1 US Wrocław / OBDL J. Góra
Identification of data sources
• Three major databases:
• Local Data Bank
• biggest set of statistical information availablefor a wide range of years
• updated monthly
• Demography Database
• integrated data source for state and structureof population, vital statistics and migrations
• Development monitoring system STRATEG
• a system for facilitating and monitoring the development policy
• key measures to monitor execution of strategies at local, regional, transregional and EU level.
Identification of data sources
• Other data sources:
• publications
• tables
• communiques
• announcements
• articles
Identification of data sources
• Metadata:
• thematic category,
• format (PDF, DOC, XLS, CSV),
• spatial reference (country, NUTS, LAU, functional areas, urbanareas),
• temporal reference (years)
• presence of identifiers (TERYT, NTS, NUTS)
• update cycle
Preliminary analysis of data sources
• Key aspects:
• openness
• redundance of information
• popularity (based on view and download statistics)
• Inclusion / exclusion of the data source
Statistical units harmonization
• Basis:
• NTS (Nomenclature of Territorial Units for Statistical Purposes)
Name NTS NUTS/LAU Identifier
Region 1 NUTS 1 1.6
Voivodship 2 NUTS 2 2.6.22
Subregion 3 NUTS 3 3.6.22.40
Powiat 4 LAU 1 4.6.22.40.11
Gmina 5 LAU 2 5.6.22.40.11.01.1
Statistical units harmonization
• Input data:
• administrative boundaries since 2002 for LAU 2 (gmina), excluding 2007
• Harmonization process:
• structure standardization
• standardization of identifiers (creating NTS identifiers)
• aggregation to higher level units (LAU 1 -> NUTS 1)
Statistical units harmonization
• Non-standard statistical units:
• functional areas
• urban areas
• Groups of NTS units
• Derive mostly from strategic documents
• Changes of geometries in time to be determined
Statistical units URIs
• NTS as basic classification
Name NTS NUTS/LAU
Identifier URIhttp://nts.stat.gov.pl/...
Region 1 NUTS 1 1.6 …1/6
Voivodship 2 NUTS 2 2.6.22 …2/6/22
Subregion 3 NUTS 3 3.6.22.40 …3/6/22/40
Powiat 4 LAU 1 4.6.22.40.11 …4/6/22/40/11
Gmina 5 LAU 2 5.6.22.40.11.01.1 …5/6/22/40/11/01/1
http://nts.stat.gov.pl/5/6/22/40/11/01/1
Data transformation plan
• From ontology to SPARQL endpoint
• Decide what will be published as Open Data
• three major databases
• other data sources
• Create ontology
• Map to existing databases
• Export to RDF data store
• Publish on linked data server
• Workflow tested on STRATEG database
Ontology - methods and tools
• Ontop - platform to query databases as Virtual RDF Graphs using SPARQL
• SPARQL 1.0 Support
• Supports interface for ontology development
• Intuitive/powerful mapping language
• Support for free and commercial DBMS
• SPARQL end-point
SPARQL endpoint tools for the web
• Apache Jena Fuseki
• Fuseki is a SPARQL server. It allows REST-style SPARQL Query.
• Ontop generated RDF’s are imported to Apache Jena
• Pubby
• A Linked Data Frontend for SPARQL Endpoints
• Pubby makes it easy to turn a SPARQL endpoint into a Linked Data server. It is implemented as a Java web application.
• Provides data at given linked data uri
Further works
• Consultation of the designed workflow during a studyvisit at the Madrid University of Technology
• Setting up an internal test linked data server to implement web tools
• Creating ontologies and workflows for databases and other data sources
Summary – results so far
• Harmonized geometries for statistical units
• Identified data sources with comprehensive metadata
• Preliminary data transformation plan with tools tested
Poland’s data opening strategy
• launched this year
• aimed at opening data resources of governmentinstitutions with respect to the 5-stars of linked open data goals
• the grant results (guidelines) in line with the strategy
• increased probability of acquiring financing for a fullyfledged implementation
INSPIRE Thematic Clusters
https://themes.jrc.ec.europa.eu – collaboration platform
Statistical Cluster:
statistical units
population distribution (demography)
human health and safety
Informal meeting of Cluster members duringthe coffee break (15:30-16:00)
top related