proposed linked data migration framework for singapore government datasets
DESCRIPTION
Critical Inquiry Presentation on 'Designing a Linked Data Migrational Framework for Singapore Government Datasets'TRANSCRIPT
Designing a Linked Data Migrational Framework for Singapore Government Data Sets
•Sesagiri Raamkumar Aravind•Thangavelu Muthu Kumaar• Kaleeswaran Sudarsan
Msc(KM) Critical Inquiry in Knowledge Management
AGENDA• Basics of Linked Data• data.gov.sg• Purpose of this project• Migrational Framework
• Eight Steps• Use Cases• Conclusion
GovernmentsEnterprises
Libraries &Museums
Social Media Data(Blogs, Facebook)Business
Entertainment
OPPORTUNITY OF LINKING DATA ACROSS VARIOUS DOMAINSAND TYPES
Types of Data
•Factual Data•Transactional Data•Textual Data•Spatial Data•Multimedia•Files & Database
Mr.Lee Kuan Yew! an exploration!..
Mr.Brendan Luyt’s Associated publication search…….
(TraditionalApproach) (Linked Data
Approach)
Others….
Linked Open Data Cloud (Web of Data)
Linked Open Data Cloud (Web of Data)
iDA Singapore launched Data.gov.sg portal and mGov@SG public services during June 2011
Data.gov.sg provides 5000+ public data sets from 50 government agencies
Purpose: Building applications, research and for creating applications using the data
Data.Gov.Sg
ABC Water Proj (R)
Agency Websites
Singstat publicationsMINISTRIES
XLS
HTML
Accountant-General's DepartmentAccounting and Corporate Regulatory Authority
Agency For Science, Technology & ResearchAttorney-General’s Chambers
Building & Construction AuthorityCentral Narcotics Bureau
Central Provident Fund Board Civil Aviation Authority of Singapore
Department of StatisticsEconomic Development Board
Energy Market AuthorityHealth Sciences Authority
Housing & Development BoardImmigration & Checkpoints Authority
Infocomm Development Authority of SingaporeInland Revenue Authority of Singapore
Institute of Technical EducationIntellectual Property Office of Singapore
JTC CorporationJudiciary, Subordinate CourtsJudiciary, Supreme CourtLand Transport AuthorityMajlis Ugama Islam Singapura
Maritime & Port Authority of Singapore
Monetary Authority of SingaporeNanyang Polytechnic
National Environment AgencyNational Heritage Board
National Library Board National Parks Board
Ngee Ann Polytechnic People's Association
Public Service DivisionPublic Transport Council
Public Utilities Board Republic Polytechnic
Sentosa Development Corporation Singapore Civil Defence Force
Singapore Customs Singapore Land Authority
Singapore Police ForceSingapore Polytechnic
Singapore Sports CouncilSingapore Workforce Development Agency
Spring Singapore Temasek Polytechnic
Urban Redevelopment Authority
Ministry of Community Development, Youth & Sports
Ministry of Education
Ministry of Foreign Affairs
Ministry of Health
Ministry of Law –Community Mediation Unit
Ministry of Manpower
Ministry of Transport
Media Development Authority
BFABuildings(C)GreenBuilding(E)
C- CommunityCul - Culture
E- EnvironmentEmp- Employment
Edu - EducationH- HealthF- Family
R- RecreationS- Sports
Breast Screen (H)Cervical Screen (H)Healthier Dining (H)
Quit Centers (H)
Infocomm Access (C)Silver infocomm (C)
Wireless Hotspots (R)Child care (F)Disability (F)Elder care (F)
Family (F)Family Friendly Estab (F)
Student Care (F)Comm Mediation Center (C)
After Death Facilities (E)Funeral Palours (E)Dengue Cluster (H)Hawker Center (E)
NEA Offices (E)Recycling Bins (E)
Waste Disposal Site (E)
Waste Treatment (E)
Heritage sites(Cul)Monuments(Cul)
Museums(Cul)
Libraries (Cul)Streets and Places(Cul)
CD Councils (C)Community Clubs (C)
Constituency offices (C)Other facilities (C)
Other Pan networks (C)PA head quarters (C)
Residents Committee(C)Water Venture (C)
National Parks (R)Skyrise greenery (E)
Sports clubs (S)
CET Centers(Emp)WDA Service points(Emp)
Kindergartens (Edu)Get TokenAddress SearchAgency Data SearchStatic Map
Get Layer InfoMashupGet Related Data
Get DirectionsPublic Transportation
Reverse Geocode
Map-related APIs from various agenciesTraffic-related APIs from Land Transport Authority
Tourism-related APIs from the Singapore Tourism BoardEnvironment-related APIs from the National Environment Agency
Library-related data feeds & web services from National Library Board
DGS Eco System
SG DATA
TEXTUAL
SPATIAL
API
THEMES OPERATIONSCATEGORIES
UNSTRUCTURED DATA
STRUCTURED DATA
STRUCTURED DATA
STATUTORY BOARDS
SG Government Data Eco System
Drawbacks of Existing Data Ecosystem
•Siloed architecture
•Absence of vocabulary standardization(common language)
•Multiple data consumption end points
•Steep learning curve for developers during application development process
•Absence of interlinking between data sets
Solutions to above identified drawbacks through Linked Data works at multiple levels
Data Storage - Can support distributed storage
Data Representation - Common format(RDF) for both data and metadata.
Data Consumption - via a single output terminal(SPARQL)
Data Interlinking - Use of Ontologies (vocabularies)
IDA can use Linked data on top of their traditional systems instead of going for a complete overhaul
http://wheredoesmymoneygo.org/bubbletree-map.html#/~/grand-total--2010-
http://www.sgdi.gov.sg/
http://labs.data.gov.uk/gov-structure/departments/
UK Linked Data Implementation
RDF
Subject-Predicate -Object
Jurong belongs to the West Zone
Linked Data Representation Format
http://data.gov.sg/resource/area/Jurong_West
http://data.gov.sg/ontology/property/has_zone
http://data.gov.sg/resource/zone/West
Subject
Predicate
Object
http://w3.org/2003/01/geo/wgs84_pos#/lat http://w3.org/2003/01/geo/wgs84_pos#/long
12.55550.21222
Why are we doing this project?
To prescribe a migrational framework for linked data for data.gov.sg (DGS) data sets
First hand view of the required migration activities
Issues anticipated at each step
Evaluation & Recommendation on Linked Data tools
To help IDA in understanding the benefits of Linked Data
Framework Formulation Process
• Based on study of Linked Data Migration Research Papers and cookbooks published by the World Wide Web Consortium(W3C)
• Analysis of Linked Data implementations in UK ,US and Brazil
• Evaluation of Linked Data tools with Singapore data sets for recommendation in each step of the framework
• Contemplating on probable issues that could be faced during implementation
URA Sites for Sales dataset(Urban Planning)DOS Population and Household Characteristics dataset (Population Demographics)
Age Pyramid of Resident Population
Old Age Support Ratio
Datasets Used for Framework Evaluation
Proposed Linked Data Migrational Framework for DGS
Specification Identfication Analysis
Object Modeling
Ontology Modeling
URI Naming
RDF Creation
External Linking
Datasets Publication
Discovery & Exploitation
Re-use Create
S2R D2R A2R
\
Govt Agencies and IDA
Govt Agencies Domain Matter Experts
Ontology Modelers
IDA and Web Architects
Developers
Developers and Domain Experts
Developers
Web Architects
ObjectivesSpecifications
Project Duration
Dataset PrioritizationDataset License SettingImpln Mode Selection
RoadmapArchitecture
Overview
Relational ModelDataset Overview
Drawing Objects in Whiteboard
Conceptual View
Conceptual ViewPublic Vocabularies
Re-use of Existing Vocabularies
Creation of New Vocabularies
OWL, RDFS, RDF Vocabulary files
Resources Class and Properties
Visualization of URI mining process
URI AdministrationURI Lifecycle
ER ModelSpreadsheets,
DBMS, API
Conversion to RDF triples using Mapping files
RDF Triples
Government and external data sets
Linking based on Similarity Algorithms
Outbound Links
RDF TriplesOntologies
SPARQL, APIData InsertionVOID ModelingData Retrieval
API to SPARQL conversion
VOID TriplesJSON data
Actual DataExisting Apps
GamificationCrowdsourcing
Catalog RegistrationExternal Reference
New Apps
PR
OC
ES
S
PR
OC
ES
S
PR
OC
ES
S
PR
OC
ES
S
PR
OC
ES
S
PR
OC
ES
S
PR
OC
ES
S
PR
OC
ES
S
Resource
Allocation
10
Resource
Allocation
15
Resource
Allocation
15
Resource
Allocation
5
Resource
Allocation
20
Resource
Allocation
5
Resource
Allocation
15
Resource
Allocation
15
1
2
3
45
6
78
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
SpecificationIdentification
Addressing security concerns with licenses.•The Open Database License (ODbL) •Open Data Commons Attribution License•The Creative Commons Licenses
Linked Data only(just URI linking)
1st levelIdeal for testing the URI
lifecycleDecision on URI Administration
Centralized(DGS) vs. Decentralized(Agency)
Linked Data +RDF
2nd levelComplete realization of
Linked data and Semantic Web
standardsDecision to use this mode can be taken
after evaluation of POC
Linked Data for files only(URIs for files)
Optional To improve the
discovery of files in DGS through semantic
annotation
Key Points
Analysis•Understand data.gov.sg database specifications (relational model & ER model) •Seven issues identified at data storage and consumption level
Home
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
Object ModelingThis is modeling without usage context.*Requires normalization of database model in 3NF form
IssuesPossibility of applying high abstraction and high granularity to objects
Key Learning Ease in identifying the use of common objects across data setsFacilitates brainstorming of relationships between objects
Home
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
Ontology Modeling
Takes the output conceptual diagram from Object Modeling as input.
Key Impetus•Re-use of popular vocabularies (below table)•Use of STDTrip methodology for arriving at Ontologies for relational databases.
Predicate/Vocabularies Purpose
rdfs:label and skos:prefLabel Naming thingsGeonames Model spatial dataVoID Description Describe RDF schema or vocabularyvCard Describing addressRDF, RDFS Model simple data
Use Case Problem StatementConsider an industrial entrepreneur intending to buy a site from Urban Redevelopment Authority (URA)
Issues• Conflicting vocabulary in data.gov.sg and
OneMap• Different levels of granularity in datasets (ex: Location in URA ‘Site for Sales’ dataset
Home
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
Ontology Modeling
Date fields, location fields and fields related to measurements in DGS have scope for vocabulary re-use
Vocabulary for the identified data sets (developed using Protege) with screenshots
List of vocabularies required for LOGD implementation
List of tools used for ontology modeling
OUTPUT?ALLOCATION PERCENTAGE?PERSONNEL INVOLVED
Home
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
URI NamingUniform Resource Indicator (URI) is analogous to assignment of ip address to every computer
Identified URI Administration Modes 1.) Maintained centrally in the DGS platform (resultant URIs will start with http://data.gov.sg/) – RECOMMENDED2.) Maintained by individual agencies (resultant URIs will start with http://ura.gov.sg or http://sla.gov.sg).3.) Maintained externally by third party platforms such as Kasabi (resultant URIs will start with http://data.kasabi.org).
ABOX TBOXhttp://data.gov.sg/ontology/Ministry/ http://data.gov.sg/ministry/MOHhttp://data.gov.sg/ontology/Agency/ http://data.gov.sg/agency/SLAhttp://data.gov.sg/ontology/SiteLocation http://data.gov.sg/location/pioneer_road_northhttp://data.gov.sg/ontology/Race http://data.gov.sg/race/chinese
Dataset ID URAstaticfile001Dataset http://data.gov.sg/dataset/ URAstaticfile001/Class http://data.gov.sg/terms/class/URAstaticfile001/sitesforsaleProperty http://data.gov.sg/terms/property/URAstaticfile001/timeRow 1 http://data.gov.sg/dataset/URAstaticfile001/1Row 1 - A generic column http://data.gov.sg/dataset/URAstaticfile001/1/columnName
Dataset URIs
Issues• Usage of different Linked Data tools can hamper URI naming
• Possibility of Dead links
Key Points
Home
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
RDF CreationEvaluated 3 tools for each mode of conversion - Google Refine, RDF Views and RDF Sponger
Issues•Absence of intimation about API outages can cause the system to return null or invalid results•Google Refine doesn’t create URIs for each row in the static file•Changes to data.gov.sg tables , API output done without appropriate changes in mapping files will affect RDF conversion
Home
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
External LinkingExternal Linking is connecting with other data sets in the web of data
Data.gov.sg
WorldBank CIA World Factbook DBpedia FAO Geonames Supreme
CourtFlickr
<http://data.gov.sg/location/bugis> <owl:sameAs> <http://www.dbpedia.org/resource/Bugis><http://data.gov.sg/race/malay> <owl:sameAs> <http://www.dbpedia.org/resource/Malay_race>
Issues•The outbound links made to data sets outside of IDA’s purview can be risky
•Dead links are a vivid possibility during the change of resource URIs or system downtime
Key Points
Home
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
Datasets Publication
Linked Data API callhttp://data.gov.sg/lda/
childcare/north
SPARQL QuerySelect ?ccWhere {
?cc dgs:haszone dgs:north.?cc dgs:facilitytype dgs:childcare.
}LIMIT 100
TripleStore
LDA-SPARQLMapping file
Conversion from RDF to
JSON
RDF TriplesHttp://data.gov.sg/facility/cc/name1Http://data.gov.sg/facility/cc/name2Http://data.gov.sg/facility/cc/name3
.
.
.Http://data.gov.sg/facility/cc/name100
JSON OutputEntry: name1Entry: name2
.
.
.Entry: name100
Issues• Difficulty for Application developers - SPARQL does not currently support sub-queries, views, stored procedures etc
• Inferencing is not possible with Linked Data API
• Security implementation with 3rd party Linked Data hosting platforms.
Triple Store Metadata Publication
Linked Data API
Linked Data Hosting
Datasets Publication
Recommendations•Linked data hosting platforms are best suited for open license datasets(ex: Singstat publications)
•Use of APIs for updating RDF triples instead of SPARQL Update document
•Use of VOID generators for creating statistics triples
HomeKey
Points
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
Discovery & Exploitation
Key Theme1.) Internal discovery within Singapore for local citizens
2.) External discovery for attracting usage of Singapore government data in international economic & political research and global issues(water scarcity, Carbon Footprint etc)
• Internal Discovery can be improved by having different end points(SPARQL, API, Apps, RDF Dumps), creating awareness programs on availability of these data sets, employing crowdsourcing and gamification techniques to enhance visibility and utility of these data sets
• External discovery is optional if IDA wishes to see the DGS system being limited to Singapore purview. External discovery can be initiated by registering the datasets in open government dataset portals(Potential candidates are datasets with Open license)
Home
Original data
provided by URA
Possible because of the re-use of the
common resource URI Pasir Ris across
data sets
Similarly, location based data from OneMap API is
retrieved for Pasir Ris
Interlinked Datasets Post-Migration
Other Interesting Use Cases
Definitely not Science Fiction!
Q & A Engine that works on top of government linked data. Inspired by www.trueknowledge.com
Sense-MakingQuestion: Which recent year had a growth rate close to 50% for majority of Singapore based SME?
Step1: Spot the resources in this queryDbpedia Spotlight does just that! – Semantic Information Extraction
Which recent year had a growth rate close to 50% for majority of Singapore based SME
Step2: Identify the relationship between the resources
SME is instance of the Organization class Organization class comes under Singapore country
Growth rate is a property of Sales class Year is a class by itself
Majority is subset of Group class
Step3: Use NLP technique – Syntactic Analysis (Stanford Parser) followed by Focus Extraction for understanding the question
2010 is retuned as the result!
Step 4: Look for RDF triples that meet the criteria
Syntactic Parse tree is generated followed by Access Pattern
Summary
Four in-person discussion sessions with IDA, NIIT and SLA
Analysis of Five data.gov.sg system specifications
Evaluation of Four existing Migration Frameworks
Prototyping with Six core Linked Data Tools
Dataset Publication
Virtuoso Universal Server Linked Data API
External LinkingSILK LIMES
RDF CreationGoogle Refine RDF Views RDF Sponger
URI NamingPubby
Ontology ModelingProtégé
Object ModelingConcept Map
Summary• Applicability of the framework to Singapore Government
Data• Issues identified in existing Data Eco System• Recommended tools and best practices for each step• Launchpad for SG Linked Data implementation
Final Thoughts…• ROI is not a key metric for Linked Data implementation• Benefits of moving to Linked Data is intangible and may
not be immediately realizable• Volume of work is huge compared to traditional systems