linked census data
DESCRIPTION
TRANSCRIPT
![Page 1: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/1.jpg)
DANS is een instituut van KNAW en NWO
Data Archiving and Networked ServicesData Archiving and Networked Services
Linked Census DataSemantics for Knowledge Discovery of the Past
Albert Meroño-Peñuela
01/03/2013
DANS is een instituut van KNAW en NWO
![Page 2: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/2.jpg)
Main goal: cross queries
?
![Page 3: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/3.jpg)
Main goal: requirements
• Schema flexibility: do not commit to a specific schema
• Linkage– Internally (e.g between tables), to make relations explicit– Externally
• Harmonization datasets (e.g. HISCO, AC)• Enriching datasets (e.g. labour strikes, book publications)
• Inference: of new knowledge (e.g. ink_manufacturer(X) & ink_manufacturer chemical |= chemical(X))
• Publication: as open data for researchers on the Web (through Service Architectures)
![Page 4: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/4.jpg)
Main goal: RDF datamodel
![Page 5: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/5.jpg)
CEDAR development cycle, iteration 1
• Gathering: only one file• Conversion: TabLinker, small table size• Querying: simple, ad-hoc SPARQL + trivial visualization
![Page 6: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/6.jpg)
Iteration 1: conversion
https://github.com/Data2Semantics/TabLinker
• Supervised Excel to RDF conversion• Python feat. xlutils, xlrd, rdflib libs• Intended for complex layouts that cannot be handled with
automatic csv2rdf scripts• Maps workbooks to the RDF Data Cube vocabulary• Layout needs to be manually annotated
![Page 7: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/7.jpg)
Iteration 1: conversion
![Page 8: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/8.jpg)
Iteration 1: conversion
![Page 9: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/9.jpg)
Iteration 1: queryingPREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?place ?size WHERE { ?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal;
d2s:dimension d2sdata:M_; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] .
?place skos:prefLabel "TOT"@nl . } ORDER BY DESC(?size)
![Page 10: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/10.jpg)
Iteration 1: querying
http://cedar-project.nl/visualizing-sparql-query-results-on-the-census/
![Page 11: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/11.jpg)
Iteration 1: outcome
![Page 12: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/12.jpg)
CEDAR development cycle, iteration 2
• Gathering: arbitrary number of files• But, what do we have?
• Conversion: arbitrary table size, annotations• Querying: SPARQL with mappings, top level ontologies
![Page 13: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/13.jpg)
Iteration 2: gathering
Hey, what’s there?
Inventory of the dataset•How many files do we have?•How many tables/sheets?•How many variables?•How many annotations?
TabExtractor (Python feat. xlrd, Levenshtein libs)
https://github.com/CEDAR-project/TabExtractor
![Page 14: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/14.jpg)
Iteration 2: gathering
https://github.com/CEDAR-project/TabExtractorhttps://www.dropbox.com/s/ah7lgmji2ofat3w/Census%20summary.xls
![Page 15: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/15.jpg)
Iteration 2: gathering
https://github.com/CEDAR-project/TabExtractorhttps://www.dropbox.com/s/vw1rf4pp8g8sxn3/annotations-dump-translation.csv
![Page 16: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/16.jpg)
Iteration 2: gathering
Year File Table Row Col Author1899 VT_1899_06_H5.xls Utrecht 155 3 Vreugdenhil1899 VT_1899_06_H5.xls Utrecht 805 3 Vreugdenhil1930 WT_1930_04_A-T2.xls Tabel 2a 0 0 Helpdesk1930 WT_1930_04_A-T2.xls Tabel 2b 0 0 Th. Vreugdenhil1909 VT_1909_01_T.xls Tabel 1 10058 13 DFS 71909 VT_1909_01_T.xls Tabel 1 3321 15 ServiceProfs 0011909 VT_1909_01_T.xls Tabel 1 11909 13 DFS 71909 VT_1909_01_T.xls Tabel 1 12596 11 DFS 8
![Page 17: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/17.jpg)
Iteration 2: gathering
• 507 Excel files• 2,288 tables• 33,283 annotated cells
– 10.95% numerical corrections– 89.05% textual descriptions / anomalies
But TabExtractor ain’t a sexy thing…• Bring metadata together• Publish on the Web? Archive?
![Page 18: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/18.jpg)
Iteration 2: gathering
Subset of the dataset•Miniproject 1
– 1889– Occupational census– Province Noord-Brabant– 1 table
•Miniproject 2– 1859, 1869, 1879, 1889– Population census– Province Noord-Brabant– 4 tables
![Page 19: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/19.jpg)
Iteration 2: conversion
• Iteration 1 converted to RDF only Excel cells• Some cells have annotations attached
– Value corrections: 5 8 – Explanations, descriptions: Number includes 2 people of
unkown age– Inconsistencies: Sum does not add up
• Iteration 2 produces proper named graphs for annotations
![Page 20: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/20.jpg)
Iteration 2: conversion
Annotations data model
![Page 21: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/21.jpg)
Iteration 2: conversion
Annotations data model
![Page 22: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/22.jpg)
Iteration 2: conversion
![Page 23: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/23.jpg)
Iteration 2: data quality
• Annotations can improve data quality• Model has to be extended with actions
– If sum doesn’t add up Retrieve numbers from other tables/sources
– Appropriate vocabularies
![Page 24: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/24.jpg)
Iteration 2: data quality• Measure of data quality? Benford’s Law
– Data distributions in censuses meet Benford’s Law– Demo available!
![Page 25: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/25.jpg)
Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?place ?size WHERE { ?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal;
d2s:dimension d2sdata:M_; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] .
?place skos:prefLabel "TOT"@nl . } ORDER BY DESC(?size)
![Page 26: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/26.jpg)
Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/>PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1879_10_H1_marked/NOORD-BRABANT/>PREFIX ns2: <http://www.data2semantics.org/core/Kom-buiten-de-kom/>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?place ?size WHERE {?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal;
d2s:dimension d2sdata:M;ns2:Kom_Buiten_de_kom ?place;d2s:populationSize ?size ] .?place skos:prefLabel "Totaal in
de gemeente"@nl .}ORDER BY DESC(?size)
PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?place ?size WHERE { ?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal;
d2s:dimension d2sdata:M_; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] .
?place skos:prefLabel "TOT"@nl . } ORDER BY DESC(?size)
![Page 27: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/27.jpg)
PREFIX d2s: <http://www.data2semantics.org/core/>PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1879_10_H1_marked/NOORD-BRABANT/>PREFIX ns2: <http://www.data2semantics.org/core/Kom-buiten-de-kom/>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?place ?size WHERE {?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal;
d2s:dimension d2sdata:M;ns2:Kom_Buiten_de_kom ?place;d2s:populationSize ?size ] .?place skos:prefLabel "Totaal in
de gemeente"@nl .}ORDER BY DESC(?size)
PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?place ?size WHERE { ?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal;
d2s:dimension d2sdata:M_; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] .
?place skos:prefLabel "TOT"@nl . } ORDER BY DESC(?size)
Iteration 2: querying
![Page 28: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/28.jpg)
Iteration 2: querying
• Things to be mapped– Occupations (HISCO)– Municipalities (Amsterdamse Code)– Housing types– Religions– Etc.
• Converted the HISCO and AC mappings to RDF (https://github.com/CEDAR-project/Harmonize)– Linked to the tables RDF
![Page 29: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/29.jpg)
Iteration 2: linking HISCO
![Page 30: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/30.jpg)
Iteration 2: linking AC
![Page 31: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/31.jpg)
Iteration 2: linking
![Page 32: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/32.jpg)
Iteration 2: linking
• Issue: HISCO is too generic (top-down approach)– Class 21110 too abstract: General Manager– Visualization of SPARQL HISCO mappings
• Issue: AC works at the municipality level– Other geographical harmonizations?
• Need for year-level ontologies– Classification systems are different
• R script to do bottom-up approach Classification extractor (https://github.com/albertmeronyo/OccupationOntology)
– Automated removal of non-related cols and rows– Introduction of redundancy (‘Id.’ values)– Removal of totals– Work in progress: ontology merging
![Page 33: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/33.jpg)
Iteration 2: linking
Upper ontologies (HISCO, AC)
Year-dependent ontologies
![Page 34: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/34.jpg)
Iteration 2: linking
Upper ontologies (HISCO, AC)
Year-dependent ontologies
![Page 35: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/35.jpg)
Iteration 2: linking
Upper ontologies (HISCO, AC)
Year-dependent ontologies
? ?
![Page 36: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/36.jpg)
Concept drift
• Models drift over time• Classes merge, split, change their properties
(beroepenklassen)• Although, some core meaning remains (shoemakers)• Can we automatically identify and align drifted
models?
? ?t1 t2 tn
![Page 37: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/37.jpg)
Conclusion: milestones
• Complete inventory of the dataset (w/ metadata generation)
• Translation to RDF– Raw data– Annotations– Harmonization/linking
• Successful data quality experiments (Benford’s Law)• Useful software
– TabLinker (Excel/CSV to RDF)– TabExtractor (Excel/CSV metadata collector)– Harmonize (HISCO/AC to Census linker)– OccupationOntology (bottom-up occupation ontology extractor)
![Page 38: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/38.jpg)
Conclusion: future work
• Better software– TabLinker: automate mark-up process– TabExtractor: improve and publish inventory output– Harmonize: improve HISCO/AC datamodels– OccupationOntology: extend to housing types, religions, etc.
• Concept drift literature on drifting models (Kuukkanen 2008, Gonçalves et al. 2009, Shenghui et al. 2010)
• Semantic Web literature on modeling geographical change (Kauppinen 2010)
– Integrate with AC dataset?
• Link meaningful datasets with the census– Labour strikes– Book publications– More?
![Page 39: Linked Census Data](https://reader034.vdocument.in/reader034/viewer/2022051612/54bc9ee64a7959777e8b4590/html5/thumbnails/39.jpg)
Data Archiving and Networked Services (DANS)Anna van Saksenlaan 10 | 2593 HT Den Haag Postbus 93067 | 2509 AB Den Haag070 3446 484 | [email protected] | www.dans.knaw.nlKVK 54667089 | DANS is een instituut van KNAW en NWO
Thank you
http://www.cedar-project.nl