profiling linked open data
Post on 17-Jan-2017
290 Views
Preview:
TRANSCRIPT
Profiling Linked (Open) Data
Blerina Spahiu
Department of Computer Science,Systems and Communication,University of Milan - Bicoccablerina.spahiu@disco.unimib.it
Supervisor: Andrea Maurino, Matteo PalmonariTutor: Prof. Flavio De Paoli
blerina.spahiu@disco.unimib.it
Outline
The research background The research plan Preliminary results Conclusions and Future Work
2University of Milan - Bicocca
Outline
The research background The research plan Preliminary results Conclusions and Future Work
3University of Milan - Bicocca
Data profiling definition
- Where shall I begin, please your Majesty? - Begin at the beginning - the King said gravely.
Lewis Carroll in Alice’s Adventures in Wonderland
The process of evaluating data quality is called data profiling and typically involves gathering several aggregated data statistics which constitute the data profiling.
Encyclopedia of Database Systems, June 2014
4University of Milan - Bicocca
5
Linked Open Data Cloud
University of Milan - Bicocca
-1014 datasets-188 mio. triples-7 topical categories-80% of linking property is owl:sameAs-98,22% of datasets use RDF vocabulary-7,85% of the datasets provide licensing information.Etc.
6
Linked Open Data Cloud
University of Milan - Bicocca
What types of resources are described in a data set? How are they described?
How well connected are the datasets in the LOD cloud? What is their topic/s?
Are data described as prescribed by the ontology?
Why we need data profiling?
“….because prevention is better than curing”
7University of Milan - Bicocca
Data quality assessment Query optimization Ontology / Data integration Data analytics Complex schema discovery Topical discovery Data visualization
8
State of the artTools Goal Input Output Autom
atization
Scalability
Availability
License Tutorial
RoombaAssaf et al., 2015
Generate descriptivedataset profiles
Query portal APIs for available metadata
Quality assessment of metadata
Code in github
Open Source
LODStatsAuer et al., 2012
Comprehensive statistics about RDF
RDF 32 statistical criteria on schema and data level
Only the demo
Demo
ExpLODKhatchadourian S. and Consenses M. P., 2010
Supports exploring summaries of RDF usage and interlinking among datasets
RDF dataset, the BL (bisimulation label) schema and the neighborhoods to consider
Summaries can be viewed and explored in an interactive graphical way and can be exported in a variety of formats
RDFStatsLangegger A. and Wob W., 2010
Generation of different statistics
RDF dataset Histograms for value distributions, classes/properties/datatypes
Semi-Automatic
Yes Apache Yes
ProLOD++Bohm et al., 2010
Computes different profiling,mining or cleansing tasks
RDF dataset Statistics about properties, classes etc. Information about uniqueness and keyness.
Automatic
Demo
9
Data Profiling Tools Survey
University of Milan - Bicocca
Profiling challenges
The results of data profiling are computationally complex to discover Different and new data management architectures and frameworks have
emerged Linked Open Data are heterogeneous data
• Syntactic Heterogeneity (Different formats, query languages)• Schematic Heterogeneity (Different encoding schemas)• Semantic Heterogeneity
(Different vocabularies, semantic overlap of terms) Unified view of data profiling as a field Unifying framework for its task
10University of Milan - Bicocca
Outline
The research background The research plan Preliminary results Conclusions and Future Work
11University of Milan - Bicocca
Objectives Develop automatic approaches Generate new statistics and knowledge patterns to provide dataset summary
and inspect its quality.• Apply data mining techniques to extract useful knowledge from large
datasets• Implementation of different approaches for outlier detection
Algorithms to overcome challenges to perform profiling in Linked Open Data• Parallel calculation of statistics and patterns extraction in LOD• Data mining techniques to deal with high dimensionality
Topical information extraction and classification Developing a methodology on how to perform profiling tasks
• A deep literature study to classify and formalize profiling tasks
12University of Milan - Bicocca
13
Methodology Used
Schematic
Hetetogeneity
University of Milan - Bicocca
Semantic Heterogeneity Syntactic Heterogeneity
Topical Discovery
DataQuality
DataUnderstanding
Outline
The research background The research plan Preliminary results Conclusions and Future Work
14University of Milan - Bicocca
15
Work already done (1)Profiling of Italian Public Administration websites
• Decree 33 and 150
University of Milan - Bicocca
16
Profiling of Italian PAs
Benchmark of PAs• Geographical distribution (country wide)• Type of PAs (region, municipality, county)• Size (number of inhabitants)
Compliance Index• Completeness • Accuracy• Timeliness
Profiling websites in terms of compliance
University of Milan - Bicocca
17
“Data Profiling, the moment of truth”
University of Milan - Bicocca
The average index of compliance for the selected• Italian Regions is 0.488 (50% has an index lower than the mean). • Italian Provinces is 0.561 (more than 50% has an index lower than the mean).• Italian Municipalities is 0.462 (more than 50% has an index lower than the mean).
RegionsVeneto has the highest score (0.839)Campania has the lowest score (0.043)
ProvincesBergamo in Lombardia Region have the highest score (0.759)Massa Carrara, in theToscana Region has the lowest score (0.266)
Municipalities Voghera (Lombardia Region) has the highest score (0.759)Ozegna (Piemonte Region) has the lowest score (0.164)
Works already done (2)
Facilitating query for similar datasets discovery Speeding up data searches Trends and best practices of a particular domain can be identified
18
To which extent topical classification can be automated
Data Corpus and Feature Set
Category Datasets %
Government 183 18.05
Publications 96 9.47
Life sciences 83 8.19
User generated content 48 4.73
Cross domain 41 4.04
Media 22 2.17
Geographic 21 2.07
Social Web 520 51.28
19
Data corpus (1014 datasets) extracted in April 2014 from Schmachenberg et al.
Vocabulary Usage (1439) Class URIs (914) Property URIs (2333) Local Class Names (1041) Local Property Names (2493) Text from rdfs:label (1440) Top Level Domain (55) In and Out Degree (2)
Experimental Setup
Classification Approaches K-Nearest Neighbor J-48 Naïve Bayes
Two normalization strategies Binary (bin) Relative term occurrences (rto)
Three sampling techniques No sampling Down sampling Up sampling
20
Results on Combined Feature Sets
21
Our model reaches an accuracy of 81.62%
Confusion Matrix
22
Confusion between publications with government and life sciences
Confusion between user generated content and social networking
23
Works already done (3) ABSTAT is a framework which can be used to summarize linked datasets and at
the same time to provide statistics about them• Summary consists of Abstract Knowledge Patterns (AKPs) of the form
<subjectType, predicate, objectType>• Can help users comparing two datasets• Help detecting errors in the data such as accuracy
Eg: AKPs <dbo:Band,dbo:genre,dbo:Band> • The domain or the range is unspecified for 585 properties in DBpedia Ontology
University of Milan - Bicocca
SubjectType Porperty ObjectType
http://dbpedia.org/ontology/Town http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Country
http://dbpedia.org/ontology/City http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Legistrature
http://dbpedia.org/ontology/Settlement http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Settlement
http://dbpedia.org/ontology/Country http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/PoliticalParty
http://dbpedia.org/ontology/Village http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/MilitaryConflict
http://dbpedia.org/ontology/Organization http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/City
http://dbpedia.org/ontology/AdministrativeRegion http://dbpedia.org/ontology/governmentType
24
Evaluation Plan
Where the Gold Standard exist validate in terms of precision, recall and F-measure
Difficulties to evaluate the validity of the proposed approach
• How these statistics or summarization allow to improve the performance of the actual profiling tasks
• Humans will evaluate the validity of the summarization in terms of relatedness and informativeness
• Provide users a list of statistics and ask their opinion which is more important for their use case
University of Milan - Bicocca
Outline
The research background The research plan Preliminary results Conclusions and Future Work
25University of Milan - Bicocca
Conclusions and Future Work
The Topical Classification approach yield an accuracy of 82%, enriching with other features like the linkage coverage
• Each dataset has only one topic, for some datasets multi label classification can be appropriate
• A classifier chain for the multi-label classification• Because of the heavy imbalance of the data a two stage classifier
might help Enrich ABSTAT framework with other statistics and to apply it to
unstructured data such as microdata. Investigate the trade-off between ABSTAT summarization to support
dataset exploration and understanding.
26
Publications• A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data
Policies: An empirical evaluation of Italian Local Public Administration. ECIS –eGOV Workshop in the Twenty Second European Conference on Information Systems, Tel Aviv 2014
• A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data Policies: An empirical evaluation of Italian Local Public Administration. Information Polity Journal, p.263-275, 2014
• M.Palmonari, A.Rula, R.Porrini, A. Maurino, B.Spahiu, V. Ferme – ABSTAT: Linked Data Summaries with Abstraction and STATistics- The Semantic Web: ESWC 2015, Portoroz Slovenia, May31th, 2015 to June 4th, 2015
• R.Meusel, B.Spahiu, C. Bizer, H. Paulheim – Towards Automatic Classification of LOD datasets – LDOW Workshop co-located with 24th International World Wide Web Conferenze (WWW 2015) Firenze, May 19, 2015
• B. Spahiu – Profiling the Linked (Open) Data – Doctoral Consortium Call at ISWC 2015• C. Xie, D. Ritze, B. Spahiu, H. Cai- Instance-based property matching in Linked Open
Data Environment – Ontology Matching Workshop co-located with 14th International Semantic Web Conference, 2015 Bethlehem, Pennsylvania USA.
27University of Milan - Bicocca
Thank you for your attention!
28University of Milan - Bicocca
top related