korea advanced institute of science and technology scalable access and process of linked open data a...
TRANSCRIPT
Korea Advanced Institute of Science and Technology
Scalable Access and Process of Linked Open Data
A Semantic Cloud Generation Approach based on Linked Data for Efficient Semantic Annotation
In-Young KoAugust 16, 2012
Adapted from the material of Hangyu Ko, a Ph.D. student at WebEng Lab.
2012 International Asian Summer School on Linked Data (IASLOD 2012) August 13 – 17, 2012, KAIST, Daejeon, Korea
Contents Introduction
Objectives and Motivations Challenges
SPARQL Performance Dynamic Access to Linked Data Forming effective semantic clouds
Similarity-Link-based Semantic Cloud Generation Similarity-Link Analysis for Concept Grouping Centrality Measurement Incremental Traversing and Grouping
Evaluation Performance of Incremental Traversing Candidate Reduction of Similarity-Link Analysis
Conclusion
2012.08.16
2
Copyright (c) In-Young Ko, KAIST
Objectives To provide a semantic-cloud-based annotation scheme
Use semantic clouds as the primary interface
Easy to add semantic annotation in resource-constrained environments (e.g., smart phones and IPTVs)
To propose the framework of generating efficient semantic clouds To allow users to intuitively recognize candidate concepts with
resolving semantic ambiguity
To utilize Linked Data to dynamically generate semantic clouds
2012.08.16
3
Copyright (c) In-Young Ko, KAIST
Tagging with terms that map onto ontology classes Enhanced information retrieval
e.g., Resolve anomalies in search “Apple” the fruit and “Apple” the company
Types of semantic annotation Manual Semantic Annotation
Human annotators are often fraught with errors Knowledge acquisition bottleneck
Automatic Semantic Annotation Impossible to automatically identify and classify all entities in
source documents with complete accuracy Semi-automatic Semantic Annotation
All existing semantic annotation systems rely on human intervention at some point in the annotation process
Semantic Annotation
2012.08.16
4
Copyright (c) In-Young Ko, KAIST
Usability and Scalability in Semantic Annotation
Problems of previous efforts on semantic annotation Use terms from ontologies created by domain experts
Do not provide sufficient options to cover various kinds of semantics
Do not necessarily reflect newly created knowledge in an up-to-date manner
2012.08.16
5
Copyright (c) In-Young Ko, KAIST
Motivating Scenario
2012.08.16
6
Copyright (c) In-Young Ko, KAIST
Expected Results Example ‘Apple’
Additional relevant terms that don’t contain the keyword e.g. ‘iTunes’, ‘Macintosh’ in the pink cloud
2012.08.16
7
Copyright (c) In-Young Ko, KAIST
Apple Corp.
Linked Data is large-scale & heterogeneous Semantic Web data
More than 31 billions RDF from 295 different datasets
2012.08.16
8
Copyright (c) In-Young Ko, KAIST
What is Linked Data ?
Problems in Retrieving Linked Data
2012.08.16
9
Copyright (c) In-Young Ko, KAIST
SPARQL Need to know all the endpoints of datasets
Slow response time
Linked Data Search Engines Sindice, SWSE, Falcon, etc.
Only 0.6% ~ 30% of the result comes from Linked Data datasets
Limited number of results: maximum 1000 URIs for each query
Billion Triple Challenge (BTC) Dataset (2009 ~ 2011) Aggregation and organization of dumps from Linked Data search engines Only 30% of triples belongs to Linked Data datasets (984,611,067 / 3,173,563,606)
LD Spider Need to know seed URIs
Challenges in Generating Semantic Clouds
2012.08.16
10
Copyright (c) In-Young Ko, KAIST
Dynamic Access to Linked Data Too many responses for each query
Not feasible to ask users to choose the most appropriate one
Forming effective semantic clouds for annotation Relevant terms are grouped into few number of clouds
The semantics of a cloud should be intuitively recognizable
Semantic ambiguity between clouds should be minimized
No. Keyword Triples1. Animal 67,0542. Apple 49,4433. California 149,7854. Cloud 63,9395. Music 256,3526. New York 17,2917. Sky 153,7168. Tiger 27,9019. Travel 65,871
10. Wedding 22,624
An incremental and iterative access method
Three phases of semantic cloud generation:
Semantic Cloud Generation Framework
2012.08.16
11
Copyright (c) In-Young Ko, KAIST
2012.08.16
12
Copyright (c) In-Young Ko, KAIST
1. Find Spotting Points• Find the initial set of RDF nodes by using a LOD search engine or BTC• Retrieve and group similar RDF concepts via SPARQL endpoints• Prioritize the nodes within a group by using centrality analysis
1. Find Spotting Points• Find the initial set of RDF nodes by using a LOD search engine or BTC• Retrieve and group similar RDF concepts via SPARQL endpoints• Prioritize the nodes within a group by using centrality analysis
2. Select Links to Traverse• Consider popular relations such as FOAF, DC, SKOS and SIOC• Selectively traverse the Linked Data graph based on user or task context
2. Select Links to Traverse• Consider popular relations such as FOAF, DC, SKOS and SIOC• Selectively traverse the Linked Data graph based on user or task context
3. Generate Concept Clouds• Check semantic similarity to merge RDF nodes• Minimize semantic ambiguity to make clusters more distinguishable• Increase # of hops, and # of common terms to cover more RDF nodes
3. Generate Concept Clouds• Check semantic similarity to merge RDF nodes• Minimize semantic ambiguity to make clusters more distinguishable• Increase # of hops, and # of common terms to cover more RDF nodes
User Selects a Cloud
User Selects a Cloud
: end
: start
Semantic Cloud Generation Steps
N
Y
Finding Spotting Points Concept Search
Keyword based search on relevant concepts Similarity-Link Analysis
owl:sameAs parsing for grouping semantically same concepts skos:broader parsing for grouping semantically relevant concepts
Centrality Measurements Importance of each node, connection of each node
2012.08.16
13
Copyright (c) In-Young Ko, KAIST
Concept Search
2012.08.16
14
Copyright (c) In-Young Ko, KAIST
Keyword-based SearchKeyword-based Search
SubjectSubject PredicatePredicate ObjectObject
http://www.w3.org/2000/01/rdf-schema#labelhttp://www.w3.org/2004/02/skos/core#prefLabelhttp://purl.org/dc/elements/1.1/titlehttp://purl.org/dc/terms/titlehttp://sw.cyc.com/CycAnnotations_v1#labelhttp://rdf.freebase.com/ns/type.object.namehttp://www.geonames.org/ontology#namehttp://www.w3.org/2004/02/skos/core#altLabel
DBPedia Freebase∙∙∙
Concept RetrievalConcept Retrieval
Concept1Concept2
⁞Concept n
Concept1Concept2
⁞Concept
k
Each data set uses different ontologyEach data set uses different ontology Collect the ‘Subject’ concepts Collect the ‘Subject’ concepts
Similarity-Link Analysis Model
2012.08.16
15
Copyright (c) In-Young Ko, KAIST
ConceptOutLink InLink
Integer
has # of Links
Literal
hasLabel hasURI
Literal
hasInLinkshasOutLinks
hasSkosNarrower/Broader
LiteralLiteralLiteral
hasOwlsameAshasSkosExactMatch0…n0…n 0…n
1…11…1
1…1
Similarity-Link Analysis – owl:sameAs
2012.08.16
16
Copyright (c) In-Young Ko, KAIST
CC
CC
CC
CCCC
CC
CC
CC
CC
owl:sameAs
owl:sameAs
owl:sameAs
owl:sameAs
Similarity-Link Analysis – skos:broader
2012.08.16
17
Copyright (c) In-Young Ko, KAIST
CCCC
CC
CC
CC
CC
CCCC
CC
skos:broader
skos:broader
Similarity-Link Analysis – an example
2012.08.16
18
Copyright (c) In-Young Ko, KAIST
<http://rdf.freebase.com/ns/m/0k8z><http://rdf.freebase.com/ns/m/0k8z>
<http://dbpedia.org/data/Category:Apple_Inc.><http://dbpedia.org/data/Category:Apple_Inc.>owl:sameAs
Similarity-Link AnalysisSimilarity-Link Analysis
skos:broader
<http://dbpedia.org/data/Category:Apple_Inc._hardware><http://dbpedia.org/data/Category:Apple_Inc._hardware>
<http://dbpedia.org/data/Category:Apple_IIGS><http://dbpedia.org/data/Category:Apple_IIGS>
<http://dbpedia.org/data/Category:Apple_Lisa><http://dbpedia.org/data/Category:Apple_Lisa>
Centrality Analysis Connections to other concepts
Degree centrality The number of links incident upon a node Indegree (popularity), outdegree (gregariousness)
Eigenvector centrality Connections to high-scoring nodes contribute more than connections to low-
scoring nodes Katz centrality & PageRank
Generalization of degree centrality (all nodes connected through a path) A variant of eigenvector centrality
Closeness to other concepts Closeness centrality
Farness = sum of distances to all other nodes Closeness = the inverse of the farness
Betweenness centrality High probability to occur on a randomly chosen shortest path b/w two randomly
chosen nodes high betweenness
2012.08.16
19
Copyright (c) In-Young Ko, KAIST
http://en.wikipedia.org/wiki/Centrality
Concept Groups
No. Label Dataset1 Apple@cs Freebase2 Apple_Corp. Freebase3 Sugared Apple Freebase4 R.W. Apple Jr. DBpedia5 Apple Pie DBpedia6 Apple_Pink Freebase7 Barton Brands Freebase8 Apple MacBook Freebase9 H.W. Longfellow Freebase
10 Apple_Developer_Tools Freebase
2012.08.16
20
Copyright (c) In-Young Ko, KAIST
Keyword: ‘Apple’
Concept Clusters (1-hop traversal)
2012.08.16
21
Copyright (c) In-Young Ko, KAIST
Incremental Traversal & Grouping
Problems of accessing Linked Data via SPARQL endpoints Slow response time Exponentially increased number of concepts to traverse
Approach to solve the problems Incrementally traverse (80% of relevant concepts can be retrieved in
2 hops) Wait for the result from an endpoint only for the threshold time
2012.08.16
22
Copyright (c) In-Young Ko, KAIST
0 hop 1 hop 2 hop
Evaluation Performance of semantic cloud generation Concept reduction ratio User study
Data Preparation CKAN data hub to obtain endpoints
173 of endpoints Jena ARQ for exploiting SPARQL query Test keywords
Top 30 tags in Flickr
2012.08.16
23
Copyright (c) In-Young Ko, KAIST
Performance of Incremental Traversal
Threshold (5 seconds) for SPARQL Query 153 endpoints out of 173 endpoints Coverage: 88.44%
2012.08.16
24
Copyright (c) In-Young Ko, KAIST
Res
pons
e ti
me
(mse
c.)
5000
20 153
Performance of Incremental Traversing
2012.08.16
25
Copyright (c) In-Young Ko, KAIST
Concept Reduction of Similarity-link Analysis
2012.08.16
26
Copyright (c) In-Young Ko, KAIST
Keyword # of Concept# of Concept
(sameAs)# of Concept
(SKOS)Reduction Ratio
%
Newyork 895 296 0 33.07263Animal 1524 484 1 31.82415
California 2911 648 175 28.27207Wedding 164 40 0 24.39024
Music 8264 1839 16 22.44676Sky 2741 242 278 18.97118
Tiger 459 42 41 18.08279Apple 877 772 729 16.87571
Reduction Ratio (%)
(Keyword)
Avg. 14.25%
User Study Top 30 Popular Tags from Web (Flickr)
Apple, Mouse, Tiger, Paris, Bank, Health, Web, Art, Nature, Park Beach, California, Canon, Music, London, Travel, Wedding, Festival, Square, Party Newyork, Water, Sky, Snow, Portrait, Nikon, Cloud, Green, Spring, Animal
2012.08.16
27
Copyright (c) In-Young Ko, KAIST
Implementation in IPTV domain
2012.08.16
28
Copyright (c) In-Young Ko, KAIST
Annotation Timing
Annotation Timing
Start ButtonStart
Button
Keyword (User input)
Keyword (User input)
Cloud Generation
Cloud Generation
Selected Linked Data
Selected Linked Data
Semantic Cloud
Semantic Cloud
1
2
3
4
Cloud Generation
Cloud Generation
5
Conclusion Contributions
Efficient handling of a large-scale Linked Data Generating semantic clouds that enable users to
Specify semantics by using simply keywords Intuitively recognize semantic options to annotate Easily resolve semantic ambiguity
Future Works User studies to measure the usability of the proposed
approach Considering semantically ambiguous situations
Empirical studies to decide followings Optimal number of spotting point Maximum number of hops to traverse Threshold value to decide the optimal set of SPARQL
endpoints for initial generation
2012.08.16
29
Copyright (c) In-Young Ko, KAIST
Questions?
2012.08.16
30
Copyright (c) In-Young Ko, KAIST
References (1/2)[1] Christian B., Tom H., Berners-Lee T.: Linked Data – The Story So Far. International Journal on Semantic Web and Information Systems, vol. 5, issue 3, 1-22 (2009)
[2] Bayerl P.S., Lungen H., Gut U., Paul K.I.: Methodology for reliable schema development and evaluation of manual annotations. Knowledge Markup and Semantic Annotation at the International Conference on Knowledge Capture 2003 (2003)
[3] Vehvilaiinen A., Hyvonen E., Alm O.: A Semi-Automatic Semantic Annotation and Authoring Tool for a Library Help Desk Service. In Proceedings of the 1st Semantic Authoring and Annotation Conference 2006 (2006)
[4] Kiryakov A., Popov B., Ognyanoff Dl., Manov D., Kirilov A., Goranov M.: Semantic Annotation, Indexing, and Retrieval. In ELSEVIER Journal of Web Semantics 2004 (2004)
[5] Reeve L., Han H.: Survey of Semantic Annotation Platforms. In ACM Symposium on Applied Computing (2005)
[6] Uren V., Cimiano P., Iria J., Handschuh S., Vargas-Vera M., Motta E., Ciravegna F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. In ELSEVIER Journal of Web Semantics (2005)
[7] In-Young Ko, Sang-Ho Choi, Han-Gyu Ko.: A Blog-centered IPTV Environment for Enhancing Contents Provision, Consumption, and Evolution. In Proceedings of the 10th International Conference on Web Engineering 2010 LNCS, vol. 6189, 522-526 (2010)
[8] Lord F.M.: Optimal Number of Choices per Item – A Comparison of Four Approaches. In Journal of Educational Measurement, vol. 14, no. 1, 33-38 (1977)
[9] Ding L., Finin T., Joshi A., Pank R., Cost S.R., Peng Y., Reddivari P., Doshi V., Sachs J.: Swoogle: a search and metadata eigine for the semantic web. In Proceedings of the CIMK 2004 (2004)
[10] Cheng, G., Ge W., Qu Y.: Falcons: Searching and Browsing Entities on the Semantic Web. In Proceedings of the 17th International World Wide Web Conference, Beijing, China, April 21-25, (2008)
[11] Tummarello, G., Delbru, R., Oren, E.: Sindice.com: Weaving the Open Linked Data. In Proceedings of the 6th International Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference. LNCS, vol. 4825, 552-565 (2007)
[12] Delbru R., Rakhmawati N.A., Tummarello G.: Sindice at SemSearch 2010. In Proceedings of the 19th International World Wide Web Conference, Raleigh, North Carolina, USA, April 26-30 (2010)
[13] Benjamin A., Leo S., Tomas R.: ConTag: A semantic tag recommendation system, In Proceedings of I-MEDIA and I-SEMANTICS, Graz, Austria, September 5-7 (2007)
2012.08.16
31
Copyright (c) In-Young Ko, KAIST
References (2/2)[14] Roberto M., Azzurra R., Tommaso D. N., Eugenio D.: Semantic tag cloud generation via Dbpedia, In Proceedings of the 11th International Conference, EC-Web 2010, Bilbao, Spain, September (2010)
[15] Song Y., Zhang L., Giles. L.: Automatic Tag Recommendation Algorithm for Social Recommender Systems, In ACM Transactions on the Web, Vol. 5, No. 1, Article 4, February 2011 (2011)
[16] W3C SWEO Community Project Linking Open Data, http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData (accessed June, 12 2012)
[17] Linking Open Data Statistics, http://www4.wiwiss.fu-berlin.de/lodcloud/ (accessed June, 12 2012)
[18] Hogan A., Zimmermann A., Umbrich J., Polleres A., Decker S.: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. In ELSEVIER Journal of Web Semantics 2011 (2011)
[19] Hu W., Qu Y., Sun X.: Bootstrapping object coreferencing on the semantic web. Journal of Computer Science Technology, 26(4), 663-675
[20] Ding I., Shinavier J., Shangguan Z., McGuinness D.: SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl:sameAs in Linked Data. Lecture Notes in Computer Science, Volume 6496/2010, 145-160
[21] Gionanni B., Stefano S.: A Spectrometry of Linked Data. In proceedings of LDOW 2012 in International World Wide Web Conference 2012
[22] Freeman, Linton: A set of measures of centrality based upon betweenness. Sociometry 40: 35-41, 1997
[23] Han-Gyu Ko, In-Young Ko.: Generation of Semantic Clouds based on Linked Data for Efficient Multimedia Semantic Annotation. In proceedings of ExploreWeb 2011 in International Conference on Web Engineering 2011
[24] SPARQL Query Language for RDF, http://www.w3.org/TR/rdf-sparql-query/ (accessed June, 12 2012)
[25] Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Technical report (1998)
[26] Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In Proceedings of the 9 th Annual ACM-SIAM Symposium on Discrete Algorithms (1998)
2012.08.16
32
Copyright (c) In-Young Ko, KAIST