semantic similarity computation on the web of data jin guang zheng tetherless world constellation,...
Post on 20-Jan-2016
230 Views
Preview:
TRANSCRIPT
Semantic Similarity Computation on the Web of Data
Jin Guang ZhengTetherless World Constellation, Computer Science Department
RPI
Outline
• Introduction– Research Problem– Historical review– Contribution Overview
• Contribution I: Information Entropy and Weighted Similarity Model– Semantic Similarity Computation Intuitions– IEWS Model
• Contribution II: Semantic Similarity based Entity Matcher– Entity Matching Problem– System
• Contribution III: Semantic Similarity based Entity Linking Tool– Entity Linking Problem– System
• Evaluation
• Summary2
Background
• Entity– A thing on the Web of Data that has an URL as identifier– E.g. Organization, Location, Person– http://dbpedia.org/resource/George_Washington
• Triple– Subject, Predicate(Property), Object
• :George_Washington dbpediaProp:birthDate 1732• subject: :Geroge_Washington• predicate: dbpediaProp:birthDate• Object: 1732
• You can read it as: George Washington’s birth date is 1732.
– Object can also be an URL which can be described by another set of triples • :Geroge_Washington dbpediaProp:birthPlace :Virginia• :Virginia dbpediaProp:area 42774 sq mi
• You can read it as: George Washington’s birth place is Virginia. • Virginia has area 42774 sq mi
–
3
Semantic Similarity
• Semantic similarity: how likely two things are semantically the same base on the likeliness of the semantic contents.– Car, Automobile– http://dbpedia.org/resource/New_York_City,
http://data.nytimes.com/N46020133052049607171 (New York City)
4
The Problem
• Entities on the Web of Data – Entities on the Web of Data are from different
sources, heterogeneous content– Some entities are similar to each other, or refer to
same real world object– How can we know if any entities are similar to
each other? How to compute similarity score among the entities on the Web of Data?
– Limits possible applications: data integration, data aggregation, data clustering
The Problem
• Entity Matching– How to tell if two entities are referring to same real world
object/concept and create “same as” type of link automatically
– Enables Data integration, data interoperability
• Entity Recognition – How to find the “correct” entity from the Web of data to
annotate entity mentions in the free text.– Machines can process texts in “smart” way
• Knowing “Geroge Washington” is referring to president George Washington not George Washington University.
6
Historical Review
• Semantic Similarity Computation– Ontology based Edge-counting method:
– Similarity between words is computed by applying a function to the length of the path linking the words in an ontology. [8][2]
– Information-content based method:[4][5][6][7]– Similarity between documents is computed by using a corpus to
compute the amount of information they shared
– Hybrid method [1][3]– Similarity between documents/words are computed by using
combination of above approaches.
Computing similarity between documents and words as opposed to entities. 7
Historical Review
• Semantic Similarity Based Entity Matching• Ontology Matching & Instance Matching
– ASMOV computes “children”, “parent” and lexical similarity [9]
– Duan use “Jaccard” and “Edit distance” similarity and performs clustering [10]
– User configured information as a guide and computes similarity with the information provided by user [11]
– Rong et al. [12] extracted literal information from the entities and represented this information as vectors
Computing information entropy and learning the importance of the properties that describe the
entities in similarity computation8
Historical Review
• Semantic Similarity Based Entity Recognition• Bagga et al. [14][15] use Vector Space Model (VSM) to represent
the context of the entity mention and use cosine similarity to suggest possible annotation for the entity mention
• Minkov et al. [16] and Jiang et al. [17] suggest to use graph based algorithms to further the similarity computation
• Linden [18] leverages information from Wikipedia and taxonomy from the knowledge base to compute similarity between the Wikipedia concepts and entity mentions to suggest annotation
Computing similarity between entity mentions in free text and Wikipedia documents as opposed to entity mentions in free text and entities on the
Web of Data.9
Challenges of ComputingSimilarity on the Web of Data
• Challenge III: Extra information are not necessarily to differentiate entities
http://dbpedia.org/resource/New_York_City, (>100 triples) http://data.nytimes.com/N46020133052049607171 (< 20 triples)
• Challenge IV: The amount of Linked Open Data on the Web is already in the order of billions of entities and triples and is still increasing
11
Advantages of Computing Similarity on the Web of Data
• Advantages:– Entities on the Web of Data are well-
structured
– There are typed links among the entities on the Web of Data
• rdf:type, foaf:name, etc.
12
Overview of Contributions
• Contribution I: Information Entropy and Weighted Similarity Model (IEWS)
– We developed a new semantic similarity computation model which is more suitable for similarity computation among entities on the Web of Data.
• Contribution II: Semantic Similarity based Entity Matcher– We developed a new Entity Matcher based on IEWS Model which
outperforms existing systems in terms of precision, and recall.
• Contribution III: Semantic Similarity based Entity Linking Tool– We develop a new Entity Linking tool based on IEWS Model.
13
14
Contribution I:Information Entropy and Weighted Similarity
Model
Assumptions
• Assumption 1: The entities are described using same language.
• Assumption 2: The descriptions of an entity are consistent.
• Assumption 3: Closed world assumption. All descriptions of an entity are provided.
• Assumption 4: Entities that are similar to each other must have some literal content that are similar.
Intuitions
• Intuition 1: The similarity between entities A and B is related to their commonality and difference. The more commonality they share, the more similar they are. The more difference they have, the less similar they are.
• Pair 1:– :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person– :Entity1 ex:lives_in “NY” :Entity2 ex:lives_in “NY”
• Pair 2:– :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person– :Entity1 ex:lives_in “NY” :Entity2 ex:lives_in “RI”
• Sim(Pair 1) > Sim(Pair 2)
16
Intuitions
• Intuition 2: The commonality and difference between entities A and B are related to the amount of information that descriptions of A and B deliver. The more amount of information the descriptions deliver, the more it affects the similarity score.
• Given SSN is unique identifier and there are many people in the dataset
• Pair 1: – :Entity1 ex:SSN “123-45-6789” Entity2 ex:SSN “123-45-
6789”
• Pair 2: given – :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person
• Sim(Pair 1) > Sim (Pair 2) 17
Intuitions
• Intuition 3: The commonality and difference between entities A and B are related to the importance of their descriptions. The more important a description is, the more it affects the similarity score.
• Given people can travel to many places and gender is disjoint property
• Pair 1:– :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person– :Entity1 ex:travel_to “UK” :Entity2 ex:travel_to “Canada”
• Pair 2:– :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person– :Entity1 ex:gender “female” :Entity2 ex:gender “male”
• Sim(Pair 1) > Sim (Pair 2)18
Intuitions
• Intuition 4: The similarity between entities A and B is in range of 0 to 1. 1 is reached when A and B are semantically the same. 0 is reached when A and B are semantically different.
19
Semantic Similarity Between Entities
• Given semantic similarity computation intuitions, how can we compute the similarity among the entities?
– Entities are described by set of triples
– Similarity between entities can be computed by comparing their triples
20
Triple-wise Similarity
• Simpv computation process • Both Objects are String
– Apply Jaccard similarity computation algorithm
• Object 1 is URL, Object 2 is String– Extract String content of object 1, and then apply Lexical
similarity computation algorithm– ex. _:A _:category _:B, _:B _:label “country”; _:C _:category
“country”
• Objects are URL– Get the semantic content of both objects and then compute
similarity– Stop traverse down if (IE > 0.9 || delta(IE) < 0.05)– Use last part of URL and treat as String
• Different property describe same information– Perform property mapping– Schema that described the entities are available (OWL,
SKOS)
21
Information Entropy
• Intuition 2: The commonality and difference between entities A and B are related to the amount of information that descriptions of A and B deliver. The more amount of information the descriptions deliver, the more it affects the similarity score.
• Information Theory:– Information Entropy is a quantified measure of the uncertainty of the information
content -> quantified the expected amount of the information in a description
22
Information Entropy
23
Property Possible Values (number of occurrences) Expected value of information, IE: H(X)
rdf:type foaf:Person (5) 0
foaf:name “Jonh Smith” (2)
“Anne Jones” (1)
“Mike Williams” (1)
“Mary Miller” (1)
0.827
ex:SSN “123-45-0001” (1)
“123-45-0002” (1)
“123-45-0003” (1)
“123-45-0004” (1)
“123-45-0005” (1)
1
ex:home_state “NY” (2)
“
MA” (2)
“
RI” (1)
0.655
ex:gender “Male” (3)
“Female” (2)
0.418
Information Entropy
• Joint Entropy– Given a set of triples, how much information are given by these triples
• Conditional Entropy– If already given a triple A, how much information will triple B provide
• Chain Rules for Information Entropy– Use chain rule to compute joint entropy
– Scalability is a problem
• Approximate Information Entropy– Pick only the properties that have high information entropy
24
Importance of Property
• Intuition 3: The commonality and difference between entities A and B are related to the importance of their descriptions. The more important a description is, the more it affects the similarity score.
• Importance is different from information Entropy:– Property ex:gender is important description even though its entropy is low
compare to other property.– if the values are different, it is strong indication that two objects are not the
same.
• We can use “weight” to describe the Importance of a property
25
Weight Learning Problem
• Weight Learning Problem(WLP): given a training set T = {(δ1,δ1’,s1), (δ2,δ2’,s2) ... (δn,δn’,sn)}, where δi and δi’ are two sets of triples describe the entities ei and ei’, and si is the similarity score between ei and ei’
– Find a vector of weights for all properties that are used to describe all entities to be compared, so that computed similarity score is as close to S’s as possible
26
Binary Classification Problem
• Binary Classification Problem(BCP): given a training set T = {(x1,y1), (x2,y2) … (xn,yn)}, where xi ∈Rd and y is a set of classification labels {-1, +1}
– find an optimal separating hyperplanes W*Ф(x) + b = 0 that separates xs correctly
27
Reduce WLP to BCP
• Defining W and y.– W: a vector of weights for all properties that are used to
describe all entities to be compared
– y: a set of classes that represents the level of similarity between entity e and e’
• y = [low (simW <= 0.5), high (0.5<simW)]
28
Reduce WLP to BCP
• We need to make sure the size of Ф(x) is same as the size of W
• Ф(x): a vector of property-based similarities between entities e and e’ – During Simpv computation process, we obtain a vector of triple-wise
similarities between two entities
– A property can be used multiple times to describe an entity• _:Entity1 rdf:type _:Location ; _:Entity1 rdf:type _:Place• Take the average of triple-wise similarity to get property-based similarity
– For any properties that are not used to describe entities e and e’, we assigned a 0
29
Informaiton Entropy and Weighted Similarity Model
30
31
Contribution II:IEWS Model based Entity Matcher
Entity Matching Problem
• Entity Matching – Given two sets of Entities E and E’, decide
if a “same as” type of link should be created between entity e in E and entity e’ in E’
– Use Semantic similarity as a metric to decide whether a “same as” type of link should be created
32
Entity Match
• Types of Entity Matching– Instance Matching:
• Focus on instance level of matching.• Matching instance data that refer to real world
object
– Ontology Matching• Focus on schema level of matching• Matching concepts and properties that are
mean to describe same idea
33
Entity Match System Flow
34
Blocking Algorithm
• Giving two large sets of entities, pairwise similarity computation becomes so expansive.
– index entities/creating blocks for entities have same keywords
– filter the index by removing the block if lw > lb
Consider the following four entities and their corresponding LDs.
w = {A,B,C,E,K,L} x = {C,D,E,L}
y = {B,K,E,L} z = {A,B,L}
If lb = 2, Then the prefixes and corresponding blocks are
A : {w, z} C : {w, x} D : {x} K : {w, y}
Match Selection
• Base on the matching task, final match can be select use different configuration– Threshold base: th > 0.9 – For all entity select top matched entities
37
Contribution III:IEWS Model based Entity Linking Tool
Entity Recognition System Flow
4040
m1, m2,
… mn
IEWSModel
similarity matrix
Final Match
Entity Mentions
Candidate Entities
m1
m2
m5
m10
mn
Entities From Knowledge base
Structure Representation of Entity Mentions
• Get structured representation of entity mentions– “George Washington is the first president of the United
States”
Entity1 rdfs:label “George Washington”
Entity1 ?p2 Entity2
Entity1 ?p3 Entity3
Entity2 rdfs:label “President”
Entity3 rdfs:label “The United States”
41
The Knowledge Base
• Entities Base– We use entities from Billion Triple Challenge 2009
to construct our entity base
– Triples in BTC 2009 described these entities.
• Surface Form Base– Given a surface form of an entity, we need to
know what entities are possible candidate entities. Ex. “Washington”
– We collect a set of surface form data from BTC: rdfs:label, foaf:name, dbpedia:redirects, etc.
Candidate Entities
• Given entity mention with surface form sf, what entities from WOD are candidates– Can't select all possible entities due to large entity
base.
– Pre-rank entities use Link Frequency Analysis (similar to page rank) + TFIDF computation.
– Top 10 candidates for each entity mention are selected for further analysis.
Entities with stronger relation are selected
• Final Similarity Computation– Final similarity score between constructed entities
representation from free text and candidate entities are computed using IEWS Model
– Information Entropy of properties are computed using BTC 2009 dataset
– No weights learning performed in this task
– Only direct descriptions of the candidate entity are analyzed.
Evaluation
• Overview– Human study
– Applications (Entity Matcher, Entity Linking Tool) of IEWS Model.
– Evaluated Weighted Learning and Information Entropy (Intuitions 2 and 3)
– Blocking algorithm
– Study IE based stop traverse algorithm
• System– PC with 8 Intel Xeon processors of speed 2.40 GHz
and 32 GB memory. Each processor has a 12M cache.
Evaluation
• Human Suervey
– Purpose: study how well does the scores compute by IEWS Model is close to the scores given by human
– Metric: high correlation between computed scores and human evaluated scores indicates that the similarity scores computed by the model are more accurate
46
Evaluation
• Evaluation Dataset Design– 1. Test all semantic similarity computation intuitions, mainly focus
on intuition II and III
• SSN is told to be unique (Test intuitions II and III)
• gender is a disjoint property (Test intuitions II)
• All data are consistent
– 2. Covered various challenges of the real world dataset on the Web of Data
• Different property describe same information (Challenge I)
• Same information structured differently (Challenge II)
• Extra information does not mean to differentiate entities (Challenge III)
47
Evaluation
48
Descriptions Test Intuition II Test Intuition III Test Both Intuitions
# pairs in sample data 29 pairs 30 pairs 16 pairs
Descriptions Different Property Different Structure Extra Triples
# pairs in sample data 31 pairs 23 pairs 39 pairs
Evaluation
• Conference Ontology Dataset– 99 training cases– 21 evaluation cases
• Systems– Compare with 20
systems
• Result– SEM+: 0.82, the
highest among all systems
51
Comparing F-measure of the systems
Instance Matching Dataset 2839 possible matching Manually create training
dataset by randomly pair unmatch entities (100 pairs) and randomly select 100 matched entity pair
Systems Compare with 17
systems Result
SEM+: 0.785
Evaluation
Evaluation
53
• Instance Matching Dataset– Sandbox case as
training set– 120 evaluation cases
• Systems– Compare with 4
systems
• Result– SEM+: 0.94
53
Comparing F-measure of the systems
Evaluation
AgreementMaker SERIMI Zhishi.Link SEM+
Pre F1 Rec Pre F1 Re Pre F1 Re Pre F1 Rec
People
0.98 0.88 0.80 0.94 0.96 0.94 0.97 0.97 0.97 0.99 0.99 0.99
Organization
0.84 0.74 0.67 0.89 0.92 0.87 0.90 0.91 0.93 0.95 0.95 0.95
Location
0.79 0.69 0.61 0.69 0.83 0.67 0.92 0.92 0.91 0.91 0.91 0.91
Dataset: OAEI NYTimes to DBpedia instance matching dataset.
NYTimes: 9943 entities, 335198 triples
Dbpedia: 8862 entities, 4315062 triples
Evaluation
Purpose: study the impact of Weight Learning and Information Entropy
Dataset: Conference Ontology
Enabled Component
Precision Recall F1 value
Triple-Wise 0.71 0.71 0.71
Triple-Wise + WL
0.79 0.79 0.79
Triple-Wise + IE
0.76 0.76 0.76
Triple-Wise + WL + IE
0.82 0.82 0.82
Purpose: study the effect of blocking algorithm in IEWS Model
Dataset: OAEI NYTimes to DBpedia instance matching dataset.
NYTimes: 9943 entities, 335198 triples
Dbpedia: 8862 entities, 4315062 triples
Metric: number of missing pairs in same block, number of computation reduced.
Evaluation
Evaluation
lb Peop. Org. Loc. Comb.
2 5257 2571 2613 8779
10 39998 26747 21310 75388
50 259776 177396 203442 567443
100 591918 362984 480197 1197011
full 24780483 5981460 3686400 88114866
lb Peop. Org. Loc. Comb.
2 0.65(3243) 0.61(1195) 0.64(1241) 0.57(4999)
10 0.92(4596) 0.895(1745) 0.9(1725) 0.87(7687)
50 0.995(4951) 0.97(1892) 0.954(1827) 0.97(8624)
100 0.996(4958) 0.97(1894) 0.96(1838) 0.98(8682)
full 1.0(4977) 1.0(1949) 1.0(1916) 1.0(8842)
Number of Computation reduced to for different lb
Recalls on different lb with number of correct pair found in the same block.
Evaluation
Purpose: Study IE based stop traverse algorithm
Dataset: Real world dataset (NYTimes, DBpedia)
Evaluation
• Evaluation Environment– Dataset:
• Cucerzan's Dataset (list of entity mentions from news article and wikipedia document and their correct linkings)
– Metric:
• Precision
– Compared Methods:
• modified version of Cucerzan's algorithm (for WOD scenario)
• Spotlight
• Baseline method: VSM + frequency based analysis
Evaluation
VSM + FA: Vector Space Model with Link Frequency Analysis
News Dataset Wikipedia Dataset
Contribution I Summary
• Contribution I: Information Entropy and Weighted Similarity Model– Human survey studies showed that the similarity score
computed by IEWS Model are close to human intuitions– We showed that Information Entropy and Weights are
important in similarity computation– We showed that IEWS Model computes accurate similarity
score via entity matching and entity linking applications– We shoed that IEWS Model can be applied to solve entity
linking and entity matching problem.
61
Contribution II Summary
• Contribution II: IEWS Model Based Entity Matcher– SEM+ outperform all other systems in term of f-measure
– Among all the entity matchers, SEM+ and LogMap are the only two system we know of that can achieve high f-measures in both instance matching and ontology matching
– We showed that our blocking algorithm can improve similarity computation speed while maintain high recall.
62
Contribution III Summary
• Contribution III: IEWS Model based Entity Linking Tools– Our entity linking tool is the first tool performs on
billion triple dataset – We showed that our entity linking tool has high
accuracy
63
Reference
• [1] Pirro, G., Euzenat, J.: A Feature and Information Theoretic Framework for Semantic Similarity and Relatedness, International Semantic Web Conference (ISWC), 2010 (conference paper)
• [2] Li, Y., McLean, D., Bandar, Z., O’Shea, J., Crockett, K.: Sentence Similarity Based on Semantic Nets and Corpus Statistics, IEEE Transactions on Knowledge and Data Engineering, 2006
• [3] Li, Y., Bandar, Z.A., McLean, D.: An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Trans On Knowledge and Data Engineering (2003)
• [4] Resnik, O.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity and Natural Language. Journal of Artificial intelligence Research 1999 (Journal paper)
• [5] Foltz, P., Kintsch, W., Landauer, T.: The measurement of Textual Coherence with Latent Semantic Analysis, Discourse Processes, 1998
• [6] Landauer, T., Foltz, P., Laham, D.: Introduction to latent Semantic Analysis, Discourse Processes, 1998
64
• [7] Landauer T., Laham, D., Rehder, B., Schreiner M.: How Well Can Passage meaning Be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans, 19th Ann. Meeting of the Cognitive Science Soc., 1997
• [8] Rada, R., Mili, H., Bicknell, E., Blettner, M., Development and application of a Metric on Semantic Nets, IEEE Transactions on Systems, Man and Cybernetics, 1989
• [9] Yves R. Jean-Mary, E. Patrick Shironoshita, and Mansur R. Kabuka Ontology matching with semantic verification, Web Semantics Journal 2009 (Journal paper)
• [10] Duan, S., Fokoue, A., Srinivas, K., and Byrne B., A Clustering-based Approach to Ontology Alignment, ISWC 2011
• [11] Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. Discovering and manintaining Links on the Web of Data. In Proceedings of ISWC, 2009
• [12] Shu Rong, Xing Niu, Evan Wei Xiang, Haofeng Wang, Qiang Yang, and Yong Yu. A Machine Learning Approach for Instance Matching Based on Similarity Metrics. In Proceedings of ISWC, 2012
65
• [13] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proceedings of COLING, pages 79–85, 1998
• [14] E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual search and name disambiguation in email using graphs. In Proceedings of SIGIR, pages 27–34, 2006
• [15] L. Jiang, J. Wang, N. An, S. Wang, J. Zhan, and L. Li. Grape: A graph-based framework for disambiguating people appearances in web search. In Proceedings of ICDM, pages 199–208, 2009.
• [16] W. Shen, J. Wang, P. Luo, M. Wang, LINDEN:Linking Named Entities with Knowledge Base via Semantic Knowledge. In Proceedings of WWW 2012
66
top related