semantic similarity computation on the web of data jin guang zheng tetherless world constellation,...

Semantic Similarity Computation on the Web of Data

Jin Guang ZhengTetherless World Constellation, Computer Science Department

Outline

• Introduction– Research Problem– Historical review– Contribution Overview

• Contribution I: Information Entropy and Weighted Similarity Model– Semantic Similarity Computation Intuitions– IEWS Model

• Contribution II: Semantic Similarity based Entity Matcher– Entity Matching Problem– System

• Contribution III: Semantic Similarity based Entity Linking Tool– Entity Linking Problem– System

• Evaluation

• Summary2

Background

• Entity– A thing on the Web of Data that has an URL as identifier– E.g. Organization, Location, Person– http://dbpedia.org/resource/George_Washington

• Triple– Subject, Predicate(Property), Object

• :George_Washington dbpediaProp:birthDate 1732• subject: :Geroge_Washington• predicate: dbpediaProp:birthDate• Object: 1732

• You can read it as: George Washington’s birth date is 1732.

– Object can also be an URL which can be described by another set of triples • :Geroge_Washington dbpediaProp:birthPlace :Virginia• :Virginia dbpediaProp:area 42774 sq mi

• You can read it as: George Washington’s birth place is Virginia. • Virginia has area 42774 sq mi

Semantic Similarity

• Semantic similarity: how likely two things are semantically the same base on the likeliness of the semantic contents.– Car, Automobile– http://dbpedia.org/resource/New_York_City,

http://data.nytimes.com/N46020133052049607171 (New York City)

The Problem

• Entities on the Web of Data – Entities on the Web of Data are from different

sources, heterogeneous content– Some entities are similar to each other, or refer to

same real world object– How can we know if any entities are similar to

each other? How to compute similarity score among the entities on the Web of Data?

– Limits possible applications: data integration, data aggregation, data clustering

The Problem

• Entity Matching– How to tell if two entities are referring to same real world

object/concept and create “same as” type of link automatically

– Enables Data integration, data interoperability

• Entity Recognition – How to find the “correct” entity from the Web of data to

annotate entity mentions in the free text.– Machines can process texts in “smart” way

• Knowing “Geroge Washington” is referring to president George Washington not George Washington University.

Historical Review

• Semantic Similarity Computation– Ontology based Edge-counting method:

– Similarity between words is computed by applying a function to the length of the path linking the words in an ontology. [8][2]

– Information-content based method:[4][5][6][7]– Similarity between documents is computed by using a corpus to

compute the amount of information they shared

– Hybrid method [1][3]– Similarity between documents/words are computed by using

combination of above approaches.

Computing similarity between documents and words as opposed to entities. 7

Historical Review

• Semantic Similarity Based Entity Matching• Ontology Matching & Instance Matching

– ASMOV computes “children”, “parent” and lexical similarity [9]

– Duan use “Jaccard” and “Edit distance” similarity and performs clustering [10]

– User configured information as a guide and computes similarity with the information provided by user [11]

– Rong et al. [12] extracted literal information from the entities and represented this information as vectors

Computing information entropy and learning the importance of the properties that describe the

entities in similarity computation8

Historical Review

• Semantic Similarity Based Entity Recognition• Bagga et al. [14][15] use Vector Space Model (VSM) to represent

the context of the entity mention and use cosine similarity to suggest possible annotation for the entity mention

• Minkov et al. [16] and Jiang et al. [17] suggest to use graph based algorithms to further the similarity computation

• Linden [18] leverages information from Wikipedia and taxonomy from the knowledge base to compute similarity between the Wikipedia concepts and entity mentions to suggest annotation

Computing similarity between entity mentions in free text and Wikipedia documents as opposed to entity mentions in free text and entities on the

Web of Data.9

Challenges of ComputingSimilarity on the Web of Data

• Challenge III: Extra information are not necessarily to differentiate entities

http://dbpedia.org/resource/New_York_City, (>100 triples) http://data.nytimes.com/N46020133052049607171 (< 20 triples)

• Challenge IV: The amount of Linked Open Data on the Web is already in the order of billions of entities and triples and is still increasing

Advantages of Computing Similarity on the Web of Data

• Advantages:– Entities on the Web of Data are well-

structured

– There are typed links among the entities on the Web of Data

• rdf:type, foaf:name, etc.

Overview of Contributions

• Contribution I: Information Entropy and Weighted Similarity Model (IEWS)

– We developed a new semantic similarity computation model which is more suitable for similarity computation among entities on the Web of Data.

• Contribution II: Semantic Similarity based Entity Matcher– We developed a new Entity Matcher based on IEWS Model which

outperforms existing systems in terms of precision, and recall.

• Contribution III: Semantic Similarity based Entity Linking Tool– We develop a new Entity Linking tool based on IEWS Model.

Contribution I:Information Entropy and Weighted Similarity

Assumptions

• Assumption 1: The entities are described using same language.

• Assumption 2: The descriptions of an entity are consistent.

• Assumption 3: Closed world assumption. All descriptions of an entity are provided.

• Assumption 4: Entities that are similar to each other must have some literal content that are similar.

Intuitions

• Intuition 1: The similarity between entities A and B is related to their commonality and difference. The more commonality they share, the more similar they are. The more difference they have, the less similar they are.

• Pair 1:– :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person– :Entity1 ex:lives_in “NY” :Entity2 ex:lives_in “NY”

• Pair 2:– :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person– :Entity1 ex:lives_in “NY” :Entity2 ex:lives_in “RI”

• Sim(Pair 1) > Sim(Pair 2)

Intuitions

• Intuition 2: The commonality and difference between entities A and B are related to the amount of information that descriptions of A and B deliver. The more amount of information the descriptions deliver, the more it affects the similarity score.

• Given SSN is unique identifier and there are many people in the dataset

• Pair 1: – :Entity1 ex:SSN “123-45-6789” Entity2 ex:SSN “123-45-

6789”

• Pair 2: given – :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person

• Sim(Pair 1) > Sim (Pair 2) 17

Intuitions

• Intuition 3: The commonality and difference between entities A and B are related to the importance of their descriptions. The more important a description is, the more it affects the similarity score.

• Given people can travel to many places and gender is disjoint property

• Pair 1:– :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person– :Entity1 ex:travel_to “UK” :Entity2 ex:travel_to “Canada”

• Pair 2:– :Entity1 rdf:type foaf:Person :Entity2 rdf:type foaf:Person– :Entity1 ex:gender “female” :Entity2 ex:gender “male”

• Sim(Pair 1) > Sim (Pair 2)18

Intuitions

• Intuition 4: The similarity between entities A and B is in range of 0 to 1. 1 is reached when A and B are semantically the same. 0 is reached when A and B are semantically different.

Semantic Similarity Between Entities

• Given semantic similarity computation intuitions, how can we compute the similarity among the entities?

– Entities are described by set of triples

– Similarity between entities can be computed by comparing their triples

Triple-wise Similarity

• Simpv computation process • Both Objects are String

– Apply Jaccard similarity computation algorithm

• Object 1 is URL, Object 2 is String– Extract String content of object 1, and then apply Lexical

similarity computation algorithm– ex. _:A _:category _:B, _:B _:label “country”; _:C _:category

“country”

• Objects are URL– Get the semantic content of both objects and then compute

similarity– Stop traverse down if (IE > 0.9 || delta(IE) < 0.05)– Use last part of URL and treat as String

• Different property describe same information– Perform property mapping– Schema that described the entities are available (OWL,

Information Entropy

• Intuition 2: The commonality and difference between entities A and B are related to the amount of information that descriptions of A and B deliver. The more amount of information the descriptions deliver, the more it affects the similarity score.

• Information Theory:– Information Entropy is a quantified measure of the uncertainty of the information

content -> quantified the expected amount of the information in a description

Information Entropy

Property Possible Values (number of occurrences) Expected value of information, IE: H(X)

rdf:type foaf:Person (5) 0

foaf:name “Jonh Smith” (2)

“Anne Jones” (1)

“Mike Williams” (1)

“Mary Miller” (1)

ex:SSN “123-45-0001” (1)

“123-45-0002” (1)

“123-45-0003” (1)

“123-45-0004” (1)

“123-45-0005” (1)

ex:home_state “NY” (2)

MA” (2)

RI” (1)

ex:gender “Male” (3)

“Female” (2)

Information Entropy

• Joint Entropy– Given a set of triples, how much information are given by these triples

• Conditional Entropy– If already given a triple A, how much information will triple B provide

• Chain Rules for Information Entropy– Use chain rule to compute joint entropy

– Scalability is a problem

• Approximate Information Entropy– Pick only the properties that have high information entropy

Importance of Property

• Intuition 3: The commonality and difference between entities A and B are related to the importance of their descriptions. The more important a description is, the more it affects the similarity score.

• Importance is different from information Entropy:– Property ex:gender is important description even though its entropy is low

compare to other property.– if the values are different, it is strong indication that two objects are not the

• We can use “weight” to describe the Importance of a property

Weight Learning Problem

• Weight Learning Problem(WLP): given a training set T = {(δ1,δ1’,s1), (δ2,δ2’,s2) ... (δn,δn’,sn)}, where δi and δi’ are two sets of triples describe the entities ei and ei’, and si is the similarity score between ei and ei’

– Find a vector of weights for all properties that are used to describe all entities to be compared, so that computed similarity score is as close to S’s as possible

Binary Classification Problem

• Binary Classification Problem(BCP): given a training set T = {(x1,y1), (x2,y2) … (xn,yn)}, where xi ∈Rd and y is a set of classification labels {-1, +1}

– find an optimal separating hyperplanes W*Ф(x) + b = 0 that separates xs correctly

Reduce WLP to BCP

• Defining W and y.– W: a vector of weights for all properties that are used to

describe all entities to be compared

– y: a set of classes that represents the level of similarity between entity e and e’

• y = [low (simW <= 0.5), high (0.5<simW)]

Reduce WLP to BCP

• We need to make sure the size of Ф(x) is same as the size of W

• Ф(x): a vector of property-based similarities between entities e and e’ – During Simpv computation process, we obtain a vector of triple-wise

similarities between two entities

– A property can be used multiple times to describe an entity• _:Entity1 rdf:type _:Location ; _:Entity1 rdf:type _:Place• Take the average of triple-wise similarity to get property-based similarity

– For any properties that are not used to describe entities e and e’, we assigned a 0

Informaiton Entropy and Weighted Similarity Model

Contribution II:IEWS Model based Entity Matcher

Entity Matching Problem

• Entity Matching – Given two sets of Entities E and E’, decide

if a “same as” type of link should be created between entity e in E and entity e’ in E’

– Use Semantic similarity as a metric to decide whether a “same as” type of link should be created

Entity Match

• Types of Entity Matching– Instance Matching:

• Focus on instance level of matching.• Matching instance data that refer to real world

object

– Ontology Matching• Focus on schema level of matching• Matching concepts and properties that are

mean to describe same idea

Entity Match System Flow

Blocking Algorithm

• Giving two large sets of entities, pairwise similarity computation becomes so expansive.

– index entities/creating blocks for entities have same keywords

– filter the index by removing the block if lw > lb

Consider the following four entities and their corresponding LDs.

w = {A,B,C,E,K,L} x = {C,D,E,L}

y = {B,K,E,L} z = {A,B,L}

If lb = 2, Then the prefixes and corresponding blocks are

A : {w, z} C : {w, x} D : {x} K : {w, y}

Match Selection

• Base on the matching task, final match can be select use different configuration– Threshold base: th > 0.9 – For all entity select top matched entities

Contribution III:IEWS Model based Entity Linking Tool

Entity Recognition System Flow

m1, m2,

… mn

IEWSModel

similarity matrix

Final Match

Entity Mentions

Candidate Entities

Entities From Knowledge base

Structure Representation of Entity Mentions

• Get structured representation of entity mentions– “George Washington is the first president of the United

States”

Entity1 rdfs:label “George Washington”

Entity1 ?p2 Entity2

Entity1 ?p3 Entity3

Entity2 rdfs:label “President”

Entity3 rdfs:label “The United States”

The Knowledge Base

• Entities Base– We use entities from Billion Triple Challenge 2009

to construct our entity base

– Triples in BTC 2009 described these entities.

• Surface Form Base– Given a surface form of an entity, we need to

know what entities are possible candidate entities. Ex. “Washington”

– We collect a set of surface form data from BTC: rdfs:label, foaf:name, dbpedia:redirects, etc.

Candidate Entities

• Given entity mention with surface form sf, what entities from WOD are candidates– Can't select all possible entities due to large entity

– Pre-rank entities use Link Frequency Analysis (similar to page rank) + TFIDF computation.

– Top 10 candidates for each entity mention are selected for further analysis.

Entities with stronger relation are selected

• Final Similarity Computation– Final similarity score between constructed entities

representation from free text and candidate entities are computed using IEWS Model

– Information Entropy of properties are computed using BTC 2009 dataset

– No weights learning performed in this task

– Only direct descriptions of the candidate entity are analyzed.

Evaluation

• Overview– Human study

– Applications (Entity Matcher, Entity Linking Tool) of IEWS Model.

– Evaluated Weighted Learning and Information Entropy (Intuitions 2 and 3)

– Blocking algorithm

– Study IE based stop traverse algorithm

• System– PC with 8 Intel Xeon processors of speed 2.40 GHz

and 32 GB memory. Each processor has a 12M cache.

Evaluation

• Human Suervey

– Purpose: study how well does the scores compute by IEWS Model is close to the scores given by human

– Metric: high correlation between computed scores and human evaluated scores indicates that the similarity scores computed by the model are more accurate

Evaluation

• Evaluation Dataset Design– 1. Test all semantic similarity computation intuitions, mainly focus

on intuition II and III

• SSN is told to be unique (Test intuitions II and III)

• gender is a disjoint property (Test intuitions II)

• All data are consistent

– 2. Covered various challenges of the real world dataset on the Web of Data

• Different property describe same information (Challenge I)

• Same information structured differently (Challenge II)

• Extra information does not mean to differentiate entities (Challenge III)

Evaluation

Descriptions Test Intuition II Test Intuition III Test Both Intuitions

# pairs in sample data 29 pairs 30 pairs 16 pairs

Descriptions Different Property Different Structure Extra Triples

# pairs in sample data 31 pairs 23 pairs 39 pairs

Evaluation

• Conference Ontology Dataset– 99 training cases– 21 evaluation cases

• Systems– Compare with 20

systems

• Result– SEM+: 0.82, the

highest among all systems

Comparing F-measure of the systems

Instance Matching Dataset 2839 possible matching Manually create training

dataset by randomly pair unmatch entities (100 pairs) and randomly select 100 matched entity pair

Systems Compare with 17

systems Result

SEM+: 0.785

Evaluation

• Instance Matching Dataset– Sandbox case as

training set– 120 evaluation cases

• Systems– Compare with 4

systems

• Result– SEM+: 0.94

Comparing F-measure of the systems

Evaluation

AgreementMaker SERIMI Zhishi.Link SEM+

Pre F1 Rec Pre F1 Re Pre F1 Re Pre F1 Rec

People

0.98 0.88 0.80 0.94 0.96 0.94 0.97 0.97 0.97 0.99 0.99 0.99

Organization

0.84 0.74 0.67 0.89 0.92 0.87 0.90 0.91 0.93 0.95 0.95 0.95

Location

0.79 0.69 0.61 0.69 0.83 0.67 0.92 0.92 0.91 0.91 0.91 0.91

Dataset: OAEI NYTimes to DBpedia instance matching dataset.

NYTimes: 9943 entities, 335198 triples

Dbpedia: 8862 entities, 4315062 triples

Evaluation

Purpose: study the impact of Weight Learning and Information Entropy

Dataset: Conference Ontology

Enabled Component

Precision Recall F1 value

Triple-Wise 0.71 0.71 0.71

Triple-Wise + WL

0.79 0.79 0.79

Triple-Wise + IE

0.76 0.76 0.76

Triple-Wise + WL + IE

0.82 0.82 0.82

Purpose: study the effect of blocking algorithm in IEWS Model

Dataset: OAEI NYTimes to DBpedia instance matching dataset.

NYTimes: 9943 entities, 335198 triples

Dbpedia: 8862 entities, 4315062 triples

Metric: number of missing pairs in same block, number of computation reduced.

Evaluation

lb Peop. Org. Loc. Comb.

2 5257 2571 2613 8779

10 39998 26747 21310 75388

50 259776 177396 203442 567443

100 591918 362984 480197 1197011

full 24780483 5981460 3686400 88114866

lb Peop. Org. Loc. Comb.

2 0.65(3243) 0.61(1195) 0.64(1241) 0.57(4999)

10 0.92(4596) 0.895(1745) 0.9(1725) 0.87(7687)

50 0.995(4951) 0.97(1892) 0.954(1827) 0.97(8624)

100 0.996(4958) 0.97(1894) 0.96(1838) 0.98(8682)

full 1.0(4977) 1.0(1949) 1.0(1916) 1.0(8842)

Number of Computation reduced to for different lb

Recalls on different lb with number of correct pair found in the same block.

Evaluation

Purpose: Study IE based stop traverse algorithm

Dataset: Real world dataset (NYTimes, DBpedia)

Evaluation

• Evaluation Environment– Dataset:

• Cucerzan's Dataset (list of entity mentions from news article and wikipedia document and their correct linkings)

– Metric:

• Precision

– Compared Methods:

• modified version of Cucerzan's algorithm (for WOD scenario)

• Spotlight

• Baseline method: VSM + frequency based analysis

Evaluation

VSM + FA: Vector Space Model with Link Frequency Analysis

News Dataset Wikipedia Dataset

Contribution I Summary

• Contribution I: Information Entropy and Weighted Similarity Model– Human survey studies showed that the similarity score

computed by IEWS Model are close to human intuitions– We showed that Information Entropy and Weights are

important in similarity computation– We showed that IEWS Model computes accurate similarity

score via entity matching and entity linking applications– We shoed that IEWS Model can be applied to solve entity

linking and entity matching problem.

Contribution II Summary

• Contribution II: IEWS Model Based Entity Matcher– SEM+ outperform all other systems in term of f-measure

– Among all the entity matchers, SEM+ and LogMap are the only two system we know of that can achieve high f-measures in both instance matching and ontology matching

– We showed that our blocking algorithm can improve similarity computation speed while maintain high recall.

Contribution III Summary

• Contribution III: IEWS Model based Entity Linking Tools– Our entity linking tool is the first tool performs on

billion triple dataset – We showed that our entity linking tool has high

accuracy

Reference

• [1] Pirro, G., Euzenat, J.: A Feature and Information Theoretic Framework for Semantic Similarity and Relatedness, International Semantic Web Conference (ISWC), 2010 (conference paper)

• [2] Li, Y., McLean, D., Bandar, Z., O’Shea, J., Crockett, K.: Sentence Similarity Based on Semantic Nets and Corpus Statistics, IEEE Transactions on Knowledge and Data Engineering, 2006

• [3] Li, Y., Bandar, Z.A., McLean, D.: An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Trans On Knowledge and Data Engineering (2003)

• [4] Resnik, O.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity and Natural Language. Journal of Artificial intelligence Research 1999 (Journal paper)

• [5] Foltz, P., Kintsch, W., Landauer, T.: The measurement of Textual Coherence with Latent Semantic Analysis, Discourse Processes, 1998

• [6] Landauer, T., Foltz, P., Laham, D.: Introduction to latent Semantic Analysis, Discourse Processes, 1998

• [7] Landauer T., Laham, D., Rehder, B., Schreiner M.: How Well Can Passage meaning Be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans, 19th Ann. Meeting of the Cognitive Science Soc., 1997

• [8] Rada, R., Mili, H., Bicknell, E., Blettner, M., Development and application of a Metric on Semantic Nets, IEEE Transactions on Systems, Man and Cybernetics, 1989

• [9] Yves R. Jean-Mary, E. Patrick Shironoshita, and Mansur R. Kabuka Ontology matching with semantic verification, Web Semantics Journal 2009 (Journal paper)

• [10] Duan, S., Fokoue, A., Srinivas, K., and Byrne B., A Clustering-based Approach to Ontology Alignment, ISWC 2011

• [11] Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. Discovering and manintaining Links on the Web of Data. In Proceedings of ISWC, 2009

• [12] Shu Rong, Xing Niu, Evan Wei Xiang, Haofeng Wang, Qiang Yang, and Yong Yu. A Machine Learning Approach for Instance Matching Based on Similarity Metrics. In Proceedings of ISWC, 2012

• [13] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proceedings of COLING, pages 79–85, 1998

• [14] E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual search and name disambiguation in email using graphs. In Proceedings of SIGIR, pages 27–34, 2006

• [15] L. Jiang, J. Wang, N. An, S. Wang, J. Zhan, and L. Li. Grape: A graph-based framework for disambiguating people appearances in web search. In Proceedings of ICDM, pages 199–208, 2009.

• [16] W. Shen, J. Wang, P. Luo, M. Wang, LINDEN:Linking Named Entities with Knowledge Base via Semantic Knowledge. In Proceedings of WWW 2012

semantic similarity computation on the web of data jin guang zheng tetherless world constellation,...

computes similarity

computing similarity

lexical similarity

distance similarity

similarity score

web of data entities

data aggregation

washington dbpediaprop

Documents

a framework for earth science search interface development...

data-gov @ rpi li ding, jim hendler and deborah mcguinness...

next generation environmental informatics as exemplified by...

towards tetherless computing - university of...

tetherless world constellation open government data jim...

guang tian trip and parking at to ds - guang tian

tetherless world constellation, rpi social machines

guang r. gao

rpi construction .equipment. rpi equipment, …rpi...

semantic wiki @ rpi li ding, jie bao presented by deborah l....

tetherless world constellation · fits web architecture...

cs 4700: foundations of artificial intelligence...– rpi...

tetherless world constellation - aminer · tetherless world...

tetherless near-infrared control of brain activity in...

implementation and test case task force status at f2f1...

rpi orientation

facilitating next generation science collaboration: marine...

s villanova 105 30 - national collegiate athletic...

s a&m-corpus christi div. i only div. i non-conf....

tim clark harvard medical school & massachusetts general...