![Page 1: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/1.jpg)
Ontology-Driven Ontology-Driven Automatic Entity Automatic Entity
Disambiguation in Disambiguation in Unstructured TextUnstructured Text
Jed HassellJed Hassell
![Page 2: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/2.jpg)
IntroductionIntroduction
►No explicit semantic information about No explicit semantic information about data and objects are presented in data and objects are presented in most of the Web pages.most of the Web pages.
►Semantic Web aims to solve this Semantic Web aims to solve this problem by providing an underlying problem by providing an underlying mechanism to add semantic metadata mechanism to add semantic metadata to content:to content: Ex: The entity “UGA” pointing to Ex: The entity “UGA” pointing to
http://www.uga.eduhttp://www.uga.edu Using entity disambiguationUsing entity disambiguation
![Page 3: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/3.jpg)
IntroductionIntroduction
►We use background knowledge in the We use background knowledge in the form of an ontologyform of an ontology
►Our contributions are two-fold:Our contributions are two-fold: A novel method to disambiguate entities A novel method to disambiguate entities
within within unstructured textunstructured text by using clues in the by using clues in the text and exploiting metadata from the text and exploiting metadata from the ontology, ontology,
An implementation of our method that uses a An implementation of our method that uses a very large, real-world ontology to demonstrate very large, real-world ontology to demonstrate effective entity disambiguation in the domain effective entity disambiguation in the domain of Computer Science researchers.of Computer Science researchers.
![Page 4: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/4.jpg)
BackgroundBackground
►Sesame RepositorySesame Repository Open source RDF repositoryOpen source RDF repository We chose Sesame, as opposed to Jena We chose Sesame, as opposed to Jena
and BRAHMS, because of its ability to and BRAHMS, because of its ability to store large amounts of information by not store large amounts of information by not being dependant on memory storage being dependant on memory storage alonealone
We chose to use Sesame’s native mode We chose to use Sesame’s native mode because our dataset is typically too large because our dataset is typically too large to fit into memory and using the database to fit into memory and using the database option is too slow in update operationsoption is too slow in update operations
![Page 5: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/5.jpg)
Dataset 1: DBLP OntologyDataset 1: DBLP Ontology
► DBLP is a website that contains bibliographic DBLP is a website that contains bibliographic information for computer scientists, journals information for computer scientists, journals and proceedings:and proceedings: 3,079,414 entities (447,121 are authors)3,079,414 entities (447,121 are authors) We used a SAX parser to parse DBLP XML file that We used a SAX parser to parse DBLP XML file that
is available onlineis available online Created relationships such as “co-author”Created relationships such as “co-author” Added information regarding affiliationsAdded information regarding affiliations Added information regarding areas of interestAdded information regarding areas of interest Added alternate spellings for international Added alternate spellings for international
characterscharacters
![Page 6: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/6.jpg)
Dataset 2: DBWorld PostsDataset 2: DBWorld Posts
►DBWorldDBWorld Mailing list of information for upcoming Mailing list of information for upcoming
conferences related to the databases fieldconferences related to the databases field Created a HTML scraper that downloads Created a HTML scraper that downloads
everything with “Call for Papers”, “Call for everything with “Call for Papers”, “Call for Participation” or “CFP” in its subjectParticipation” or “CFP” in its subject
Unstructured textUnstructured text
![Page 7: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/7.jpg)
Overview of System Overview of System ArchitectureArchitecture
![Page 8: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/8.jpg)
ApproachApproach
►Entity NamesEntity Names Entity attribute that represents the name Entity attribute that represents the name
of the entityof the entity Can contain more than one nameCan contain more than one name
![Page 9: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/9.jpg)
ApproachApproach
► Text-proximity RelationshipsText-proximity Relationships Relationships that can be expected to be in text-Relationships that can be expected to be in text-
proximity of the entityproximity of the entity Nearness measured in character spacesNearness measured in character spaces
![Page 10: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/10.jpg)
ApproachApproach
► Text Co-occurrence RelationshipsText Co-occurrence Relationships Similar to text-proximity relationships except Similar to text-proximity relationships except
proximity is not relevantproximity is not relevant
![Page 11: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/11.jpg)
ApproachApproach
►Popular EntitiesPopular Entities The intuition behind this is to specify The intuition behind this is to specify
relationships that will bias the right entity relationships that will bias the right entity to be the most popular entityto be the most popular entity
This should be used with care, depending This should be used with care, depending on the domainon the domain
DBLP ex: the number of papers the entity DBLP ex: the number of papers the entity has authoredhas authored
![Page 12: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/12.jpg)
ApproachApproach
► Semantic RelationshipsSemantic Relationships Entities can be related to one another through Entities can be related to one another through
their collaboration networktheir collaboration network DBLP ex: Entities are related to one another DBLP ex: Entities are related to one another
through co-author relationshipsthrough co-author relationships
![Page 13: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/13.jpg)
AlgorithmAlgorithm
► Idea is to spot entity names in text Idea is to spot entity names in text and assign each potential match a and assign each potential match a confidence scoreconfidence score
►This confidence score will be adjusted This confidence score will be adjusted as the algorithm progresses and as the algorithm progresses and represents the certainty that this represents the certainty that this spotted entity represents a particular spotted entity represents a particular object in the ontologyobject in the ontology
![Page 14: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/14.jpg)
Algorithm – Flow ChartAlgorithm – Flow Chart
StartSpot entity
namesFound?
Do nothing
Initiate confidence
score and store in Candidate
Entities
More entities?
no
yes
Yes
Spot text-proximity
relationships
no
Found?Adjust
confidence score
Do nothingMore
candidate entities?
yes
no
yes
![Page 15: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/15.jpg)
Algorithm – Flow ChartAlgorithm – Flow Chart
Spot text co-occurrence
relationshipsFound?
Adjust confidence
score
Do nothingMore
candidate Entities?
yes
no
yes
Adjust confidence score based on
number of popular entity relationships
Search for semantic
relationshipsFound?
Adjust confidence
score
No changeMore
candidate entities?
no
no
yes
yes
Candidate entity rise above threshold?
no Endno
Yes (Iterative Step)
![Page 16: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/16.jpg)
AlgorithmAlgorithm
► Spotting Entity NamesSpotting Entity Names Search document for entity names within the Search document for entity names within the
ontologyontology Each of the entities in the ontology that match a Each of the entities in the ontology that match a
name found in the document become a name found in the document become a candidate entitycandidate entity
Assign initial confidence scores for candidate Assign initial confidence scores for candidate entities based on these formulas:entities based on these formulas:
![Page 17: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/17.jpg)
AlgorithmAlgorithm
►Spotting Literal Values of Text-Spotting Literal Values of Text-proximity Relationshipsproximity Relationships Only consider relationships from Only consider relationships from
candidate entitiescandidate entities Substantially increase confidence score if Substantially increase confidence score if
within proximitywithin proximity Ex: Entity affiliation found next to entity Ex: Entity affiliation found next to entity
namename
![Page 18: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/18.jpg)
AlgorithmAlgorithm
►Spotting Literal Values of Text Co-Spotting Literal Values of Text Co-occurrence Relationshipsoccurrence Relationships Only consider relationships from Only consider relationships from
candidate entitiescandidate entities Increase confidence score if found within Increase confidence score if found within
the document (location does not matter)the document (location does not matter) Ex: Entity’s areas of interest found in the Ex: Entity’s areas of interest found in the
documentdocument
![Page 19: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/19.jpg)
AlgorithmAlgorithm
►Using Popular EntitiesUsing Popular Entities Slightly increase the confidence score of Slightly increase the confidence score of
candidate entities based on the amount of candidate entities based on the amount of popular entity relationshipspopular entity relationships
Valuable when used as a tie-breakerValuable when used as a tie-breaker Ex: Candidate entities with more than 15 Ex: Candidate entities with more than 15
publications receive a slight increase in publications receive a slight increase in their confidence scoretheir confidence score
![Page 20: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/20.jpg)
AlgorithmAlgorithm
►Using Semantic RelationshipsUsing Semantic Relationships Use relationships among entities to boost Use relationships among entities to boost
confidence scores of candidate entitiesconfidence scores of candidate entities Each candidate entity with a confidence Each candidate entity with a confidence
score above the score above the thresholdthreshold is analyzed for is analyzed for semantic relationships to other candidate semantic relationships to other candidate entities. If another candidate entity is entities. If another candidate entity is found and is below the found and is below the thresholdthreshold, that , that entity’s confidence score is increasedentity’s confidence score is increased
![Page 21: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/21.jpg)
AlgorithmAlgorithm
► If any candidate entity rises above the If any candidate entity rises above the thresholdthreshold, the process repeats until , the process repeats until the algorithm stabilizesthe algorithm stabilizes
►This is an iterative step and always This is an iterative step and always convergesconverges
![Page 22: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/22.jpg)
OutputOutput
►XML formatXML format URI – the DBLP URL of the entityURI – the DBLP URL of the entity Entity nameEntity name Confidence scoreConfidence score Character offset – the location of the Character offset – the location of the
entity in the documententity in the document►This is a generic output and can easily This is a generic output and can easily
be converted for use in Microformats, be converted for use in Microformats, RDFa, etc.RDFa, etc.
![Page 23: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/23.jpg)
OutputOutput
![Page 24: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/24.jpg)
Output - MicroformatOutput - Microformat
![Page 25: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/25.jpg)
Evaluation: Gold Standard Evaluation: Gold Standard SetSet
►We evaluate our system using a gold We evaluate our system using a gold standard set of documentsstandard set of documents 20 manually disambiguated documents20 manually disambiguated documents Randomly chose 20 consecutive post from Randomly chose 20 consecutive post from
DBWorldDBWorld We use We use precisionprecision and and recallrecall as the as the
measurement of evaluation for our measurement of evaluation for our systemsystem
![Page 26: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/26.jpg)
Evaluation: Gold Standard Evaluation: Gold Standard SetSet
![Page 27: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/27.jpg)
Evaluation: Gold Standard Evaluation: Gold Standard SetSet
![Page 28: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/28.jpg)
Evaluation: Precision & Evaluation: Precision & RecallRecall
►We define set We define set AA as the set of unique as the set of unique names identified using the names identified using the disambiguated datasetdisambiguated dataset
►We define set We define set BB as the set of entities as the set of entities found by our methodfound by our method
►The intersection of these sets The intersection of these sets represents the set of entities correctly represents the set of entities correctly identified by our methodidentified by our method
![Page 29: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/29.jpg)
Evaluation: Precision & Evaluation: Precision & RecallRecall
► Precision is the Precision is the proportion of correctly proportion of correctly disambiguated entities disambiguated entities with regard to with regard to BB
► Recall is the proportion Recall is the proportion of correctly of correctly disambiguated entities disambiguated entities with regard to with regard to AA
![Page 30: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/30.jpg)
Evaluation: ResultsEvaluation: Results► Precision and recall when compared to Precision and recall when compared to
entire gold standard set:entire gold standard set:
► Precision and recall on a per document Precision and recall on a per document basis:basis:
Correct Disambiguation Found Entities Total Entities Precision Recall
602 620 758 97.1% 79.4%
Precision and Recall
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Documents
Per
cen
tag
e
Recall
Precision
![Page 31: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/31.jpg)
Related WorkRelated Work
►Semex:Semex: Personal information management system Personal information management system
that works with a user’s desktopthat works with a user’s desktop Takes advantage of a predictable Takes advantage of a predictable
structurestructure The results of disambiguated entities are The results of disambiguated entities are
propagated to other ambiguous entities, propagated to other ambiguous entities, which could then be reconciled based on which could then be reconciled based on recently reconciled entities much like our recently reconciled entities much like our work doeswork does
![Page 32: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/32.jpg)
Related WorkRelated Work
►Kim:Kim: An application that aims to be an An application that aims to be an
automatic ontology populationautomatic ontology population Contains an entity recognition portion that Contains an entity recognition portion that
uses natural language processorsuses natural language processors Evaluations performed on human Evaluations performed on human
annotated corporaannotated corpora Missed a lot of entities and results had Missed a lot of entities and results had
many false positivesmany false positives
![Page 33: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/33.jpg)
ConclusionConclusion
►Our method uses relationships Our method uses relationships between entities in the ontology to go between entities in the ontology to go beyond traditional syntactic-based beyond traditional syntactic-based disambiguation techniquesdisambiguation techniques
►This work is among the first to This work is among the first to successfully use relationships for successfully use relationships for identifying entities in text without identifying entities in text without relying on the structure of the textrelying on the structure of the text
![Page 34: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text](https://reader036.vdocument.in/reader036/viewer/2022062423/56814499550346895db1423e/html5/thumbnails/34.jpg)
Thank you!Thank you!