semiautomatic generation of resilient data-extraction ontologies
DESCRIPTION
Semiautomatic Generation of Resilient Data-Extraction Ontologies. Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF. Introduction. Wrapper-driven data extraction Pros: data-source-specified, high performance Cons: lack of resiliency and scalability - PowerPoint PPT PresentationTRANSCRIPT
Semiautomatic Generation of Resilient Data-Extraction
Ontologies
Yihong Ding
Data Extraction GroupBrigham Young University
Sponsored by NSF
2
Introduction
• Wrapper-driven data extraction– Pros: data-source-specified, high performance– Cons: lack of resiliency and scalability
• Ontology-driven data extraction– Pros: application-domain-specified, resilient and scalable– Cons: hard to create
• Objective– Generating data-extraction ontologies
3
Generation Architecture
Data Extraction Ontology
Integrated Knowledge Base
training documents
interact if necessary
Results Storage
Concept Selection
ExtractionProcessing
pre-processing
cleanrecords
RelationRetrieval
ConstraintDiscovery
testdocuments
Knowledge Sources
pre-processing
ResultEvaluation
KnowledgePreparation
ApplicationSpecification
DomainAllocation
OntologyGeneration
4
Knowledge Base Construction
• Knowledge Sources– Mikrokosmos (K) Ontology– Data-Frame Library– Additional Lexicons– WordNet
• Integration of Knowledge Base
Data-Frame Library
KOntolog
y
Synonym Dictionary
(WordNet)
Lexicons
KNOWLEDGE BASE
5
Application Specification
Record 1:
00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446
Record 2:
02 Buick Century Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250
Record 3:
02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755
Record 4:
00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
6
Domain Allocation: concept selection
• Select concepts using string-matching with object values• Resolve conflict by context or semantic meanings
02 Buick CenturyPwr Seat,Nada Retail 13,695. <Price>
<Mileage>
Data Frame Library
retailby keyword
identification
7
Domain Allocation: relationship retrieval
Record 1:
00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446
Record 2:
02 Buick Century Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250
Record 3:
02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755
Record 4:
00 Buick Century Stk# HU7159 Green $9,319, 714-2200
To Apply By Phone, 1-877-228-9486, OREM Utah
• Find paths among selected concept nodes• Retrieve cluster representing application domain
<MAKE>
<FEATURE>
<AUTOMOBILE>
<PRICE><PHONE>
<YEAR><TEMPORAL-UNIT>
8
<MAKE><FEATURE>
<AUTOMOBILE>
<PRICE>
Domain Allocation: constraint discovery
• Discover participation times for each object values• Specify discovered values to be participation constraints
02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755
00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah
<MAKE><FEATURE>
<AUTOMOBILE>
<PRICE>
AUTOMOBILE [0:1]
has MAKE [1:*]
AUTOMOBILE [0:*]
has FEATURE [1:*]
AUTOMOBILE [0:1]
has PRICE [1:1]
9
Ontology Generation
• Initial ontology: automatically generated
• Updated ontology: user tuning
• Expectation– Rejecting existence much easier than adding new– Modification as less as possible
10
Evaluation and Results
• Evaluation– Compare: Generated vs. Expert-created– POG (Precision of Ontology Generation)– PROG (Pseudo-Recall of Ontology Generation)– EPROG (Effective-PROG)
• Results– Three testing domains: Apt-Rental, Used-Auto-Ads, Nation-
Essence– Average POG less than 0.23– Lowest EPROG is around 0.70, highest is almost 1.0
11
Conclusion
• Exploits existing knowledge
• Specifies application domain
• Allocates domain inside the knowledge base
• Generates a data-extraction ontology
• Shows effective recall of more than 70% on average