semiautomatic generation of resilient data-extraction ontologies

11
Semiautomatic Generation of Resilient Data- Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Upload: debra

Post on 16-Mar-2016

32 views

Category:

Documents


1 download

DESCRIPTION

Semiautomatic Generation of Resilient Data-Extraction Ontologies. Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF. Introduction. Wrapper-driven data extraction Pros: data-source-specified, high performance Cons: lack of resiliency and scalability - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Semiautomatic Generation of Resilient Data-Extraction Ontologies

Semiautomatic Generation of Resilient Data-Extraction

Ontologies

Yihong Ding

Data Extraction GroupBrigham Young University

Sponsored by NSF

Page 2: Semiautomatic Generation of Resilient Data-Extraction Ontologies

2

Introduction

• Wrapper-driven data extraction– Pros: data-source-specified, high performance– Cons: lack of resiliency and scalability

• Ontology-driven data extraction– Pros: application-domain-specified, resilient and scalable– Cons: hard to create

• Objective– Generating data-extraction ontologies

Page 3: Semiautomatic Generation of Resilient Data-Extraction Ontologies

3

Generation Architecture

Data Extraction Ontology

Integrated Knowledge Base

training documents

interact if necessary

Results Storage

Concept Selection

ExtractionProcessing

pre-processing

cleanrecords

RelationRetrieval

ConstraintDiscovery

testdocuments

Knowledge Sources

pre-processing

ResultEvaluation

KnowledgePreparation

ApplicationSpecification

DomainAllocation

OntologyGeneration

Page 4: Semiautomatic Generation of Resilient Data-Extraction Ontologies

4

Knowledge Base Construction

• Knowledge Sources– Mikrokosmos (K) Ontology– Data-Frame Library– Additional Lexicons– WordNet

• Integration of Knowledge Base

Data-Frame Library

KOntolog

y

Synonym Dictionary

(WordNet)

Lexicons

KNOWLEDGE BASE

Page 5: Semiautomatic Generation of Resilient Data-Extraction Ontologies

5

Application Specification

Record 1:

00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446

Record 2:

02 Buick Century Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250

Record 3:

02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755

Record 4:

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

Page 6: Semiautomatic Generation of Resilient Data-Extraction Ontologies

6

Domain Allocation: concept selection

• Select concepts using string-matching with object values• Resolve conflict by context or semantic meanings

02 Buick CenturyPwr Seat,Nada Retail 13,695. <Price>

<Mileage>

Data Frame Library

retailby keyword

identification

Page 7: Semiautomatic Generation of Resilient Data-Extraction Ontologies

7

Domain Allocation: relationship retrieval

Record 1:

00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446

Record 2:

02 Buick Century Custom, Pwr Seat, Nada Retail 13,695Only $12,695. 221-1250

Record 3:

02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755

Record 4:

00 Buick Century Stk# HU7159 Green $9,319, 714-2200

To Apply By Phone, 1-877-228-9486, OREM Utah

• Find paths among selected concept nodes• Retrieve cluster representing application domain

<MAKE>

<FEATURE>

<AUTOMOBILE>

<PRICE><PHONE>

<YEAR><TEMPORAL-UNIT>

Page 8: Semiautomatic Generation of Resilient Data-Extraction Ontologies

8

<MAKE><FEATURE>

<AUTOMOBILE>

<PRICE>

Domain Allocation: constraint discovery

• Discover participation times for each object values• Specify discovered values to be participation constraints

02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

<MAKE><FEATURE>

<AUTOMOBILE>

<PRICE>

AUTOMOBILE [0:1]

has MAKE [1:*]

AUTOMOBILE [0:*]

has FEATURE [1:*]

AUTOMOBILE [0:1]

has PRICE [1:1]

Page 9: Semiautomatic Generation of Resilient Data-Extraction Ontologies

9

Ontology Generation

• Initial ontology: automatically generated

• Updated ontology: user tuning

• Expectation– Rejecting existence much easier than adding new– Modification as less as possible

Page 10: Semiautomatic Generation of Resilient Data-Extraction Ontologies

10

Evaluation and Results

• Evaluation– Compare: Generated vs. Expert-created– POG (Precision of Ontology Generation)– PROG (Pseudo-Recall of Ontology Generation)– EPROG (Effective-PROG)

• Results– Three testing domains: Apt-Rental, Used-Auto-Ads, Nation-

Essence– Average POG less than 0.23– Lowest EPROG is around 0.70, highest is almost 1.0

Page 11: Semiautomatic Generation of Resilient Data-Extraction Ontologies

11

Conclusion

• Exploits existing knowledge

• Specifies application domain

• Allocates domain inside the knowledge base

• Generates a data-extraction ontology

• Shows effective recall of more than 70% on average