schema matching and data extraction over html tables

Post on 18-Jan-2016

47 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Schema Matching and Data Extraction over HTML Tables. Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University. supported by NSF. Introduction. Many tables on the Web How to integrate data stored in different tables? Detect the table of interest - PowerPoint PPT Presentation

TRANSCRIPT

Schema Matching and Data Extraction over HTML Tables

Cui Tao

Data Extraction Research GroupDepartment of Computer Science

Brigham Young University

supported by NSF

Introduction

Many tables on the Web How to integrate data stored in

different tables? Detect the table of interest Form attribute-value pairs (adjust if

necessary) Do extraction Infer mappings from extraction patterns

ProblemDetecting The Table of Interest

?

Problem

Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air

Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,

Engine, Fuel Economy} Target database schema

{Car, Year, Make, Model, Mileage, Price, PhoneNr},

{Car, Feature}

Different schemas

ProblemAttribute is Value

Problem Attribute-Value is Value

? ?

ProblemValue is not Value

ProblemFactored Values

ProblemSplit Values

ProblemMerged Values

ProblemInformation Behind Links

Single-ColumnTable (formattedas list)

Tableextendingover severalpages

Solution Detect the table of interest Form attribute-value pairs (adjust

if necessary) Do extraction Infer mappings from extraction

patterns

SolutionDetect The Table of Interest

‘Real’ table test Same number of values Table size

Attribute test Density measure test

# of ontology extracted values total # of values in the table

Solution Remove Factoring

2001

2001

2001

2000

2000

2000

2000

2000

2000

1999

1999

SolutionReplace Boolean Values

SolutionForm Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

SolutionAdjust Attribute-Value Pairs

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

SolutionAdd Information Hidden Behind Links

Unstructured and semi-structured:

concatenate

<Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879>

Single attribute value pairs:Pair them together

List:Mark the beginning

and the end

<

>

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Each row is a car.

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

SolutionInferred Mapping Creation

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Experimental ResultsCar Advertisement Application domain 10 “training” tables

100% of the 57 mappings (no false mappings) 94.6% precision of the values in linked pages

(5.4% false declarations) 50 test tables

94.7% of the 300 mappings (no false mappings) On the bases of sampling 3,000 values in linked

pages, we obtained 97% recall and 86% precision

Other Applications Cell Phone Plan Application domain Soccer Player Application domain

Contribution Provides an approach to extract

information automatically from HTML tables

Suggests a different way to solve the problem of schema matching

top related