automatically extracting ontologically specified data from html tables with unknown structure
DESCRIPTION
Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure. David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University. Funded by NSF. Leverage this …. … to do this. Information Exchange. Source. Target. Information Extraction. Schema - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/1.jpg)
ER 2002BYU Data Extraction Group
Automatically Extracting Ontologically Specified Data
from HTML Tableswith Unknown Structure
David W. Embley, Cui Tao, Stephen W. Liddle
Brigham Young University
Funded by NSF
![Page 2: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/2.jpg)
ER 2002BYU Data Extraction Group
Information ExchangeSource Target
InformationExtraction
SchemaMatching
Leveragethis …
… to dothis
![Page 3: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/3.jpg)
ER 2002BYU Data Extraction Group
Information Extraction
![Page 4: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/4.jpg)
ER 2002BYU Data Extraction Group
Extracting Pertinent Information from Documents
![Page 5: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/5.jpg)
ER 2002BYU Data Extraction Group
A Conceptual-Modeling SolutionYear Price
Make Mileage
Model
Feature
PhoneNr
Extension
Car
hashas
has
has is for
has
has
has
1..*
0..1
1..*
1..* 1..*
1..*
1..*
1..*
0..1 0..10..1
0..1
0..1
0..1
0..*
1..*
![Page 6: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/6.jpg)
ER 2002BYU Data Extraction Group
Car-Ads OntologyCar [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]
constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … …End;
![Page 7: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/7.jpg)
ER 2002BYU Data Extraction Group
Recognition and Extraction
Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (336)835-85970002 1998 Elantra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081
Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stereo0002 a/c0003 Auto0003 jade green0003 gold
![Page 8: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/8.jpg)
ER 2002BYU Data Extraction Group
Schema Matching for HTML Tables with Unknown Structure
![Page 9: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/9.jpg)
ER 2002BYU Data Extraction Group
Table-Schema Matching(Basic Idea)
• Many Tables on the Web• Ontology-Based Extraction
– Works well for unstructured or semistructured data– What about structured data – tables?
• Method– Form attribute-value pairs– Do extraction– Infer mappings from extraction patterns
![Page 10: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/10.jpg)
ER 2002BYU Data Extraction Group
Problem: Different Schemas
Target Database Schema{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Different Source Table Schemas– {Run #, Yr, Make, Model, Tran, Color, Dr}– {Make, Model, Year, Colour, Price, Auto, Air Cond.,
AM/FM, CD}– {Vehicle, Distance, Price, Mileage}– {Year, Make, Model, Trim, Invoice/Retail, Engine,
Fuel Economy}
![Page 11: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/11.jpg)
ER 2002BYU Data Extraction Group
Problem: Attribute is Value
![Page 12: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/12.jpg)
ER 2002BYU Data Extraction Group
Problem: Attribute-Value is Value
? ?
![Page 13: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/13.jpg)
ER 2002BYU Data Extraction Group
Problem: Value is not Value
![Page 14: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/14.jpg)
ER 2002BYU Data Extraction Group
Problem: Implied Values
``````
![Page 15: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/15.jpg)
ER 2002BYU Data Extraction Group
Problem: Missing Attributes
![Page 16: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/16.jpg)
ER 2002BYU Data Extraction Group
Problem: Compound Attributes
![Page 17: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/17.jpg)
ER 2002BYU Data Extraction Group
Problem: Factored Values
![Page 18: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/18.jpg)
ER 2002BYU Data Extraction Group
Problem: Split Values
![Page 19: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/19.jpg)
ER 2002BYU Data Extraction Group
Problem: Merged Values
![Page 20: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/20.jpg)
ER 2002BYU Data Extraction Group
Problem: Values not of Interest
![Page 21: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/21.jpg)
ER 2002BYU Data Extraction Group
Problem: Information Behind Links
Single-ColumnTable (formattedas list)
Tableextendingover severalpages
![Page 22: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/22.jpg)
ER 2002BYU Data Extraction Group
Solution
• Form attribute-value pairs (adjust if necessary)
• Do extraction
• Infer mappings from extraction patterns
![Page 23: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/23.jpg)
ER 2002BYU Data Extraction Group
Solution: Remove Internal Factoring
Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*
Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table
Legend
ACURA
ACURA
![Page 24: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/24.jpg)
ER 2002BYU Data Extraction Group
Solution: Replace Boolean Values
Legend
ACURA
ACURA
β CD Table
Yes,
CD
CD
Yes,Yes,βAutoβAir CondβAM/FMYes,
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
Air Cond.
Air Cond.
Air Cond.
Air Cond.
Auto
Auto
Auto
Auto
![Page 25: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/25.jpg)
ER 2002BYU Data Extraction Group
Solution: Form Attribute-Value Pairs
Legend
ACURA
ACURA
CD
CD
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
Air Cond.
Air Cond.
Air Cond.
Air Cond.
Auto
Auto
Auto
Auto
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >
![Page 26: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/26.jpg)
ER 2002BYU Data Extraction Group
Solution: Adjust Attribute-Value Pairs
Legend
ACURA
ACURA
CD
CD
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
Air Cond.
Air Cond.
Air Cond.
Air Cond.
Auto
Auto
Auto
Auto
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>
![Page 27: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/27.jpg)
ER 2002BYU Data Extraction Group
Solution: Do Extraction
Legend
ACURA
ACURA
CD
CD
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
Air Cond.
Air Cond.
Air Cond.
Air Cond.
Auto
Auto
Auto
Auto
![Page 28: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/28.jpg)
ER 2002BYU Data Extraction Group
Solution: Infer Mappings
Legend
ACURA
ACURA
CD
CD
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
Air Cond.
Air Cond.
Air Cond.
Air Cond.
Auto
Auto
Auto
Auto
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Each row is a car. πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπYearTable
Note: Mappings produce sets for attributes. Joining to form recordsis trivial because we have OIDs for table rows (e.g. for each Car).
![Page 29: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/29.jpg)
ER 2002BYU Data Extraction Group
Solution: Do Extraction
Legend
ACURA
ACURA
CD
CD
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
Air Cond.
Air Cond.
Air Cond.
Air Cond.
Auto
Auto
Auto
Auto
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table
![Page 30: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/30.jpg)
ER 2002BYU Data Extraction Group
Solution: Do Extraction
Legend
ACURA
ACURA
CD
CD
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
Air Cond.
Air Cond.
Air Cond.
Air Cond.
Auto
Auto
Auto
Auto
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
πPriceTable
![Page 31: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/31.jpg)
ER 2002BYU Data Extraction Group
Solution: Do Extraction
Legend
ACURA
ACURA
CD
CD
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
AM/FM
Air Cond.
Air Cond.
Air Cond.
Air Cond.
Auto
Auto
Auto
Auto
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Yes,ρ Colour←Feature π ColourTable U ρ Auto←Feature π Auto β AutoTable U ρ Air Cond.←Feature π Air Cond.
β Air Cond.Table U ρ AM/FM←Feature π AM/FM β AM/FMTable U ρ CD←Featureπ CDβ CDTableYes, Yes, Yes,
![Page 32: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/32.jpg)
ER 2002BYU Data Extraction Group
Experiment
• Tables from 60 sites• 10 “training” tables• 50 test tables• 357 mappings (from all 60 sites)
– 172 direct mappings (same attribute and meaning)– 185 indirect mappings (29 attribute synonyms, 5 “Yes/No”
columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)
![Page 33: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/33.jpg)
ER 2002BYU Data Extraction Group
Results• 10 “training” tables
– 100% of the 57 mappings (no false mappings)– 94.6% of the values in linked pages (5.4% false declarations)
• 50 test tables– 94.7% of the 300 mappings (no false mappings)– On the bases of sampling 3,000 values in linked pages, we obtained 97%
recall and 86% precision
• 16 missed mappings– 4 partial (not all unions included)– 6 non-U.S. car-ads (unrecognized makes and models)– 2 U.S. unrecognized makes and models– 3 prices (missing $ or found MSRP instead)– 1 mileage (mileages less than 1,000)
![Page 34: Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure](https://reader035.vdocument.in/reader035/viewer/2022062423/56814eab550346895dbc5836/html5/thumbnails/34.jpg)
ER 2002BYU Data Extraction Group
Conclusions• Summary
– Transformed schema-matching problem to extraction– Inferred semantic mappings– Discovered source-to-target mapping rules
• Evidence of Success– Tables (mappings): 95% (Recall); 100% (Precision)– Linked Text (value extraction): ~97% (Recall); ~86% (Precision)
• Future Work– Discover and exploit structure in linked text– Broaden table understanding– Integrate with current extraction tools
www.deg.byu.edu