from tables to frames aleksander pivk 1,2, philipp cimiano 2, york sure 2 1 jozef stefan institute,...

36
From Tables To Frames Aleksander Pivk 1,2 , Philipp Cimiano 2 , York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of Karlsruhe, Karlsruhe 09.11.2004 The Third International Semantic Web Conference - ISWC 2004 November 07 – 11, 2004, Hiroshima, Japan

Upload: candace-horn

Post on 22-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames

Aleksander Pivk1,2, Philipp Cimiano2, York Sure2

1Jozef Stefan Institute, Ljubljana, Slovenia2 AIFB Institute, University of Karlsruhe, Karlsruhe

09.11.2004

The Third International Semantic Web Conference - ISWC 2004

November 07 – 11, 2004, Hiroshima, Japan

Page 2: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Outline

Motivation Foundation: Table Model Methodology Evaluation Conclusion Future Work

Page 3: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Motivation

problem: well-known annotation bottleneck

solution: automatic metadata generation goal: describe the semantics of tables in

model-theoretic-way (F-Logic) tables with different structure but same

meaning (should) have the same representation

benefit: enable e.g. query answering all conferences where ‘prof. Studer’ is in PC all tours to COUNTRY at DATE where price<AMOUNT

Page 4: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Foundation: Table Model

dimensions of table model [Hurst’00] graphical (image processing) physical (inter-cell relative location) structural (organization of cells indicating their

navigational relationship) functional (purpose of regions in terms of data access)

two functional cell types: A-cell and I-cell two functional I-cell roles: data and access

semantic (relation between cell content, structure and orientation)

frame makes explicit the meaning of the cell contents (F-Logic concepts) the functional dimension of the table (method signature) the semantic dimension of the table (frame structure)

example:

Page 5: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Table model

A-cell

I-cell (access)

I-cell (data)

LEGEND:

A-cell

I-cell (access)

I-cell (data)

LEGEND:

Person Type Economic ExtendedSingle Room 35.450 2.510Double Room 32.500 1.430Extra Bed 30.550 720Occupation 25.800 1.430No occupation 23.850 720Extra Bed 22.900 360

Adult

Child

Page 6: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Simple Table Classes

220318046-60XM001

195227532-45LM223

120149523-31LM311

9094916-23LM209

707259-15LM208

50520-7LM202

InsuranceCost

Trip Duration (in days)

Trip Code

220318046-60XM001

195227532-45LM223

120149523-31LM311

9094916-23LM209

707259-15LM208

50520-7LM202

InsuranceCost

Trip Duration (in days)

Trip Code

1-Dimensional

F2-Baggage Claim:

18A29Gate/Terminal:

Mar 16 -Not Available

Mar 16 -Not Available

Actual:

Mar 16 -4:20pm

Mar 16 -11:45am

Scheduled:

Honolulu, HI (HNL)

Dallas/Ft Worth,

TX (DFW)

City:

Arrival Departure

F2-Baggage Claim:

18A29Gate/Terminal:

Mar 16 -Not Available

Mar 16 -Not Available

Actual:

Mar 16 -4:20pm

Mar 16 -11:45am

Scheduled:

Honolulu, HI (HNL)

Dallas/Ft Worth,

TX (DFW)

City:

Arrival Departure

2-Dimensional

Page 7: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Complex Table Classes

36022.900Extra Bed

72023.850No Occupation

1.43025.800Occupation

Child

72030.550Extra Bed

1.43032.500Double Room

2.51035.450Single Room

PRICE

Adult

ExtendedEconomicClass/Extension

36022.900Extra Bed

72023.850No Occupation

1.43025.800Occupation

Child

72030.550Extra Bed

1.43032.500Double Room

2.51035.450Single Room

PRICE

Adult

ExtendedEconomicClass/Extension

1. Over-expanded labels

FloatRegularRate (%)

Fixed Deposit

Regular Fixed Deposit

4,354,353 Months

4,64,66 Months

4,74,79 Months

551 Year

5,055,052 Years

5,055,053 Years

3 Years

2 Years

1 Year

Rate (%)

5,1

5,1

5,05

Regular

5,1

5,1

5,05

Float

FloatRegularRate (%)

Fixed Deposit

Regular Fixed Deposit

4,354,353 Months

4,64,66 Months

4,74,79 Months

551 Year

5,055,052 Years

5,055,053 Years

3 Years

2 Years

1 Year

Rate (%)

5,1

5,1

5,05

Regular

5,1

5,1

5,05

Float

2. Partition labels

3. Combination – running example

Page 8: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Methodology

the methodology instantiates stepwise the table model

main differences: do not consider graphical component extent semantic component

1 Cleaning & Normalization

2 Structure Detection

3 Building of FTM

4 Semantic Enriching of FTM

Physical

Structure

Function

Semantic

HTML

Frame

Steps of Methodology Table ModelInput Output

1 Cleaning & Normalization

2 Structure Detection

3 Building of FTM

4 Semantic Enriching of FTM

Physical

Structure

Function

Semantic

HTMLHTML

FrameFrame

Steps of Methodology Table ModelInput Output

Page 9: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Cleaning & Norm.

construct an initial matrix structure DOM tree

cleaning: syntactic errors (CyberNeko HTML parser) normalization: aligning the table, resorting

cells spanning multiple rows/columns (colspan, rowspan)

example:

2 Structure Detection

3 Building of FTM

4 Semantic Enriching of FTM

Structure

Function

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

2 Structure Detection

3 Building of FTM

4 Semantic Enriching of FTM

Structure

Function

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

A-cell

I-cell (access)

I-cell (data)

LEGEND:

A-cell

I-cell (access)

I-cell (data)

LEGEND:

A-cell

I-cell (access)

I-cell (data)

LEGEND:

A-cell

I-cell (access)

I-cell (data)

LEGEND:

Economic ExtendedSingle Room 35.450 2.510Double Room 32.500 1.430Extra Bed 30.550 720Occupation 25.800 1.430No occupation 23.850 720Extra Bed 22.900 360

DP9LAX01AB01.05.04-31.09.04

Adult

Child

Class/Price

Tour CodeValid

Tour Code Tour Code DP9LAX01AB DP9LAX…Valid Valid 01.05.04-31… 01.05.04…

Class/Price Class/Price Economic ExtendedAdult Single Room 35.450 2.510Adult Double Room 32.500 1.430Adult Extra Bed 30.550 720Child Occupation 25.800 1.430Child No occupation 23.850 720Child Extra Bed 22.900 360

Page 10: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Structure Detection

detecting table orientation: rely on similarity of cells (size, content,

token types) intuition:

if rows are similar, then orientation is vertical (top-to-down)

if columns are similar, then orientation is horizontal (left-to-right)

initialize logical units and regions split table into LUs group same-sized, similar cells into regions

within LUs

3 Building of FTM

4 Semantic Enriching of FTM

Function

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

2 Structure Detection Structure

3 Building of FTM

4 Semantic Enriching of FTM

Function

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

2 Structure Detection Structure

Page 11: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Discovery of Regions 3 Building of FTM

4 Semantic Enriching of FTM

Function

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

2 Structure Detection Structure

3 Building of FTM

4 Semantic Enriching of FTM

Function

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

2 Structure Detection Structure

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

Page 12: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Discovery of Regions

do while (distribution in LU not uniform)(explanation of uniformity: logical unit consists of logical sub-units where each sub-unit includes only regions of same size and orientation)

choose the best coherent region used to propagate and normalize the neighboring regions

normalize logical sub-unit choose neighboring regions (i.e. only within same rows for vertical

orientation)

example:

3 Building of FTM

4 Semantic Enriching of FTM

Function

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

2 Structure Detection Structure

3 Building of FTM

4 Semantic Enriching of FTM

Function

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

2 Structure Detection Structure

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

Page 13: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Building FTM

functional table model regions as nodes arranged in a tree properties of leaf nodes:

are only regions consisting exclusively of I-cells are assigned their functional role (access, data) are assigned two semantic labels:

label describing the content of the region (instances) label as a combination of a region label and parent A-cell nodes

labels inner nodes are either regions consisting of A-cells or

‘connection’ nodes (e.g. root)

construction of FTM bottom-up approach (from lowest logical unit upwards) description through an example

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

3 Building of FTM Function

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

3 Building of FTM Function

Page 14: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Building FTM

type of the (colored) logical unit = I-cells only regions are turned into leaves semantic labels and roles are set to a default value

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

3 Building of FTM Function

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

3 Building of FTM Function

<label>

<role>

AdultAdultAdultChildChildChild

<label>

<role>

Single RoomDouble Room

Extra BedOccupation

No Occupat… Extra Bed

<label>

<role>

35,45032,50030,55025,800

/22,900

<label>

<role>

2,5101,430720

1,430720360

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

Page 15: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Building FTM type of the (colored) logical unit = A-cells only

regions turned into inner nodes and connected to appropriate sub-nodes (leaves)

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

3 Building of FTM Function

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

3 Building of FTM Function

<label>

<role>

AdultAdultAdultChildChildChild

<label>

<role>

Single RoomDouble Room

Extra BedOccupation

No Occupat… Extra Bed

<label>

<role>

35,45032,50030,55025,800

/22,900

<label>

<role>

2,5101,430720

1,430720360

Class/Price Economic Extended

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

Page 16: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Building FTM type of the (colored) logical unit = special case

close a subtree by inserting a ‘connection’ node which reflects a logical separation in the table (transition from a LU with only A-cells to a LU with I-cells)

assign functional roles to leaves within a connected sub-tree: functional role access assigned to all consecutive leaves (from left) that

together form a unique identifier (key); other leaves assign functional role data (possible) change of reading orientation in the new logical unit

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

3 Building of FTM Function

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

3 Building of FTM Function

<label>

access

AdultAdultAdultChildChildChild

<label>

access

Single RoomDouble Room

Extra BedOccupation

No Occupat… Extra Bed

<label>

data

35,45032,50030,55025,800

/22,900

<label>

data

2,5101,430720

1,430720360

Class/Price Economic Extended

Connection Node

<label>

<role>

DP9LAX01AB

<label>

<role>

01.05.2004 - 30.09.2004

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

720/No

OcuppationChild

ExtendedEconomicClass/PriceClass/Price

72030.550Extra BedAdult

1.43025.800OccupationChild

2.51035.450Single RoomAdult

1.43032.500Double RoomAdult

36022.900Extra BedChild

01.05.04 -30.09.04

01.05.04 -30.09.04ValidValid

DP9LAX01ABDP9LAX01ABTour CodeTour Code

Page 17: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Building FTM type of the (colored) logical unit = A-cells only

regions turned into inner nodes and connected to appropriate sub-nodes (leaves)

finally, connect all unconnected nodes to a root node

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

3 Building of FTM Function

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

3 Building of FTM Function

<label>

access

<label>

access

<label>

data

<label>

data

Class/Price Economic Extended

Connection Node

<label>

data

DP9LAX01AB

<label>

data

01.05.2004 - 30.09.2004

Tour Code Valid

Root

Page 18: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Building FTM

recapitulation of FTM: consider multiple-level sub-trees for merging conditions: same tree structure and at least one level of matching A-

cells merging step:

merge nodes at the same position and level (leaf and inner nodes) if merged inner nodes (A-cells) are not equal

find a semantic label of a new merged node create a new leaf node (with A-cells as values) assign functional role of the new leaf to access

example:

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

3 Building of FTM Function

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

3 Building of FTM Function

Root

Morning

Lecturer Room

v1, v2

/data

v1, v2v1, v2

//datadata

v3, v4

/data

v3, v4v3, v4

//datadata

Evening

Lecturer Room

v5, v6

/data

v5, v6v5, v6

//datadata

v7, v8

/data

v7, v8v7, v8

//datadata

Time Period

Lecturer Room

v1, v5,v2, v6v1, v5,v2, v6

//datadata

v3, v7,v4, v8v3, v7,v4, v8

//datadata

Root

MorningEvening

Time Periodaccess

Page 19: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTML

Frame

1 Cleaning & Normalization Physical

3 Building of FTM Function

2 Structure Detection

4 Semantic Enriching of FTM

Structure

Semantic

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

3 Building of FTM FunctionBuilding FTM

<label>

access

AdultAdultAdultChildChildChild

<label>

access

Single RoomDouble Room

Extra BedOccupation

No Occupat… Extra Bed

<label>

data

35,45032,50030,55025,800

/22,900

<label>

data

2,5101,430720

1,430720360

Class/Price Economic Extended

Connection Node

<label>

access

AdultAdultAdultChildChildChild

<label>

access

Single RoomDouble Room

Extra BedOccupation

No Occupat… Extra Bed

<label>

data

35,45032,50030,55025,800

/22,900

2,5101,430720

1,430720360

Class Price

Connection Node

<label>

access

EconomicExtended

Page 20: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Semantic Enriching of FTM

find semantic labels for regions by consulting: Wordnet lexical ontology: use synsets to find hypernyms GoogleSets service: additonal way to find synonyms

transformations of region’s cell labels: punctuation removal stopword removal compute IDF (document is a cell) for each word, and filter

out the ones with value lower than treshold select words that appear at the end of the labels

(nominal head in the nominal compound is at the end) query GoogleSets with the remaining words to filter out the

ones that are not mutually similar

2 Structure Detection

3 Building of FTM

Structure

Function

HTML

Frame

1 Cleaning & Normalization Physical

4 Semantic Enriching of FTM Semantic

2 Structure Detection

3 Building of FTM

Structure

Function

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

4 Semantic Enriching of FTM Semantic

Page 21: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Semantic Enriching of FTM

assign each leaf its semantic label that describes the content (instances) of the region

2 Structure Detection

3 Building of FTM

Structure

Function

HTML

Frame

1 Cleaning & Normalization Physical

4 Semantic Enriching of FTM Semantic

2 Structure Detection

3 Building of FTM

Structure

Function

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

4 Semantic Enriching of FTM Semantic

Person

access

AdultAdultAdultChildChildChild

Room

access

Single RoomDouble Room

Extra BedOccupation

No Occupat… Extra Bed

<label>

data

35,45032,50030,55025,800

/22,900

2,5101,430720

1,430720360

Class Price

Connection Node

<label>

data

DP9LAX01AB

Date

data

01.05.2004 - 30.09.2004

Tour Code Valid

Root

Type

access

EconomicExtended

Page 22: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Final FTM (final) semantic labels of leaves:

label is a combination of a region label and parent A-cell nodes labels

2 Structure Detection

3 Building of FTM

Structure

Function

HTML

Frame

1 Cleaning & Normalization Physical

4 Semantic Enriching of FTM Semantic

2 Structure Detection

3 Building of FTM

Structure

Function

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

4 Semantic Enriching of FTM Semantic

Person

access

AdultAdultAdultChildChildChild

Room

access

Single RoomDouble Room

Extra BedOccupation

No Occupat… Extra Bed

<label>

data

35,45032,50030,55025,800

/22,900

2,5101,430720

1,430720360

Class Price

Connection Node Tour Code Valid

Root

PersonClass RoomClass Price

Type

access

EconomicExtended

TypePrice

<label>

data

DP9LAX01AB

Code

Date

data

01.05.2004 - 30.09.2004

DateValid

Page 23: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Map FTM to a Frame

method is a tuple frame is a pair generation of a frame

create method m for every leaf node, which functional role is data

parameters of m are all leaf nodes with functional role access,where they must be located on the same level of m ’s sub-tree or on m ’s parent path towards root node

set range for m according to the syntactic token type of its region

names for parameters and methods are obtained from a final FTM

example:

2 Structure Detection

3 Building of FTM

Structure

Function

HTML

Frame

1 Cleaning & Normalization Physical

4 Semantic Enriching of FTM Semantic

2 Structure Detection

3 Building of FTM

Structure

Function

HTMLHTML

FrameFrame

1 Cleaning & Normalization Physical

4 Semantic Enriching of FTM Semantic

Tour [ Code => ALPHANUMERIC; DateValid => DATE; Price (PersonClass, RoomClass, TypePrice) => LARGE_NUMBER].

Page 24: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Evaluation

task: for each table compare automatically generated

frame against two manually created frames measure in terms of Precision, Recall and F-

measure dataset:

consists of 21 tables: 3 tables for each simple table class (1D, 2D) and 5 tables for each complex table class

tourism domain annotators:

14 subjects each subject had to annotate 3 tables, each

belonging to a different table class (14x3=21x2=42)

Page 25: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Evaluation

performed along following 4 functions: - example: [m1 (X, Y) => INTEGER] vs. [method1 (X, YY, W)=>INTEGER] syntactic correctness:

how well the functional dimension of the table is captured (SynC=2/3)

strict comparison: calculate how identical are nameM , rangeM , and PM identifiers of

methods (P=2/4, R=2/5) soft comparison:

for soft matching we used a combination of TFIDF and Jaro-Wrinkler string distance scheme [Cohen et al., 2003]

calculate soft matching for identifiers of methods (P=3/4, R=3/5, where ‘Y’≈‘YY’)

conceptual comparison: conceptually equivalent identifiers have been determined

(i.e. ‘RegionType’=‘Region’=‘Location’) calculate conceptual matching for identifiers of methods

(P=4/4, R=4/5, where ‘m1’≈‘method1’)

Page 26: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Evaluation

performed from 2 aspects: average: consider all frames maximum: choose only the best manually created

frame for each generated frame results:

Page 27: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Conclusion

shown that our methodology stepwise instantiates the underlying table model

experiments show that: from conceptual point of view the system gets

appropriate names for frames in almost 75% it gets totally identical names in more than

50%

we demonstrated and evaluated the successful automatic generation of frames from HTML tables

Page 28: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Future Work

generate one (most general) frame from multiple tables reduction of complexity

population of ontologies with instances show feasibility of approach in practical setting

use given ontology as background knowledge

Page 29: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

TNX

Page 30: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Inter-annotator agreement

max (FX)=Fconceptual ≈60% only 2 totally identical frames (2/21=9.52%) only 5 identical frames from a conceptual view

(5/21=23.81%) this 5 tables cover all 1D class tables and 2 (out of 3) 2D

class tables

possible reasons for low agreements: the annotators did not follow the guidelines precisely the task itself is hard the annotation guidelines were not clear/detailed

enough

actual results:

Page 31: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Example 1

Page 32: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Example 1

Generated Frame

Annotator 1:

Annotator 2:

Tour [ Name (Code) => TOKEN Price (Code) => CURRENCY Hotel (Code) => TOKEN Meal (Code) => TOKEN]-------------------------------------------------------Tour [ TourCode => ALPHANUMERIC TourName => TOKEN Price => CURRENCY Hotel => TOKEN Meal => TOKEN]-------------------------------------------------------TourCode [ TourName => TOKEN Price => CURRENCY Hotel => ALPHANUMERIC Meal => ALPHANUMERIC]

Page 33: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Example 2

Page 34: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Example 2

Generated Frame:

Annotator 1:

Annotator 2:

Trip[ Cost (TimePeriod) => CURRENCY Insurance (TimePeriod) => CURRENCY]

-------------------------------------------------------

Trip[ Cost(Duration) => CURRENCY Insurance(Duration) => CURRENCY]

-------------------------------------------------------

Trip[ Duration=>ALPHANUMERIC DurationType=>ALPHANUMERIC Cost=>CURRENCY Insurance=>CURRENCY]

Page 35: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Example 3

Page 36: From Tables To Frames Aleksander Pivk 1,2, Philipp Cimiano 2, York Sure 2 1 Jozef Stefan Institute, Ljubljana, Slovenia 2 AIFB Institute, University of

From Tables To Frames - ISWC 2004, Hiroshima, Japan

Example 3

Generated Frame:

Annotator 1:

Transportation[ Description (Transportation) => STRING HalfDay (Transportation) => CURRENCY FullDay (Transportation) => CURRENCY HoursHakone (Transportation)=> CURRENCY]-------------------------------------------------------Transportation [ Vehicle => ALPHANUMERIC Seats => NUMBER WheelChairs => NUMBER JumpSeats => NUMBER Baggage => NUMBER Toilet => NUMBER Duration(TourType) => NUMBER Cost(TourType) => CURRENCY]