from tables to frames aleksander pivk 1,2, philipp cimiano 2, york sure 2 1 jozef stefan institute,...
TRANSCRIPT
From Tables To Frames
Aleksander Pivk1,2, Philipp Cimiano2, York Sure2
1Jozef Stefan Institute, Ljubljana, Slovenia2 AIFB Institute, University of Karlsruhe, Karlsruhe
09.11.2004
The Third International Semantic Web Conference - ISWC 2004
November 07 – 11, 2004, Hiroshima, Japan
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Outline
Motivation Foundation: Table Model Methodology Evaluation Conclusion Future Work
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Motivation
problem: well-known annotation bottleneck
solution: automatic metadata generation goal: describe the semantics of tables in
model-theoretic-way (F-Logic) tables with different structure but same
meaning (should) have the same representation
benefit: enable e.g. query answering all conferences where ‘prof. Studer’ is in PC all tours to COUNTRY at DATE where price<AMOUNT
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Foundation: Table Model
dimensions of table model [Hurst’00] graphical (image processing) physical (inter-cell relative location) structural (organization of cells indicating their
navigational relationship) functional (purpose of regions in terms of data access)
two functional cell types: A-cell and I-cell two functional I-cell roles: data and access
semantic (relation between cell content, structure and orientation)
frame makes explicit the meaning of the cell contents (F-Logic concepts) the functional dimension of the table (method signature) the semantic dimension of the table (frame structure)
example:
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Table model
A-cell
I-cell (access)
I-cell (data)
LEGEND:
A-cell
I-cell (access)
I-cell (data)
LEGEND:
Person Type Economic ExtendedSingle Room 35.450 2.510Double Room 32.500 1.430Extra Bed 30.550 720Occupation 25.800 1.430No occupation 23.850 720Extra Bed 22.900 360
Adult
Child
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Simple Table Classes
220318046-60XM001
195227532-45LM223
120149523-31LM311
9094916-23LM209
707259-15LM208
50520-7LM202
InsuranceCost
Trip Duration (in days)
Trip Code
220318046-60XM001
195227532-45LM223
120149523-31LM311
9094916-23LM209
707259-15LM208
50520-7LM202
InsuranceCost
Trip Duration (in days)
Trip Code
1-Dimensional
F2-Baggage Claim:
18A29Gate/Terminal:
Mar 16 -Not Available
Mar 16 -Not Available
Actual:
Mar 16 -4:20pm
Mar 16 -11:45am
Scheduled:
Honolulu, HI (HNL)
Dallas/Ft Worth,
TX (DFW)
City:
Arrival Departure
F2-Baggage Claim:
18A29Gate/Terminal:
Mar 16 -Not Available
Mar 16 -Not Available
Actual:
Mar 16 -4:20pm
Mar 16 -11:45am
Scheduled:
Honolulu, HI (HNL)
Dallas/Ft Worth,
TX (DFW)
City:
Arrival Departure
2-Dimensional
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Complex Table Classes
36022.900Extra Bed
72023.850No Occupation
1.43025.800Occupation
Child
72030.550Extra Bed
1.43032.500Double Room
2.51035.450Single Room
PRICE
Adult
ExtendedEconomicClass/Extension
36022.900Extra Bed
72023.850No Occupation
1.43025.800Occupation
Child
72030.550Extra Bed
1.43032.500Double Room
2.51035.450Single Room
PRICE
Adult
ExtendedEconomicClass/Extension
1. Over-expanded labels
FloatRegularRate (%)
Fixed Deposit
Regular Fixed Deposit
4,354,353 Months
4,64,66 Months
4,74,79 Months
551 Year
5,055,052 Years
5,055,053 Years
3 Years
2 Years
1 Year
Rate (%)
5,1
5,1
5,05
Regular
5,1
5,1
5,05
Float
FloatRegularRate (%)
Fixed Deposit
Regular Fixed Deposit
4,354,353 Months
4,64,66 Months
4,74,79 Months
551 Year
5,055,052 Years
5,055,053 Years
3 Years
2 Years
1 Year
Rate (%)
5,1
5,1
5,05
Regular
5,1
5,1
5,05
Float
2. Partition labels
3. Combination – running example
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Methodology
the methodology instantiates stepwise the table model
main differences: do not consider graphical component extent semantic component
1 Cleaning & Normalization
2 Structure Detection
3 Building of FTM
4 Semantic Enriching of FTM
Physical
Structure
Function
Semantic
HTML
Frame
Steps of Methodology Table ModelInput Output
1 Cleaning & Normalization
2 Structure Detection
3 Building of FTM
4 Semantic Enriching of FTM
Physical
Structure
Function
Semantic
HTMLHTML
FrameFrame
Steps of Methodology Table ModelInput Output
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Cleaning & Norm.
construct an initial matrix structure DOM tree
cleaning: syntactic errors (CyberNeko HTML parser) normalization: aligning the table, resorting
cells spanning multiple rows/columns (colspan, rowspan)
example:
2 Structure Detection
3 Building of FTM
4 Semantic Enriching of FTM
Structure
Function
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
2 Structure Detection
3 Building of FTM
4 Semantic Enriching of FTM
Structure
Function
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
A-cell
I-cell (access)
I-cell (data)
LEGEND:
A-cell
I-cell (access)
I-cell (data)
LEGEND:
A-cell
I-cell (access)
I-cell (data)
LEGEND:
A-cell
I-cell (access)
I-cell (data)
LEGEND:
Economic ExtendedSingle Room 35.450 2.510Double Room 32.500 1.430Extra Bed 30.550 720Occupation 25.800 1.430No occupation 23.850 720Extra Bed 22.900 360
DP9LAX01AB01.05.04-31.09.04
Adult
Child
Class/Price
Tour CodeValid
Tour Code Tour Code DP9LAX01AB DP9LAX…Valid Valid 01.05.04-31… 01.05.04…
Class/Price Class/Price Economic ExtendedAdult Single Room 35.450 2.510Adult Double Room 32.500 1.430Adult Extra Bed 30.550 720Child Occupation 25.800 1.430Child No occupation 23.850 720Child Extra Bed 22.900 360
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Structure Detection
detecting table orientation: rely on similarity of cells (size, content,
token types) intuition:
if rows are similar, then orientation is vertical (top-to-down)
if columns are similar, then orientation is horizontal (left-to-right)
initialize logical units and regions split table into LUs group same-sized, similar cells into regions
within LUs
3 Building of FTM
4 Semantic Enriching of FTM
Function
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
2 Structure Detection Structure
3 Building of FTM
4 Semantic Enriching of FTM
Function
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
2 Structure Detection Structure
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Discovery of Regions 3 Building of FTM
4 Semantic Enriching of FTM
Function
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
2 Structure Detection Structure
3 Building of FTM
4 Semantic Enriching of FTM
Function
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
2 Structure Detection Structure
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Discovery of Regions
do while (distribution in LU not uniform)(explanation of uniformity: logical unit consists of logical sub-units where each sub-unit includes only regions of same size and orientation)
choose the best coherent region used to propagate and normalize the neighboring regions
normalize logical sub-unit choose neighboring regions (i.e. only within same rows for vertical
orientation)
example:
3 Building of FTM
4 Semantic Enriching of FTM
Function
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
2 Structure Detection Structure
3 Building of FTM
4 Semantic Enriching of FTM
Function
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
2 Structure Detection Structure
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Building FTM
functional table model regions as nodes arranged in a tree properties of leaf nodes:
are only regions consisting exclusively of I-cells are assigned their functional role (access, data) are assigned two semantic labels:
label describing the content of the region (instances) label as a combination of a region label and parent A-cell nodes
labels inner nodes are either regions consisting of A-cells or
‘connection’ nodes (e.g. root)
construction of FTM bottom-up approach (from lowest logical unit upwards) description through an example
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
3 Building of FTM Function
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
3 Building of FTM Function
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Building FTM
type of the (colored) logical unit = I-cells only regions are turned into leaves semantic labels and roles are set to a default value
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
3 Building of FTM Function
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
3 Building of FTM Function
<label>
<role>
AdultAdultAdultChildChildChild
<label>
<role>
Single RoomDouble Room
Extra BedOccupation
No Occupat… Extra Bed
<label>
<role>
35,45032,50030,55025,800
/22,900
<label>
<role>
2,5101,430720
1,430720360
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Building FTM type of the (colored) logical unit = A-cells only
regions turned into inner nodes and connected to appropriate sub-nodes (leaves)
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
3 Building of FTM Function
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
3 Building of FTM Function
<label>
<role>
AdultAdultAdultChildChildChild
<label>
<role>
Single RoomDouble Room
Extra BedOccupation
No Occupat… Extra Bed
<label>
<role>
35,45032,50030,55025,800
/22,900
<label>
<role>
2,5101,430720
1,430720360
Class/Price Economic Extended
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Building FTM type of the (colored) logical unit = special case
close a subtree by inserting a ‘connection’ node which reflects a logical separation in the table (transition from a LU with only A-cells to a LU with I-cells)
assign functional roles to leaves within a connected sub-tree: functional role access assigned to all consecutive leaves (from left) that
together form a unique identifier (key); other leaves assign functional role data (possible) change of reading orientation in the new logical unit
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
3 Building of FTM Function
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
3 Building of FTM Function
<label>
access
AdultAdultAdultChildChildChild
<label>
access
Single RoomDouble Room
Extra BedOccupation
No Occupat… Extra Bed
<label>
data
35,45032,50030,55025,800
/22,900
<label>
data
2,5101,430720
1,430720360
Class/Price Economic Extended
Connection Node
<label>
<role>
DP9LAX01AB
<label>
<role>
01.05.2004 - 30.09.2004
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
720/No
OcuppationChild
ExtendedEconomicClass/PriceClass/Price
72030.550Extra BedAdult
1.43025.800OccupationChild
2.51035.450Single RoomAdult
1.43032.500Double RoomAdult
36022.900Extra BedChild
01.05.04 -30.09.04
01.05.04 -30.09.04ValidValid
DP9LAX01ABDP9LAX01ABTour CodeTour Code
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Building FTM type of the (colored) logical unit = A-cells only
regions turned into inner nodes and connected to appropriate sub-nodes (leaves)
finally, connect all unconnected nodes to a root node
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
3 Building of FTM Function
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
3 Building of FTM Function
<label>
access
…
<label>
access
…
<label>
data
…
<label>
data
…
Class/Price Economic Extended
Connection Node
<label>
data
DP9LAX01AB
<label>
data
01.05.2004 - 30.09.2004
Tour Code Valid
Root
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Building FTM
recapitulation of FTM: consider multiple-level sub-trees for merging conditions: same tree structure and at least one level of matching A-
cells merging step:
merge nodes at the same position and level (leaf and inner nodes) if merged inner nodes (A-cells) are not equal
find a semantic label of a new merged node create a new leaf node (with A-cells as values) assign functional role of the new leaf to access
example:
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
3 Building of FTM Function
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
3 Building of FTM Function
Root
Morning
Lecturer Room
v1, v2
/data
v1, v2v1, v2
//datadata
v3, v4
/data
v3, v4v3, v4
//datadata
Evening
Lecturer Room
v5, v6
/data
v5, v6v5, v6
//datadata
v7, v8
/data
v7, v8v7, v8
//datadata
Time Period
Lecturer Room
v1, v5,v2, v6v1, v5,v2, v6
//datadata
v3, v7,v4, v8v3, v7,v4, v8
//datadata
Root
MorningEvening
Time Periodaccess
From Tables To Frames - ISWC 2004, Hiroshima, Japan
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTML
Frame
1 Cleaning & Normalization Physical
3 Building of FTM Function
2 Structure Detection
4 Semantic Enriching of FTM
Structure
Semantic
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
3 Building of FTM FunctionBuilding FTM
<label>
access
AdultAdultAdultChildChildChild
<label>
access
Single RoomDouble Room
Extra BedOccupation
No Occupat… Extra Bed
<label>
data
35,45032,50030,55025,800
/22,900
<label>
data
2,5101,430720
1,430720360
Class/Price Economic Extended
Connection Node
<label>
access
AdultAdultAdultChildChildChild
<label>
access
Single RoomDouble Room
Extra BedOccupation
No Occupat… Extra Bed
<label>
data
35,45032,50030,55025,800
/22,900
2,5101,430720
1,430720360
Class Price
Connection Node
<label>
access
EconomicExtended
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Semantic Enriching of FTM
find semantic labels for regions by consulting: Wordnet lexical ontology: use synsets to find hypernyms GoogleSets service: additonal way to find synonyms
transformations of region’s cell labels: punctuation removal stopword removal compute IDF (document is a cell) for each word, and filter
out the ones with value lower than treshold select words that appear at the end of the labels
(nominal head in the nominal compound is at the end) query GoogleSets with the remaining words to filter out the
ones that are not mutually similar
2 Structure Detection
3 Building of FTM
Structure
Function
HTML
Frame
1 Cleaning & Normalization Physical
4 Semantic Enriching of FTM Semantic
2 Structure Detection
3 Building of FTM
Structure
Function
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
4 Semantic Enriching of FTM Semantic
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Semantic Enriching of FTM
assign each leaf its semantic label that describes the content (instances) of the region
2 Structure Detection
3 Building of FTM
Structure
Function
HTML
Frame
1 Cleaning & Normalization Physical
4 Semantic Enriching of FTM Semantic
2 Structure Detection
3 Building of FTM
Structure
Function
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
4 Semantic Enriching of FTM Semantic
Person
access
AdultAdultAdultChildChildChild
Room
access
Single RoomDouble Room
Extra BedOccupation
No Occupat… Extra Bed
<label>
data
35,45032,50030,55025,800
/22,900
2,5101,430720
1,430720360
Class Price
Connection Node
<label>
data
DP9LAX01AB
Date
data
01.05.2004 - 30.09.2004
Tour Code Valid
Root
Type
access
EconomicExtended
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Final FTM (final) semantic labels of leaves:
label is a combination of a region label and parent A-cell nodes labels
2 Structure Detection
3 Building of FTM
Structure
Function
HTML
Frame
1 Cleaning & Normalization Physical
4 Semantic Enriching of FTM Semantic
2 Structure Detection
3 Building of FTM
Structure
Function
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
4 Semantic Enriching of FTM Semantic
Person
access
AdultAdultAdultChildChildChild
Room
access
Single RoomDouble Room
Extra BedOccupation
No Occupat… Extra Bed
<label>
data
35,45032,50030,55025,800
/22,900
2,5101,430720
1,430720360
Class Price
Connection Node Tour Code Valid
Root
PersonClass RoomClass Price
Type
access
EconomicExtended
TypePrice
<label>
data
DP9LAX01AB
Code
Date
data
01.05.2004 - 30.09.2004
DateValid
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Map FTM to a Frame
method is a tuple frame is a pair generation of a frame
create method m for every leaf node, which functional role is data
parameters of m are all leaf nodes with functional role access,where they must be located on the same level of m ’s sub-tree or on m ’s parent path towards root node
set range for m according to the syntactic token type of its region
names for parameters and methods are obtained from a final FTM
example:
2 Structure Detection
3 Building of FTM
Structure
Function
HTML
Frame
1 Cleaning & Normalization Physical
4 Semantic Enriching of FTM Semantic
2 Structure Detection
3 Building of FTM
Structure
Function
HTMLHTML
FrameFrame
1 Cleaning & Normalization Physical
4 Semantic Enriching of FTM Semantic
Tour [ Code => ALPHANUMERIC; DateValid => DATE; Price (PersonClass, RoomClass, TypePrice) => LARGE_NUMBER].
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Evaluation
task: for each table compare automatically generated
frame against two manually created frames measure in terms of Precision, Recall and F-
measure dataset:
consists of 21 tables: 3 tables for each simple table class (1D, 2D) and 5 tables for each complex table class
tourism domain annotators:
14 subjects each subject had to annotate 3 tables, each
belonging to a different table class (14x3=21x2=42)
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Evaluation
performed along following 4 functions: - example: [m1 (X, Y) => INTEGER] vs. [method1 (X, YY, W)=>INTEGER] syntactic correctness:
how well the functional dimension of the table is captured (SynC=2/3)
strict comparison: calculate how identical are nameM , rangeM , and PM identifiers of
methods (P=2/4, R=2/5) soft comparison:
for soft matching we used a combination of TFIDF and Jaro-Wrinkler string distance scheme [Cohen et al., 2003]
calculate soft matching for identifiers of methods (P=3/4, R=3/5, where ‘Y’≈‘YY’)
conceptual comparison: conceptually equivalent identifiers have been determined
(i.e. ‘RegionType’=‘Region’=‘Location’) calculate conceptual matching for identifiers of methods
(P=4/4, R=4/5, where ‘m1’≈‘method1’)
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Evaluation
performed from 2 aspects: average: consider all frames maximum: choose only the best manually created
frame for each generated frame results:
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Conclusion
shown that our methodology stepwise instantiates the underlying table model
experiments show that: from conceptual point of view the system gets
appropriate names for frames in almost 75% it gets totally identical names in more than
50%
we demonstrated and evaluated the successful automatic generation of frames from HTML tables
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Future Work
generate one (most general) frame from multiple tables reduction of complexity
population of ontologies with instances show feasibility of approach in practical setting
use given ontology as background knowledge
From Tables To Frames - ISWC 2004, Hiroshima, Japan
TNX
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Inter-annotator agreement
max (FX)=Fconceptual ≈60% only 2 totally identical frames (2/21=9.52%) only 5 identical frames from a conceptual view
(5/21=23.81%) this 5 tables cover all 1D class tables and 2 (out of 3) 2D
class tables
possible reasons for low agreements: the annotators did not follow the guidelines precisely the task itself is hard the annotation guidelines were not clear/detailed
enough
actual results:
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Example 1
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Example 1
Generated Frame
Annotator 1:
Annotator 2:
Tour [ Name (Code) => TOKEN Price (Code) => CURRENCY Hotel (Code) => TOKEN Meal (Code) => TOKEN]-------------------------------------------------------Tour [ TourCode => ALPHANUMERIC TourName => TOKEN Price => CURRENCY Hotel => TOKEN Meal => TOKEN]-------------------------------------------------------TourCode [ TourName => TOKEN Price => CURRENCY Hotel => ALPHANUMERIC Meal => ALPHANUMERIC]
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Example 2
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Example 2
Generated Frame:
Annotator 1:
Annotator 2:
Trip[ Cost (TimePeriod) => CURRENCY Insurance (TimePeriod) => CURRENCY]
-------------------------------------------------------
Trip[ Cost(Duration) => CURRENCY Insurance(Duration) => CURRENCY]
-------------------------------------------------------
Trip[ Duration=>ALPHANUMERIC DurationType=>ALPHANUMERIC Cost=>CURRENCY Insurance=>CURRENCY]
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Example 3
From Tables To Frames - ISWC 2004, Hiroshima, Japan
Example 3
Generated Frame:
Annotator 1:
Transportation[ Description (Transportation) => STRING HalfDay (Transportation) => CURRENCY FullDay (Transportation) => CURRENCY HoursHakone (Transportation)=> CURRENCY]-------------------------------------------------------Transportation [ Vehicle => ALPHANUMERIC Seats => NUMBER WheelChairs => NUMBER JumpSeats => NUMBER Baggage => NUMBER Toilet => NUMBER Duration(TourType) => NUMBER Cost(TourType) => CURRENCY]