6/17/20151 table structure understanding by sibling page comparison cui tao data extraction group...

Post on 20-Dec-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

04/18/23 1

Table Structure Understanding

by Sibling Page Comparison

Cui Tao

Data Extraction Group

Department of Computer Science

Brigham Young University

Supported by NSF

04/18/23 2

Table Structure Understanding

Motivation Many documents contain tables Data extraction Data integration Ontology evolution

Solution Locate tables Locate table labels Locate table values Find label/value associations

04/18/23 3

Table Structure Understanding

04/18/23 4

Table Structure Understanding

1

2

(Gene Model, 1) = F18H3.5a

(Gene Model, 2) = F18H3.5b

:

:

04/18/23 5

04/18/23 6

04/18/23 7

Sibling Pages

Generated output pages user query results in predefined page structure

Same web site ~ same structure

04/18/23 8

Problems

Data rich area --- discard the irrelevant parts Find table correspondences Find mappings between table cells Find structure patterns

04/18/23 9

HTML Table Components

04/18/23 10

Data Rich Area

04/18/23 11

Table Unnesting

04/18/23 12

DOM Tree

04/18/23 13

Simple Tree Matching

Simple Tree Matching (STM) Yang91 Maximum matching pairs of nodes O(mn)

label

Value

04/18/23 14

Table Structure Pattern

04/18/23 15

Table Structure Pattern

04/18/23 16

Experimental Results

Initial Test General pattern extraction

Molecular biology: 95.6% Car ad: 100%

Dynamic adjustment Unseen structure Structure variations

top related