6/17/20151 table structure understanding by sibling page comparison cui tao data extraction group...

16
03/27/22 1 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University Supported by NSF

Post on 20-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 1

Table Structure Understanding

by Sibling Page Comparison

Cui Tao

Data Extraction Group

Department of Computer Science

Brigham Young University

Supported by NSF

Page 2: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 2

Table Structure Understanding

Motivation Many documents contain tables Data extraction Data integration Ontology evolution

Solution Locate tables Locate table labels Locate table values Find label/value associations

Page 3: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 3

Table Structure Understanding

Page 4: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 4

Table Structure Understanding

1

2

(Gene Model, 1) = F18H3.5a

(Gene Model, 2) = F18H3.5b

:

:

Page 5: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 5

Page 6: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 6

Page 7: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 7

Sibling Pages

Generated output pages user query results in predefined page structure

Same web site ~ same structure

Page 8: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 8

Problems

Data rich area --- discard the irrelevant parts Find table correspondences Find mappings between table cells Find structure patterns

Page 9: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 9

HTML Table Components

Page 10: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 10

Data Rich Area

Page 11: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 11

Table Unnesting

Page 12: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 12

DOM Tree

Page 13: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 13

Simple Tree Matching

Simple Tree Matching (STM) Yang91 Maximum matching pairs of nodes O(mn)

label

Value

Page 14: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 14

Table Structure Pattern

Page 15: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 15

Table Structure Pattern

Page 16: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University

04/18/23 16

Experimental Results

Initial Test General pattern extraction

Molecular biology: 95.6% Car ad: 100%

Dynamic adjustment Unseen structure Structure variations