![Page 1: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/1.jpg)
04/18/23 1
Table Structure Understanding
by Sibling Page Comparison
Cui Tao
Data Extraction Group
Department of Computer Science
Brigham Young University
Supported by NSF
![Page 2: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/2.jpg)
04/18/23 2
Table Structure Understanding
Motivation Many documents contain tables Data extraction Data integration Ontology evolution
Solution Locate tables Locate table labels Locate table values Find label/value associations
![Page 3: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/3.jpg)
04/18/23 3
Table Structure Understanding
![Page 4: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/4.jpg)
04/18/23 4
Table Structure Understanding
1
2
(Gene Model, 1) = F18H3.5a
(Gene Model, 2) = F18H3.5b
:
:
![Page 5: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/5.jpg)
04/18/23 5
![Page 6: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/6.jpg)
04/18/23 6
![Page 7: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/7.jpg)
04/18/23 7
Sibling Pages
Generated output pages user query results in predefined page structure
Same web site ~ same structure
![Page 8: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/8.jpg)
04/18/23 8
Problems
Data rich area --- discard the irrelevant parts Find table correspondences Find mappings between table cells Find structure patterns
![Page 9: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/9.jpg)
04/18/23 9
HTML Table Components
![Page 10: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/10.jpg)
04/18/23 10
Data Rich Area
![Page 11: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/11.jpg)
04/18/23 11
Table Unnesting
![Page 12: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/12.jpg)
04/18/23 12
DOM Tree
![Page 13: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/13.jpg)
04/18/23 13
Simple Tree Matching
Simple Tree Matching (STM) Yang91 Maximum matching pairs of nodes O(mn)
label
Value
![Page 14: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/14.jpg)
04/18/23 14
Table Structure Pattern
![Page 15: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/15.jpg)
04/18/23 15
Table Structure Pattern
![Page 16: 6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d455503460f94a227c8/html5/thumbnails/16.jpg)
04/18/23 16
Experimental Results
Initial Test General pattern extraction
Molecular biology: 95.6% Car ad: 100%
Dynamic adjustment Unseen structure Structure variations