Download - Cui Tao PhD Dissertation Defense
![Page 1: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/1.jpg)
1
Cui TaoPhD Dissertation Defense
Ontology Generation, Information Harvesting and Semantic Annotation For Machine-
Generated Web Pages
![Page 2: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/2.jpg)
2
MotivationBirth date of my great
grandpa
Price and mileage of red Nissans, 1990 or newer
Protein and amino acids information of gene cdk-4?
US states with property crime rates above 1%
![Page 3: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/3.jpg)
3
Search by Search Engine
![Page 4: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/4.jpg)
4
Search the Hidden Web
• The Hidden Web:– Hidden behind forms– Hard to query “cdk-4"
![Page 5: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/5.jpg)
5
Query for Data
• The Hidden Web:– Hidden behind forms– Hard to query
Find the protein and the animo-acids
information for gene “cdk-4"
![Page 6: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/6.jpg)
6
A Web of Pages A Web of Knowledge
• Web of Knowledge– Machine-“understandable”– Publicly accessible– Queriable by standard query languages
• Semantic annotation– Domain ontologies– Populated conceptual model
• Problems to resolve– How do we create ontologies?– How do we annotate pages for ontologies?
![Page 7: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/7.jpg)
Contributions of Dissertation Work
• Web of Pages Web of Knowledge– Knowledge & meta-knowledge extraction– Reformulation as machine-“understandable”
knowledge
• Automatic & semi-automatic solutions via:– Sibling tables (TISP/TISP++)– User-created forms (FOCIH)
7
![Page 8: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/8.jpg)
8
Automatic Annotation with TISP(Table Interpretation with Sibling Pages)
• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations
![Page 9: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/9.jpg)
9
Recognize Tables
Data Table
Layout Tables (discard)
NestedData Tables
![Page 10: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/10.jpg)
10
Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918
12
![Page 11: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/11.jpg)
11
Interpretation Technique:Sibling Page Comparison
![Page 12: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/12.jpg)
12
Interpretation Technique:Sibling Page Comparison
Same
![Page 13: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/13.jpg)
13
Interpretation Technique:Sibling Page Comparison
Almost Same
![Page 14: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/14.jpg)
14
Interpretation Technique:Sibling Page Comparison
Different
Same
![Page 15: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/15.jpg)
15
Technique Details
• Unnest tables• Match tables in sibling pages
– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)
• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment
![Page 16: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/16.jpg)
16
Table Unnesting
![Page 17: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/17.jpg)
17
Regularity Expectations:
• (<tr><(td|th)> {L} <(td|th)> {V})n
• <tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
• …
Pattern combinations are also possible.
Table Structure Patterns
![Page 18: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/18.jpg)
18
<tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
Table Structure Patterns
![Page 19: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/19.jpg)
19
Pattern Usage
![Page 20: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/20.jpg)
20
Dynamic Pattern Adjustment
![Page 21: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/21.jpg)
21
TISP++
• Automatic ontology generation
• Automatic information annotation
![Page 22: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/22.jpg)
22
Ontology Generation – OSM
• Object set: table labels– Lexical: labels that associate with actual values– Non-lexical: labels that associate with other tables
• Relationship set: table nesting• Constraints: updates based on observation
![Page 23: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/23.jpg)
23
Ontology Generation – OWL
• Object set: OWL class• Relationship set: OWL object property• Lexical object set:
– OWL data type property– Different annotation properties to keep track of
the provenance
![Page 24: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/24.jpg)
Generated Ontology
![Page 25: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/25.jpg)
Generated Ontology
![Page 26: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/26.jpg)
26
RDF Graph
![Page 27: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/27.jpg)
27
Query the DataFind the protein
and the animo-acids information for gene “cdk-4"
![Page 28: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/28.jpg)
28
TISP Evaluation• Applications
– Commercial: car ads– Scientific: molecular biology– Geopolitical: US states and countries
• Data: > 2,000 tables in 35 sites• Evaluation
– Initial two sibling pages• Correct separation of data tables from layout tables?• Correct pattern recognition?
– Remaining tables in site• Information properly extracted?• Able to detect and adjust for pattern variations?
![Page 29: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/29.jpg)
29
Experimental Results• Table recognition: correctly discarded 157 of
158 layout tables
• Pattern recognition: correctly found 69 of 72 structure patterns
• Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct
![Page 30: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/30.jpg)
30
TISP++ Performance
• Performance depends on TISP• TISP test set
– Generates all ontologies correctly– Annotates all information in tables correctly
![Page 31: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/31.jpg)
31
Form-based Ontology Creation and Information Harvesting (FOCIH)
• Personalized ontology creation by form– General familiarity– Reasonable conceptual framework– Appropriate correspondence
• Transformable to ontological descriptions• Capable of accepting source data
• Automated ontology creation • Automated information harvesting
![Page 32: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/32.jpg)
32
Form Creation
![Page 33: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/33.jpg)
33
Created Sample Form
![Page 34: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/34.jpg)
34
Generated Ontology View
![Page 35: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/35.jpg)
35
Source-to-Form Mapping
![Page 36: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/36.jpg)
36
Source-to-Form Mapping
![Page 37: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/37.jpg)
37
Source-to-Form Mapping
![Page 38: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/38.jpg)
38
Source-to-Form Mapping
![Page 39: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/39.jpg)
39
Almost Ready to Harvest
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Pattern recognition– Instance recognition
![Page 40: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/40.jpg)
40
Reading Path
![Page 41: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/41.jpg)
41
Pattern & Instance Recognition
![Page 42: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/42.jpg)
42
Pattern & Instance Recognition
![Page 43: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/43.jpg)
43
Pattern & Instance Recognitionregular expression
for decimal numberleft context
right context
![Page 44: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/44.jpg)
44
Pattern & Instance Recognition
list pattern, delimiter is “,”
![Page 45: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/45.jpg)
45
Pattern & Instance Recognition
list pattern, delimiter is regular expression for percentage numbers and a comma
![Page 46: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/46.jpg)
46
Pattern & Instance Recognition
list pattern, delimiter is regular expression for percentage numbers and a comma
![Page 47: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/47.jpg)
47
Can Now Harvest
![Page 48: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/48.jpg)
48
Can Now Harvest
![Page 49: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/49.jpg)
49
Can Now Harvest
![Page 50: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/50.jpg)
50
Semantic Annotation
![Page 51: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/51.jpg)
51
Semantic Annotation
![Page 52: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/52.jpg)
52
Semantic Annotation
![Page 53: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/53.jpg)
53
Semantic Annotation
![Page 54: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/54.jpg)
54
Semantic Annotation
![Page 55: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/55.jpg)
55
Semantic Query
![Page 56: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/56.jpg)
56
FOCIH Performance
• Ontology creation• Semantic annotation
– Depends on TISP performance– Depends on pattern and instance recognition
performance
![Page 57: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/57.jpg)
57
FOCIH Performance
• Pattern and instance recognition:– Works with highly regular data– Tested 71 mappings– 25 full-string values (25/25 correct)– 38 substring values (29/38 correct)– 8 list patterns (6/8 correct)
![Page 58: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/58.jpg)
58
FOCIH Difficulties
![Page 59: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/59.jpg)
59
FOCIH Difficulties
![Page 60: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/60.jpg)
60
FOCIH Difficulties
No selection
![Page 61: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/61.jpg)
61
WoK via TISP
![Page 62: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/62.jpg)
62
WoK via TISP
![Page 63: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/63.jpg)
63
WoK via FOCIH
![Page 64: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/64.jpg)
64
WoK via FOCIH
![Page 65: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/65.jpg)
65
Contributions
• TISP: automatic sibling table interpretation• TISP++:
– Automatic ontology generation based on interpreted tables
– Automatic semantic annotation for interpreted tables• FOCIH:
– Semi-automatic personalized ontology creation– Automatic personalized information harvesting and
semantic annotation• All together: contributes to turning the current web
of pages into a web of Knowledge
![Page 66: Cui Tao PhD Dissertation Defense](https://reader036.vdocument.in/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/66.jpg)
66
Future Work
• Sibling pages in addition to sibling tables
• Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.