![Page 1: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/1.jpg)
Information Extractionfrom Web Documents
CS 652 Information Extraction and Integration
Li XuYihong Ding
![Page 2: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/2.jpg)
2
IR and IEIR (Information Retrieval) Retrieves relevant documents from collections Information theory, probabilistic theory, and
statistics
IE (Information Extraction) Extracts relevant information from documents Machine learning, computational linguistics,
and natural language processing
![Page 3: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/3.jpg)
3
History of IE
Large amount of both online and offline textual data.Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks
Latin American terrorism Joint ventures Microelectronics Company management changes
![Page 4: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/4.jpg)
4
Evaluation MetricsPrecision
Recall
F-measure
![Page 5: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/5.jpg)
5
Web Documents
Unstructured (Free) Text Regular sentences and paragraphs Linguistic techniques, e.g., NLP
Structured Text Itemized information Uniform syntactic clues, e.g., table
understanding
Semistructured Text Ungrammatical, telegraphic (e.g., missing
attributes, multi-value attributes, …) Specialized programs, e.g., wrappers
![Page 6: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/6.jpg)
6
Approaches to IEKnowledge Engineering Grammars are constructed by hand Domain patterns are discovered by human
experts through introspection and inspection of a corpus
Much laborious tuning and “hill climbing”
Machine Learning Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user
![Page 7: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/7.jpg)
7
Knowledge EngineeringAdvantages With skills and experience, good performing
systems are not conceptually hard to develop.
The best performing systems have been hand crafted.
Disadvantages Very laborious development process Some changes to specifications can be hard
to accommodate Required expertise may not be available
![Page 8: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/8.jpg)
8
Machine Learning Advantages Domain portability is relatively straightforward System expertise is not required for customization “Data driven” rule acquisition ensures full
coverage of examples
Disadvantages Training data may not exist, and may be very
expensive to acquire Large volume of training data may be required Changes to specifications may require
reannotation of large quantities of training data
![Page 9: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/9.jpg)
9
WrapperA specialized program that identifies data of interest and maps them to some suitable format (e.g. XML or relational tables)
Challenge: recognizing the data of interest among many other uninterested pieces of text
Tasks Source understanding Data processing
![Page 10: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/10.jpg)
10
Free Text
AutoSlogLiepPalkaHastenCrystal WebFoot
WHISK
![Page 11: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/11.jpg)
11
AutoSlog [1993]
The Parliament building was bombed by Carlos.
![Page 12: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/12.jpg)
12
LIEP [1995]
The Parliament building was bombed by Carlos.
![Page 13: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/13.jpg)
13
PALKA [1995]
The Parliament building was bombed by Carlos.
![Page 14: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/14.jpg)
14
HASTEN [1995]
The Parliament building was bombed by Carlos.
Egraphs(SemanticLabel, StructuralElement)
![Page 15: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/15.jpg)
15
CRYSTAL [1995]The Parliament building was bombed by Carlos.
![Page 16: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/16.jpg)
16
CRYSTAL + Webfoot [1997]
![Page 17: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/17.jpg)
17
WHISK [1999]The Parliament building was bombed by Carlos.
WHISK Rule:*(PhyObj)*@passive *F ‘bombed’ * {PP
‘by’ *F (Person)}
Context-based patterns
![Page 18: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/18.jpg)
18
Web DocumentsSemistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998)
Semistructured and Structured WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock,
1998)
![Page 19: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/19.jpg)
19
Inductive Learning
TaskInductive InferenceLearning Systems Zero-order First-order, e.g., Inductive Logic
Programming (ILP)
![Page 20: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/20.jpg)
20
RAPIER [1997]Inductive Logic ProgrammingExtraction Rules Syntactic information Semantic information
Advantage Efficient learning (bottom-up)
Drawback Single-slot extraction
![Page 21: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/21.jpg)
21
RAPIER Rule
![Page 22: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/22.jpg)
22
SRV [1998]Relational Algorithm (top-down)Features Simple features (e.g., length, character
type, …) Relational features (e.g., next-token, …)
Advantages Expressive rule representation
Drawbacks Single-slot rule generation Large-volume of training data
![Page 23: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/23.jpg)
23
SRV Rule
![Page 24: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/24.jpg)
24
WHISK [1998]Covering Algorithm (top-down)Advantages Learn multi-slot extraction rules Handle various order of items-to-be-extracted Handle document types from free text to
structured text
Drawbacks Must see all the permutations of items Less expressive feature set Need large volume of training data
![Page 25: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/25.jpg)
25
WHISK Rule
![Page 26: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/26.jpg)
26
WIEN [1997]Assumes Items are always in fixed, known order
Introduces several types of wrappersAdvantages Fast to learn and extract
Drawbacks Can not handle permutations and missing
items Must label entire pages Does not use semantic classes
![Page 27: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/27.jpg)
27
WIEN Rule
![Page 28: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/28.jpg)
28
SoftMealy [1998]Learns a transducerAdvantages Learns order of items Allows item permutations and missing items Allows both the use of semantic classes and
disjunctions
Drawbacks Must see all possible permutations Can not use delimiters that do not
immediately precede and follow the relevant items
![Page 29: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/29.jpg)
29
SoftMealy Rule
![Page 30: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/30.jpg)
30
STALKER [1998,1999,2001]
Hierarchical Information ExtractionEmbedded Catalog Tree (ECT) FormalismAdvantages Extracts nested data Allows item permutations and missing items Need not see all of the permutations One hard-to-extract item does not affect others
Drawbacks Does not exploit item order
![Page 31: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/31.jpg)
31
STALKER Rule
![Page 32: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/32.jpg)
32
Web IE Tools (main technique used)
Wrapper languages (TSIMMIS, Web-OQL) HTML-aware (X4F, XWRAP, RoadRunner, Lixto) NLP-based (RAPIER, SRV, WHISK) Inductive learning (WIEN, SoftMealy, Stalker) Modeling-based (NoDoSE, DEByE) Ontology-based (BYU ontology)
![Page 33: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/33.jpg)
33
Degree of Automation
Trade-off: page lay-out dependent
RoadRunner Assume target pages were automatically
generated from some data sources The only fully automatic wrapper generator
BYU ontology Manually created with graphical editing tool Extraction process fully automatic
![Page 34: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/34.jpg)
34
Support of Complex Objects
Complex objects: nested objects, graphs, trees, complex tables, …
Earlier tools do not support extracting from complex objects, like RAPIER, SRV, WHISK, and WIEN.BYU ontology Support
![Page 35: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/35.jpg)
35
Page Contents
Semistructured data (table type, richly tagged)Semistructured text (text type, rarely tagged)
NLP-based tools: text type onlyOther tools (except ontology-based): table type onlyBYU ontology: both types
![Page 36: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/36.jpg)
36
Ease of Use
HTML-aware tools, easiest to use
Wrapper languages, hardest to use
Other tools, in the middle
![Page 37: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/37.jpg)
37
Output
XML is the best output format for data sharing on the Web.
![Page 38: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/38.jpg)
38
Support for Non-HTML Sources
NLP-based and ontology-based, automatically supportOther tools, may support but need additional helper like syntactical and semantic analyzer
BYU ontology support
![Page 39: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/39.jpg)
39
Resilience and Adaptiveness
Resilience: continuing to work properly in the occurrence of changes in the target pagesAdaptiveness: working properly with pages from some other sources but in the same application domain
Only BYU ontology has both the features.
![Page 40: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/40.jpg)
40
Summary of Qualitative Analysis
![Page 41: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/41.jpg)
41
Graphical Perspective of Qualitative Analysis
![Page 42: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/42.jpg)
42
Name Struc_ture
Semi
Free Single-slot
Multi-slot
Missing items
Permuta_tions
Nested_data
Resilient
WIEN X X X
SoftMealy
X X X X X X*
STALKER
X X X * X X X
RAPIER X X ? X X X ?SRV X X ? X X X ?
WHISK X X X X X X X* ?
AutoSlog
X X X X
ROAD_RUNNER
X X X X X
BYU Onto
X X ? X X X X X X
X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.
![Page 43: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/43.jpg)
43
Problem of IE (unstructured documents)
Meaning
Knowledge
Information
Data
Source Target
Information Extraction
![Page 44: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/44.jpg)
44
Problem of IE (structured documents)
Meaning
Knowledge
Information
Data
Source Target
Information Extraction
![Page 45: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/45.jpg)
45
Problem of IE (semistructured documents)
Meaning
Knowledge
Information
Data
Source Target
Information Extraction
![Page 46: Information Extraction from Web Documents CS 652 Information Extraction and Integration Li Xu Yihong Ding](https://reader030.vdocument.in/reader030/viewer/2022033022/56649d485503460f94a23fa6/html5/thumbnails/46.jpg)
46
Meaning
Knowledge
Information
Data
Solution of IE (the Semantic Web)
Source Target
Information Extraction