comparison of ie approaches chia-hui chang national central university jan. 4, 2005
Post on 20-Dec-2015
217 Views
Preview:
TRANSCRIPT
Comparison of Comparison of IE Approaches IE Approaches
Chia-Hui ChangChia-Hui ChangNational Central UniversityNational Central University
Jan. 4, 2005Jan. 4, 2005
IntroductionIntroduction• Abundant information on the Web
– Static Web pages– Searchable databases: Deep Web
• Information Integration– Information for life
• e.g. shopping agents, travel agents
– Data for research purpose• e.g. bioinformatics, auction economy
Various IE SurveyVarious IE Survey• Muslea• Hsu and Dung• Chang• Kushmerick• Laender• Sarawagi• Kuhlins and Tredwell
Related Work: Time Related Work: Time • MUC Approaches
– AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995]
• Post-MUC Approaches – WHISK [Soderland, 1999], RAPIER [califf, 1998],
SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]
Related Work: Automation DegreeRelated Work: Automation Degree
• Hsu and Dung [1998]– hand-crafted wrappers using general
programming languages– specially designed programming
languages or tools– heuristic-based wrappers, and – WI approaches
Related Work: Automation DegreeRelated Work: Automation Degree
• Chang and Kuo [2003]– systems that need programmers, – systems that need annotation examples,– annotation-free systems and – semi-supervised systems
Related Work: Related Work: Input and Extraction RulesInput and Extraction Rules
• Muslea [1999]– IE from free text using extraction patterns that a
re mainly based on syntactic/semantic constraints.
– The second class is Wrapper induction systems which rely on the use of delimiter-based rules.
– The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.
Related Work: Extraction RulesRelated Work: Extraction Rules
• Kushmerick [2003]– Finite-state tools (regular expressions)– Relational learning tools (logic rules)
Related Work: TechniquesRelated Work: Techniques• Laender [2002]
– languages for wrapper development – HTML-aware tools – NLP-based tools – Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER),
– Modeling-based tools – Ontology-based tools
• New Criteria:– degree of automation, support for complex objects, page con
tents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.
Related Work: Output TargetsRelated Work: Output Targets
• Sarawagi [2002]– Record-level– Page-level– Site-level
Related Work: UsabilityRelated Work: Usability • Kuhlins and Tredwell [2002]
– Commercial– Noncommercial
Three DimensionsThree Dimensions• Task Domain
– Input (Unstructured, semi-structured)– Output Targets (record-level, page-level, site-level)
• Automation Degree– Programmer-involved, learning-based or annotatio
n-free approaches• Techniques
– Regular expression rules vs Prolog-like logic rules– Deterministic finite-state transducer vs probabilisti
c hidden Markov models
Classification by Automation DegreeClassification by Automation Degree
• Manually– TSIMMIS, Minerva, WebOQL, W4F, XWrap
• Supervised– WIEN, Stalker, Softmealy
• Semi-supervised– IEPAD, OLERA
• Unsupervised– DeLa, RoadRunner, EXALG
Task Domain: InputTask Domain: Input
Task Domain: OutputTask Domain: Output• Missing Attributes• Multi-valued Attributes• Multiple Permutations• Nested Data Objects• Various Templates for an attribute• Common Templates for various attribut
es• Untokenized Attributes
Tools PT NHS CP EL Nested MA MVA MOA FVF UDA UTA SPA
Manual
Minerva Semi-S Yes Yes Record Level Yes Yes Yes Yes Yes No Yes Yes
TSIMMIS Semi-S Yes Yes Record Level Yes Yes Yes No Yes No Yes No
WebOQL Semi-S No Yes Record Level Yes Yes Yes Yes Yes No No No
W4F Semi-S No Yes Record Level Yes Yes Yes Yes No No No Yes
XWRAP Semi-S No Yes Record Level Yes Yes Yes No No No No Yes
Supervise
d
RAPIER Free Yes Yes Field Level No Yes Yes Yes Yes Yes Yes No
SRV Free Yes Yes Field Level No Yes Yes Yes Yes Yes Yes No
WHISK Free Yes Yes Record Level No Yes Yes Yes Yes Yes Yes No
NoDoSE Semi-S Yes Yes Record Level Yes Yes Yes Yes No No No No
DEByE Semi-S Yes Yes Record Level Yes Yes Yes Yes No No No No
WIEN Semi-S Yes Yes Record Level No No No No No No No No
STALKER Semi-S Yes Yes Record Level Yes Yes Yes Yes Yes No Yes Yes
SoftMealy Semi-S Yes Yes Record LevelMultiPass
Yes Yes Limited Yes No Yes Yes
Semi-
Supervise
d
IEPAD Semi-S No Limited Record Level Limited Yes Yes Limited Yes No Yes Yes
OLERA Semi-S No Limited Record Level Limited Yes Yes Limited Yes No Yes Yes
Un-Supervise
d
RoadRunner Semi-S No Limited Page Level Yes Yes Yes No No No No Yes
EXALG Semi-S Yes Limited Page Level Yes Yes Yes No Yes No No Yes
DeLa Semi-S No Limited Record Level Yes Yes Yes Limited Yes No No Yes
Automation DegreeAutomation Degree• Page-fetching Support• Annotation Requirement• Output Support• API Support
ToolsGUI
support
Page-Fetching support
Output Support
Training Examples
API. Support
Minerva No No XML No Yes
TSIMMIS No No Text No Yes
WebOQL No No Text No Yes
W4F Yes Yes XML Labeled Yes
XWRAP Yes Yes XML Labeled Yes
RAPIER No No Text Labeled No
SRV No No Text Labeled No
WHISK No No Text Labeled No
NoDoSE Yes No XML, OEM Labeled Yes
DEByE Yes Yes XML, SQL DB Labeled Yes
WIEN Yes No Text Labeled Yes
STALKER Yes No Text Labeled Yes
SoftMealy Yes Yes XML, SQL DB Labeled Yes
IEPAD Yes No Text Unlabeled No
OLERA Yes No XML Unlabeled No
RoadRunner No Yes XML Unlabeled Yes
EXALG No No Text Unlabeled No
DeLa No Yes Text Unlabeled Yes
TechnologiesTechnologies• Scan passes• Extraction rule types• Learning algorithms• Tokenization schemes• Feature used
Tools Scan PassExtraction Rule Type
Features Used Learning AlgorithmTokenization Schemes
Minerva Single Regular exp. HTML tags/Literal words None Manually
TSIMMIS Single Regular exp. HTML tags/Literal words None Manually
WebOQL Single Regular exp. Hypertree None Manually
W4F Single Regular exp. DOM tree path addressing None Tag Level
XWRAP Single Context-Free DOM tree None Tag Level
RAPIER Multiple Logic rules Syntactic/Semantic ILP (bottom-up) Word Level
SRV Multiple Logic rules Syntactic/Semantic ILP (top-down) Word Level
WHISK Single Regular exp. Syntactic/Semantic Set covering (top-down) Word Level
NoDoSE Single Regular exp. HTML tags/Literal words Data Modeling Word Level
DEByE Single Regular exp. HTML tags/Literal words Data Modeling Word Level
WIEN Single Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level
STALKER Multiple Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level
SoftMealy Both Regular exp. HTML tags/Literal words Ad-hoc (bottom-up) Word Level
IEPAD Single Regular exp. HTML tagsPattern Mining, String
AlignmentMulti-Level
OLERA Single Regular exp. HTML tags String Alignment Multi-Level
RoadRunner Single Regular exp. HTML tags String Alignment Tag Level
EXALG Single Regular exp. HTML tags/Literal wordsEquivalent Class and Role Differentiation
Word Level
DeLa Single Regular exp. HTML tags Pattern Mining Tag Level
ConclusionConclusion• Criteria for evaluating IE systems
from the task domain• Comparison of IE systems from
various automation degree• The use of various techniques in IE
systems
Future WorkFuture Work• Page Fetching
– XWrap, W4F, WNDL• Schema Mapping
– Full information– Partial information
• Query Interface Integration– [He, Chang and Han, 2004
ReferencesReferences• C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, Criteria
for Evaluating Web Information Extraction Systems.
top related