a survey of approaches on mining the structure from unstructured data dutch-belgian database day...
Post on 15-Jan-2016
214 views
TRANSCRIPT
A Survey of Approaches on Mining the Structure from Unstructured Data
Dutch-Belgian Database Day 2009 (DBDBD 2009)
1
Nov. 30, 2009
Frederik [email protected]
Flavius [email protected]
Uzay [email protected]
Econometric InstituteErasmus University Rotterdam
PO Box 1738, NL-3000 DRRotterdam, the Netherlands
Introduction
• A lot of data is generated every day• Difficult to find information that meets one’s needs• There is a need to mine the structure of data as a first step
towards understanding it• Part of the effort to make the Web machine-understandable
• Solution: employ NLP techniques to extract knowledge from unstructured text written in natural language
Dutch-Belgian Database Day 2009 (DBDBD 2009)
2
Nov. 30, 2009
Which Technique to Choose?
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
3
Statistics-Based NLP (1)
• Utilize statistics and mathematical models based on probability theory
• Refers to all non-symbolic and non-logical work on NLP, i.e., it encompasses all quantitative approaches to automated language processing, including:– Probabilistic modeling
– Information theory
– Linear algebra
• Phrases extracted from text written in an arbitrary natural language are analyzed in order to find (statistical) relations
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
4
Statistics-Based NLP (2)
• Word-based:– Statistics collection on words
– Frequency counting and ranking generation (e.g., TF-IDF)
– Collocations (cliff-hanger, eye candy, take care, profit announcement, etc.)
– Word Sense Disambiguation (WSD)
– Inference models: n-grams
– Clustering
• Grammar-based:– Part-Of-Speech (POS) tagging
– Stochastic Context-Free Grammars (SCFG)
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
5
Statistics-Based NLP (3)
• Advantages:– Not based on knowledge, thus they do not require linguistic
resources, nor do they require expert knowledge
– Issues regarding leaking grammars, inconsistencies among humans, dialects, etc. are alleviated
• Disadvantages:– Often need a large amount of data
– Approaches do not deal with meaning explicitly, i.e., statistical methods discover relations in corpora without considering semantics
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
6
Statistics-Based NLP (4)
• Examples:– (Bannard et al., 2003) discuss several techniques for using
statistical models acquired from corpus data to infer the meaning of verb-particle constructions:
• Collocation-like approach, frequency counting• Focus on mining relations between words
– (Taira and Soderland, 1999) implement a statistical natural language processor:
• Based on resonance probabilities between word pairs• Uses word affinity knowledge from training sentences• Focus on acquiring knowledge from radiology reports
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
7
Pattern-Based NLP (1)
• Use linguistic patterns to extract data from texts• Patterns can be:
– Predefined
– Discovered (learned)
• Knowledge used:– Lexical knowledge
– Syntactic knowledge
– Semantic knowledge
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
8
Pattern-Based NLP (2)
• Lexico-syntactic patterns:– Combine lexical and syntactic elements with regular expressions
– E.g., “{NNP, }* NNP{,}? and NNP {(announce | discuss)} collaboration {with NNP}?” mines a corpus for information on fusions and collaborations of companies and/or persons
• Lexico-semantic patterns:– Enrich lexico-syntactic patterns through the addition of semantics
– Gazetteers (simple typing):• Use linguistic meaning of text
• E.g., “[sub:company] announces collaboration with [obj:company]”
– Ontologies (complex typing):• Include also relationships
• E.g., “[kb:Company] kb:collaborates
[kb:Company]”
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
9
Pattern-Based NLP (3)
• Advantages:– Need less training data
– Complex expressions can be defined
– Results are easily interpretable
• Disadvantages:– Lexical knowledge is required
– Prior expert/domain knowledge might be required (for lexico-semantic patterns)
– Defining and maintaining patterns is a cumbersome and non-trivial task
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
10
Pattern-Based NLP (4)
• Examples:– CAFETIERE (Black et al., 2005):
• Employs extraction rules defined at lexico-semantic level• Makes use of gazetteering• Knowledge is stored using Narrative Knowledge Representation
Language (NKRL)• Knowledge base lacks reasoning support• Focus on extracting relations from corpora
– Hermes (Frasincar et al., 2009):• Patterns defined at lexico-semantic level• Makes use of ontologies and reasoning engines• Knowledge is based on an OWL domain ontology• Focus on the use of pattern-based NLP in building personalized news
services
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
11
Hybrid NLP (1)
• Combine linguistic knowledge with statistical methods• Usually, it appears to be difficult to stay within the boundaries of
a single approach• Thus, it is convenient to combine best from both worlds:
– Bootstrapping lexical methods
– Solving lack of expert knowledge by applying statistical methods
– Statistical methods that use some present (lexical) knowledge
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
12
Hybrid NLP (2)
• Advantages:– Solve problems related to scaling and required expert knowledge of
pattern-based approaches
– Do not require as much data as statistical approaches
– Inherit some of the advantages of both statistical and pattern-based approaches
• Disadvantages:– By combining different techniques, maintaining completeness and
accuracy of the systems becomes more difficult
– Multidisciplinary aspects
– Inherit some of the disadvantages of both statistical and pattern-based approaches
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
13
Hybrid NLP (3)
• Examples:– Corpus-Based Statistics-Oriented techniques (Su et al., 1996):
• Mainly statistical learning techniques, guided by high-level linguistic constructs
• Applications in POS tagging, semantic analysis of corpora, machine translation, annotation, etc.
• Focus is on extracting inductive knowledge from corpora to support building large scale NLP systems
– PANKOW (Cimiano et al., 2004):• Generates instances of lexico-syntactic patterns indicating a certain
semantic or ontological relation• Counts number of occurrences of patterns• Statistical distribution of instances of these patterns constitutes the
collective knowledge• Focus is on supporting annotation
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
14
Conclusions
• Three main approaches to NLP:– Statistics-based
– Pattern-based
– Hybrid
• Which techniques to use for your NLP tasks? There is no single best approach, but consider these rough guidelines:– Evaluate your problem, preferences, and available resources
– If you are less concerned with semantics and you assume that knowledge lies within statistical facts on a specific corpus, use a statistics-based approach
– If you are concerned with the semantics of discovered information, or you want to be able to easily explain and control the results, use a pattern-based approach
– If you need to bootstrap a pattern-based approach using statistics (e.g., insufficient knowledge available) or the other way around (e.g., need of a priori knowledge) use a hybrid approach
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
15
References• C. Bannard, T. Baldwin, and A. Lascarides. A statistical approach to the semantics of verb-
particles. In ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 65-72. Association for Computational Linguistics, 2003.
• W. J. Black, J. McNaught, A. Vasilakopoulos, K. Zervanou, B. Theodoulidis, and F. Rinaldi. CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and Relations. Technical Report TR-U4.3.1, Department of Computation, UMIST, Manchester, 2005.
• P. Cimiano, S. Handschuh, and S. Staab. Towards the Self-Annotating Web. In 13th International Conference on World Wide Web (WWW 2004), pages 462-471. ACM, 2004.
• F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research, 5(3):35-53, 2009.
• K.-Y. Su, T.-H. Chiang, and J.-S. Chang. An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing, 1(1):101-157, 1996.
• R. K. Taira and S. G. Sodepages rland. A statistical natural language processor for medical reports. In AMIA Symposium 1999, pages 970-974. American Medical Informatics Association, 1999.
Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009
16