a survey of approaches on mining the structure from unstructured data dutch-belgian database day...

A Survey of Approaches on Mining the Structure from Unstructured Data

Dutch-Belgian Database Day 2009 (DBDBD 2009)

1

Nov. 30, 2009

Frederik [email protected]

Flavius [email protected]

Uzay [email protected]

Econometric InstituteErasmus University Rotterdam

PO Box 1738, NL-3000 DRRotterdam, the Netherlands

Introduction

• A lot of data is generated every day• Difficult to find information that meets one’s needs• There is a need to mine the structure of data as a first step

towards understanding it• Part of the effort to make the Web machine-understandable

• Solution: employ NLP techniques to extract knowledge from unstructured text written in natural language

Dutch-Belgian Database Day 2009 (DBDBD 2009)

2

Nov. 30, 2009

Which Technique to Choose?

Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30, 2009

3

Statistics-Based NLP (1)

• Utilize statistics and mathematical models based on probability theory

• Refers to all non-symbolic and non-logical work on NLP, i.e., it encompasses all quantitative approaches to automated language processing, including:– Probabilistic modeling

– Information theory

– Linear algebra

• Phrases extracted from text written in an arbitrary natural language are analyzed in order to find (statistical) relations


4


• Word-based:– Statistics collection on words

– Frequency counting and ranking generation (e.g., TF-IDF)

– Collocations (cliff-hanger, eye candy, take care, profit announcement, etc.)

– Word Sense Disambiguation (WSD)

– Inference models: n-grams

– Clustering

• Grammar-based:– Part-Of-Speech (POS) tagging

– Stochastic Context-Free Grammars (SCFG)


5


• Advantages:– Not based on knowledge, thus they do not require linguistic

resources, nor do they require expert knowledge

– Issues regarding leaking grammars, inconsistencies among humans, dialects, etc. are alleviated

• Disadvantages:– Often need a large amount of data

– Approaches do not deal with meaning explicitly, i.e., statistical methods discover relations in corpora without considering semantics


6


• Examples:– (Bannard et al., 2003) discuss several techniques for using

statistical models acquired from corpus data to infer the meaning of verb-particle constructions:

• Collocation-like approach, frequency counting• Focus on mining relations between words

– (Taira and Soderland, 1999) implement a statistical natural language processor:

• Based on resonance probabilities between word pairs• Uses word affinity knowledge from training sentences• Focus on acquiring knowledge from radiology reports


7

Pattern-Based NLP (1)

• Use linguistic patterns to extract data from texts• Patterns can be:

– Predefined

– Discovered (learned)

• Knowledge used:– Lexical knowledge

– Syntactic knowledge

– Semantic knowledge


8


• Lexico-syntactic patterns:– Combine lexical and syntactic elements with regular expressions

– E.g., “{NNP, }* NNP{,}? and NNP {(announce | discuss)} collaboration {with NNP}?” mines a corpus for information on fusions and collaborations of companies and/or persons

• Lexico-semantic patterns:– Enrich lexico-syntactic patterns through the addition of semantics

– Gazetteers (simple typing):• Use linguistic meaning of text

• E.g., “[sub:company] announces collaboration with [obj:company]”

– Ontologies (complex typing):• Include also relationships

• E.g., “[kb:Company] kb:collaborates

[kb:Company]”


9


• Advantages:– Need less training data

– Complex expressions can be defined

– Results are easily interpretable

• Disadvantages:– Lexical knowledge is required

– Prior expert/domain knowledge might be required (for lexico-semantic patterns)

– Defining and maintaining patterns is a cumbersome and non-trivial task


10


• Examples:– CAFETIERE (Black et al., 2005):

• Employs extraction rules defined at lexico-semantic level• Makes use of gazetteering• Knowledge is stored using Narrative Knowledge Representation

Language (NKRL)• Knowledge base lacks reasoning support• Focus on extracting relations from corpora

– Hermes (Frasincar et al., 2009):• Patterns defined at lexico-semantic level• Makes use of ontologies and reasoning engines• Knowledge is based on an OWL domain ontology• Focus on the use of pattern-based NLP in building personalized news

services


11

Hybrid NLP (1)

• Combine linguistic knowledge with statistical methods• Usually, it appears to be difficult to stay within the boundaries of

a single approach• Thus, it is convenient to combine best from both worlds:

– Bootstrapping lexical methods

– Solving lack of expert knowledge by applying statistical methods

– Statistical methods that use some present (lexical) knowledge


12

Hybrid NLP (2)

• Advantages:– Solve problems related to scaling and required expert knowledge of

pattern-based approaches

– Do not require as much data as statistical approaches

– Inherit some of the advantages of both statistical and pattern-based approaches

• Disadvantages:– By combining different techniques, maintaining completeness and

accuracy of the systems becomes more difficult

– Multidisciplinary aspects

– Inherit some of the disadvantages of both statistical and pattern-based approaches


13

Hybrid NLP (3)

• Examples:– Corpus-Based Statistics-Oriented techniques (Su et al., 1996):

• Mainly statistical learning techniques, guided by high-level linguistic constructs

• Applications in POS tagging, semantic analysis of corpora, machine translation, annotation, etc.

• Focus is on extracting inductive knowledge from corpora to support building large scale NLP systems

– PANKOW (Cimiano et al., 2004):• Generates instances of lexico-syntactic patterns indicating a certain

semantic or ontological relation• Counts number of occurrences of patterns• Statistical distribution of instances of these patterns constitutes the

collective knowledge• Focus is on supporting annotation


14

Conclusions

• Three main approaches to NLP:– Statistics-based

– Pattern-based

– Hybrid

• Which techniques to use for your NLP tasks? There is no single best approach, but consider these rough guidelines:– Evaluate your problem, preferences, and available resources

– If you are less concerned with semantics and you assume that knowledge lies within statistical facts on a specific corpus, use a statistics-based approach

– If you are concerned with the semantics of discovered information, or you want to be able to easily explain and control the results, use a pattern-based approach

– If you need to bootstrap a pattern-based approach using statistics (e.g., insufficient knowledge available) or the other way around (e.g., need of a priori knowledge) use a hybrid approach


15

References• C. Bannard, T. Baldwin, and A. Lascarides. A statistical approach to the semantics of verb-

particles. In ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 65-72. Association for Computational Linguistics, 2003.

• W. J. Black, J. McNaught, A. Vasilakopoulos, K. Zervanou, B. Theodoulidis, and F. Rinaldi. CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and Relations. Technical Report TR-U4.3.1, Department of Computation, UMIST, Manchester, 2005.

• P. Cimiano, S. Handschuh, and S. Staab. Towards the Self-Annotating Web. In 13th International Conference on World Wide Web (WWW 2004), pages 462-471. ACM, 2004.

• F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research, 5(3):35-53, 2009.

• K.-Y. Su, T.-H. Chiang, and J.-S. Chang. An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing, 1(1):101-157, 1996.

• R. K. Taira and S. G. Sodepages rland. A statistical natural language processor for medical reports. In AMIA Symposium 1999, pages 970-974. American Medical Informatics Association, 1999.


16

a survey of approaches on mining the structure from unstructured data dutch-belgian database day...

Documents

statisticsbased nlp

patternbased nlp

nlp techniques

statistics collection

statistical models

structure of data

acquiring knowledge

corpus data