jordi turmo, 2010 adaptive information extraction summary information extraction systems...

35
Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators Translingual architectures Information integration in MIE systems Evaluation Adaptability

Upload: gillian-ross

Post on 21-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine Translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine Translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

Page 2: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

IntroductionIntroductionMultilinguality

• Multilingual IE (MIE) tasks:

The textual information contained in the output templates is wanted to be presented in a different language than the input documents

• Tipically: • input documents written in one language • output templates written in another one

Page 3: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

IntroductionIntroductionMultilinguality

• Relatively little research in MIE• LRE program in Europe

• ECRAN, FACILE, AVENTINUS, SPARKLE, …• tools and components for IE in different languages

• TIDES program in USA• PROTEUS, RIPTIDES, CREST, …• fast machine translation and information access

Page 4: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

• Up to now Multilingual IE evaluation just for NE tasks. Two recent scenarios:

• CoNLL 2002-2003:• Language-independent NE recognition

• ACE 2007: • Arabic input documents• English output NE mentions

• Fei Huang (2005). Multilingual NE Extraction and Translation from text and speech. PhD. Thesis

IntroductionIntroductionMultilinguality

Open researchline

Page 5: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

IntroductionIntroductionMultilinguality

• Basic elements of MIE architectures:• language guessers• monolingual architectures

• Classical approches:• use of Machine Translation with monolingual IE architectures• extension of monolingual architectures to translingual architectures

Page 6: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

IntroductionIntroductionMultilinguality

• Basic elements of MIE architectures:• language guessers• monolingual architectures

• Classical approches:• use of Machine Translation with monolingual IE architectures• extension of monolingual architectures to translingual architectures

Page 7: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

Page 8: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Language guessersLanguage guessersMultilinguality

• Goal: identify the language of a document

• Linguistic approach:• based on a vocabulary of keywords• idea: at least one word from a tipical sentence written in some language should be included in the corresponding vocabulary• manually built

Page 9: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Language guessersLanguage guessersMultilinguality

• Stochastic approach:• most widely used• based on:

• generate a frequency table of elements per language• compare frequencies of elements in the document with those in the table.• elements = or special characters or word sequences or char sequences(different approaches)

Page 10: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Language guessersLanguage guessersMultilinguality

• Stochastic approach:• Pros: good results (over 95% accuracy)• Cons: short texts [Zhdanova,02] copes with this problem

Page 11: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine Translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine Translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

Page 12: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Machine translatorsMachine translatorsMultilinguality

• A set of monoligual IE systems

Language guesser

IE (s1)

IE (s2)

IE (sk)

...

mt (s1,t)

mt (s2,t)

mt (sk,t)

...

templates

si t

MIE

Page 13: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Machine translatorsMachine translatorsMultilinguality

• Just one monoligual IE system

Language guesser

mt (t’,t)

mt (t’,t)

mt (t’,t)

...

templates

si t

IE (t’)

MT (s1,t’)

MT (s2,t’)

MT (sk,t’)

...

MIE

Page 14: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine Translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine Translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

Page 15: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Translingual architecturesTranslingual architecturesMultilinguality

• Try to overcome the ineficiency of the MIE architectures based on MT• Merging of IE and interlingua MT

• Idea: when dealing with a particular domain, it is possible to build a language-independent conceptual model of the particular scenario of extraction [Gaizauskas et al. 97]

Page 16: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Translingual architecturesTranslingual architecturesMultilinguality

• For each source language requires:• Use of different lexical preprocessors • Use of different syntactico-semantic parsing • Use of different sets of IE patterns (if the MIE system is based on pattern matching)

• Possible use of language-independent processors (e.g., NERC)

Page 17: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Translingual architecturesTranslingual architecturesMultilinguality

• Use of language-independent ontology• The internal representation of the extracted information is language independent

• Use of soft techniques for NL generation• The output templates are generated using the lexicon of the target language• lexical choice problem!

Page 18: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Translingual architecturesTranslingual architecturesMultilinguality

• M-LASIE system [Gaizauskas et. al 97]• Ad-hoc representation of the domain model• Lexicons mapped to concepts• Add a new source language, involves

• Add new lexicon + mappings• Add new tagger and parser• …

Page 19: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Translingual architecturesTranslingual architecturesMultilinguality

• M-TURBIO system [Turmo et. al 99]• EuroWordNet (EWN)• Sets of IE-patterns for each source language• Mappings from IE-patterns to ILIs in EWN• Add a new source language, involves

• Add new IE-patterns • Add new tagger and parser• …

Page 20: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine Translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

• Information Extraction Systems

• Multilinguality• Introduction

• Language guessers

• Machine Translators

• Translingual architectures

• Information integration in MIE systems

• Evaluation

• Adaptability

Page 21: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Information Integration in MIEsInformation Integration in MIEsMultilinguality

• The most general architecture• Input documents in different source languages not aligned• Output templates in different target languages

• Possible approaches:• MIE system + II system• MIE/II system

Page 22: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Information Integration in MIEsInformation Integration in MIEsMultilinguality

• Pros:• Versatil• An instance can occur just in one document written in a specific language.• Can be easier to extract an instance expressed in one language than another

• better processors or resources

• Cons:• Problems inherent to II

• inconsistent values, similar values, generalizations, …

Page 23: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems

• Multilinguality

• Evaluation• Introduction

• Metrics

• Data sets

• Adaptability

• Information Extraction Systems

• Multilinguality

• Evaluation• Introduction

• Metrics

• Data sets

• Adaptability

Page 24: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

IntroductionIntroductionEvaluation

• The evaluation of the performance of an IE system depends on different factors:

• The IE task: domain, language, document style, …

• The user needs: software use, human use, just some clues about the relevant facts, the context in which they occur, …

What does correctly extracted means?What are the right metrics?What are the best data sets?

Page 25: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

IntroductionIntroductionEvaluation

The president of ALP in Spain will leave his job tomorrow night

NP NP

The president of ALP in Spain will leave his job tomorrow night

NP

Exact extraction

?

The president of ALP in Spain will leave his job tomorrow night

NP

The president of ALP in Spain will leave his job tomorrow night

NP

Exact extraction

?

Page 26: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems

• Multilinguality

• Evaluation• Introduction

• Metrics

• Data sets

• Adaptability

• Information Extraction Systems

• Multilinguality

• Evaluation• Introduction

• Metrics

• Data sets

• Adaptability

Page 27: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

MetricsMetricsEvaluation

• Different evaluation frameworks with different points of view of what is correctly extracted:

• MUC: • correct = partial extraction (-MUC5)• correct = exact extraction (MUC6, MUC7)• Recall, Precision and F (c.f., Historical Framework)• PASCAL: • correct = exact extraction• Same metrics as in MUC6

• ACE:• correct = partial extraction (more sophisticated than MUC)

Page 28: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

MetricsMetricsEvaluation

ACE metric

Idea: How well match the information extracted by a system with that of the reference model?

• Given a system output, s, and a reference model, m, find the global optimum of function Value(s,m) that maximizes the matchings between instances in s and instances in m

Page 29: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

MetricsMetricsEvaluation

ACE metric

Value(s,m) = Value(sys_tokeni) / Value(ref_tokenj)Σi

Σj

token = instance extracted = [attributes, args or mentions]

Value(token) = Element_value(token) * Argument_value(token)

• Penalties: unmapped attributes, unmapped arguments, wrong mappings• Parameters: weights for penalties

Page 30: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

MetricsMetricsEvaluation

ACE metric

• Software for ACE evaluation and more information on ACE evaluation available in

http://www.nist.gov/speech/tests/ace

Page 31: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems

• Multilinguality

• Evaluation• Introduction

• Metrics

• Data sets

• Adaptability

• Information Extraction Systems

• Multilinguality

• Evaluation• Introduction

• Metrics

• Data sets

• Adaptability

Page 32: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Data setsData setsEvaluation

• Ad-hoc• State of the art (e.g., from MUC, ACE, PASCAL)

Each one appropriated to evaluate different IE tasks, depending on different factors

• Availability ? • Suitability ?

Page 33: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Data sets: MUCData sets: MUCEvaluation

• Sources: • free text written text (Newswire)

• MUC-6 and MUC7 data sets• Suitable tasks:

• NE subtasks• Element Extraction tasks (template element –TE)• Event Extraction tasks (scenario template -ST)• Relation Extraction tasks are quite easy

• Language: English• Available from LDC (Linguistic Data Consortium)

• http://www.ldc.upenn.edu

Page 34: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Data sets: ACEData sets: ACEEvaluation

• Sources: • Free text written text (Newswires, Weblogs, Discussion Forums)• Free text oral transcripts (Broadcast News, Telph. conversations)

• Suitable tasks (up to now):• NE subtasks (extended from MUC)• Relation Extraction tasks• Event Extraction tasks need more annotation efforts

• Language: English , Arabic, Chinese, Spanish depending on the input source• Available from LDC (Linguistic Data Consortium)

• http://www.ldc.upenn.edu

Page 35: Jordi Turmo, 2010 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators

Jordi Turmo, 2010 Adaptive Information Extraction

Data sets: PASCALData sets: PASCALEvaluation

• Sources: • Semi-structure documents (Seminar announcements, Corporate acquisitions, Legal sentences)

• Suitable tasks (up to now):• Element Extraction tasks

• Language: English, Italian• Available from

• http://nlp.shef.ac.uk/dot.kom/resources.html

• Similar sources in repository RISE• http://www.isi.edu/info-agents/RISE/index.html