aquaint phase ii six month workshop – october 2004 fusing rich information extracted from multiple...

29
AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex Answers Vasileios Hatzivassiloglou, Kathleen R. McKeown, Dan Jurafsky, Wayne H. Ward, James H. Martin Columbia University Stanford University University of Colorado at Boulder University of Texas at Dallas

Upload: damian-bridges

Post on 17-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Fusing Rich Information Extracted from Multiple Media and Languages

to Generate Contextualized, Complex Answers

Vasileios Hatzivassiloglou, Kathleen R. McKeown, Dan Jurafsky, Wayne H. Ward, James H. Martin

Columbia UniversityStanford University

University of Colorado at BoulderUniversity of Texas at Dallas

Page 2: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Phase II Vision

• Provide long, detailed, and complex answers

• Handle question types other than factual questions

• Develop a unified, extensible framework for treating such questions

Page 3: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Research Goals

• Develop new unified strategy for generating and piecing together complex answers

• Shallow semantic analysis annotates answer fragments, allowing answer filtering, comparison, and composition

• Extend analysis to multiple languages, media, and linked questions

Page 4: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Semantic Analysis

• Multiple levels

• Top level provides appropriate fillers for slots dependent on the question type– Events (who? when? where? completed?

conditional?)– Opinions (target, holder, group, actual opinion

predicate, time frame, polarity, strength)– Definitions– Biographies

Page 5: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Semantic Analysis Support

• Bottom level annotates text with general features that can be used to determine the higher level features– Semantic roles (from semantic parser)– Time expressions– Lexical polarity and semantic strength values

Page 6: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Maximum Coverage of Information

• A new approach for formalizing the problem of information selection

• Input:– Set of text units (e.g., sentences) that are potentially

relevant to the answer

– Set of concepts that are desirable in the answer (e.g., representations of related events)

– Matrix showing which text unit covers which concepts

– Information weights assigned to each concept

– Costs assigned to each text unit

Page 7: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Example

• I(T1) = I(T2 & T3)

T1

T2

T3

T4

C1 C2 C3 C4 C5

1 1 0 1 1

1 0 0 1 0

1 0 1 1 1

0 1 0 0 1

Page 8: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Benefits of the approach

• Formalization allows decoupling of the features (concepts) from the information selection algorithm

• Problem translates to well-known complexity theory problem (maximum set cover)

• Proof that under this model, this part of Q&A is NP-hard

Page 9: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

But there is a silver lining…

• Efficient and effective greedy algorithm for Maximum Set Cover can be applied here

• Solution guaranteed to cover at least (1-1/e) ≈ 64% of the information in the ideal solution

• Evaluation over DUC data showed that this approach addresses redundancy effectively (see Filatova & Hatzivassiloglou, Coling 04)

Page 10: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Definitional Questions

• Approach: Combine data-driven and knowledge-based methods

• The latter anticipate what “should” be in the definition (e.g., “X is a kind of Y”)

• System improvements– Doubled predicate pattern coverage in 2004– Increased system robustness– Included rewriting of pronominal references

Page 11: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Learning Definitional Predicates

• Before, we used hand-annotated examples

• Now, we– bootstrap from a few known patterns (X caused

Y) signaling a given relationship to– find many pairs for this relationship

(attack/explosion, speeding/ticket)– use statistical data to find new such

relationships without the patterns

Page 12: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Extracting Definitions

• First place in “question-based” DUC 2004 definitions among 22 teams

Who is Sonia Gandhi?

Congress President Sonia Gandhi, who married into what was once India’s most powerful political family, is the first non-Indian since independence 50 years ago to lead the Congress. After Prime Minister Rajiv Gandhi was assassinated in 1991, Gandhi was persuaded by the Congress to succeed her husband to continue leading the party as the chief, but she refused. The BJP had shrugged off the influence of the 51-year-old Sonia Gandhi when she stepped into politics early this year, dismissing her as a “foreigner.” Sonia Gandhi is now an Indian citizen. Gandhi, who is 51, met her husband when she was an 18-year old student at Cambridge in London, the first time she was away from her native Italy.

Page 13: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

New Work in Opinions

• Localize opinion to a specific predicate; add time and opinion holder attributes

• Use WordNet hypernym/hyponym relationships to propagate positive/negative polarity values at the word level

• Calculate measure of semantic strength

• Participated in recent opinion pilot

Page 14: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

New Work in Events

• Tested event model (participants + connecting verb) as a possible set of information concepts

• Significant improvement over a word-based approach (tf*idf)

• Use clusters of related events to learn automatically which relationships are random and which are typical of an event type

Page 15: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Fusing Rich Information Extracted from Multiple Media and Languages to Generate

Contextualized, Complex AnswersProject Status

Wayne Ward, James H. Martin, Kadri HaciogluSameer Pradhan, Steven Bethard,Ying Chen, Benjamin Douglas

University of Colorado

Dan JurafskyStanford University

Page 16: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Initial Focus

• Semantic Role Structure for QA– Approaches complementary to Columbia

• Specific Work On– Opinions– Time Expressions– Events

• Multi-Lingual Work– English, Chinese, Arabic tools

Page 17: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Thematic Parse Accuracy

ID Class Combined

Gold 96 (97,96) 93 91 (91,90)

Charniak 87 (92,82) 92 81 (86,76)

PropBank Data

TREC Data

ID Class Combined

Charniak 73 (76,71) 84 63 (65,61)

Page 18: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Alternate Algorithms

• Dependency tree based– Potentially more robust because of simpler path structures– Different “view” from Minipar, based on rules not trained on

TreeBank

• Chunking– SVM chunk syntactic base phrases– Second SVM classify chunks with semantic roles

Page 19: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Semantic Parsing in Chinese• Syntactic parser

– SVM POS tagger– Retrained Collins parser– Chinese Treebank 2.0– Performance: P/R = 78.9/76.4

• Semantic parser– PropBank Tags– Features: Syntactic Path, Target, Phrasal Category– Data: 1023 sentences as training set 113 sentences test set– Performance: P/R = 81.6/67.1

Page 20: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Opinion/Opinion_Holder• Joint work with Columbia• Opinion ID as supervised Machine Learning• Answer “How does X feel about Y”• Propositional opinions (prop arg of verb)• Same SVM framework as general semantic tagger• Annotated FrameNet and PropBank sentences

If [ OH she] hadn’t known [O that he liked nothing about her] she might have mistaken that note in his voice for admiration

Page 21: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Opinion/Opinion_Holder• Two different SVM architectures for Opinion

– Single classifier walk constituent tree CxC– 2 stage: find propositions then classify op/non-op PxP

• Opinion and Opinion_Holder

P R F Opinion CxC 58 51 54 Opinion PxP 68 44 53

O/OH CxC 57 48 52

Page 22: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Time Expressions

• Recognize time expressions in English and Chinese• SVM chunking and tagging problem• Language independent representation• Participated in TERN evaluation

That’s 30 percent more than [the same period [a year ago.]]

Page 23: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Time Expressions

P R F

English 97 91

89 83

93 87

Chinese 97 84

85 74

91 79

Page 24: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Event Detection• Train and test on TimeBank corpus• Determine phrases describing events

• Chunk EVENT expressions in TimeBank

• Label with attribute– REPORTING, PERCEPTION, ASPECTUAL,

I_ACTION, I_STATE, STATE, OCCURRENCE.

Page 25: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Arabic Work

• SVM based NLP tools for Arabic

• Tokenizer

• Part-Of-Speech tagger

• Syntactic base phrase chunker

• Trained on Arabic TreeBank

Page 26: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Arabic Work

Acc P R F

Tokenization 99 99 99

POS 96

Base Phrase 92 92 92

Page 27: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Next 18 months

• Complete opinion work• Much more focus on events• Processing audio documents

– Produce word lattice with ASR– Use chunking tagger to parse word lattice

• Dialog– Decomposition– Clarification– Follow-up

Page 28: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Thematic Role Tagging

• Assigning semantic labels to sentence elements.• Elements are arguments of some predicate or

participants in some event.

• [DATE In 1901] [PATIENT President William

McKinley] was [PREDICATE shot] [AGENT by anarchist

Leon Czolgosz] [LOCATION at the Pan-American Exposition]

Page 29: AQUAINT Phase II Six Month Workshop – October 2004 Fusing Rich Information Extracted from Multiple Media and Languages to Generate Contextualized, Complex

AQUAINT Phase II Six Month Workshop – October 2004

Use of thematic tagging in QA• Generating novel answers involving

– Opinions (believe, confirm, deny, negate)

– Events (Activities with a starting and ending point involving fixed

participants)– Causal questions

• Query: What effect does a prism have on light?

• Thematic Tagging:[RESULT What effect] does [CAUSE a prism] have on [THEME light]?

• Now search for a RESULT that has ‘prism’ as a CAUSE.