answer extraction ling573 nlp systems and applications may 19, 2011

85
Answer Extraction Ling573 NLP Systems and Applications May 19, 2011

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Answer ExtractionLing573

NLP Systems and ApplicationsMay 19, 2011

Roadmap Noisy-channel Question-Answering

Answer selection by reranking

Redundancy-based Answer Selection

Noisy Channel QAEmployed for speech, POS tagging, MT, summ,

etc

Intuition:Question is a noisy representation of the answer

Noisy Channel QAEmployed for speech, POS tagging, MT, summ,

etc

Intuition:Question is a noisy representation of the answer

Basic approach:Given a corpus of (Q,SA) pairs

Train P(Q|SA)

Find sentence with answer asSi,Aij that maximize P(Q|Si,Aij)

QA Noisy ChannelA: Presley died of heart disease at Graceland in 1977,

and..Q: When did Elvis Presley die?

QA Noisy ChannelA: Presley died of heart disease at Graceland in 1977,

and..Q: When did Elvis Presley die?

Goal:Align parts of Ans parse tree to question

Mark candidate answersFind highest probability answer

ApproachAlignment issue:

ApproachAlignment issue:

Answer sentences longer than questionsMinimize length gap

Represent answer as mix of words/syn/sem/NE units

ApproachAlignment issue:

Answer sentences longer than questionsMinimize length gap

Represent answer as mix of words/syn/sem/NE unitsCreate ‘cut’ through parse tree

Every word –or an ancestor – in cutOnly one element on path from root to word

ApproachAlignment issue:

Answer sentences longer than questionsMinimize length gap

Represent answer as mix of words/syn/sem/NE unitsCreate ‘cut’ through parse tree

Every word –or an ancestor – in cutOnly one element on path from root to word

Presley died of heart disease at Graceland in 1977, and..

Presley died PP PP in DATE, and..

When did Elvis Presley die?

Approach (Cont’d)Assign one element in cut to be ‘Answer’

Issue: Cut STILL may not be same length as Q

Approach (Cont’d)Assign one element in cut to be ‘Answer’

Issue: Cut STILL may not be same length as Q

Solution: (typical MT)Assign each element a fertility

0 – delete the word; > 1: repeat word that many times

Approach (Cont’d)Assign one element in cut to be ‘Answer’

Issue: Cut STILL may not be same length as Q

Solution: (typical MT)Assign each element a fertility

0 – delete the word; > 1: repeat word that many times

Replace A words with Q words based on alignment

Permute result to match original Question

Everything except cut computed with OTS MT code

SchematicAssume cut, answer guess all equally likely

Training Sample Generation

Given question and answer sentences

Parse answer sentence

Create cut s.t.:Words in both Q & A are preservedAnswer reduced to ‘A_’ syn/sem class labelNodes with no surface children reduced to syn

classKeep surface form of all other nodes

20K TREC QA pairs; 6.5K web question pairs

Selecting AnswersFor any candidate answer sentence:

Do same cut process

Selecting AnswersFor any candidate answer sentence:

Do same cut processGenerate all candidate answer nodes:

Syntactic/Semantic nodes in tree

Selecting AnswersFor any candidate answer sentence:

Do same cut processGenerate all candidate answer nodes:

Syntactic/Semantic nodes in treeWhat’s a bad candidate answer?

Selecting AnswersFor any candidate answer sentence:

Do same cut processGenerate all candidate answer nodes:

Syntactic/Semantic nodes in treeWhat’s a bad candidate answer?

StopwordsQuestion words!

Create cuts with each answer candidate annotatedSelect one with highest probability by model

Example Answer CutsQ: When did Elvis Presley die?

SA1: Presley died A_PP PP PP, and …

SA2: Presley died PP A_PP PP, and ….

SA3: Presley died PP PP in A_DATE, and …

Results: MRR: 24.8%; 31.2% in top 5

Error AnalysisComponent specific errors:

Patterns: Some question types work better with patternsTypically specific NE categories (NAM, LOC, ORG..)Bad if ‘vague’

Stats based:No restrictions on answer type – frequently ‘it’

Patterns and stats:‘Blatant’ errors:

Select ‘bad’ strings (esp. pronouns) if fit position/pattern

Error AnalysisComponent specific errors:

Patterns: Some question types work better with patternsTypically specific NE categories (NAM, LOC, ORG..)Bad if ‘vague’

Error AnalysisComponent specific errors:

Patterns: Some question types work better with patternsTypically specific NE categories (NAM, LOC, ORG..)Bad if ‘vague’

Stats based:No restrictions on answer type – frequently ‘it’

Error AnalysisComponent specific errors:

Patterns: Some question types work better with patternsTypically specific NE categories (NAM, LOC, ORG..)Bad if ‘vague’

Stats based:No restrictions on answer type – frequently ‘it’

Patterns and stats:‘Blatant’ errors:

Select ‘bad’ strings (esp. pronouns) if fit position/pattern

Combining UnitsLinear sum of weights?

Combining UnitsLinear sum of weights?

Problematic:Misses different strengths/weaknesses

Combining UnitsLinear sum of weights?

Problematic:Misses different strengths/weaknesses

Learning! (of course)Maxent re-ranking

Linear

Feature Functions48 in total

Component-specific:Scores, ranks from different modules

Patterns. Stats, IR, even QA word overlap

Feature Functions48 in total

Component-specific:Scores, ranks from different modules

Patterns. Stats, IR, even QA word overlap

Redundancy-specific:# times candidate answer appears (log, sqrt)

Feature Functions48 in total

Component-specific:Scores, ranks from different modules

Patterns. Stats, IR, even QA word overlap

Redundancy-specific:# times candidate answer appears (log, sqrt)

Qtype-specific:Some components better for certain types:

type+mod

Feature Functions48 in total

Component-specific:Scores, ranks from different modules

Patterns. Stats, IR, even QA word overlap

Redundancy-specific:# times candidate answer appears (log, sqrt)

Qtype-specific:Some components better for certain types: type+mod

Blatant ‘errors’: no pronouns, when NOT DoW

ExperimentsPer-module reranking:

Use redundancy, qtype, blatant, and feature from mod

ExperimentsPer-module reranking:

Use redundancy, qtype, blatant, and feature from mod

Combined reranking:All features (after feature selection to 31)

ExperimentsPer-module reranking:

Use redundancy, qtype, blatant, and feature from mod

Combined reranking:All features (after feature selection to 31)

Patterns: Exact in top 5: 35.6% -> 43.1%

Stats: Exact in top 5: 31.2% -> 41%

Manual/knowledge based: 57%

Redundancy-based QAAskMSR (2001,2002); Aranea (Lin, 2007)

Redundancy-based QA Systems exploit statistical regularity to find “easy”

answers to factoid questions on the Web

Redundancy-based QA Systems exploit statistical regularity to find “easy”

answers to factoid questions on the Web—When did Alaska become a state?(1) Alaska became a state on January 3, 1959.(2) Alaska was admitted to the Union on January

3, 1959.

Redundancy-based QA Systems exploit statistical regularity to find “easy”

answers to factoid questions on the Web—When did Alaska become a state?(1) Alaska became a state on January 3, 1959.(2) Alaska was admitted to the Union on January

3, 1959.

—Who killed Abraham Lincoln?(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet.

He will forever be known as the man who ended Abraham Lincoln’s life.

Redundancy-based QA Systems exploit statistical regularity to find “easy”

answers to factoid questions on the Web—When did Alaska become a state?(1) Alaska became a state on January 3, 1959.(2) Alaska was admitted to the Union on January 3, 1959.

—Who killed Abraham Lincoln?(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet. He

will forever be known as the man who ended Abraham Lincoln’s life.

Text collection

Redundancy-based QA Systems exploit statistical regularity to find “easy”

answers to factoid questions on the Web—When did Alaska become a state?(1) Alaska became a state on January 3, 1959.(2) Alaska was admitted to the Union on January 3, 1959.

—Who killed Abraham Lincoln?(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet. He

will forever be known as the man who ended Abraham Lincoln’s life.

Text collection may only have (2), but web? anything

Redundancy-based QA Systems exploit statistical regularity to find “easy”

answers to factoid questions on the Web—When did Alaska become a state?(1) Alaska became a state on January 3, 1959.(2) Alaska was admitted to the Union on January 3, 1959.

—Who killed Abraham Lincoln?(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet. He

will forever be known as the man who ended Abraham Lincoln’s life.

Text collection may only have (2), but web?

Redundancy & AnswersHow does redundancy help find answers?

Redundancy & AnswersHow does redundancy help find answers?

Typical approach: Answer type matching

E.g. NER, butRelies on large knowledge-based

Redundancy approach:

Redundancy & AnswersHow does redundancy help find answers?

Typical approach: Answer type matching

E.g. NER, butRelies on large knowledge-based

Redundancy approach: Answer should have high correlation w/query

termsPresent in many passages

Uses n-gram generation and processing

Redundancy & AnswersHow does redundancy help find answers?

Typical approach: Answer type matching

E.g. NER, butRelies on large knowledge-based

Redundancy approach: Answer should have high correlation w/query terms

Present in many passages Uses n-gram generation and processing

In ‘easy’ passages, simple string match effective

Redundancy ApproachesAskMSR (2001):

Lenient: 0.43; Rank: 6/36; Strict: 0.35; Rank: 9/36

Redundancy ApproachesAskMSR (2001):

Lenient: 0.43; Rank: 6/36; Strict: 0.35; Rank: 9/36

Aranea (2002, 2003):Lenient: 45%; Rank: 5; Strict: 30%; Rank:6-8

Redundancy ApproachesAskMSR (2001):

Lenient: 0.43; Rank: 6/36; Strict: 0.35; Rank: 9/36

Aranea (2002, 2003):Lenient: 45%; Rank: 5; Strict: 30%; Rank:6-8

Concordia (2007): Strict: 25%; Rank 5

Redundancy ApproachesAskMSR (2001):

Lenient: 0.43; Rank: 6/36; Strict: 0.35; Rank: 9/36

Aranea (2002, 2003):Lenient: 45%; Rank: 5; Strict: 30%; Rank:6-8

Concordia (2007): Strict: 25%; Rank 5

Many systems incorporate some redundancy Answer validationAnswer reranking

LCC: huge knowledge-based system, redundancy improved

Redundancy-based Answer Extraction

Prior processing:Question formulation (class 6)Web searchRetrieve snippets – top 100

Redundancy-based Answer Extraction

Prior processing: Question formulation (class 6) Web search Retrieve snippets – top 100

N-grams: Generation Voting Filtering Combining Scoring Reranking

N-gram Generation & Voting

N-gram generation from unique snippets:Approximate chunking – without syntaxAll uni-, bi-, tri-, tetra- grams

Concordia added 5-grams (prior errors)

N-gram Generation & Voting

N-gram generation from unique snippets:Approximate chunking – without syntaxAll uni-, bi-, tri-, tetra- grams

Concordia added 5-grams (prior errors)Score: based on source query: exact 5x, others 1x

N-gram voting:Collates n-gramsN-gram gets sum of scores of occurrencesWhat would be highest ranked ?

N-gram Generation & Voting

N-gram generation from unique snippets:Approximate chunking – without syntaxAll uni-, bi-, tri-, tetra- grams

Concordia added 5-grams (prior errors)Score: based on source query: exact 5x, others 1x

N-gram voting:Collates n-gramsN-gram gets sum of scores of occurrencesWhat would be highest ranked ?

Specific, frequent: Question terms, stopwords

N-gram FilteringThrows out ‘blatant’ errors

Conservative or aggressive?

N-gram FilteringThrows out ‘blatant’ errors

Conservative or aggressive? Conservative: can’t recover error

Question-type-neutral filters:

N-gram FilteringThrows out ‘blatant’ errors

Conservative or aggressive? Conservative: can’t recover error

Question-type-neutral filters:Exclude if begin/end with stopwordExclude if contain words from question, except

‘Focus words’ : e.g. units

Question-type-specific filters:

N-gram FilteringThrows out ‘blatant’ errors

Conservative or aggressive? Conservative: can’t recover error

Question-type-neutral filters:Exclude if begin/end with stopwordExclude if contain words from question, except

‘Focus words’ : e.g. units

Question-type-specific filters: ‘how far’, ‘how fast’:

N-gram FilteringThrows out ‘blatant’ errors

Conservative or aggressive? Conservative: can’t recover error

Question-type-neutral filters:Exclude if begin/end with stopwordExclude if contain words from question, except

‘Focus words’ : e.g. units

Question-type-specific filters: ‘how far’, ‘how fast’: exclude if no numeric ‘who’,’where’: exclude if not NE (first & last caps)

N-gram FilteringThrows out ‘blatant’ errors

Conservative or aggressive? Conservative: can’t recover error

Question-type-neutral filters:Exclude if begin/end with stopwordExclude if contain words from question, except

‘Focus words’ : e.g. units

Question-type-specific filters: ‘how far’, ‘how fast’: exclude if no numeric ‘who’,’where’:

N-gram FilteringClosed-class filters:

Exclude if not members of an enumerable list

N-gram FilteringClosed-class filters:

Exclude if not members of an enumerable listE.g. ‘what year ‘ -> must be acceptable date year

N-gram FilteringClosed-class filters:

Exclude if not members of an enumerable listE.g. ‘what year ‘ -> must be acceptable date year

Example after filtering:Who was the first person to run a sub-four-minute

mile?

N-gram Filtering Impact of different filters:

Highly significant differences when run w/subsets

N-gram Filtering Impact of different filters:

Highly significant differences when run w/subsetsNo filters: drops 70%

N-gram Filtering Impact of different filters:

Highly significant differences when run w/subsetsNo filters: drops 70%Type-neutral only: drops 15%

N-gram Filtering Impact of different filters:

Highly significant differences when run w/subsetsNo filters: drops 70%Type-neutral only: drops 15%Type-neutral & Type-specific: drops 5%

N-gram CombiningCurrent scoring favors longer or shorter spans?

N-gram CombiningCurrent scoring favors longer or shorter spans?

E.g. Roger or Bannister or Roger Bannister or Mr…..

N-gram CombiningCurrent scoring favors longer or shorter spans?

E.g. Roger or Bannister or Roger Bannister or Mr…..Bannister pry highest – occurs everywhere R.B. +

Generally, good answers longer (up to a point)

N-gram CombiningCurrent scoring favors longer or shorter spans?

E.g. Roger or Bannister or Roger Bannister or Mr…..Bannister pry highest – occurs everywhere R.B. +

Generally, good answers longer (up to a point)

Update score: Sc += ΣSt, where t is unigram in c

Possible issues:

N-gram CombiningCurrent scoring favors longer or shorter spans?

E.g. Roger or Bannister or Roger Bannister or Mr…..Bannister pry highest – occurs everywhere R.B. +

Generally, good answers longer (up to a point)

Update score: Sc += ΣSt, where t is unigram in c

Possible issues:Bad units: Roger Bannister was

N-gram CombiningCurrent scoring favors longer or shorter spans?

E.g. Roger or Bannister or Roger Bannister or Mr…..Bannister pry highest – occurs everywhere R.B. +

Generally, good answers longer (up to a point)

Update score: Sc += ΣSt, where t is unigram in c

Possible issues:Bad units: Roger Bannister was – blocked by filters

Also, increments score so long bad spans lower

Improves significantly

N-gram ScoringNot all terms created equal

N-gram ScoringNot all terms created equal

Usually answers highly specificAlso disprefer non-units

Solution

N-gram ScoringNot all terms created equal

Usually answers highly specificAlso disprefer non-units

Solution: IDF-based scoringSc=Sc * average_unigram_idf

N-gram ScoringNot all terms created equal

Usually answers highly specificAlso disprefer non-units

Solution: IDF-based scoringSc=Sc * average_unigram_idf

N-gram ScoringNot all terms created equal

Usually answers highly specificAlso disprefer non-units

Solution: IDF-based scoringSc=Sc * average_unigram_idf

N-gram RerankingPromote best answer candidates:

N-gram RerankingPromote best answer candidates:

Filter any answers not in at least two snippets

N-gram RerankingPromote best answer candidates:

Filter any answers not in at least two snippetsUse answer type specific forms to raise matches

E.g. ‘where’ -> boosts ‘city, state’

Small improvement depending on answer type

SummaryRedundancy-based approaches

Leverage scale of web search Take advantage of presence of ‘easy’ answers on

webExploit statistical association of question/answer

text

SummaryRedundancy-based approaches

Leverage scale of web search Take advantage of presence of ‘easy’ answers on

webExploit statistical association of question/answer

text

Increasingly adopted:Good performers independently for QAProvide significant improvements in other systems

Esp. for answer filtering

SummaryRedundancy-based approaches

Leverage scale of web search Take advantage of presence of ‘easy’ answers on webExploit statistical association of question/answer text

Increasingly adopted:Good performers independently for QAProvide significant improvements in other systems

Esp. for answer filtering

Does require some form of ‘answer projection’Map web information to TREC document

Summary Redundancy-based approaches

Leverage scale of web search Take advantage of presence of ‘easy’ answers on web Exploit statistical association of question/answer text

Increasingly adopted: Good performers independently for QA Provide significant improvements in other systems

Esp. for answer filtering

Does require some form of ‘answer projection’ Map web information to TREC document

Aranea download: http://www.umiacs.umd.edu/~jimmylin/resources.html