query processing: query formulation
Embed Size (px)
DESCRIPTION
Query Processing: Query Formulation. Ling573 NLP Systems and Applications April 14, 2011. Roadmap. Motivation: Retrieval gaps Query F ormulation: Question Series Query reformulation: AskMSR patterns MULDER parse-based formulation Classic query expansion Semantic resources - PowerPoint PPT PresentationTRANSCRIPT
Query Processing: Query Formulation
Query Processing:Query FormulationLing573NLP Systems and ApplicationsApril 14, 2011RoadmapMotivation:Retrieval gaps
Query Formulation:Question SeriesQuery reformulation:AskMSR patternsMULDER parse-based formulationClassic query expansionSemantic resourcesPseudo-relevance feedbackRetrieval GapsGoal: Based on question,Retrieve documents/passages that best capture answerRetrieval GapsGoal: Based on question,Retrieve documents/passages that best capture answerProblem:Mismatches in lexical choice, sentence structureRetrieval GapsGoal: Based on question,Retrieve documents/passages that best capture answerProblem:Mismatches in lexical choice, sentence structureQ: How tall is Mt. Everest? Retrieval GapsGoal: Based on question,Retrieve documents/passages that best capture answerProblem:Mismatches in lexical choice, sentence structureQ: How tall is Mt. Everest? A: The height of Everest isRetrieval GapsGoal: Based on question,Retrieve documents/passages that best capture answerProblem:Mismatches in lexical choice, sentence structureQ: How tall is Mt. Everest? A: The height of Everest isQ: When did the first American president take office?A: George Washington was inaugurated in.Query FormulationGoals:Query FormulationGoals:Overcome lexical gaps & structural differencesTo enhance basic retrieval matchingTo improve target sentence identificationIssues & Approaches:Query FormulationGoals:Overcome lexical gaps & structural differencesTo enhance basic retrieval matchingTo improve target sentence identificationIssues & Approaches:Differences in word forms:Query FormulationGoals:Overcome lexical gaps & structural differencesTo enhance basic retrieval matchingTo improve target sentence identificationIssues & Approaches:Differences in word forms:Morphological analysisDifferences in lexical choice:Query FormulationGoals:Overcome lexical gaps & structural differencesTo enhance basic retrieval matchingTo improve target sentence identificationIssues & Approaches:Differences in word forms:Morphological analysisDifferences in lexical choice:Query expansionDifferences in structureQuery FormulationConvert question suitable form for IRStrategy depends on document collectionWeb (or similar large collection):stop structure removal: Delete function words, q-words, even low content verbsCorporate sites (or similar smaller collection):Query expansionCant count on document diversity to recover word variationAdd morphological variants, WordNet as thesaurusReformulate as declarative: rule-basedWhere is X located -> X is located inQuestion SeriesTREC 2003-Target: PERS, ORG,..Assessors create series of questions about targetIntended to model interactive Q/A, but often stiltedIntroduces pronouns, anaphora
Question SeriesTREC 2003-Target: PERS, ORG,..Assessors create series of questions about targetIntended to model interactive Q/A, but often stiltedIntroduces pronouns, anaphora
Handling Question SeriesGiven target and series, how deal with reference?Handling Question SeriesGiven target and series, how deal with reference?Shallowest approach:Concatenation:Add the target to the questionHandling Question SeriesGiven target and series, how deal with reference?Shallowest approach:Concatenation:Add the target to the questionShallow approach:Replacement:Replace all pronouns with target
Handling Question SeriesGiven target and series, how deal with reference?Shallowest approach:Concatenation:Add the target to the questionShallow approach:Replacement:Replace all pronouns with targetLeast shallow approach:Heuristic reference resolutionQuestion Series ResultsNo clear winning strategyQuestion Series ResultsNo clear winning strategyAll largely about the targetSo no big win for anaphora resolutionIf using bag-of-words features in search, works fine
Question Series ResultsNo clear winning strategyAll largely about the targetSo no big win for anaphora resolutionIf using bag-of-words features in search, works fine
Replacement strategy can be problematic E.g. Target=Nirvana:What is their biggest hit?
Question Series ResultsNo clear winning strategyAll largely about the targetSo no big win for anaphora resolutionIf using bag-of-words features in search, works fine
Replacement strategy can be problematic E.g. Target=Nirvana:What is their biggest hit?When was the band formed?
Question Series ResultsNo clear winning strategyAll largely about the targetSo no big win for anaphora resolutionIf using bag-of-words features in search, works fine
Replacement strategy can be problematic E.g. Target=Nirvana:What is their biggest hit?When was the band formed?Wouldnt replace the band
Question Series ResultsNo clear winning strategyAll largely about the targetSo no big win for anaphora resolutionIf using bag-of-words features in search, works fine
Replacement strategy can be problematic E.g. Target=Nirvana:What is their biggest hit?When was the band formed?Wouldnt replace the band
Most teams concatenate
AskMSRShallow Processing for QA (Dumais et al 2002, Lin2007)
12345IntuitionRedundancy is useful!If similar strings appear in many candidate answers, likely to be solutionEven if cant find obvious answer stringsIntuitionRedundancy is useful!If similar strings appear in many candidate answers, likely to be solutionEven if cant find obvious answer stringsQ: How many times did Bjorn Borg win Wimbledon?Bjorn Borg blah blah blah Wimbledon blah 5 blah Wimbledon blah blah blah Bjorn Borg blah 37 blah. blah Bjorn Borg blah blah 5 blah blah Wimbledon 5 blah blah Wimbledon blah blah Bjorn Borg.IntuitionRedundancy is useful!If similar strings appear in many candidate answers, likely to be solutionEven if cant find obvious answer stringsQ: How many times did Bjorn Borg win Wimbledon?Bjorn Borg blah blah blah Wimbledon blah 5 blah Wimbledon blah blah blah Bjorn Borg blah 37 blah. blah Bjorn Borg blah blah 5 blah blah Wimbledon 5 blah blah Wimbledon blah blah Bjorn Borg.Probably 5Query ReformulationIdentify question type:E.g. Who, When, Where,Create question-type specific rewrite rules:Query ReformulationIdentify question type:E.g. Who, When, Where,Create question-type specific rewrite rules:Hypothesis: Wording of question similar to answerFor where queries, move is to all possible positionsWhere is the Louvre Museum located? =>Is the Louvre Museum locatedThe is Louvre Museum locatedThe Louvre Museum is located, .etc.Query ReformulationIdentify question type:E.g. Who, When, Where,Create question-type specific rewrite rules:Hypothesis: Wording of question similar to answerFor where queries, move is to all possible positionsWhere is the Louvre Museum located? =>Is the Louvre Museum locatedThe is Louvre Museum locatedThe Louvre Museum is located, .etc.Create type-specific answer type (Person, Date, Loc)Query Form Generation3 query forms: Initial baseline query
Query Form Generation3 query forms: Initial baseline queryExact reformulation:weighted 5 times higherAttempts to anticipate location of answerQuery Form Generation3 query forms: Initial baseline queryExact reformulation:weighted 5 times higherAttempts to anticipate location of answerExtract using surface patternsWhen was the telephone invented?
Query Form Generation3 query forms: Initial baseline queryExact reformulation:weighted 5 times higherAttempts to anticipate location of answerExtract using surface patternsWhen was the telephone invented?the telephone was invented ?x
Query Form Generation3 query forms: Initial baseline queryExact reformulation:weighted 5 times higherAttempts to anticipate location of answerExtract using surface patternsWhen was the telephone invented?the telephone was invented ?xGenerated by ~12 pattern matching rules on terms, POSE.g. wh-word did A verb B -Query Form Generation3 query forms: Initial baseline queryExact reformulation:weighted 5 times higherAttempts to anticipate location of answerExtract using surface patternsWhen was the telephone invented?the telephone was invented ?xGenerated by ~12 pattern matching rules on terms, POSE.g. wh-word did A verb B -> A verb+ed B ?x (general)Where is A? ->
Query Form Generation3 query forms: Initial baseline queryExact reformulation:weighted 5 times higherAttempts to anticipate location of answerExtract using surface patternsWhen was the telephone invented?the telephone was invented ?xGenerated by ~12 pattern matching rules on terms, POSE.g. wh-word did A verb B -> A verb+ed B ?x (general)Where is A? -> A is located in ?x (specific)Inexact reformulation: bag-of-words
Query ReformulationExamples
Deeper Processing for Query FormulationMULDER (Kwok, Etzioni, & Weld)Converts question to multiple search queriesForms which match targetVary specificity of queryMost general bag of keywordsMost specific partial/full phrasesDeeper Processing for Query FormulationMULDER (Kwok, Etzioni, & Weld)Converts question to multiple search queriesForms which match targetVary specificity of queryMost general bag of keywordsMost specific partial/full phrasesEmploys full parsing augmented with morphologySyntax for Query FormulationParse-based transformations:Applies transformational grammar rules to questionsSyntax for Query FormulationParse-based transformations:Applies transformational grammar rules to questionsExample rules:Subject-auxiliary movement:Q: Who was the first American in space?Syntax for Query FormulationParse-based transformations:Applies transformational grammar rules to questionsExample rules:Subject-auxiliary movement:Q: Who was the first American in space?Alt: was the first American; the first American in space wasSubject-verb movement:Who shot JFK? Syntax for Query FormulationParse-based transformations:Applies transformational grammar rules to questionsExample rules:Subject-auxiliary movement:Q: Who was the first American in space?Alt: was the first American; the first American in space wasSubject-verb movement:Who shot JFK? => shot JFKEtcMorphology based transformation:Verb-conversion: do-aux+v-infSyntax for Query FormulationParse-based transformations:Applies transformational grammar rules to questionsExample rules:Subject-auxiliary movement:Q: Who was the first American in space?Alt: was the first American; the first American in space wasSubject-verb movement:Who shot JFK? => shot JFKEtcMorphology based transformation:Verb-conversion: do-aux+v-inf => conjugated verbMachine Learning ApproachesDiverse approaches:Assume annotated query logs, annotated question sets, matched query/snippet pairsMachine Learning ApproachesDiverse approaches:Assume annotated query logs, annotated question sets, matched query/snippet pairsLearn question paraphrases (MSRA)Improve QA by setting question sitesImprove search by generating alternate question forms
Machine Learning ApproachesDiverse approaches:Assume annotated query logs, annotated question sets, matched query/snippet pairsLearn question paraphrases (MSRA)Improve QA by setting question sitesImprove search by generating alternate question forms
Question reformulation as machine translationGiven question logs, click-through snippetsTrain machine learning model to transform Q -> A Query ExpansionBasic idea:Improve matching by adding words with similar meaning/similar topic to queryQuery ExpansionBasic idea:Improve matching by adding words with similar meaning/similar topic to queryAlternative strategies:Use fixed lexical resource E.g. WordNet
Query ExpansionBasic idea:Improve matching by adding words with similar meaning/similar topic to queryAlternative strategies:Use fixed lexical resource E.g. WordNet
Use information from document collectionPseudo-relevance feedbackWordNet Based ExpansionIn Information Retrieval settings, mixed historyHelped, hurt, or no effectWith long queries & long documents, no/bad effectWordNet Based ExpansionIn Information Retrieval settings, mixed historyHelped, hurt, or no effectWith long queries & long documents, no/bad effectSome recent positive results on short queriesE.g. Fang 2008Contrasts different WordNet, Thesaurus similarityAdd semantically similar terms to queryAdditional weight factor based on similarity scoreSimilarity MeasuresDefinition similarity: Sdef(t1,t2)Word overlap between glosses of all synsetsDivided by total numbers of words in all synsets glosses
Similarity MeasuresDefinition similarity: Sdef(t1,t2)Word overlap between glosses of all synsetsDivided by total numbers of words in all synsets glossesRelation similarity:Get value if terms are:Synonyms, hypernyms, hyponyms, holonyms, or meronyms
Similarity MeasuresDefinition similarity: Sdef(t1,t2)Word overlap between glosses of all synsetsDivided by total numbers of words in all synsets glossesRelation similarity:Get value if terms are:Synonyms, hypernyms, hyponyms, holonyms, or meronyms
Term similarity score from Lins thesaurus
ResultsDefinition similarity yields significant improvementsAllows matching across POSMore fine-grained weighting that binary relationsCollection-based Query ExpansionXu & Croft 97 (classic)Thesaurus expansion problematic:Often ineffectiveIssues:Collection-based Query ExpansionXu & Croft 97 (classic)Thesaurus expansion problematic:Often ineffectiveIssues:Coverage: Many words esp. NEs missing from WordNetCollection-based Query ExpansionXu & Croft 97 (classic)Thesaurus expansion problematic:Often ineffectiveIssues:Coverage: Many words esp. NEs missing from WordNetDomain mismatch:Fixed resources general or derived from some domainMay not match current search collectionCat/dog vs cat/more/ls
Collection-based Query ExpansionXu & Croft 97 (classic)Thesaurus expansion problematic:Often ineffectiveIssues:Coverage: Many words esp. NEs missing from WordNetDomain mismatch:Fixed resources general or derived from some domainMay not match current search collectionCat/dog vs cat/more/lsUse collection-based evidence: global or local
Global AnalysisIdentifies word cooccurrence in whole collectionApplied to expand current queryContext can differentiate/group conceptsGlobal AnalysisIdentifies word cooccurrence in whole collectionApplied to expand current queryContext can differentiate/group conceptsCreate index of concepts:Concepts = noun phrases (1-3 nouns long)Global AnalysisIdentifies word cooccurrence in whole collectionApplied to expand current queryContext can differentiate/group conceptsCreate index of concepts:Concepts = noun phrases (1-3 nouns long)Representation: ContextWords in fixed length window, 1-3 sentencesGlobal AnalysisIdentifies word cooccurrence in whole collectionApplied to expand current queryContext can differentiate/group conceptsCreate index of concepts:Concepts = noun phrases (1-3 nouns long)Representation: ContextWords in fixed length window, 1-3 sentencesConcept identifies context word documentsUse query to retrieve 30 highest ranked conceptsAdd to query Local AnalysisAka local feedback, pseudo-relevance feedbackLocal AnalysisAka local feedback, pseudo-relevance feedbackUse query to retrieve documentsSelect informative terms from highly ranked documentsAdd those terms to queryLocal AnalysisAka local feedback, pseudo-relevance feedbackUse query to retrieve documentsSelect informative terms from highly ranked documentsAdd those terms to querySpecifically, Add 50 most frequent terms,10 most frequent phrases bigrams w/o stopwordsReweight termsLocal Context AnalysisMixes two previous approachesUse query to retrieve top n passages (300 words)Select top m ranked concepts (noun sequences)Add to query and reweight
Local Context AnalysisMixes two previous approachesUse query to retrieve top n passages (300 words)Select top m ranked concepts (noun sequences)Add to query and reweight
Relatively efficientApplies local search constraintsExperimental ContrastsImprovements over baseline:Local Context Analysis: +23.5% (relative)Local Analysis: +20.5%Global Analysis: +7.8%
Experimental ContrastsImprovements over baseline:Local Context Analysis: +23.5% (relative)Local Analysis: +20.5%Global Analysis: +7.8%
LCA is best and most stable across data setsBetter term selection that global analysisExperimental ContrastsImprovements over baseline:Local Context Analysis: +23.5% (relative)Local Analysis: +20.5%Global Analysis: +7.8%
LCA is best and most stable across data setsBetter term selection that global analysisAll approaches have fairly high varianceHelp some queries, hurt othersExperimental ContrastsImprovements over baseline:Local Context Analysis: +23.5% (relative)Local Analysis: +20.5%Global Analysis: +7.8%
LCA is best and most stable across data setsBetter term selection that global analysisAll approaches have fairly high varianceHelp some queries, hurt othersAlso sensitive to # terms added, # documentsGlobal Analysis
Local Analysis
LCA
What are the different techniques used to create self-induced hypnosis?Deliverable #2 DiscussionMore training data availableTest data releasedRequirements
Deliverable Reports