natural language inference, reading comprehension and...

NaturalLanguageInference,ReadingComprehensionandDeepLearning

ChristopherManning@chrmanning•@stanfordnlp

StanfordUniversitySIGIR2016

Machine ComprehensionTested by question answering (Burges)

“Amachinecomprehends apassageoftext if,foranyquestion regardingthattextthatcanbeansweredcorrectlybyamajorityofnativespeakers,thatmachinecanprovideastringwhichthosespeakerswouldagreebothanswersthatquestion,anddoesnotcontaininformationirrelevanttothatquestion.”

IR needs language understanding

• ThereweresomethingsthatkeptIRandNLPapart• IRwasheavilyfocusedonefficiencyandscale• NLPwaswaytoofocusedonformratherthanmeaning

• Nowtherearecompellingreasonsforthemtocometogether• TakingIRprecisionandrecalltothenextlevel• [carparts forsale]• Shouldmatch:Sellingautomobileandpickupengines, transmissions

• Example fromJeffDean’sWSDM2016talk

• Informationretrieval/questionansweringinmobilecontexts• Websnippetsnolongercutitonawatch!

Menu

1. Naturallogic:Aweaklogicoverhumanlanguagesforinference

2. Distributedwordrepresentations

3. Deep,recursiveneuralnetworklanguageunderstanding

How can information retrieval be viewed more as theorem proving (than matching)?

AI2 4th Grade Science Question Answering [Angeli, Nayak, & Manning, ACL 2016]

Our“knowledge”:

Ovariesarethefemalepartoftheflower,whichproduceseggsthatareneededformakingseeds.

Thequestion:

Whichpartofaplantproducestheseeds?

Theanswerchoices:

theflowertheleavesthestemtheroots

How can we represent and reason with broad-coverage knowledge?

1. Rigid-schemaknowledgebaseswithwell-definedlogicalinference

2. Open-domainknowledgebases(OpenIE)– noclearontologyorinference[Etzionietal.2007ff]

3. HumanlanguagetextKB–Norigidschema,butwith“Naturallogic”candoformalinferenceoverhumanlanguagetext [MacCartneyandManning2008]

Natural Language Inference[Dagan 2005, MacCartney & Manning, 2009]

Doesapieceoftextfollowsfromorcontradictanother?TwosenatorsreceivedcontributionsengineeredbylobbyistJackAbramoffinreturnforpoliticalfavors.

JackAbramoffattemptedtobribetwolegislators.

Heretrytoproveorrefuteaccordingtoalargetextcollection:1. Theflowerofaplantproducestheseeds2. Theleavesofaplantproducestheseeds3. Thestemofaplantproducestheseeds4. Therootsofaplantproducestheseeds

Follows

Text as Knowledge Base

Storingknowledgeastextiseasy!

Doinginferencesovertextmightbehard

Don’twanttoruninferenceovereveryfact!

Don’twanttostorealltheinferences!

Inferences … on demand from a query … [Angeli and Manning 2014]

… using text as the meaningrepresentation

Natural Logic: logical inference over text

WearedoinglogicalinferenceThecatateamouse⊨ ¬Nocarnivoreseatanimals

WedoitwithnaturallogicIfImutateasentenceinthisway,doIpreserveitstruth?Post-DealIranAsksifU.S. IsStill‘GreatSatan,’orSomethingLess ⊨ACountryAsksifU.S. IsStill‘GreatSatan,’orSomethingLess

• Asoundandcompleteweaklogic[Icard andMoss2014]• Expressiveforcommonhumaninferences*• “Semantic”parsingisjustsyntacticparsing• Tractable:Polynomialtimeentailmentchecking• Playsnicelywithlexicalmatchingback-offmethods

#1. Common sense reasoning

PolarityinNaturalLogic

WeorderphrasesinpartialordersSimplestone:is-a-kind-ofAlso:geographicalcontainment,etc.

Polarity:Inacertaincontext,isitvalidtomoveupordowninthisorder?

Example inferences

Quantifiersdeterminethepolarity ofphrases

Validmutationsconsider polarity

Successfultoyinference:• Allcatseatmice ⊨ Allhousecatsconsumerodents

“Soft” Natural Logic

• Wealsowanttomakelikely(butnotcertain)inferences• SamemotivationasMarkovlogic,probabilisticsoftlogic,

etc.• Eachmutationedgetemplatefeaturehasacostθ ≥0• Costofanedgeisθi ·fi• Costofapathisθ ·f• Canlearnparametersθ• Inferenceisthengraphsearch

#2. Dealing with real sentences

Naturallogicworkswithfactsliketheseintheknowledgebase:ObamawasborninHawaii

Butreal-worldsentencesarecomplexandlong:BorninHonolulu,Hawaii,Obama isagraduateofColumbiaUniversityandHarvardLawSchool,whereheservedaspresidentoftheHarvardLawReview.

Approach:1. Classifierdivideslongsentencesintoentailedclauses2. Naturallogicinferencecanshortentheseclauses

Universal Dependencies (UD)http://universaldependencies.github.io/docs/

Asingleleveloftypeddependencysyntaxthat(i) worksforallhumanlanguages(ii) givesasimple,human-friendlyrepresentationofsentence

Dependencysyntaxisbetterthanaphrase-structuretreeformachineinterpretation– it’salmostasemanticnetwork

UDaimstobelinguisticallybetteracrosslanguagesthanearlierrepresentations,suchasCoNLLdependencies

Generation of minimal clauses

1. Classificationproblem:givenadependencyedge,doesitintroduceaclause?

2. Isitmissingacontrolledsubjectfromsubj/object?

3. Shortenclauseswhilepreservingvalidity,usingnaturallogic!• Allyoungrabbitsdrinkmilk

⊭ Allrabbitsdrinkmilk

• OK: SJC,theBayArea’sthirdlargestairport,oftenexperiences delaysduetoweather.

• Oftenbetter:SJCoftenexperiencesdelays.

#3. Add a lexical alignment classifier

• Sometimeswecan’tquitemaketheinferencesthatwewouldliketomake:

• Weuseasimplelexicalmatchback-offclassifierwithfeatures:• Matchingwords,mismatchedwords,unmatchedwords• Thesealwaysworkprettywell• Thiswasthe lessonofRTEevaluations andperhaps orIRingeneral

The full system

• Werunourusualsearchoversplitup,shortenedclauses• Ifwefindapremise,great!• Ifnot,weusethelexicalclassifierasanevaluationfunction

• Weworktodothisquicklyatscale• Visit1Mnodes/second, don’trefeaturize, justdelta• 32bytesearchstates (thanksGabor!)

Solving NY State 4th grade science (Allen AI Institute datasets)

Multiplechoicequestionsfromreal4th gradescienceexamsWhichactivityisanexampleofagoodhealthhabit?

(A)Watchingtelevision(B)Smokingcigarettes(C)Eatingcandy(D)Exercisingeveryday

Inourcorpus knowledgebase:• PlasmaTV’scandisplayupto16millioncolors...greatfor

watchingTV...alsomakeagoodscreen.• Notsmokingordrinkingalcoholisgoodforhealth,regardlessof

whetherclothingiswornornot.• Eatingcandyfordinerisanexampleofapoorhealthhabit.• Healthyisexercising

Solving 4th grade science (Allen AI NDMC)

System Dev TestKnowBot[Hixon etal.NAACL2015] 45 –KnowBot(augmentedwith humaninloop) 57 –IRbaseline(Lucene) 49 42NaturalLI 52 51Moredata+IRbaseline 62 58Moredata+NaturalLI 65 61NaturalLI+🔔 +(lex.classifier) 74 67Aristo[Clarketal.2016]6systems,evenmoredata 71

Testset:NewYorkRegents4thGradeScienceexammultiple-choice questionsfromAI2Training:BasicisBarron’sstudyguide;moredataisSciText corpusfromAI2.Score:%correct

Natural Logic

• Canwejustusetextasaknowledgebase?• Naturallogicprovidesauseful,formal(weak)logicfortextual

inference• Naturallogiciseasilycombinablewithlexicalmatching

methods,includingneuralnetmethods• Theresultingsystemisusefulfor:• Common-sensereasoning• QuestionAnswering• OpenInformationExtraction• i.e.,gettingoutrelation triples fromtext

Can information retrieval benefit from distributed representations of words?

From symbolic to distributed representations

Thevastmajorityofrule-basedorstatisticalNLPand IR workregardedwordsasatomicsymbols:hotel, conference,walk

Invectorspaceterms,thisisavectorwithone1andalotofzeroes

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

Wenowcallthisa“one-hot”representation.

Sec. 9.2.2

From symbolic to distributed representations

Itsproblem:• Ifusersearchesfor[Dellnotebookbatterysize],wewouldliketomatchdocumentswith“Delllaptopbatterycapacity”

• Ifusersearchesfor[Seattlemotel],wewouldliketomatchdocumentscontaining“Seattlehotel”

Butmotel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]T

hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0OurqueryanddocumentvectorsareorthogonalThereisnonaturalnotionofsimilarityinasetofone-hotvectors

Sec. 9.2.2

Capturing similarity

Therearemanythingsyoucandoaboutsimilarity,manywellknowninIR

Queryexpansionwithsynonymdictionaries

Learningwordsimilaritiesfromlargecorpora

ButawordrepresentationthatencodessimilaritywinsLessparameterstolearn(perword,notperpair)

Moresharingofstatistics

Moreopportunitiesformulti-tasklearning

Distributional similarity-based representations

Youcangetalotofvaluebyrepresentingawordbymeansofitsneighbors

“Youshallknowawordbythecompanyitkeeps”(J.R.Firth1957:11)

OneofthemostsuccessfulideasofmodernNLPgovernment debt problems turning into banking crises as has happened in

saying that Europe needs unified banking regulation to replace the hodgepodge

ë Thesewordswillrepresentbankingì

Basic idea of learning neural network word embeddings

Wedefinesomemodelthataimstopredictawordbasedonotherwordsinitscontext

Chooseargmaxw w·((wj−1 +wj+1)/2)

whichhasalossfunction,e.g.,

J =1−wj·((wj−1 +wj+1)/2)

Welookatmanysamplesfromabiglanguagecorpus

Wekeepadjustingthevectorrepresentationsofwordstominimizethisloss

Unitnormvectors

With distributed, distributional representations, syntactic and semantic similarity is captured

0.2860.792−0.177−0.1070.109−0.5420.3490.271

currency =

Distributional representations can solve the fragility of NLP tools

StandardNLPsystems– here,theStanfordParser– areincrediblyfragilebecauseofsymbolicrepresentations

Crazysententialcomplement, suchasfor“likes [(being)crazy]”

Distributional representations can capture the long tail of IR similarity

Google’sRankBrain

Notnecessarilyasgoodfortheheadofthequerydistribution,butgreatforseeingsimilarityinthetail

3rd mostimportantrankingsignal(we’retold…)

LSA:Count!models• Factorizea(maybeweighted,oftenlog-scaled)term-

document(Deerwester etal.1990)orword-contextmatrix(Schütze1992)intoUΣVT

• Retainonlyksingularvalues,inordertogeneralize

[Cf.Baroni:Don’tcount,predict!Asystematiccomparisonofcontext-countingvs.context-predicting semanticvectors. ACL2014]

LSA (Latent Semantic Analysis) vs. word2vec

k

Sec. 18.3

word2vecCBOW/SkipGram:Predict![Mikolov etal.2013]:Simplepredictmodelsforlearningwordvectors• Trainwordvectorstotrytoeither:

• Predictawordgivenitsbag-of-wordscontext(CBOW);or

• Predictacontextword(position-independent)fromthecenterword

• Updatewordvectorsuntiltheycandothispredictionwell

LSA vs. word2vec

Sec. 18.3

word2vec encodes semantic components as linear relations

COALS model (count-modified LSA)[Rohde, Gonnerman & Plaut, ms., 2005]

Count based vs. direct prediction

LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert)

• Fast training• Efficient usage of statistics

• Primarily used to capture word similarity

• May not use the best methods for scaling counts

• NNLM, HLBL, RNN, word2vec Skip-gram/CBOW, (Bengio et al; Collobert & Weston; Huang et al; Mnih & Hinton; Mikolov et al; Mnih & Kavukcuoglu)

• Scales with corpus size

• Inefficient usage of statistics

• Can capture complex patterns beyond word similarity

• Generate improved performance on other tasks

Ratiosofco-occurrenceprobabilitiescanencodemeaningcomponents

Crucialinsight:

x =solid x =water

large

x =gas

small

x =random

smalllarge

small large large small

~1 ~1large small

Encoding meaning in vector differences[Pennington, Socher, and Manning, EMNLP 2014]

Ratiosofco-occurrenceprobabilitiescanencodemeaningcomponents

Crucialinsight:

x =solid x =water

1.9x10-4

x =gas x =fashion

2.2x10-5

1.36 0.96

Encoding meaning in vector differences[Pennington, Socher, and Manning, EMNLP 2014]

8.9

7.8x10-4 2.2x10-3

3.0x10-3 1.7x10-5

1.8x10-5

6.6x10-5

8.5x10-2

A:Log-bilinearmodel:

withvectordifferences

Encoding meaning in vector differences

Q:Howcanwecaptureratiosofco-occurrenceprobabilitiesasmeaningcomponentsinawordvectorspace?

Nearestwordsto frog:

1.frogs2.toad3.litoria4.leptodactylidae5.rana6.lizard7.eleutherodactylus

Glove Word similarities[Pennington et al., EMNLP 2014]

litoria leptodactylidae

rana eleutherodactylushttp://nlp.stanford.edu/projects/glove/

Glove Visualizations

http://nlp.stanford.edu/projects/glove/

Glove Visualizations: Company - CEO

Named Entity Recognition Performance

ModelonCoNLL CoNLL ’03dev CoNLL ’03test ACE2 MUC7CategoricalCRF 91.0 85.4 77.4 73.4SVD(logtf) 90.5 84.8 73.6 71.5HPCA 92.6 88.7 81.7 80.7C&W 92.2 87.4 81.7 80.2CBOW 93.1 88.2 82.2 81.1GloVe 93.2 88.3 82.9 82.2

F1scoreofCRFtrainedonCoNLL2003Englishwith50dimwordvectors

Word embeddings: Conclusion

Glovetranslatesmeaningfulrelationshipsbetweenword-wordco-occurrencecounts into linearrelations inthewordvectorspace

GloveshowstheconnectionbetweenCount!workandPredict!work– appropriatescalingofcountsgivesthepropertiesandperformanceofPredict!Models

Alotofotherimportantworkinthislineofresearch:[Levy&Goldberg,2014][Arora,Li,Liang,Ma&Risteski,2015][Hashimoto,Alvarez-Melis &Jaakkola,2016]

Can we use neural networks to understand, not just word similarities,

but language meaning in general?

CompositionalityArtificial Intelligence requires being able to understand bigger things from knowing

about smaller parts

We need more than word embeddings!

Howcanweknowwhenlargerlinguisticunitsaresimilarinmeaning?

Thesnowboarderisleapingoverthemogul

Apersononasnowboardjumpsintotheair

Peopleinterpretthemeaningoflargertextunits–entities,descriptiveterms,facts,arguments,stories– bysemanticcomposition ofsmallerelements

Beyond the bag of words: Sentiment detection

Isthetoneofapieceoftextpositive,negative,orneutral?

• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments~90%,BUT

……loved……………great………………impressed………………marvelous…………

Stanford Sentiment Treebank

• 215,154phraseslabeledin11,855sentences• Cantrainandtestcompositions

http://nlp.stanford.edu:8080/sentiment/

Tree-Structured Long Short-Term Memory Networks [Tai et al., ACL 2015]

Tree-structured LSTM

GeneralizessequentialLSTMtotreeswithanybranchingfactor

Positive/Negative Results on Treebank

75

80

85

90

95

TrainingwithSentence Labels TrainingwithTreebank

BiNB

RNN

MV-RNN

RNTN

TreeLSTM

Experimental Results on Treebank

• TreeRNN cancaptureconstructionslikeXbutY• Biword NaïveBayesisonly58%onthese

Stanford Natural Language Inference Corpus http://nlp.stanford.edu/projects/snli/570K Turker-judged pairs, based on an assumed picture

Amanridesabikeon asnowcoveredroad.Aman isoutside. ENTAILMENT

2femalebabieseatingchips.Twofemalebabiesareenjoyingchips.NEUTRAL

Amaninanapronshoppingatamarket.Amaninanapronispreparingdinner.CONTRADICTION

NLI with Tree-RNNs[Bowman, Angeli, Potts & Manning, EMNLP 2015]

Approach: Wewouldliketoworkoutthemeaningofeachsentenceseparately– apurecompositionalmodel

ThenwecomparethemwithNN&classifyforinference

P(Entail)=0.8

manoutside vs. maninsnow

snowin

insnowman

maninsnow

outsideman

manoutside

Softmax classifier

ComparisonNNlayer(s)

Composition NN layer

Learnedwordvectors

Tree recursive NNs (TreeRNNs)

Theoreticallyappealing

Veryempiricallycompetitive

But

Prohibitivelyslow

Usuallyrequireanexternalparser

Don’texploitcomplementarylinearstructureoflanguage

A recurrent NN allows efficient batched computation on GPUs

TreeRNN: Input-specific structure undermines batched computation

The Shift-reduce Parser-Interpreter NN (SPINN) [Bowman, Gauthier et al. 2016]

BasemodelequivalenttoaTreeRNN,but…supportsbatchedcomputation:25× speedups

Plus:Effectivenewhybridthatcombineslinearandtree-structuredcontext

Canstandalonewithoutaparser

Beginning observation: binary trees = transition sequences

SHIFT SHIFTREDUCE SHIFTSHIFT REDUCE

REDUCE

SHIFT SHIFTSHIFT SHIFTREDUCE REDUCE

REDUCE

SHIFT SHIFTSHIFT REDUCESHIFT REDUCE

REDUCE

The Shift-reduce Parser-Interpreter NN (SPINN)

The Shift-reduce Parser-Interpreter NN (SPINN)

ThemodelincludesasequenceLSTMRNN• ThisactsasasimpleparserbypredictingSHIFTorREDUCE• Italsogivesleftsequencecontextasinputtocomposition

Implementing the stack

• Naïveimplementation:simulatesstacksin abatchwithafixed-sizemultidimensionalarray ateachtimestep• Backpropagationrequiresthateachintermediatestackbemaintainedinmemory

• ⇒ Largeamountofdatacopyingandmovementrequired• Efficientimplementation

• Haveonlyonestackarrayforeachexample• Ateachtimestep,augmentwiththecurrentheadofthestack• Keeplistofbackpointers forREDUCEoperations

• Similartozipperdatastructuresemployedelsewhere

Array Backpointers

1

2

3

4

5

Spot 1

sat 1 2

down 123

(satdown) 14

(Spot(satdown)) 5

A thinner stack

Using SPINN for natural language inference

o₁ o₂the cat

sat down

the cat

is angry

(the cat) (sat down)

(the cat) (is angry)

SNLI Results

Model %Accuracy(Testset)

Feature-based classifier 78.2PreviousSOTAsentenceencoder[Mou etal.2016]

82.1

LSTMRNNsequencemodel 80.6Tree LSTM 80.9SPINN 83.2SOTA(sentencepairalignmentmodel)[Parikhetal.2016]

86.8

Successes for SPINN over LSTM

Exampleswithnegation• P:Therhythmicgymnastcompletesherfloorexerciseatthecompetition.

• H:Thegymnastcannotfinishherexercise.

Longexamples(>20words)• P:AmanwearingglassesandaraggedcostumeisplayingaJaguarelectricguitarandsingingwiththeaccompanimentofadrummer.

• H:Amanwithglassesandadisheveledoutfitisplayingaguitarandsingingalongwithadrummer.

Envoi

• Thereareverygoodreasonsforwantingtorepresentmeaningwithdistributedrepresentations

• Sofar,distributionallearninghasbeenmosteffectiveforthis• Butcf.[Young,Lai,Hodosh&Hockenmaier 2014]ondenotationalrepresentations,usingvisualscenes

• However,wewantnotjustwordmeanings,butalso:• Meaningsoflargerunits,calculatedcompositionally• Theabilitytodonaturallanguageinference

• TheSPINNmodelisfast— closetorecurrentnetworks!• Itshybridsequence/treestructureispsychologicallyplausible

andout-performsothersentencecompositionmethods

2011201320152017speechvisionNLPIR

Final Thoughts

Youarehere

IR NLP

Final Thoughts

I’mcertainthatdeep learningwillcometodominateSIGIRover thenextcoupleofyears…justlikespeech, vision,andNLPbefore it.Thisisagoodthing.Deeplearningprovidessomepowerfulnewtechniques thatarejustbeingamazinglysuccessful onmanyhardappliedproblems.However, weshould realize thatthere isalsocurrentlyahugeamountofhypeaboutdeep learningandartificialintelligence.Weshouldnotletagenuineenthusiasmforimportantandsuccessful newtechniques leadtoirrationalexuberance oradiminishedappreciation ofotherapproaches. Finally,despite theefforts ofanumberofpeople, inpractice therehas beenaconsiderable divisionbetween thehumanlanguage technologyfieldsofIR,NLP,andspeech.Partlythisisduetoorganizational factorsandpartlythatatonetimethesubfieldseachhadaverydifferent focus.However, recentchanges inemphasis– withIRpeoplewantingtounderstand theuserbetterandNLPpeoplemuchmoreinterested inmeaningandcontext– meanthattherearealotofcommoninterests, andIwouldencouragemuchmorecollaborationbetween NLPandIRinthenextdecade.

natural language inference, reading comprehension and...

Documents