lec02-introduction to deep learning

61
Deep Learning and Lexical, Syntactic and Semantic Analysis Wanxiang Che and Yue Zhang 2016-10

Upload: dangdat

Post on 14-Feb-2017

227 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: lec02-Introduction to Deep Learning

Deep Learning and Lexical, Syntacticand Semantic Analysis

Wanxiang Che and Yue Zhang2016-10

Page 2: lec02-Introduction to Deep Learning

Part 2: Introduction toDeepLearning

Page 3: lec02-Introduction to Deep Learning

Part 2.1: DeepLearning Background

Page 4: lec02-Introduction to Deep Learning

WhatisMachineLearning?

• FromDatatoKnowledge

InputAlgorithm

Output

Traditional Program

InputOutput

“Algorithm”

ML Program

Page 5: lec02-Introduction to Deep Learning

AStandardExampleofML

• TheMNIST(ModifiedNIST)databaseofhand-writtendigitsrecognition– Publiclyavailable– AhugeamountabouthowwellvariousMLmethodsdoonit

– 60,000+ 10,000 hand-writtendigits(28x28pixelseach)

Page 6: lec02-Introduction to Deep Learning

Veryhardtosaywhatmakesa2

Page 7: lec02-Introduction to Deep Learning

Traditional Model(before2012)

• Fixed/engineeredfeatures+trainableclassifier (分类器)– Designingafeatureextractorrequiresconsiderableeffortsbyexperts

SIFTGISTShapecontext

Page 8: lec02-Introduction to Deep Learning

DeepLearning (after2012)

• LearningHierarchicalRepresentations• DEEPmeansmorethanonestageofnon-linearfeaturetransformation

Page 9: lec02-Introduction to Deep Learning

DeepLearningArchitecture

Page 10: lec02-Introduction to Deep Learning

DeepLearningisNotNew

• 1980stechnology(NeuralNetworks)

Page 11: lec02-Introduction to Deep Learning

AboutNeuralNetworks

• Pros– Simpletolearnp(y|x)– ResultsOKforshallownets

• Cons– Doesnotlearnp(x)– Troublewith>3layers– Overfitts– Slowtotrain

Page 12: lec02-Introduction to Deep Learning

DeepLearningbeatsNN

• Pros– Simpletolearnp(y|x)– ResultsOK forshallownets

• Cons– Doesnotlearnp(x)– Troublewith>3layers– Overfitts– Slowtotrain

Unsupervisedfeaturelearning:RMBs,DAEs,

• Dropout• Maxout• StochasticPooling

GPU

• Newactivationfunctions:ReLU,…

• Gatedmechanism

Page 13: lec02-Introduction to Deep Learning

ResultsonMNIST

• NaïveNeuralNetwork– 96.59%

• SVM(defaultsettingsforlibsvm)– 94.35%

• OptimalSVM[AndreasMueller]– 98.56%

• Thestateoftheart:ConvolutionalNN(2013)– 99.79%

Page 14: lec02-Introduction to Deep Learning

DeepLearningWins

9.MICCAI2013GrandChallengeonMitosisDetection8.ICPR2012ContestonMitosisDetectioninBreastCancerHistologicalImages7.ISBI2012BrainImageSegmentationChallenge(withsuperhumanpixelerrorrate)6.IJCNN2011TrafficSignRecognitionCompetition(onlyourmethodachievedsuperhumanresults)5.ICDAR2011offlineChineseHandwritingCompetition4.OnlineGermanTrafficSignRecognitionContest3.ICDAR2009ArabicConnectedHandwritingCompetition2.ICDAR2009HandwrittenFarsi/ArabicCharacterRecognitionCompetition1.ICDAR2009FrenchConnectedHandwritingCompetition.Comparetheoverviewpageonhandwritingrecognition.• http://people.idsia.ch/~juergen/deeplearning.html

Page 15: lec02-Introduction to Deep Learning

DeepLearningforSpeechRecognition

Page 16: lec02-Introduction to Deep Learning

DeepLearningforNLP

Monolingual Data

Multi-lingual Data

Multi-modalData

Big Data

Deep Learning

喜欢

苹果

喜欢

苹果

喜欢

苹果

Recurrent NN Convolutional NN Recursive NN

Semantic Vector Space

Applications

QAPOS tagging Parsing

CaptionMT Dialog MCTest

Word Seg…

Page 17: lec02-Introduction to Deep Learning

Part 2.2: Feedforward Neural Networks

Page 18: lec02-Introduction to Deep Learning

TheTraditionalParadigmforML

1. Converttherawinputvectorintoavectoroffeatureactivations– Usehand-writtenprogramsbasedoncommon-sensetodefinethe

features2. Learnhowtoweighteachofthefeatureactivationstogeta

singlescalarquantity3. Ifthisquantityisabovesomethreshold,decidethattheinput

vectorisapositiveexampleofthetargetclass

Page 19: lec02-Introduction to Deep Learning

feature units

decision unit

input units

hand-coded programs

learned weights

TheStandardPerceptronArchitecture

IPhone is very good .

good very/good very …

>5?

0.8 0.9 0.1 …

Page 20: lec02-Introduction to Deep Learning

TheLimitationsofPerceptrons

• Thehand-codedfeatures– Greatinfluenceontheperformance– Needlotsofcosttofindsuitablefeatures

• Alinearclassifierwithahyperplane– Cannotseparatenon-lineardata,suchasXORfunctioncannotbelearnedbyasingle-layerperceptron

0,1

0,0 1,0

1,1

The positive and negative casescannot be separated by a plane

Page 21: lec02-Introduction to Deep Learning

LearningwithNon-linearHiddenLayers

Page 22: lec02-Introduction to Deep Learning

FeedforwardNeuralNetworks

• Theinformationispropagatedfromtheinputstotheoutputs

• Timehasnorole(NOcyclebetweenoutputsandinputs)

• Multi-layerPerceptron(MLP)?• Learningtheweightsofhiddenunits

isequivalenttolearningfeatures• Networkswithouthiddenlayersare

verylimitedintheinput-outputmappings– Morelayersoflinearunitsdonot

help.Itsstilllinear– Fixedoutputnon-linearities arenot

enough

x1 x2 xn…..

1st hidden layer

2nd hiddenlayer

Output layer

Page 23: lec02-Introduction to Deep Learning

MultipleLayer NeuralNetworks

• Whatarethosehiddenneuronsdoing?– Mayberepresentoutlines

Page 24: lec02-Introduction to Deep Learning

GeneralOptimizing(Learning)Algorithms

• GradientDescent

• StochasticGradientDescent(SGD)– Minibatch SGD(m>1),OnlineGD(m=1)

Page 25: lec02-Introduction to Deep Learning

Computational/Flow Graphs

• DescribingMathematicalExpressions• Forexample

– e=(a+b)*(b+1)• c=a+b,d=b+1,e=c*d

– Ifa=2,b=1

Page 26: lec02-Introduction to Deep Learning

DerivativesonComputationalGraphs

Page 27: lec02-Introduction to Deep Learning

ComputationalGraphBackwardPass(Backpropagation)

Page 28: lec02-Introduction to Deep Learning

AnFNNPOSTagger

Page 29: lec02-Introduction to Deep Learning

Part 2.3: Word Embeddings

Page 30: lec02-Introduction to Deep Learning

Typical Approaches for Word Representation

• 1-hot representation(orthogonality)– bag-of-word model

sun[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …]

[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …]

star

sim(star, sun) = 0

Page 31: lec02-Introduction to Deep Learning

DistributedWordRepresentation

• Eachwordisassociatedwithalow-dimension(compressed,50-1000),density (non-sparse)andreal (continuous)vector(wordembedding)– Learningwordvectorsthroughsupervisedmodels

• Nature– Semanticsimilarityasvectorsimilarity

Page 32: lec02-Introduction to Deep Learning

HowtoobtainWordEmbedding

Page 33: lec02-Introduction to Deep Learning

NeuralNetworkLanguageModels

• NeuralNetworkLanguageModels(NNLM)– FeedForward(Bengio etal.2003)

• Maximum-Likelihood Estimation• Back-propagation• Input: (𝑛 − 1) embeddings

Page 34: lec02-Introduction to Deep Learning

Predict Word Vector Directly

• SENNA (Collobert and Weston,2008)• word2vec(Mikolov etal.2013)

Page 35: lec02-Introduction to Deep Learning

Word2vec:CBOW(ContinuousBag-of-Word)

• Addinputsfromwordswithinshortwindowtopredictthecurrentword

• Theweightsfordifferentpositionsareshared• ComputationallymuchmoreefficientthannormalNNLM• Thehiddenlayerisjustlinear• Eachwordisanembeddingv(w)• Eachcontextisanembeddingv’(c)

Page 36: lec02-Introduction to Deep Learning

Word2vec:Skip-Gram

• Predictingsurroundingwordsusingthecurrentword

• SimilarperformancewithCBOW• Eachwordisanembeddingv(w)• Eachcontextisanembeddingv’(c)

Page 37: lec02-Introduction to Deep Learning

Word2vecTraining

• SGD+backpropagation• Mostofthecomputationalcostisafunctionofthesizeofthevocabulary(millions)

• Trainingaccelerating– NegativeSampling

• Mikolov etal.2013– HierarchicalDecomposition

• MorinandBengio 2005.Mnih andHinton2008.Mikolov etal.2013– GraphProcessingUnit (GPU)

Page 38: lec02-Introduction to Deep Learning

WordAnalogy

Page 39: lec02-Introduction to Deep Learning

Part 2.4: Recurrent and OtherNeural Networks

Page 40: lec02-Introduction to Deep Learning

LanguageModels

• Alanguagemodelcomputesaprobabilityforasequenceofword:𝑃(𝑤(,⋯𝑤+) orpredictsaprobabilityforthenextword:𝑃(𝑤+,(|𝑤(,⋯𝑤+)

• Usefulformachinetranslation,speechrecognition,andsoon• Wordordering

• P(thecatissmall)>P(smalltheiscat)• Wordchoice

• P(therearefour cats)>P(therearefor cats)

Page 41: lec02-Introduction to Deep Learning

TraditionalLanguageModels

• AnincorrectbutnecessaryMarkovassumption!• Probabilityisusuallyconditionedonwindowofn previouswords• 𝑃 𝑤(,⋯𝑤+ = ∏ 𝑃(𝑤0|𝑤(,⋯ ,𝑤01() ≈3

04( ∏ 𝑃(𝑤0|𝑤01(+1(),⋯ ,𝑤01()304(

• Howtoestimateprobabilities• 𝑝 𝑤6 𝑤( = 789+:(;<,;=)

789+:(;<)𝑝 𝑤> 𝑤(, 𝑤6 = 789+:(;<,;=,;?)

789+:(;<,;=)

• Performanceimproveswithkeepingaroundhighern-gramscountsanddoingsmoothing,suchasbackoff (e.g.if4-gramnotfound,try3-gram,etc)

• Disadvantages• ThereareALOTofn-grams!• Cannotseetoolonghistory

• P(坐/作了一整天的火车/作业)

Page 42: lec02-Introduction to Deep Learning

RecurrentNeuralNetworks(RNNs)

• Conditiontheneuralnetworkonallpreviousinputs• RAMrequirementonlyscaleswithnumberofinputs

W1 W1

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1

W2 W2W2

W3 W3W3

Page 43: lec02-Introduction to Deep Learning

RecurrentNeuralNetworks (RNNs)

• Atasingletimestept• ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)• 𝑦I: = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊>ℎ:)

ht …ht-1

𝑦I:

xt

W1

W2

W3

ht

𝑦I:

xt

W1

W2

W3

W1

Page 44: lec02-Introduction to Deep Learning

TrainingRNNsishard

• Ideallyinputsfrommanytimestepsagocanmodifyoutputy• Forexample,with2timesteps

ht ht+1ht-1

yt-1 yt yt+1

xt-1 xt xt+1

W1 W1

W2 W2

W3 W3

Page 45: lec02-Introduction to Deep Learning

BackPropagation ThroughTime(BPTT)

• Totalerroristhesumofeacherrorattimestept• PQPR

= ∑ PQTPR

U:4(

• PQTPR? =

PQTPVT

PVTPR? iseasytobecalculated

• Buttocalculate PQTPR< =

PQTPVT

PVTPXT

PXTPR< ishard(alsofor𝑊6)

• Becauseℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:) dependsonℎ:1(,whichdependson𝑊( andℎ:16,andsoon.

• So PQTPR< = ∑ PQT

PVT

PVTPXT

PXTPXY

PXYPR<

:Z4(

ht

𝑦I:

xt

W1

W2

W3

Page 46: lec02-Introduction to Deep Learning

The vanishinggradientproblem

• PQTPR

= ∑ PQTPVT

PVTPXT

PXTPXY

PXYPR

:Z4( ,ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)

• PXTPXY

= ∏ PX[PX[\<

:]4Z,( = ∏ 𝑊(diag[tanh′(⋯ )]:

]4Z,(

• PXTPXT\<

≤ 𝛾 𝑊( ≤ 𝛾𝜆(• where 𝛾 is bound diag[tanh′(⋯ )] , 𝜆(is the largest singular value of𝑊(

• PXTPXY

≤ 𝛾𝜆( :1Zà0,if𝛾𝜆( < 1

• Thiscanbecomeverysmallorverylargequicklyà Vanishingorexplodinggradient

• Trickforexplodinggradient:clippingtrick(setathreshold)

Page 47: lec02-Introduction to Deep Learning

A“solution”

• Intuition• Ensure𝛾𝜆( ≥ 1à topreventvanishinggradients

•So…• ProperinitializationoftheW• TouseReLU insteadoftanh orsigmoidactivationfunctions

Page 48: lec02-Introduction to Deep Learning

Abetter“solution”

• Recalltheoriginaltransitionequation• ℎ: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)

• Wecaninsteadupdatethestateadditively• 𝑢: = tanh(𝑊(ℎ:1( +𝑊6𝑥:)• ℎ: = ℎ:1( + 𝑢:• then, PXT

PXT\<= 1 + P9T

PXT\<≥ 1

• Ontheotherhand• ℎ: = ℎ:1( + 𝑢: = ℎ:16 + 𝑢:1( + 𝑢: = ⋯

ht …ht-1

𝑦I:

xt

Page 49: lec02-Introduction to Deep Learning

Abetter“solution” (cont.)

• Interpolatebetweenoldstateandnewstate(“choosingtoforget”)• 𝑓: = 𝜎 𝑊k𝑥: + 𝑈kℎ:1(• ℎ: = 𝑓: ⊙ ℎ:1( + (1 − 𝑓:) ⊙ 𝑢:

• Introduceaseparateinputgate𝑖:• 𝑖: = 𝜎 𝑊0𝑥: + 𝑈0ℎ:1(• ℎ: = 𝑓: ⊙ ℎ:1( + 𝑖: ⊙ 𝑢:

• Selectivelyexposememorycell𝑐: withanoutputgate𝑜:• 𝑜: = 𝜎 𝑊8𝑥: + 𝑈8ℎ:1(• 𝑐: = 𝑓: ⊙ 𝑐:1( + 𝑖: ⊙ 𝑢:• ℎ: = 𝑜: ⊙ tanh(𝑐:)

Page 50: lec02-Introduction to Deep Learning

LongShort-TermMemory(LSTM)

• Hochreiter & Schmidhuber,1997• LSTM=additiveupdates+gating

Xt

+ + + +0 1 2 3

ht-1

Ct-1 Ct

ht

+

�σ σ σtanh

tanh

ht

Page 51: lec02-Introduction to Deep Learning

GatedRecurrentUnites,GRU(Choetal.2014)

• Mainideas• Keeparoundmemoriestocapturelongdistancedependencies• Allowerrormessagestoflowatdifferentstrengthsdependingontheinputs

• Updategate• Basedoncurrentinputandhiddenstate• 𝑧: = 𝜎 𝑊q𝑥: + 𝑈qℎ:1(

• Resetgate• Similarlybutwithdifferentweights• 𝑟: = 𝜎(𝑊s𝑥: + 𝑈sℎ:1()

Page 52: lec02-Introduction to Deep Learning

GRU

• Newmemorycontent• ℎt: = tanh(𝑊𝑥: + 𝑟: ⊙ 𝑈ℎ:1()• Updategatez controlshowmuchofpaststateshouldmatternow

• Ifz closedto1,thenwecancopyinformationinthatunitthroughmanytimestepsàlessvanishinggradient!

• Ifresetgater unitiscloseto0,thenthisignorespreviousmemoryandonlystoresthenewinputinformationà allowsmodeltodropinformationthatisirrelevantinthefuture

• Unitswithlongtermdependencieshaveactiveupdategatesz• Unitswithshort-termdependenciesoftenhaverestgatesr veryactive

• Finalmemoryattimestepcombinescurrentandprevioustimesteps• ℎ: = 𝑧: ⊙ ℎ:1( + (1 − 𝑧:) ⊙ ℎt

Page 53: lec02-Introduction to Deep Learning

LSTMvs.GRU

• Noclearwinner!• Tuninghyperparameters likelayersizeisprobablymoreimportantthanpickingtheidealarchitecture

• GRUshave fewer parametersandthusmaytrainabitfasterorneedless datatogeneralize

• Ifyouhaveenoughdata,thegreaterexpressivepowerofLSTMsmay leadtobetterresults.

Page 54: lec02-Introduction to Deep Learning

MoreRNNs

• BidirectionalRNN • Stack BidirectionalRNN

Page 55: lec02-Introduction to Deep Learning

Tree-LSTMs

• TraditionalSequentialComposition

• Tree-StructuredComposition

Page 56: lec02-Introduction to Deep Learning

MoreApplicationsofRNN

• NeuralMachineTranslation• HandwritingGeneration• ImageCaptionGeneration• …...

Page 57: lec02-Introduction to Deep Learning

NeuralMachineTranslation

• RNNtrainedend-to-end:encoder-decoder

W1 W1

h2 h3h1

I love you

W2 W2W2

我 ?

Encoder:

Decoder:

Thisneedstocapturethe

entiresentence!

Page 58: lec02-Introduction to Deep Learning

AttentionMechanism– Scoring

• Bahdanau etal.2015

W1 W1

h2 h3h1

I love you

W2 W2W2

我 ?

Encoder:

Decoder:

α=score(ℎ:, ℎ{)

0.3 0.6 0.1

ht𝛼

c

Page 59: lec02-Introduction to Deep Learning

ConvolutionNeuralNetwork

CS231nConvolutionalNeuralNetworkforVisualRecognition.

Pooling

Page 60: lec02-Introduction to Deep Learning

CNN for NLP

Zhang,Y.,&Wallace,B.(2015).ASensitivityAnalysisof(andPractitioners’Guideto)ConvolutionalNeuralNetworksforSentenceClassification.

Page 61: lec02-Introduction to Deep Learning

RecursiveNeuralNetwork

Socher, R., Manning, C., & Ng, A. (2011). Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Network. NIPS.