improved semantic representations from tree-structured long short- term memory networks kai...

Improved Semantic Representations From Tree-Structured Long Short-Term

Memory Networks

Kai Sheng-Tai, Richard Socher, Christopher D. Manning

Presentation by: Reed Coke

Neural Nets in NLPAs we know, neural networks are taking NLP by storm.

Left: Visualization of Word Embeddings Right: Sentiment Analysis using RNTNs

Long Short-Term Memory (LSTM)One particular type of network architecture has become the de facto way to model sentences: the long short-term memory network (Hochreiter and Schmidhuber 97).

LSTMs are a type of recurrent neural network that are good at remembering information over long period of time within a sequence.

Cathy Finegan-Dollak recently gave a presentation about LSTMs and kindly allowed me to borrow many of her examples and images.

Long Short-Term Memory (LSTM)1.“My dog [eat/eats] rawhide”

2.“My dog, who I rescued in 2009, [eat/eats] rawhide”

Why might sentence 2 be harder to handle than sentence 1?

Long Short-Term Memory (LSTM)1.“My dog [eat/eats] rawhide”

2.“My dog, who I rescued in 2009, [eat/eats] rawhide”

Why might sentence 2 be harder to handle than sentence 1?

Long-term dependencies

The network needs to remember that dog was singular even though there were five words in the way.

1 2 3 4 5

h0 h1σ

... ht-1

y1 y2 yt

Compare our predicted sequence to correct sequence.

RNN: Deep Learning for SequencesFigure credit: Cathy Finegan-Dollak

h0 h1σ

... ht-1

dog who eats

Compare our predicted sequence to correct sequence.

RNN: Deep Learning for SequencesFigure credit: Cathy Finegan-Dollak

Modern LSTM Diagram

wxfwhf

c0 Myh0

wxiwhi

wxowho

The solution is to add gates.

i: “input gate”f: “forget gate”o: “output gate”

Figure credit: Cathy Finegan-Dollak

Modern LSTM Diagram

wxfwhf

c0 Myh0

wxiwhi

wxowho

The Input Gate:

The input gate takes values between 0 and 1 depending on the input. It then “discounts” the input sigmoid by that amount.

Modern LSTM Diagram

wxfwhf

c0 Myh0

wxiwhi

wxowho

The Forget Gate:

The forget gate helps determine what amount of the output from the previous state, C0, should be kept.

Modern LSTM Diagram

wxfwhf

c0 Myh0

wxiwhi

wxowho

The Output Gate:

The output gate modulates the fraction of the hidden state that gets sent to the output, based on the input.

LSTMs and BeyondLSTMs are experiencing wild success in language modeling (Filippova et al. 15, Sutskever et al. 14, Graves 13, Sundermeyer et al. 10, Tang et al. 2015)

But, is language really just a flat sequence of words?

LSTMs and BeyondLSTMs are experiencing wild success in language modeling (Filippova et al. 15, Sutskever et al. 14, Graves 13, Sundermeyer et al. 10, Tang et al. 2015)

But, is language really just a flat sequence of words?

If so, I kind of regret studying linguistics in college.

Tree-LSTMsOne of the things that makes language so interesting is its tree structure.

To exploit this together with the performance of an LSTM, we generalize them to arrive at Tree-LSTMs.

Instead of taking input from a single previous node, Tree-LSTMs take input from all the children of a particular node. Today’s paper discusses two variations - the Child-Sum Tree-LSTM and the N-ary Tree-LSTM.

Tree-LSTMs

Ordinary Tree-LSTM Binary Tree-LSTM

All hail Cathy, who made these figures.

Binary Tree-LSTMs

All hail Cathy, who made these figures.

The Forget Gates:

A Binary Tree-LSTM can be used on a tree has a maximum branching factor of 2.

f1 is the forget gate for all left children. f3 is the forget gate for all right children.

This scheme works well if the children are ordered in a predictable way.

Child-Sum Tree-LSTMsThe children of a Child-Sum Tree-LSTM are unordered. Instead, the ht-1 value that gets passed into ut along with xt and sent to the input gate becomes the sum of all the h values of the child nodes.

Similarly, the hidden states of all the children are summed after they are sent through their forget gates into the hidden state for xt.

SummaryN-ary Tree LSTMs

• Children are discrete

• Each of N child slots has its own forget gate (e.g. same gate for all left children)

• Children are ordered, forget gate is determined by ordering

• Also called Constituency Tree-LSTMs

• Restriction: branching factor may not exceed N.

Child-Sum Tree-LSTMs

• Children are lumped together

• Children are unordered, because they are all summed anyway.

• Also called Dependency Tree-LSTMs

SummarySentiment Classification

• Two types: binary and 1-5

• Stanford Sentiment Treebank (Socher et al. 13)

• Dataset includes parse trees

Semantic Relatedness

• Given two sentences, predict an integer [1, 5]

• 1= least related 5 = most related

• Sentences Involving Compositional Knowledge dataset (Marelli et al. 2014)

• Final label of a sentence pair is the average of 10 annotators

Sentiment Classification Results● Constituency Tree-LSTM is trained on

more data than Dependency Tree-LSTM

● Continuing to train the GLOVE vectors yield a noticeable improvement

● No mention of why CNN-multichannel performs so well on the binary task.

Semantic Relatedness Results● Supervision only at

tree node

● Maximum depth of dependency tree is smaller than binarized constituency tree

Example CodeVarious LSTM Examples:LSTM for sentiment analysis of tweets - deeplearning.netCharacter-level LSTM for sequence generation - Andrej KarpathyRNNLIB - Alex GravesLSTM with peepholes - Felix Gers

Tree LSTMs:N-ary and Child-Sum Tree LSTMs - Kai Sheng Tai

Other:GLOVE vectors - Jeffrey Pennington

improved semantic representations from tree-structured long short- term memory networks kai...

cathy finegandollaklstms

sequencesfigure credit

output gatefigure credit

predicted sequence

input gatef

input sigmoid

flat sequence of words

long period of time

Documents

manning ggg

deep learning for nlp without magic - richard socher

socher & manning's deep learning for nlp...

qu guo sheng

glove: global vectors for word representation · glove:...

spectral chinese restaurant processes: nonparametric...

xi victoria lin richard socher caiming xiong abstract

june manning thomas—vita page 1 june manning thomas, ph

cs224d deep nlp lecture 8: recurrent neural … nlp lecture...

bradley manning

round 2: objective function over 3000 iterations. · [2]...

deep learning for nlp (without magic) - richard socher and...

manning barometer

copyright 2004 sheng bai1 commview report for 60-592 by...

fancy recurrent neural networks - github pages · fancy...

natural language processing with deep learning...

manning profile - octopus marine · manning profile manning...

sheng dictionary

li- jia li, richard socher , li fei-fei

recursive deep learning a dissertation - stanford nlp...