improved semantic representations from tree-structured long short- term memory networks kai...

Improved Semantic Representations From Tree-Structured Long Short-Term

Memory Networks

Kai Sheng-Tai, Richard Socher, Christopher D. Manning

Presentation by: Reed Coke

Neural Nets in NLPAs we know, neural networks are taking NLP by storm.

Left: Visualization of Word Embeddings Right: Sentiment Analysis using RNTNs

http://nlp.stanford.edu/courses/NAACL2013/NAACL2013-Socher-Manning-DeepLearning.pdf

http://nlp.stanford.edu/sentiment/

Long Short-Term Memory (LSTM)One particular type of network architecture has become the de facto way to model sentences: the long short-term memory network (Hochreiter and Schmidhuber 97).

LSTMs are a type of recurrent neural network that are good at remembering information over long period of time within a sequence.

Cathy Finegan-Dollak recently gave a presentation about LSTMs and kindly allowed me to borrow many of her examples and images.

http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf

Long Short-Term Memory (LSTM)1.“My dog [eat/eats] rawhide”

2.“My dog, who I rescued in 2009, [eat/eats] rawhide”

Why might sentence 2 be harder to handle than sentence 1?

Long Short-Term Memory (LSTM)1.“My dog [eat/eats] rawhide”

2.“My dog, who I rescued in 2009, [eat/eats] rawhide”

Why might sentence 2 be harder to handle than sentence 1?

Long-term dependencies

The network needs to remember that dog was singular even though there were five words in the way.

1 2 3 4 5

x1

h0 h1σ

σ

wx

wh

wy

ŷ1

x2

h2σ

σ

wx

wh

wy

ŷ2

xt

htσ

σ

wx

wh

wy

ŷt

... ht-1

y1 y2 yt

Compare our predicted sequence to correct sequence.

RNN: Deep Learning for SequencesFigure credit: Cathy Finegan-Dollak

My

h0 h1σ

σ

wx

wh

wy

dog

dog

h2σ

σ

wx

wh

wy

eats

2009

htσ

σ

wx

wh

wy

eat

... ht-1

dog who eats

Compare our predicted sequence to correct sequence.

RNN: Deep Learning for SequencesFigure credit: Cathy Finegan-Dollak

Modern LSTM Diagram

My

h0

σwxc

whc

ĉ1

f1

Myh0

σ

wxfwhf

c0 Myh0

σ

wxiwhi

i1

+ c1

Myh0

σ

wxowho

o1

h1

σ

c2

...

The solution is to add gates.

i: “input gate”f: “forget gate”o: “output gate”

Figure credit: Cathy Finegan-Dollak

Modern LSTM Diagram

My

h0

σwxc

whc

ĉ1

f1

Myh0

σ

wxfwhf

c0 Myh0

σ

wxiwhi

i1

+ c1

Myh0

σ

wxowho

o1

h1

σ

c2

...

The Input Gate:

The input gate takes values between 0 and 1 depending on the input. It then “discounts” the input sigmoid by that amount.


Modern LSTM Diagram

My

h0

σwxc

whc

ĉ1

f1

Myh0

σ

wxfwhf

c0 Myh0

σ

wxiwhi

i1

+ c1

Myh0

σ

wxowho

o1

h1

σ

c2

...

The Forget Gate:

The forget gate helps determine what amount of the output from the previous state, C0, should be kept.


Modern LSTM Diagram

My

h0

σwxc

whc

ĉ1

f1

Myh0

σ

wxfwhf

c0 Myh0

σ

wxiwhi

i1

+ c1

Myh0

σ

wxowho

o1

h1

σ

c2

...

The Output Gate:

The output gate modulates the fraction of the hidden state that gets sent to the output, based on the input.


LSTMs and BeyondLSTMs are experiencing wild success in language modeling (Filippova et al. 15, Sutskever et al. 14, Graves 13, Sundermeyer et al. 10, Tang et al. 2015)

But, is language really just a flat sequence of words?

http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP042.pdf

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

http://arxiv.org/abs/1308.0850

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.248.4448&rep=rep1&type=pdf


LSTMs and BeyondLSTMs are experiencing wild success in language modeling (Filippova et al. 15, Sutskever et al. 14, Graves 13, Sundermeyer et al. 10, Tang et al. 2015)

But, is language really just a flat sequence of words?

If so, I kind of regret studying linguistics in college.


http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

http://arxiv.org/abs/1308.0850

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.248.4448&rep=rep1&type=pdf


Tree-LSTMsOne of the things that makes language so interesting is its tree structure.

To exploit this together with the performance of an LSTM, we generalize them to arrive at Tree-LSTMs.

Instead of taking input from a single previous node, Tree-LSTMs take input from all the children of a particular node. Today’s paper discusses two variations - the Child-Sum Tree-LSTM and the N-ary Tree-LSTM.

Tree-LSTMs

Ordinary Tree-LSTM Binary Tree-LSTM

All hail Cathy, who made these figures.

Binary Tree-LSTMs

All hail Cathy, who made these figures.

The Forget Gates:

A Binary Tree-LSTM can be used on a tree has a maximum branching factor of 2.

f1 is the forget gate for all left children. f3 is the forget gate for all right children.

This scheme works well if the children are ordered in a predictable way.

Child-Sum Tree-LSTMsThe children of a Child-Sum Tree-LSTM are unordered. Instead, the ht-1 value that gets passed into ut along with xt and sent to the input gate becomes the sum of all the h values of the child nodes.

Similarly, the hidden states of all the children are summed after they are sent through their forget gates into the hidden state for xt.

SummaryN-ary Tree LSTMs

• Children are discrete

• Each of N child slots has its own forget gate (e.g. same gate for all left children)

• Children are ordered, forget gate is determined by ordering

• Also called Constituency Tree-LSTMs

• Restriction: branching factor may not exceed N.

Child-Sum Tree-LSTMs

• Children are lumped together

• Children are unordered, because they are all summed anyway.

• Also called Dependency Tree-LSTMs

SummarySentiment Classification

• Two types: binary and 1-5

• Stanford Sentiment Treebank (Socher et al. 13)

• Dataset includes parse trees

Semantic Relatedness

• Given two sentences, predict an integer [1, 5]

• 1= least related 5 = most related

• Sentences Involving Compositional Knowledge dataset (Marelli et al. 2014)

• Final label of a sentence pair is the average of 10 annotators

http://nlp.stanford.edu/sentiment/

http://clic.cimec.unitn.it/composes/sick.html

Sentiment Classification Results● Constituency Tree-LSTM is trained on

more data than Dependency Tree-LSTM

● Continuing to train the GLOVE vectors yield a noticeable improvement

● No mention of why CNN-multichannel performs so well on the binary task.

Semantic Relatedness Results● Supervision only at

tree node

● Maximum depth of dependency tree is smaller than binarized constituency tree

Example CodeVarious LSTM Examples:LSTM for sentiment analysis of tweets - deeplearning.netCharacter-level LSTM for sequence generation - Andrej KarpathyRNNLIB - Alex GravesLSTM with peepholes - Felix Gers

Tree LSTMs:N-ary and Child-Sum Tree LSTMs - Kai Sheng Tai

Other:GLOVE vectors - Jeffrey Pennington

http://deeplearning.net/tutorial/lstm.html

https://github.com/karpathy/char-rnn

http://sourceforge.net/projects/rnnl/

http://www.felixgers.de/SourceCode_Data.html

https://github.com/stanfordnlp/treelstm

http://nlp.stanford.edu/projects/glove/

improved semantic representations from tree-structured long short- term memory networks kai...

Documents

cathy finegandollaklstms

sequencesfigure credit

output gatefigure credit

predicted sequence

input gatef

input sigmoid

flat sequence of words

long period of time