improved semantic representations from tree-structured long short- term memory networks kai...

22
Improved Semantic Representations From Tree- Structured Long Short-Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation by: Reed Coke

Upload: alexandrina-shepherd

Post on 21-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Improved Semantic Representations From Tree-Structured Long Short-Term

Memory Networks

Kai Sheng-Tai, Richard Socher, Christopher D. Manning

Presentation by: Reed Coke

Page 2: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Neural Nets in NLPAs we know, neural networks are taking NLP by storm.

Left: Visualization of Word Embeddings Right: Sentiment Analysis using RNTNs

Page 3: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Long Short-Term Memory (LSTM)One particular type of network architecture has become the de facto way to model sentences: the long short-term memory network (Hochreiter and Schmidhuber 97).

LSTMs are a type of recurrent neural network that are good at remembering information over long period of time within a sequence.

Cathy Finegan-Dollak recently gave a presentation about LSTMs and kindly allowed me to borrow many of her examples and images.

Page 4: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Long Short-Term Memory (LSTM)1.“My dog [eat/eats] rawhide”

2.“My dog, who I rescued in 2009, [eat/eats] rawhide”

Why might sentence 2 be harder to handle than sentence 1?

Page 5: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Long Short-Term Memory (LSTM)1.“My dog [eat/eats] rawhide”

2.“My dog, who I rescued in 2009, [eat/eats] rawhide”

Why might sentence 2 be harder to handle than sentence 1?

Long-term dependencies

The network needs to remember that dog was singular even though there were five words in the way.

1 2 3 4 5

Page 6: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

x1

h0 h1σ

σ

wx

wh

wy

ŷ1

x2

h2σ

σ

wx

wh

wy

ŷ2

xt

htσ

σ

wx

wh

wy

ŷt

... ht-1

y1 y2 yt

Compare our predicted sequence to correct sequence.

RNN: Deep Learning for SequencesFigure credit: Cathy Finegan-Dollak

Page 7: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

My

h0 h1σ

σ

wx

wh

wy

dog

dog

h2σ

σ

wx

wh

wy

eats

2009

htσ

σ

wx

wh

wy

eat

... ht-1

dog who eats

Compare our predicted sequence to correct sequence.

RNN: Deep Learning for SequencesFigure credit: Cathy Finegan-Dollak

Page 8: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Modern LSTM Diagram

My

h0

σwxc

whc

ĉ1

f1

Myh0

σ

wxfwhf

c0 Myh0

σ

wxiwhi

i1

+ c1

Myh0

σ

wxowho

o1

h1

σ

c2

...

The solution is to add gates.

i: “input gate”f: “forget gate”o: “output gate”

Figure credit: Cathy Finegan-Dollak

Page 9: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Modern LSTM Diagram

My

h0

σwxc

whc

ĉ1

f1

Myh0

σ

wxfwhf

c0 Myh0

σ

wxiwhi

i1

+ c1

Myh0

σ

wxowho

o1

h1

σ

c2

...

The Input Gate:

The input gate takes values between 0 and 1 depending on the input. It then “discounts” the input sigmoid by that amount.

Figure credit: Cathy Finegan-Dollak

Page 10: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Modern LSTM Diagram

My

h0

σwxc

whc

ĉ1

f1

Myh0

σ

wxfwhf

c0 Myh0

σ

wxiwhi

i1

+ c1

Myh0

σ

wxowho

o1

h1

σ

c2

...

The Forget Gate:

The forget gate helps determine what amount of the output from the previous state, C0, should be kept.

Figure credit: Cathy Finegan-Dollak

Page 11: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Modern LSTM Diagram

My

h0

σwxc

whc

ĉ1

f1

Myh0

σ

wxfwhf

c0 Myh0

σ

wxiwhi

i1

+ c1

Myh0

σ

wxowho

o1

h1

σ

c2

...

The Output Gate:

The output gate modulates the fraction of the hidden state that gets sent to the output, based on the input.

Figure credit: Cathy Finegan-Dollak

Page 12: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

LSTMs and BeyondLSTMs are experiencing wild success in language modeling (Filippova et al. 15, Sutskever et al. 14, Graves 13, Sundermeyer et al. 10, Tang et al. 2015)

But, is language really just a flat sequence of words?

Page 13: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

LSTMs and BeyondLSTMs are experiencing wild success in language modeling (Filippova et al. 15, Sutskever et al. 14, Graves 13, Sundermeyer et al. 10, Tang et al. 2015)

But, is language really just a flat sequence of words?

If so, I kind of regret studying linguistics in college.

Page 14: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Tree-LSTMsOne of the things that makes language so interesting is its tree structure.

To exploit this together with the performance of an LSTM, we generalize them to arrive at Tree-LSTMs.

Instead of taking input from a single previous node, Tree-LSTMs take input from all the children of a particular node. Today’s paper discusses two variations - the Child-Sum Tree-LSTM and the N-ary Tree-LSTM.

Page 15: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Tree-LSTMs

Ordinary Tree-LSTM Binary Tree-LSTM

All hail Cathy, who made these figures.

Page 16: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Binary Tree-LSTMs

All hail Cathy, who made these figures.

The Forget Gates:

A Binary Tree-LSTM can be used on a tree has a maximum branching factor of 2.

f1 is the forget gate for all left children. f3 is the forget gate for all right children.

This scheme works well if the children are ordered in a predictable way.

Page 17: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Child-Sum Tree-LSTMsThe children of a Child-Sum Tree-LSTM are unordered. Instead, the ht-1 value that gets passed into ut along with xt and sent to the input gate becomes the sum of all the h values of the child nodes.

Similarly, the hidden states of all the children are summed after they are sent through their forget gates into the hidden state for xt.

Page 18: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

SummaryN-ary Tree LSTMs

• Children are discrete

• Each of N child slots has its own forget gate (e.g. same gate for all left children)

• Children are ordered, forget gate is determined by ordering

• Also called Constituency Tree-LSTMs

• Restriction: branching factor may not exceed N.

Child-Sum Tree-LSTMs

• Children are lumped together

• Children are unordered, because they are all summed anyway.

• Also called Dependency Tree-LSTMs

Page 19: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

SummarySentiment Classification

• Two types: binary and 1-5

• Stanford Sentiment Treebank (Socher et al. 13)

• Dataset includes parse trees

Semantic Relatedness

• Given two sentences, predict an integer [1, 5]

• 1= least related 5 = most related

• Sentences Involving Compositional Knowledge dataset (Marelli et al. 2014)

• Final label of a sentence pair is the average of 10 annotators

Page 20: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Sentiment Classification Results● Constituency Tree-LSTM is trained on

more data than Dependency Tree-LSTM

● Continuing to train the GLOVE vectors yield a noticeable improvement

● No mention of why CNN-multichannel performs so well on the binary task.

Page 21: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Semantic Relatedness Results● Supervision only at

tree node

● Maximum depth of dependency tree is smaller than binarized constituency tree

Page 22: Improved Semantic Representations From Tree-Structured Long Short- Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation

Example CodeVarious LSTM Examples:LSTM for sentiment analysis of tweets - deeplearning.netCharacter-level LSTM for sequence generation - Andrej KarpathyRNNLIB - Alex GravesLSTM with peepholes - Felix Gers

Tree LSTMs:N-ary and Child-Sum Tree LSTMs - Kai Sheng Tai

Other:GLOVE vectors - Jeffrey Pennington