8. recurrent networks neural networks and deep learning ...cs9444/18s2/lect/08_recurrent4.pdf8....

COMP9444Neural Networks and Deep Learning

8. Recurrent Networks

Textbook, Chapter 10

COMP9444 c©Alan Blair, 2017-18

COMP9444 18s2 Recurrent Networks 1

Outline

� Processing Temporal Sequences

� Sliding Window

� Recurrent Network Architectures

� Hidden Unit Dynamics

� Long Short Term Memory



Processing Temporal Sequences

There are many tasks which require a sequence of inputs to be processedrather than a single input.

� speech recognition

� time series prediction

� machine translation

� handwriting recognition

How can neural network models be adapted for these tasks?



Sliding Window

The simplest way to feed temporal input to a neural network is the“sliding window” approach, first used in the NetTalk system(Sejnowski & Rosenberg, 1987).



NetTalk Task

Given a sequence of 7 characters, predict the phonetic pronunciation ofthe middle character.

For this task, we need to know the characters on both sides.

For example, how are the vowels in these words pronounced?

pa pat pate paternal

mo mod mode modern



NetTalk Architecture



NetTalk

� NETtalk gained a lot of media attention at the time.

� Hooking it up to a speech synthesizer was very cute. In the earlystages of training, it sounded like a babbling baby. When fully trained,it pronounced the words mostly correctly (but sounded somewhatrobotic).

� Later studies on similar tasks have often found that a decision treecould produce equally good or better accuracy.

� This kind of approach can only learn short term dependencies, not themedium or long term dependencies that are required for some tasks.



Simple Recurrent Network (Elman, 1990)

� at each time step, hidden layer activations are copied to “context” layer� hidden layer receives connections from input and context layers� the inputs are fed one at a time to the network, it uses the context layer

to “remember” whatever information is required for it to produce thecorrect output



Back Propagation Through Time

� we can “unroll” a recurrent architecture into an equivalent feedforwardarchitecture, with shared weights

� applying backpropagation to the unrolled architecture is reffered to as“backpropagation through time”

� we can backpropagate just one timestep, or a fixed number oftimesteps, or all the way back to beginning of the sequence



Other Recurrent Network Architectures

� it is sometimes beneficial to add “shortcut” connections directly frominput to output

� connections from output back to hidden have also been explored(sometimes called “Jordan Networks”)



Second Order (or Gated) Networks

x jt = tanh(W j0

σt +d

∑k=1

W jkσt x

kt−1)

z = tanh(P0 +d

∑j=1

P j x jn)



Task: Formal Language Recognition

Accept Reject

1 01 1 1 01 1 1 0 11 1 1 1 0 01 1 1 1 1 0 1 11 1 1 1 1 1 1 1 01 1 1 1 1 1 1 1 1 1 1 1 1 1 01 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1

Scan a sequence of characters one at a time,then classify the sequence as Accept or Reject.



Dynamical Recognizers

� gated network trained by BPTT� emulates exactly the behaviour of Finite State Automaton



Task: Formal Language Recognition

Accept Reject

1 0 0 00 1 1 0 0 01 0 0 0 0 10 1 0 0 0 0 0 0 0 0 00 0 1 1 1 1 1 0 0 0 0 1 11 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 10 0 1 1 1 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 10 1 0 0 1 0 0 1 0 0 0 0 0 01 1 1 0 0 0 0 0 0 00 0 1 0

Scan a sequence of characters one at a time,then classify the sequence as Accept or Reject.



Dynamical Recognizers

� trained network emulates the behaviour of Finite State Automaton� training set must include short, medium and long examples



Phase Transition



Chomsky Hierarchy

Language Machine Example

Regular Finite State Automaton an (n odd)

Context Free Push Down Automaton anbn

Context Sensitive Linear Bounded Automaton anbncn

Recursively Enumerable Turing Machine true QBF



Task: Formal Language Prediction

abaabbabaaabbbaaaabbbbabaabbaaaaabbbbb . . .

Scan a sequence of characters one at a time, and try at each step to predictthe next character in the sequence.

In some cases, the prediction is probabilistic.

For the anbn task, the first b is not predictable, but subsequent b’s and theinitial a in the next subsequence are predictable.



Elman Network for predicting anbn

b

a

a

b



Oscillating Solution for anbn



Learning to Predict anbn

� the network does not implement a Finite State Automaton but insteaduses two fixed points in activation space – one attracting, the otherrepelling (Wiles & Elman, 1995)

� networks trained only up to a10b10 could generalize up to a12b12

� training the weights by evolution is more stable than by backpropagation

� networks trained by evolution were sometimes monotonic rather thanoscillating



Monotonic Solution for anbn



Hidden Unit Analysis for anbn

hidden unit trajectory fixed points and eigenvectors



Counting by Spiralling

� for this task, sequence is accepted if the number of a’s and b’s are equal� network counts up by spiralling inwards, down by spiralling outwards



Hidden Unit Dynamics for anbncn

� SRN with 3 hidden units can learn to predict anbncn by counting up anddown simultaneously in different directions, thus producing a star shape.



Partly Monotonic Solution for anbncn



Long Range Dependencies

� Simple Recurrent Networks (SRNs) can learn medium-rangedependencies but have difficulty learning long range dependencies

� Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU)can learn long range dependencies better than SRN



Long Short Term Memory

Two excellent Web resources for LSTM:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

christianherta.de/lehre/dataScience/machineLearning/neuralNetworks/LSTM.php



Reber Grammar



Embedded Reber Grammar



Simple Recurrent Network

SRN – context layer is combined directly with the input to produce thenext hidden layer.

SRN can learn Reber Grammar, but not Embedded Reber Grammar.COMP9444 c©Alan Blair, 2017-18



LSTM – context layer is modulated by three gating mechanisms:forget gate, input gate and output gate.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/






Gated Recurrent Unit

GRU is similar to LSTM but has only two gates instead of three.


8. recurrent networks neural networks and deep learning ...cs9444/18s2/lect/08_recurrent4.pdf8....

Documents