applied machine learning lecture 12: time series and ...richajo/dit866/lectures/l12/l12.pdf ·...

84

Upload: others

Post on 25-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Applied Machine LearningLecture 12: Time series and sequential models

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

March 3, 2020

Page 2: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Example of tasks related to sequences

I sequence (sentiment/review) classi�cation

I machine translation

I forecasting load of electricity

I recognising activities from videos

I ....

Page 3: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

overview

I how can we represent and predict sequential data usingmachine learning models?

I use classical approachesI use tailored neural network approaches

Page 4: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

overview

I how can we represent and predict sequential data usingmachine learning models?I use classical approachesI use tailored neural network approaches

Page 5: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

sequence (review/sentiment) classi�cation

I RECOMMEND this movie!

positive

Page 6: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

sequence prediction (1)

G O T H E N B

U

Page 7: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

sequence prediction (2)

Page 8: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

sequence tagging

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG O B-PER O O B-LOC O

A C A T G G T C T G A A

N N C C C C C C C C C N

Page 9: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Machine translation

This is a very nice school building.

Detta är en mycket trevlig skolbyggnad.

Page 10: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Overview

sequence prediction: classical approaches

neural networks for sequential problems

Di�erent types of RNNs

Page 11: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Use "reduce problem" concept

I a reduction in machine learning means that we convert acomplicated problem into (one or more) simpler problemsI example: multiclass classi�cation → several binary

I let's see how we attack sequential problems as a series ofclassi�cation or regression tasks

Page 12: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

step-by-step prediction: general ideas

I break down the problem into a sequence of smallerdecisionsI think of it as a system that gradually consumes input and

generates output

I use standard classi�ers or regressors to guess the next stepI use features from the input and from the historyI might need to constrain predictions to keep output consistent

Page 13: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

example: named entity tagging

I input: a sentence

I output: names in the sentence bracketed and labeled

United Nations o�cial Ekeus heads for Baghdad.[ ORG ] [ PER ] [ LOC ]

Page 14: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG O B-PER O O B-LOC O

I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.

Page 15: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding

United Nations o�cial Ekeus heads for Baghdad .

B-ORG

I-ORG O B-PER O O B-LOC O

I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.

Page 16: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG

O B-PER O O B-LOC O

I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.

Page 17: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG O

B-PER O O B-LOC O

I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.

Page 18: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG O B-PER

O O B-LOC O

I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.

Page 19: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG O B-PER O

O B-LOC O

I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.

Page 20: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG O B-PER O O

B-LOC O

I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.

Page 21: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG O B-PER O O B-LOC

O

I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.

Page 22: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG O B-PER O O B-LOC O

I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.

Page 23: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

training step-by-step systems: the basic approach

I how can we train the decision classi�er in a step-by-stepsystem?

I imitation learning: learn to imitate an �expert�

I naive approach to imitation learning:I force the system to generate the correct outputI observe the states we pass along the wayI use them as examples and train the classi�er

Page 24: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

training step-by-step systems: the basic approach

I how can we train the decision classi�er in a step-by-stepsystem?

I imitation learning: learn to imitate an �expert�

I naive approach to imitation learning:I force the system to generate the correct outputI observe the states we pass along the wayI use them as examples and train the classi�er

Page 25: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

example of features: named entity tagging

Page 26: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

what happens at prediction time?

I how will we use this system for prediction?

I there is di�erence between training and prediction:I when training, �history features� are based on the gold standardI when predicting, they are based on automatic predictions

Page 27: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

what happens at prediction time?

I how will we use this system for prediction?

I there is di�erence between training and prediction:I when training, �history features� are based on the gold standardI when predicting, they are based on automatic predictions

Page 28: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

code example: named entity tagging

see https://www.clips.uantwerpen.be/conll2003/ner/

Page 29: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

discussion

I how would you build a model to predict the demand ofelectricity load an hour from now?

0 25 50 75 100 125 150 175 200

7000

8000

9000

10000

11000

12000

see e.g. Chang et al. (2001) EUNITE Network Competition: Electricity Load

Forecasting

Page 30: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

time series prediction

I time series is a set of observation collected at speci�ed timenormally at equal interval (every 30minutes, every month, etc.)

I One usage is: to predict future values given the historicalobserved values (predictive perspective). This is the focus inML.

I Note that the temporal aspect is important.I Useful info come not only from input features, but also from

the changes in input/output over time.I make prediction is harder

Page 31: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Components of time series, in classic statistics approach

I trend (increase/decrease, something that happen for sometimeand then disappear)

I seasonality (repeating pattern within a certain �xed period;e.g., sale is higher in December than other months)

I irregularity (noise, cannot be predicted); e.g., an outbreakhappen, sale of maskers gone up, after that come down again

I cyclic (up/down movement), they keep on repeating, butharder to predict

Page 32: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

time series analysis software in Python

I StatsModels:https://www.statsmodels.org/stable/tsa.html

Page 33: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

autoregressive models: AR(p)

I an autoregressive model AR(p) is de�ned by the equation

Yt = c +

p∑i=1

aiYt−i + εt

where Yt is the time series, c and ai are constants, and εt isGaussian noise

0 25 50 75 100 125 150 175 200

8

6

4

2

0

2

4 AR(2) a=[0.99, 0.9]

0 25 50 75 100 125 150 175 200

2

1

0

1

2

3 AR(1) a=[-0.7]

Page 34: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

using an AR model for prediction

Yt ≈ c +

p∑i=1

aiYt−i

I for our purposes, this type of model can be seen as a type oflinear regression model

I next step can be predicted from some previous steps

Page 35: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

extensions to the basic autoregressive model

I AR models with exogenous variables (ARX )

Yt = c +

p∑i=1

aiYt−i +

q∑i=1

biXt−i + εt

allow us to add more features, e.g. to model seasonality

I nonlinear models (NARX ):

Yt = F (Yt−1, ...,Xt−i , ...) + εt

where F is any predictive model

Page 36: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Converting time series prediction/forecasting to supervisedlearning

I Convert data using sliding window methodI one step forecastI multi-step forecast

I If using

Page 37: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Overview

sequence prediction: classical approaches

neural networks for sequential problems

Di�erent types of RNNs

Page 38: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

overview

I previously, we had an introduction to neural network classi�ersI key attractions of this type of model is the reduced need of

feature engineering

I today: recurrent neural networks, which we apply tosequential dataI classifying sentences or documents (e.g., predict

positive/negative review)I outputting sequences, such as time series predictionI �sequence-to-sequence�, such as machine translation

Page 39: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

recap: feedforward NN

I a feedforward neural network or multilayerperceptron consists of connected layers of �classi�ers�I the intermediate classi�ers are called hidden unitsI the �nal classi�er is called the output unit

I each hidden unit hi computes its output based on its ownweight vector whi

:

hi = f (whi· x)

I and then the output is computed from the hidden units:

y = f (wo · h)

I the function f is called the activation

Page 40: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

recap: representing words in NNs

I NN implementations tend to prefer short, dense vectors

I recall the way we code word features as one-hot sparse vectors:

tomato → [0, 0, 1, 0, 0, . . . , 0, 0, 0]carrot → [0, 0, 0, 0, 0, . . . , 0, 1, 0]

I the solution: represent words with low-dimensional vectors(word embeddings), in a way so that words with similarmeaning have similar vectors

tomato → [0.10,−0.20, 0.45, 1.2,−0.92, 0.71, 0.05]carrot → [0.08,−0.21, 0.38, 1.3,−0.91, 0.82, 0.09]

I libraries such as word2vec help us build the vectors

Page 41: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

word embeddings

I word embeddings: instead of high-dimensional sparsevectors, low-dimensional dense vectors

Gothenburg → [0.010,−0.20, . . . , 0.15, 0.07,−0.23, . . .]Stockholm → [0.015,−0.05, . . . , 0.11, 0.14,−0.12, . . .]

I created as a by-product of training a neural network, or byseparate pre-training using raw text

I this solution addresses the previously mentioned limitations:I low dimensionality, easy to plug into NNsI can encode some sort of �similarity�I to some extent, they mitigate the problem of rarity � if

pre-trained

Page 42: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Using �contexts� to determine similarityI the distributional hypothesis: the distribution of contexts in

which a word appears is a good proxy of the meaning of thatword. Assumption: two words probably have a similar�meaning� if they tend to appear in similar contexts

I example of a word�word co-occurrence matrix:I �I like NLP�I �I like deep learning�I �I enjoy �ying�

[source]

Page 43: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

measuring the similarity between words: cosine

[source]

Page 44: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

word2vec: basic assumptions

I each word w comes from a word vocabulary W

I from a large collection of text, we collect contexts such as

they drink co�ee with milk

Page 45: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

word2vec: �skip-gram with negative sampling�

I collect word�context pairs occurring in the text dataco�ee: drink

I for each such pair, randomly generate a number of syntheticpairs:co�ee: mechanic

co�ee: contrary

co�ee: persuade

I then �t the following model w.r.t. (V ,V ′)

P(true pair|(w , c)) = 1

1+ exp (−V (w) · V ′(c))

P(synthetic pair|(w , c)) = 1− 1

1+ exp (−V (w) · V ′(c))

Page 46: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

inspection of the model

I after training the embedding model, we can inspect the resultusing nearest-neighbor lists

I for illustration, vectors can be projected to two dimensionsusing methods such as t-SNE or PCA

pizzasushi

falafel

spaghetti

rock

techno

funksoul

punkjazz

routertouchpad

laptop

monitor

Page 47: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

gensim example: training a model

import gensim

text_file_name = 'data/wikipedia.txt'

sentences = gensim.models.word2vec.LineSentence(text_file_name,

limit=100000)

model = gensim.models.Word2Vec(sentences, size=20, window=5,

min_count=5, workers=2)

...

model.most_similar('europe')

Page 48: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

gensim example: using a trained model

from gensim.models import KeyedVectors

filename = 'GoogleNews-vectors-negative300.bin'

model = KeyedVectors.load_word2vec_format(filename, binary=True)

...

Page 49: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

recap: sequential prediction

United Nations o�cial Ekeus heads for Baghdad .

B-ORG I-ORG O B-PER O O B-LOC O

I a normal classi�er is applied in each step: SVM, LR, decisiontree, neural network, . . .

I limitation: features are typically extracted from a smallwindow!

Page 50: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

recurrent NN for processing sequences

I recurrent NNs (RNNs) are applied in a step-by-step fashion toa sequential input such as a sentence

I RNNs use a state vector that represents what has happenedpreviously

I after each step, a new state is computed

Page 51: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

RNN example

Page 52: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

RNN example

Page 53: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

RNN example

Page 54: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

RNN example

Page 55: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

using the RNN output

I the RNN output isn't that interesting on its own . . .

I we can use itI in sequence classi�ersI in sequence predictorsI in sequence taggersI in translation

Page 56: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

example: using the RNN output in a document classi�er

Page 57: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

example: using the RNN output in a part-of-speech tagger

Page 58: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

example: using the RNN output in a part-of-speech tagger

Page 59: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

what's in the box? simple RNN implementation

I the simplest type of RNN looks similar to afeedforward NNI the next state is computed like a hidden

layer in a feedforward NN

statet = σ(Wh · (input⊕ statet−1))

I . . . and the output from the next state

output = softmax(Wo · statet)

Page 60: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

simple RNNs in Keras

model = Sequential()

model.add(SimpleRNN(32))

Page 61: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

training RNNs: backpropagation through time

Page 62: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Overview

sequence prediction: classical approaches

neural networks for sequential problems

Di�erent types of RNNs

Page 63: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Many-to-many architectures(1)

I For input and output sequences of same length

Page 64: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Many-to-many architectures(2)

I For input and output sequences of di�erent lengths

I For example, machine translation for sentiment/documentclassi�cation, input: sequence, output: positive/negative

Page 65: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Many-to-one

I For example, for sentiment classi�cation, input: sequence andoutput: positive/negative. For video-based activity recognition,input: sequence of video frames, output: type of activity

Page 66: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

One-to-many

I For example, for music generation, input: genre, output:sequence of notes

Page 67: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

simple RNNs have a drawback

image borrowed from C. Olah's blog

Page 68: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

long short-term memory: LSTM

I the long short-term memory is a type of RNN that isdesigned to handle long-term dependencies in a better way

image borrowed from C. Olah's blog

I its state consists of two parts � a short-term and along-term part (also known as the cell state)

I note that the exact implementation can di�er slightly inliterature!

Page 69: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

LSTM: short-term part and output

image borrowed from C. Olah's blog

I short-term state is a bit similar to what we had in the simpleRNN

Page 70: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

LSTM: long-term part

image borrowed from C. Olah's blog

I the long-term state is controlled by a forget gate that decidesif information should be passed along or discarded

I parts of the input and short-term state can be added to thelong-term state

Page 71: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

LSTMs in Keras

from keras.models import Sequential

from keras.layers import Dense, Dropout

from keras.layers import Embedding

from keras.layers import LSTM

model = Sequential()

model.add(Embedding(max_features, output_dim=256))

model.add(LSTM(128))

model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',

optimizer='adam')

model.fit(x_train, y_train, batch_size=16, epochs=10)

score = model.evaluate(x_test, y_test, batch_size=16)

Page 72: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

stacked LSTMs

Page 73: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

stacked LSTMs in Keras

model = Sequential()

# returns a sequence of vectors of dimension 32

model.add(LSTM(32, return_sequences=True,

input_shape=(timesteps, data_dim)))

# returns a sequence of vectors of dimension 32

model.add(LSTM(32, return_sequences=True))

# return a single vector of dimension 32

model.add(LSTM(32))

model.add(Dense(10, activation='softmax'))

Page 75: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

example: NER using LSTMs

Page 76: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

recent developments in machine translation

[image by Chris Manning]

Page 77: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

sequence-to-sequence learning for translation

Sutskever et al. (2014) Sequence to Sequence Learning with Neural Networks.

Page 78: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

training data for machine translation systems

Page 79: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

a �nal twist on RNNs: neural attention

Page 82: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

recent developments (2018)

I We can use a pre-trained CNN for our own image recognitiontask to give us better performance, compared to when buildingown model using own training data

I can something similar be done for text?

I the ELMo model (by Allen AI) is a pre-trained LSTM modelthat can be �plugged� into other applicationsI https://allennlp.org/elmo

I the BERT model (by Google) uses a new architecture calledthe transformerI https://github.com/google-research/bert

I Many other pre-trained models derived from BERT

I XL-net

Page 84: Applied Machine Learning Lecture 12: Time series and ...richajo/dit866/lectures/l12/l12.pdf · Applied Machine Learning Lecture 12: Time series and sequential models Selpi (selpi@chalmers.se)

Next lectures

I 6 March: Guest lectures from two companies giving industrialaspects of machine learningI SolitaI Ericsson

I 10 March: Semi-supervised learning, co-training/multi-viewlearning, self-learning, recommender systems, and more onLSTM