applied machine learning lecture 12: time series and ...richajo/dit866/lectures/l12/l12.pdf ·...

Applied Machine LearningLecture 12: Time series and sequential models

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

March 3, 2020

mailto:[email protected]

Example of tasks related to sequences

I sequence (sentiment/review) classi�cation

I machine translation

I forecasting load of electricity

I recognising activities from videos

I ....

overview

I how can we represent and predict sequential data usingmachine learning models?

I use classical approachesI use tailored neural network approaches

overview

I how can we represent and predict sequential data usingmachine learning models?I use classical approachesI use tailored neural network approaches

sequence (review/sentiment) classi�cation

I RECOMMEND this movie!

↓

positive

sequence prediction (1)

G O T H E N B

↓

U

sequence prediction (2)

sequence tagging

United Nations o�cial Ekeus heads for Baghdad .

↓

B-ORG I-ORG O B-PER O O B-LOC O

A C A T G G T C T G A A

↓

N N C C C C C C C C C N

Machine translation

This is a very nice school building.

↓

Detta är en mycket trevlig skolbyggnad.

Overview

sequence prediction: classical approaches

neural networks for sequential problems

Di�erent types of RNNs

Use "reduce problem" concept

I a reduction in machine learning means that we convert acomplicated problem into (one or more) simpler problemsI example: multiclass classi�cation → several binary

I let's see how we attack sequential problems as a series ofclassi�cation or regression tasks

step-by-step prediction: general ideas

I break down the problem into a sequence of smallerdecisionsI think of it as a system that gradually consumes input and

generates output

I use standard classi�ers or regressors to guess the next stepI use features from the input and from the historyI might need to constrain predictions to keep output consistent

example: named entity tagging

I input: a sentence

I output: names in the sentence bracketed and labeled

United Nations o�cial Ekeus heads for Baghdad.[ ORG ] [ PER ] [ LOC ]

converting brackets into tags

I typical solution: Beginning/Inside/Outside coding



I see Ratinov and Roth: Design challenges and misconceptions

in named entity recognition. CoNLL 2009.




B-ORG

I-ORG O B-PER O O B-LOC O






B-ORG I-ORG

O B-PER O O B-LOC O






B-ORG I-ORG O

B-PER O O B-LOC O






B-ORG I-ORG O B-PER

O O B-LOC O






B-ORG I-ORG O B-PER O

O B-LOC O






B-ORG I-ORG O B-PER O O

B-LOC O






B-ORG I-ORG O B-PER O O B-LOC

O



training step-by-step systems: the basic approach

I how can we train the decision classi�er in a step-by-stepsystem?

I imitation learning: learn to imitate an �expert�

I naive approach to imitation learning:I force the system to generate the correct outputI observe the states we pass along the wayI use them as examples and train the classi�er

example of features: named entity tagging

what happens at prediction time?

I how will we use this system for prediction?

I there is di�erence between training and prediction:I when training, �history features� are based on the gold standardI when predicting, they are based on automatic predictions

code example: named entity tagging

see https://www.clips.uantwerpen.be/conll2003/ner/

https://www.clips.uantwerpen.be/conll2003/ner/

discussion

I how would you build a model to predict the demand ofelectricity load an hour from now?

0 25 50 75 100 125 150 175 200

7000

8000

9000

10000

11000

12000

see e.g. Chang et al. (2001) EUNITE Network Competition: Electricity Load

Forecasting

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=2A4A162A723AD2BBDA5DA2996019D3CA?doi=10.1.1.5.1794&rep=rep1&type=pdf

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=2A4A162A723AD2BBDA5DA2996019D3CA?doi=10.1.1.5.1794&rep=rep1&type=pdf

time series prediction

I time series is a set of observation collected at speci�ed timenormally at equal interval (every 30minutes, every month, etc.)

I One usage is: to predict future values given the historicalobserved values (predictive perspective). This is the focus inML.

I Note that the temporal aspect is important.I Useful info come not only from input features, but also from

the changes in input/output over time.I make prediction is harder

Components of time series, in classic statistics approach

I trend (increase/decrease, something that happen for sometimeand then disappear)

I seasonality (repeating pattern within a certain �xed period;e.g., sale is higher in December than other months)

I irregularity (noise, cannot be predicted); e.g., an outbreakhappen, sale of maskers gone up, after that come down again

I cyclic (up/down movement), they keep on repeating, butharder to predict

time series analysis software in Python

I StatsModels:https://www.statsmodels.org/stable/tsa.html

https://www.statsmodels.org/stable/tsa.html

autoregressive models: AR(p)

I an autoregressive model AR(p) is de�ned by the equation

Yt = c +

p∑i=1

aiYt−i + εt

where Yt is the time series, c and ai are constants, and εt isGaussian noise

0 25 50 75 100 125 150 175 200

8

6

4

2

0

2

4 AR(2) a=[0.99, 0.9]

0 25 50 75 100 125 150 175 200

2

1

0

1

2

3 AR(1) a=[-0.7]

using an AR model for prediction

Yt ≈ c +

p∑i=1

aiYt−i

I for our purposes, this type of model can be seen as a type oflinear regression model

I next step can be predicted from some previous steps

extensions to the basic autoregressive model

I AR models with exogenous variables (ARX )

Yt = c +

p∑i=1

aiYt−i +

q∑i=1

biXt−i + εt

allow us to add more features, e.g. to model seasonality

I nonlinear models (NARX ):

Yt = F (Yt−1, ...,Xt−i , ...) + εt

where F is any predictive model

Converting time series prediction/forecasting to supervisedlearning

I Convert data using sliding window methodI one step forecastI multi-step forecast

I If using

Overview




overview

I previously, we had an introduction to neural network classi�ersI key attractions of this type of model is the reduced need of

feature engineering

I today: recurrent neural networks, which we apply tosequential dataI classifying sentences or documents (e.g., predict

positive/negative review)I outputting sequences, such as time series predictionI �sequence-to-sequence�, such as machine translation

recap: feedforward NN

I a feedforward neural network or multilayerperceptron consists of connected layers of �classi�ers�I the intermediate classi�ers are called hidden unitsI the �nal classi�er is called the output unit

I each hidden unit hi computes its output based on its ownweight vector whi

:

hi = f (whi· x)

I and then the output is computed from the hidden units:

y = f (wo · h)

I the function f is called the activation

recap: representing words in NNs

I NN implementations tend to prefer short, dense vectors

I recall the way we code word features as one-hot sparse vectors:

tomato → [0, 0, 1, 0, 0, . . . , 0, 0, 0]carrot → [0, 0, 0, 0, 0, . . . , 0, 1, 0]

I the solution: represent words with low-dimensional vectors(word embeddings), in a way so that words with similarmeaning have similar vectors

tomato → [0.10,−0.20, 0.45, 1.2,−0.92, 0.71, 0.05]carrot → [0.08,−0.21, 0.38, 1.3,−0.91, 0.82, 0.09]

I libraries such as word2vec help us build the vectors

word embeddings

I word embeddings: instead of high-dimensional sparsevectors, low-dimensional dense vectors

Gothenburg → [0.010,−0.20, . . . , 0.15, 0.07,−0.23, . . .]Stockholm → [0.015,−0.05, . . . , 0.11, 0.14,−0.12, . . .]

I created as a by-product of training a neural network, or byseparate pre-training using raw text

I this solution addresses the previously mentioned limitations:I low dimensionality, easy to plug into NNsI can encode some sort of �similarity�I to some extent, they mitigate the problem of rarity � if

pre-trained

Using �contexts� to determine similarityI the distributional hypothesis: the distribution of contexts in

which a word appears is a good proxy of the meaning of thatword. Assumption: two words probably have a similar�meaning� if they tend to appear in similar contexts

I example of a word�word co-occurrence matrix:I �I like NLP�I �I like deep learning�I �I enjoy �ying�

[source]

https://stats.stackexchange.com/questions/179349/what-are-the-pros-and-cons-of-applying-pointwise-mutual-information-on-a-word-co

measuring the similarity between words: cosine

[source]

https://corpling.hypotheses.org/495

word2vec: basic assumptions

I each word w comes from a word vocabulary W

I from a large collection of text, we collect contexts such as

they drink co�ee with milk

word2vec: �skip-gram with negative sampling�

I collect word�context pairs occurring in the text dataco�ee: drink

I for each such pair, randomly generate a number of syntheticpairs:co�ee: mechanic

co�ee: contrary

co�ee: persuade

I then �t the following model w.r.t. (V ,V ′)

P(true pair|(w , c)) = 1

1+ exp (−V (w) · V ′(c))

P(synthetic pair|(w , c)) = 1− 1

1+ exp (−V (w) · V ′(c))

inspection of the model

I after training the embedding model, we can inspect the resultusing nearest-neighbor lists

I for illustration, vectors can be projected to two dimensionsusing methods such as t-SNE or PCA

pizzasushi

falafel

spaghetti

rock

techno

funksoul

punkjazz

routertouchpad

laptop

monitor

gensim example: training a model

import gensim

text_file_name = 'data/wikipedia.txt'

sentences = gensim.models.word2vec.LineSentence(text_file_name,

limit=100000)

model = gensim.models.Word2Vec(sentences, size=20, window=5,

min_count=5, workers=2)

...

model.most_similar('europe')

gensim example: using a trained model

from gensim.models import KeyedVectors

filename = 'GoogleNews-vectors-negative300.bin'

model = KeyedVectors.load_word2vec_format(filename, binary=True)

...

recap: sequential prediction



I a normal classi�er is applied in each step: SVM, LR, decisiontree, neural network, . . .

I limitation: features are typically extracted from a smallwindow!

recurrent NN for processing sequences

I recurrent NNs (RNNs) are applied in a step-by-step fashion toa sequential input such as a sentence

I RNNs use a state vector that represents what has happenedpreviously

I after each step, a new state is computed

RNN example

using the RNN output

I the RNN output isn't that interesting on its own . . .

I we can use itI in sequence classi�ersI in sequence predictorsI in sequence taggersI in translation

example: using the RNN output in a document classi�er

example: using the RNN output in a part-of-speech tagger

what's in the box? simple RNN implementation

I the simplest type of RNN looks similar to afeedforward NNI the next state is computed like a hidden

layer in a feedforward NN

statet = σ(Wh · (input⊕ statet−1))

I . . . and the output from the next state

output = softmax(Wo · statet)

simple RNNs in Keras

model = Sequential()

model.add(SimpleRNN(32))

training RNNs: backpropagation through time

Overview




Many-to-many architectures(1)

I For input and output sequences of same length

Many-to-many architectures(2)

I For input and output sequences of di�erent lengths

I For example, machine translation for sentiment/documentclassi�cation, input: sequence, output: positive/negative

Many-to-one

I For example, for sentiment classi�cation, input: sequence andoutput: positive/negative. For video-based activity recognition,input: sequence of video frames, output: type of activity

One-to-many

I For example, for music generation, input: genre, output:sequence of notes

simple RNNs have a drawback

image borrowed from C. Olah's blog

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

long short-term memory: LSTM

I the long short-term memory is a type of RNN that isdesigned to handle long-term dependencies in a better way


I its state consists of two parts � a short-term and along-term part (also known as the cell state)

I note that the exact implementation can di�er slightly inliterature!


LSTM: short-term part and output


I short-term state is a bit similar to what we had in the simpleRNN


LSTM: long-term part


I the long-term state is controlled by a forget gate that decidesif information should be passed along or discarded

I parts of the input and short-term state can be added to thelong-term state


LSTMs in Keras

from keras.models import Sequential

from keras.layers import Dense, Dropout

from keras.layers import Embedding

from keras.layers import LSTM


model.add(Embedding(max_features, output_dim=256))

model.add(LSTM(128))

model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',

optimizer='adam')

model.fit(x_train, y_train, batch_size=16, epochs=10)

score = model.evaluate(x_test, y_test, batch_size=16)

stacked LSTMs

stacked LSTMs in Keras


# returns a sequence of vectors of dimension 32

model.add(LSTM(32, return_sequences=True,

input_shape=(timesteps, data_dim)))

# returns a sequence of vectors of dimension 32

model.add(LSTM(32, return_sequences=True))

# return a single vector of dimension 32

model.add(LSTM(32))

model.add(Dense(10, activation='softmax'))

sequence tagging based on LSTMs

[source]

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1609-9

example: NER using LSTMs

recent developments in machine translation

[image by Chris Manning]

https://nlp.stanford.edu/manning/

sequence-to-sequence learning for translation

Sutskever et al. (2014) Sequence to Sequence Learning with Neural Networks.

https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

training data for machine translation systems

a �nal twist on RNNs: neural attention

translation with attention

[source]

https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/

example: visualizing attention

[source]

https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/

recent developments (2018)

I We can use a pre-trained CNN for our own image recognitiontask to give us better performance, compared to when buildingown model using own training data

I can something similar be done for text?

I the ELMo model (by Allen AI) is a pre-trained LSTM modelthat can be �plugged� into other applicationsI https://allennlp.org/elmo

I the BERT model (by Google) uses a new architecture calledthe transformerI https://github.com/google-research/bert

I Many other pre-trained models derived from BERT

I XL-net

https://allennlp.org/elmo

https://github.com/google-research/bert

[source]

http://jalammar.github.io/illustrated-bert/

Next lectures

I 6 March: Guest lectures from two companies giving industrialaspects of machine learningI SolitaI Ericsson

I 10 March: Semi-supervised learning, co-training/multi-viewlearning, self-learning, recommender systems, and more onLSTM

applied machine learning lecture 12: time series and ...richajo/dit866/lectures/l12/l12.pdf ·...

Documents