multimodal machine learning - piazza.com

46
1 Louis-Philippe Morency Multimodal Machine Learning Lecture 2.1: Basic Concepts โ€“ Neural Networks * Original version co-developed with Tadas Baltrusaitis

Upload: others

Post on 02-Dec-2021

3 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Multimodal Machine Learning - piazza.com

1

Louis-Philippe Morency

Multimodal Machine Learning

Lecture 2.1: Basic Concepts โ€“

Neural Networks

* Original version co-developed with Tadas Baltrusaitis

Page 2: Multimodal Machine Learning - piazza.com

Lecture Objectives

โ–ช Unimodal basic representationsโ–ช Visual, language and acoustic modalities

โ–ช Data-driven machine learningโ–ช Training, validation and testing

โ–ช Example: K-nearest neighbor

โ–ช Linear Classificationโ–ช Score function

โ–ช Two loss functions (cross-entropy and hinge loss)

โ–ช Neural networks

โ–ช Course project team formation

Page 3: Multimodal Machine Learning - piazza.com

Multimodal Machine Learning

Verbal Vocal Visual

These challenges are non-exclusive.

Page 4: Multimodal Machine Learning - piazza.com

4

Unimodal Basic

Representations

Page 5: Multimodal Machine Learning - piazza.com

5

Unimodal Classification โ€“ Visual Modality

Dog ?

โ€ฆ

Binary classification

problem

Color image

Each pixelis represented in โ„›๐‘‘, d is the number of colors (d=3 for RGB)

Input

observ

ation ๐’™

๐’Š

label ๐‘ฆ๐‘– โˆˆ ๐’ด = {0,1}

Page 6: Multimodal Machine Learning - piazza.com

6

Unimodal Classification โ€“ Visual Modality

Dog

โ€ฆ

Cat

Duck

Pig

Bird ?

Each pixelis represented in โ„›๐‘‘, d is the number of colors (d=3 for RGB)

-or-

-or-

-or-

-or-

label ๐‘ฆ๐‘– โˆˆ ๐’ด = {0,1,2,3, โ€ฆ }

Input

observ

ation ๐’™

๐’Š

Multi-class

classification problem

Page 7: Multimodal Machine Learning - piazza.com

7

Unimodal Classification โ€“ Visual Modality

โ€ฆ

Puppy ?label vector ๐’š๐’Š โˆˆ ๐’ด๐‘š = 0,1 ๐‘š

Input

observ

ation ๐’™

๐’Š

Each pixelis represented in โ„›๐‘‘, d is the number of colors (d=3 for RGB)

Dog ?

Cat ?

Duck?

Pig ?

Bird ?

Multi-label (or multi-task)

classification problem

Page 8: Multimodal Machine Learning - piazza.com

8

Unimodal Classification โ€“ Visual Modality

โ€ฆ

label vector ๐’š๐’Š โˆˆ ๐’ด๐‘š = โ„๐‘š

Input

observ

ation ๐’™

๐’Š

Each pixelis represented in โ„๐‘‘, d is the number of colors (d=3 for RGB)

Weight ?

Height ?

Age ?

Distance ?

Happy ?

Multi-label (or multi-task)

regression problem

Page 9: Multimodal Machine Learning - piazza.com

9

Unimodal Classification โ€“ Language Modality

Wri

tte

n la

ng

ua

ge

Sp

ok

en

la

ng

ua

ge

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

Input observ

ation ๐’™

๐’Š

โ€ฆ

โ€œone-hotโ€ vector

๐’™๐’Š = number of words in dictionary

Word-level

classification

Sentiment ?(positive or negative)

Part-of-speech ?(noun, verb,โ€ฆ)

Named entity ?(names of person,โ€ฆ)

Page 10: Multimodal Machine Learning - piazza.com

10

Unimodal Classification โ€“ Language Modality

Wri

tte

n la

ng

ua

ge

Sp

ok

en

la

ng

ua

ge

0

1

0

0

1

0

1

0

0

0

0

1

0

0

0

0

1

0

0

0

Input observ

ation ๐’™

๐’Š

โ€ฆ

โ€œbag-of-wordโ€ vector

๐’™๐’Š = number of words in dictionary

Document-level

classification

Sentiment ?(positive or negative)

Page 11: Multimodal Machine Learning - piazza.com

11

Unimodal Classification โ€“ Language Modality

Wri

tte

n la

ng

ua

ge

Sp

ok

en

la

ng

ua

ge

0

1

0

0

1

0

1

0

0

0

0

1

0

0

0

0

1

0

0

0

Input observ

ation ๐’™

๐’Š

โ€ฆ

โ€œbag-of-wordโ€ vector

๐’™๐’Š = number of words in dictionary

Utterance-level

classification

Sentiment ?(positive or negative)

Page 12: Multimodal Machine Learning - piazza.com

12

Unimodal Classification โ€“ Acoustic Modality

Digitalized acoustic signal

โ€ข Sampling rates: 8~96kHz

โ€ข Bit depth: 8, 16 or 24 bits

โ€ข Time window size: 20ms

โ€ข Offset: 10ms

Spectogram

0.21

0.14

0.56

0.45

0.9

0.98

0.75

0.34

0.24

0.11

0.02Input observ

ation ๐’™

๐’Š

Spoken word ?

Page 13: Multimodal Machine Learning - piazza.com

13

Unimodal Classification โ€“ Acoustic Modality

Digitalized acoustic signal

โ€ข Sampling rates: 8~96kHz

โ€ข Bit depth: 8, 16 or 24 bits

โ€ข Time window size: 20ms

โ€ข Offset: 10ms

Spectogram

0.21

0.14

0.56

0.45

0.9

0.98

0.75

0.34

0.24

0.11

0.02Input observ

ation ๐’™

๐’Š

โ€ฆ

0.24

0.26

0.58

0.9

0.99

0.79

0.45

0.34

0.24

Spoken word ?

Voice quality ?

Emotion ?

Page 14: Multimodal Machine Learning - piazza.com

14

Data-Driven

Machine Learning

Page 15: Multimodal Machine Learning - piazza.com

Data-Driven Machine Learning

1. Dataset: Collection of labeled samples D: {๐‘ฅ๐‘– , ๐‘ฆ๐‘–}

2. Training: Learn classifier on training set

3. Testing: Evaluate classifier on hold-out test set

Dataset

Training set Test set

Page 16: Multimodal Machine Learning - piazza.com

16

Simple Classifier ?

Dataset

?Basket

Dog

Kayak ?

-or-

-or-

Traffic light-or-

Page 17: Multimodal Machine Learning - piazza.com

17

Simple Classifier: Nearest Neighbor

Training

?Basket

Dog

Kayak ?

-or-

-or-

Traffic light-or-

Page 18: Multimodal Machine Learning - piazza.com

Nearest Neighbor Classifier

โ–ช Non-parametric approachesโ€”key ideas:

โ–ช โ€œLet the data speak for themselvesโ€

โ–ช โ€œPredict new cases based on similar casesโ€

โ–ช โ€œUse multiple local models instead of a single global modelโ€

โ–ช What is the complexity of the NN classifier w.r.t training

set of N images and test set of M images?

โ–ช at training time?

O(1)

โ–ช At test time?

O(N)

Page 19: Multimodal Machine Learning - piazza.com

19

Simple Classifier: Nearest Neighbor

Distance metrics

L1 (Manhattan) distance:

L2 (Eucledian) distance:

๐‘‘1 ๐‘ฅ1, ๐‘ฅ2 =

๐‘—

๐‘ฅ1๐‘—โˆ’ ๐‘ฅ2

๐‘—

๐‘‘2 ๐‘ฅ1, ๐‘ฅ2 =

๐‘—

๐‘ฅ1๐‘—โˆ’ ๐‘ฅ2

๐‘—2

Which distance metric to use?

First hyper-parameter!

Page 20: Multimodal Machine Learning - piazza.com

20

Definition of K-Nearest Neighbor

What value should we set K?

Second hyper-parameter!

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

Page 21: Multimodal Machine Learning - piazza.com

Data-Driven Approach

1. Dataset: Collection of labeled samples D: {๐‘ฅ๐‘– , ๐‘ฆ๐‘–}

2. Training: Learn classifier on training set

3. Validation: Select optimal hyper-parameters

4. Testing: Evaluate classifier on hold-out test set

Training

Data

Validation

Data

Test

Data

Page 22: Multimodal Machine Learning - piazza.com

Evaluation methods (for validation and testing)

โ–ช Holdout set: The available data set D is divided into two disjoint subsets, โ–ช the training set Dtrain (for learning a model)

โ–ช the test set Dtest (for testing the model)

โ–ช Important: training set should not be used in testing and the test set should not be used in learning. โ–ช Unseen test set provides a unbiased estimate of accuracy.

โ–ช The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.)

โ–ช This method is mainly used when the data set D is large.

โ–ช Holdout methods can also be used for validation

Page 23: Multimodal Machine Learning - piazza.com

Evaluation methods (for validation and testing)

โ–ช n-fold cross-validation: The available data is partitioned

into n equal-size disjoint subsets.

โ–ช Use each subset as the test set and combine the rest n-1

subsets as the training set to learn a classifier.

โ–ช The procedure is run n times, which give n accuracies.

โ–ช The final estimated accuracy of learning is the average of

the n accuracies.

โ–ช 10-fold and 5-fold cross-validations are commonly used.

โ–ช This method is used when the available data is not large.

Page 24: Multimodal Machine Learning - piazza.com

Evaluation methods (for validation and testing)

โ–ช Leave-one-out cross-validation: This method is

used when the data set is very small.

โ–ช Each fold of the cross validation has only a

single test example and all the rest of the data is

used in training.

โ–ช If the original data has m examples, this is m-

fold cross-validation

โ–ช It is a special case of cross-validation

Page 25: Multimodal Machine Learning - piazza.com

25

Linear Classification:

Scores and Loss

Page 26: Multimodal Machine Learning - piazza.com

26

Linear Classification (e.g., neural network)

?

1. Define a (linear) score function

2. Define the loss function (possibly nonlinear)

3. Optimization

Image

(Size: 32*32*3)

Page 27: Multimodal Machine Learning - piazza.com

27

1) Score Function

Dog ?

Cat ?

Duck ?

Pig ?

Bird ?

What should be

the prediction

score for each

label class?

๐‘“ ๐‘ฅ๐‘–;๐‘Š, ๐‘ = ๐‘Š๐‘ฅ๐‘– + ๐‘

For linear classifier:

Class score

Input observation (ith element of the dataset)

(Size: 32*32*3)

Image

Weights Bias vector

Parameters

[3072x1]

[10x3072] [10x1]

[10x3073][10x1]

Page 28: Multimodal Machine Learning - piazza.com

28

The planar decision surface

in data-space for the simple

linear discriminant function:

Interpreting a Linear Classifier

๐‘Š๐‘ฅ๐‘– + ๐‘ > 0

๐‘“(๐‘ฅ)

๐‘ค

โˆ’๐‘

๐‘ค

๐‘“(๐‘ฅ)

๐‘“(๐‘ฅ)

๐‘“(๐‘ฅ)

Page 29: Multimodal Machine Learning - piazza.com

29

Some Notation Tricks โ€“ Multi-Label Classification

๐‘“ ๐‘ฅ๐‘–;๐‘Š, ๐‘ = ๐‘Š๐‘ฅ๐‘– + ๐‘ ๐‘“ ๐‘ฅ๐‘–;๐‘Š = ๐‘Š๐‘ฅ๐‘–

InputWeights Bias

[3072x1][10x3072] [10x1]

x + InputWeights

[3073x1][10x3073]

x

Add a โ€œ1โ€ at the

end of the input

observation vector

The bias vector will

be the last column of

the weight matrix

๐‘Š = ๐‘Š1 ๐‘Š2 โ€ฆ ๐‘Š๐‘

Page 30: Multimodal Machine Learning - piazza.com

30

Some Notation Tricks

๐‘“ ๐‘ฅ๐‘–;๐‘Š, ๐‘General formulation of linear classifier:

โ€œdogโ€ linear classifier:

๐‘“ ๐‘ฅ๐‘–;๐‘Š, ๐‘ ๐‘‘๐‘œ๐‘” ๐‘“๐‘‘๐‘œ๐‘”

๐‘“ ๐‘ฅ๐‘–;๐‘Š๐‘‘๐‘œ๐‘”, ๐‘๐‘‘๐‘œ๐‘” or

or

Linear classifier for label j:

๐‘“ ๐‘ฅ๐‘–;๐‘Š, ๐‘ ๐‘— ๐‘“๐‘—

๐‘“ ๐‘ฅ๐‘–;๐‘Š๐‘— , ๐‘๐‘— or

or

Page 31: Multimodal Machine Learning - piazza.com

31

Interpreting Multiple Linear Classifiers

๐‘“ ๐‘ฅ๐‘–;๐‘Š๐‘— , ๐‘๐‘— = ๐‘Š๐‘—๐‘ฅ๐‘– + ๐‘๐‘—

CIFAR-10 object

recognition dataset๐‘“๐‘๐‘Ž๐‘Ÿ

๐‘“๐‘‘๐‘’๐‘’๐‘Ÿ

๐‘“๐‘Ž๐‘–๐‘Ÿ๐‘๐‘™๐‘Ž๐‘›๐‘’

Page 32: Multimodal Machine Learning - piazza.com

32

Linear Classification: 2) Loss Function

(or cost function or objective)

๐‘“ ๐‘ฅ๐‘–;๐‘Š

2 (dog) ?

1 (cat) ?

0 (duck) ?

3 (pig) ?

4 (bird) ?(Size: 32*32*3)

Image

98.7

45.6

-12.3

12.2

-45.3

Scores

๐‘ฅ๐‘–

Label

๐‘ฆ๐‘– = 2 (๐‘‘๐‘œ๐‘”)

Loss

๐ฟ๐‘– = ?

Multi-class problem

How to assign

only one number

representing

how โ€œunhappyโ€

we are about

these scores?

The loss function quantifies the amount by which

the prediction scores deviate from the actual values.

A first challenge: how to normalize the scores?

Page 33: Multimodal Machine Learning - piazza.com

33

First Loss Function: Cross-Entropy Loss

(or logistic loss)

Logistic function: ๐œŽ ๐‘“ =1

1 + ๐‘’โˆ’๐‘“

0.5

0

0

1

๐œŽ ๐‘“

๐‘“ โžขScore function

Page 34: Multimodal Machine Learning - piazza.com

34

First Loss Function: Cross-Entropy Loss

(or logistic loss)

Logistic function: ๐œŽ ๐‘“ =1

1 + ๐‘’โˆ’๐‘“

Logistic regression:(two classes)

= ๐œŽ ๐‘ค๐‘‡๐‘ฅ๐‘–

0.5

0

0

1

๐œŽ ๐‘“

๐‘“

๐‘ ๐‘ฆ๐‘– = "๐‘‘๐‘œ๐‘”" ๐‘ฅ๐‘–; ๐‘ค)

โžขScore function

= truefor two-class problem

Page 35: Multimodal Machine Learning - piazza.com

35

First Loss Function: Cross-Entropy Loss

(or logistic loss)

Logistic function: ๐œŽ ๐‘“ =1

1 + ๐‘’โˆ’๐‘“

๐‘ ๐‘ฆ๐‘– ๐‘ฅ๐‘–;๐‘Š) =๐‘’๐‘“๐‘ฆ๐‘–

ฯƒ๐‘— ๐‘’๐‘“๐‘—

Softmax function:(multiple classes)

Logistic regression:(two classes)

= ๐œŽ ๐‘ค๐‘‡๐‘ฅ๐‘–๐‘ ๐‘ฆ๐‘– = "๐‘‘๐‘œ๐‘”" ๐‘ฅ๐‘–; ๐‘ค)= truefor two-class problem

Page 36: Multimodal Machine Learning - piazza.com

36

First Loss Function: Cross-Entropy Loss

(or logistic loss)

๐ฟ๐‘– = โˆ’log๐‘’๐‘“๐‘ฆ๐‘–

ฯƒ๐‘— ๐‘’๐‘“๐‘—

Softmax function

Cross-entropy loss:

Minimizing the

negative log likelihood.

Page 37: Multimodal Machine Learning - piazza.com

37

Second Loss Function: Hinge Loss

loss due to

example i sum over all

incorrect labels

difference between the correct class

score and incorrect class score

(or max-margin loss or Multi-class SVM loss)

Page 38: Multimodal Machine Learning - piazza.com

38

Second Loss Function: Hinge Loss

Example:

e.g. 10

(or max-margin loss or Multi-class SVM loss)

Page 39: Multimodal Machine Learning - piazza.com

39

Two Loss Functions

Page 40: Multimodal Machine Learning - piazza.com

40

Basic Concepts:

Neural Networks

Page 41: Multimodal Machine Learning - piazza.com

Neural Networks โ€“ inspiration

โ–ช Made up of artificial neurons

Page 42: Multimodal Machine Learning - piazza.com

Neural Networks โ€“ score function

โ–ช Made up of artificial neurons

โ–ช Linear function (dot product) followed by a nonlinear

activation function

โ–ช Example a Multi Layer Perceptron

Page 43: Multimodal Machine Learning - piazza.com

Basic NN building block

โ–ช Weighted sum followed by an activation function

Activation function

Output

Input

Weighted sum

๐‘Š๐‘ฅ + ๐‘

๐‘ฆ = ๐‘“(๐‘Š๐‘ฅ + ๐‘)

Page 44: Multimodal Machine Learning - piazza.com

Neural Networks โ€“ activation function

โ–ช ๐‘“ ๐‘ฅ = tanh ๐‘ฅ

โ–ช Sigmoid - ๐‘“ ๐‘ฅ = (1 + ๐‘’โˆ’๐‘ฅ)โˆ’1

โ–ช Linear โ€“ ๐‘“ ๐‘ฅ = ๐‘Ž๐‘ฅ + ๐‘

โ–ช ReLUโ–ช Rectifier Linear Units

โ–ช Faster training - no gradient vanishing

โ–ช Induces sparsity

๐‘“ ๐‘ฅ = max 0, ๐‘ฅ ~log(1 + exp(๐‘ฅ) )

Page 45: Multimodal Machine Learning - piazza.com

45

Multi-Layer Feedforward Network

๐‘Š3

๐‘Š2๐‘Š1

๐‘ฆ๐‘–๐‘ฅ๐‘–๐‘“2;๐‘Š2

๐‘ฅ = ๐œŽ(๐‘Š2๐‘ฅ + ๐‘2)

๐‘ฆ๐‘– = ๐‘“ ๐‘ฅ๐‘– = ๐‘“3;๐‘Š3(๐‘“2;๐‘Š2

(๐‘“1;๐‘Š1๐‘ฅ๐‘–))

๐‘“3;๐‘Š3๐‘ฅ = ๐œŽ(๐‘Š3๐‘ฅ + ๐‘3)

Score function

Activation functions (individual layers)

๐ฟ๐‘– = (๐‘“ ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–)2 = (๐‘“3;๐‘Š3

(๐‘“2;๐‘Š2(๐‘“1;๐‘Š1

๐‘ฅ๐‘–)) )2

Loss function (e.g., Euclidean loss)

๐‘“1;๐‘Š1๐‘ฅ = ๐œŽ(๐‘Š1๐‘ฅ + ๐‘1)

Page 46: Multimodal Machine Learning - piazza.com

Neural Networks inference and learning

โ–ช Inference (Testing)

โ–ช Use the score function (y = ๐‘“ ๐’™;๐‘Š )

โ–ช Have a trained model (parameters ๐‘Š)

โ–ช Learning model parameters (Training)

โ–ช Loss function (๐ฟ)

โ–ช Gradient

โ–ช Optimization