multimodal machine learning - piazza.com
TRANSCRIPT
1
Louis-Philippe Morency
Multimodal Machine Learning
Lecture 2.1: Basic Concepts โ
Neural Networks
* Original version co-developed with Tadas Baltrusaitis
Lecture Objectives
โช Unimodal basic representationsโช Visual, language and acoustic modalities
โช Data-driven machine learningโช Training, validation and testing
โช Example: K-nearest neighbor
โช Linear Classificationโช Score function
โช Two loss functions (cross-entropy and hinge loss)
โช Neural networks
โช Course project team formation
Multimodal Machine Learning
Verbal Vocal Visual
These challenges are non-exclusive.
4
Unimodal Basic
Representations
5
Unimodal Classification โ Visual Modality
Dog ?
โฆ
Binary classification
problem
Color image
Each pixelis represented in โ๐, d is the number of colors (d=3 for RGB)
Input
observ
ation ๐
๐
label ๐ฆ๐ โ ๐ด = {0,1}
6
Unimodal Classification โ Visual Modality
Dog
โฆ
Cat
Duck
Pig
Bird ?
Each pixelis represented in โ๐, d is the number of colors (d=3 for RGB)
-or-
-or-
-or-
-or-
label ๐ฆ๐ โ ๐ด = {0,1,2,3, โฆ }
Input
observ
ation ๐
๐
Multi-class
classification problem
7
Unimodal Classification โ Visual Modality
โฆ
Puppy ?label vector ๐๐ โ ๐ด๐ = 0,1 ๐
Input
observ
ation ๐
๐
Each pixelis represented in โ๐, d is the number of colors (d=3 for RGB)
Dog ?
Cat ?
Duck?
Pig ?
Bird ?
Multi-label (or multi-task)
classification problem
8
Unimodal Classification โ Visual Modality
โฆ
label vector ๐๐ โ ๐ด๐ = โ๐
Input
observ
ation ๐
๐
Each pixelis represented in โ๐, d is the number of colors (d=3 for RGB)
Weight ?
Height ?
Age ?
Distance ?
Happy ?
Multi-label (or multi-task)
regression problem
9
Unimodal Classification โ Language Modality
Wri
tte
n la
ng
ua
ge
Sp
ok
en
la
ng
ua
ge
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
Input observ
ation ๐
๐
โฆ
โone-hotโ vector
๐๐ = number of words in dictionary
Word-level
classification
Sentiment ?(positive or negative)
Part-of-speech ?(noun, verb,โฆ)
Named entity ?(names of person,โฆ)
10
Unimodal Classification โ Language Modality
Wri
tte
n la
ng
ua
ge
Sp
ok
en
la
ng
ua
ge
0
1
0
0
1
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
Input observ
ation ๐
๐
โฆ
โbag-of-wordโ vector
๐๐ = number of words in dictionary
Document-level
classification
Sentiment ?(positive or negative)
11
Unimodal Classification โ Language Modality
Wri
tte
n la
ng
ua
ge
Sp
ok
en
la
ng
ua
ge
0
1
0
0
1
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
Input observ
ation ๐
๐
โฆ
โbag-of-wordโ vector
๐๐ = number of words in dictionary
Utterance-level
classification
Sentiment ?(positive or negative)
12
Unimodal Classification โ Acoustic Modality
Digitalized acoustic signal
โข Sampling rates: 8~96kHz
โข Bit depth: 8, 16 or 24 bits
โข Time window size: 20ms
โข Offset: 10ms
Spectogram
0.21
0.14
0.56
0.45
0.9
0.98
0.75
0.34
0.24
0.11
0.02Input observ
ation ๐
๐
Spoken word ?
13
Unimodal Classification โ Acoustic Modality
Digitalized acoustic signal
โข Sampling rates: 8~96kHz
โข Bit depth: 8, 16 or 24 bits
โข Time window size: 20ms
โข Offset: 10ms
Spectogram
0.21
0.14
0.56
0.45
0.9
0.98
0.75
0.34
0.24
0.11
0.02Input observ
ation ๐
๐
โฆ
0.24
0.26
0.58
0.9
0.99
0.79
0.45
0.34
0.24
Spoken word ?
Voice quality ?
Emotion ?
14
Data-Driven
Machine Learning
Data-Driven Machine Learning
1. Dataset: Collection of labeled samples D: {๐ฅ๐ , ๐ฆ๐}
2. Training: Learn classifier on training set
3. Testing: Evaluate classifier on hold-out test set
Dataset
Training set Test set
16
Simple Classifier ?
Dataset
?Basket
Dog
Kayak ?
-or-
-or-
Traffic light-or-
17
Simple Classifier: Nearest Neighbor
Training
?Basket
Dog
Kayak ?
-or-
-or-
Traffic light-or-
Nearest Neighbor Classifier
โช Non-parametric approachesโkey ideas:
โช โLet the data speak for themselvesโ
โช โPredict new cases based on similar casesโ
โช โUse multiple local models instead of a single global modelโ
โช What is the complexity of the NN classifier w.r.t training
set of N images and test set of M images?
โช at training time?
O(1)
โช At test time?
O(N)
19
Simple Classifier: Nearest Neighbor
Distance metrics
L1 (Manhattan) distance:
L2 (Eucledian) distance:
๐1 ๐ฅ1, ๐ฅ2 =
๐
๐ฅ1๐โ ๐ฅ2
๐
๐2 ๐ฅ1, ๐ฅ2 =
๐
๐ฅ1๐โ ๐ฅ2
๐2
Which distance metric to use?
First hyper-parameter!
20
Definition of K-Nearest Neighbor
What value should we set K?
Second hyper-parameter!
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
Data-Driven Approach
1. Dataset: Collection of labeled samples D: {๐ฅ๐ , ๐ฆ๐}
2. Training: Learn classifier on training set
3. Validation: Select optimal hyper-parameters
4. Testing: Evaluate classifier on hold-out test set
Training
Data
Validation
Data
Test
Data
Evaluation methods (for validation and testing)
โช Holdout set: The available data set D is divided into two disjoint subsets, โช the training set Dtrain (for learning a model)
โช the test set Dtest (for testing the model)
โช Important: training set should not be used in testing and the test set should not be used in learning. โช Unseen test set provides a unbiased estimate of accuracy.
โช The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.)
โช This method is mainly used when the data set D is large.
โช Holdout methods can also be used for validation
Evaluation methods (for validation and testing)
โช n-fold cross-validation: The available data is partitioned
into n equal-size disjoint subsets.
โช Use each subset as the test set and combine the rest n-1
subsets as the training set to learn a classifier.
โช The procedure is run n times, which give n accuracies.
โช The final estimated accuracy of learning is the average of
the n accuracies.
โช 10-fold and 5-fold cross-validations are commonly used.
โช This method is used when the available data is not large.
Evaluation methods (for validation and testing)
โช Leave-one-out cross-validation: This method is
used when the data set is very small.
โช Each fold of the cross validation has only a
single test example and all the rest of the data is
used in training.
โช If the original data has m examples, this is m-
fold cross-validation
โช It is a special case of cross-validation
25
Linear Classification:
Scores and Loss
26
Linear Classification (e.g., neural network)
?
1. Define a (linear) score function
2. Define the loss function (possibly nonlinear)
3. Optimization
Image
(Size: 32*32*3)
27
1) Score Function
Dog ?
Cat ?
Duck ?
Pig ?
Bird ?
What should be
the prediction
score for each
label class?
๐ ๐ฅ๐;๐, ๐ = ๐๐ฅ๐ + ๐
For linear classifier:
Class score
Input observation (ith element of the dataset)
(Size: 32*32*3)
Image
Weights Bias vector
Parameters
[3072x1]
[10x3072] [10x1]
[10x3073][10x1]
28
The planar decision surface
in data-space for the simple
linear discriminant function:
Interpreting a Linear Classifier
๐๐ฅ๐ + ๐ > 0
๐(๐ฅ)
๐ค
โ๐
๐ค
๐(๐ฅ)
๐(๐ฅ)
๐(๐ฅ)
29
Some Notation Tricks โ Multi-Label Classification
๐ ๐ฅ๐;๐, ๐ = ๐๐ฅ๐ + ๐ ๐ ๐ฅ๐;๐ = ๐๐ฅ๐
InputWeights Bias
[3072x1][10x3072] [10x1]
x + InputWeights
[3073x1][10x3073]
x
Add a โ1โ at the
end of the input
observation vector
The bias vector will
be the last column of
the weight matrix
๐ = ๐1 ๐2 โฆ ๐๐
30
Some Notation Tricks
๐ ๐ฅ๐;๐, ๐General formulation of linear classifier:
โdogโ linear classifier:
๐ ๐ฅ๐;๐, ๐ ๐๐๐ ๐๐๐๐
๐ ๐ฅ๐;๐๐๐๐, ๐๐๐๐ or
or
Linear classifier for label j:
๐ ๐ฅ๐;๐, ๐ ๐ ๐๐
๐ ๐ฅ๐;๐๐ , ๐๐ or
or
31
Interpreting Multiple Linear Classifiers
๐ ๐ฅ๐;๐๐ , ๐๐ = ๐๐๐ฅ๐ + ๐๐
CIFAR-10 object
recognition dataset๐๐๐๐
๐๐๐๐๐
๐๐๐๐๐๐๐๐๐
32
Linear Classification: 2) Loss Function
(or cost function or objective)
๐ ๐ฅ๐;๐
2 (dog) ?
1 (cat) ?
0 (duck) ?
3 (pig) ?
4 (bird) ?(Size: 32*32*3)
Image
98.7
45.6
-12.3
12.2
-45.3
Scores
๐ฅ๐
Label
๐ฆ๐ = 2 (๐๐๐)
Loss
๐ฟ๐ = ?
Multi-class problem
How to assign
only one number
representing
how โunhappyโ
we are about
these scores?
The loss function quantifies the amount by which
the prediction scores deviate from the actual values.
A first challenge: how to normalize the scores?
33
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Logistic function: ๐ ๐ =1
1 + ๐โ๐
0.5
0
0
1
๐ ๐
๐ โขScore function
34
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Logistic function: ๐ ๐ =1
1 + ๐โ๐
Logistic regression:(two classes)
= ๐ ๐ค๐๐ฅ๐
0.5
0
0
1
๐ ๐
๐
๐ ๐ฆ๐ = "๐๐๐" ๐ฅ๐; ๐ค)
โขScore function
= truefor two-class problem
35
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Logistic function: ๐ ๐ =1
1 + ๐โ๐
๐ ๐ฆ๐ ๐ฅ๐;๐) =๐๐๐ฆ๐
ฯ๐ ๐๐๐
Softmax function:(multiple classes)
Logistic regression:(two classes)
= ๐ ๐ค๐๐ฅ๐๐ ๐ฆ๐ = "๐๐๐" ๐ฅ๐; ๐ค)= truefor two-class problem
36
First Loss Function: Cross-Entropy Loss
(or logistic loss)
๐ฟ๐ = โlog๐๐๐ฆ๐
ฯ๐ ๐๐๐
Softmax function
Cross-entropy loss:
Minimizing the
negative log likelihood.
37
Second Loss Function: Hinge Loss
loss due to
example i sum over all
incorrect labels
difference between the correct class
score and incorrect class score
(or max-margin loss or Multi-class SVM loss)
38
Second Loss Function: Hinge Loss
Example:
e.g. 10
(or max-margin loss or Multi-class SVM loss)
39
Two Loss Functions
40
Basic Concepts:
Neural Networks
Neural Networks โ inspiration
โช Made up of artificial neurons
Neural Networks โ score function
โช Made up of artificial neurons
โช Linear function (dot product) followed by a nonlinear
activation function
โช Example a Multi Layer Perceptron
Basic NN building block
โช Weighted sum followed by an activation function
Activation function
Output
Input
Weighted sum
๐๐ฅ + ๐
๐ฆ = ๐(๐๐ฅ + ๐)
Neural Networks โ activation function
โช ๐ ๐ฅ = tanh ๐ฅ
โช Sigmoid - ๐ ๐ฅ = (1 + ๐โ๐ฅ)โ1
โช Linear โ ๐ ๐ฅ = ๐๐ฅ + ๐
โช ReLUโช Rectifier Linear Units
โช Faster training - no gradient vanishing
โช Induces sparsity
๐ ๐ฅ = max 0, ๐ฅ ~log(1 + exp(๐ฅ) )
45
Multi-Layer Feedforward Network
๐3
๐2๐1
๐ฆ๐๐ฅ๐๐2;๐2
๐ฅ = ๐(๐2๐ฅ + ๐2)
๐ฆ๐ = ๐ ๐ฅ๐ = ๐3;๐3(๐2;๐2
(๐1;๐1๐ฅ๐))
๐3;๐3๐ฅ = ๐(๐3๐ฅ + ๐3)
Score function
Activation functions (individual layers)
๐ฟ๐ = (๐ ๐ฅ๐ โ ๐ฆ๐)2 = (๐3;๐3
(๐2;๐2(๐1;๐1
๐ฅ๐)) )2
Loss function (e.g., Euclidean loss)
๐1;๐1๐ฅ = ๐(๐1๐ฅ + ๐1)
Neural Networks inference and learning
โช Inference (Testing)
โช Use the score function (y = ๐ ๐;๐ )
โช Have a trained model (parameters ๐)
โช Learning model parameters (Training)
โช Loss function (๐ฟ)
โช Gradient
โช Optimization