lecture 24: conclusion

142
Lecture 24: Conclusion Andreas Wichert Department of Computer Science and Engineering Técnico Lisboa

Upload: others

Post on 16-Apr-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 24: Conclusion

Lecture 24: Conclusion

Andreas WichertDepartment of Computer Science and Engineering

Técnico Lisboa

Page 2: Lecture 24: Conclusion

• Example of ML: Decision Trees • Linear and Nonlinear Regression• Perceptron/Logistic Regression

• Backpropagation• Learning Theory

• K-Means, EM-Clustering• RBF-Networks, SVM

• Model Selection, MDL Principle• Deep Learning• Convolution NN

• RNN• KL-Transform, PCA, ICA

• Autoencoders• Feature Extraction

• KNN, Weighted Regression• Ensemble Methods• Bayesian Networks• Boltzmann Machine

Page 3: Lecture 24: Conclusion

Example of ML: Decision Trees

Page 4: Lecture 24: Conclusion

Top-Down Induction of Decision Trees ID3

1. A ¬ the “best” decision attribute for next node2. Assign A as decision attribute (=property) for

node3. For each value of A create new descendant 4. Sort training examples to leaf node according to

the attribute value of the branch5. If all training examples are perfectly classified

(same value of target attribute) stop, else iterate over new leaf nodes

Page 5: Lecture 24: Conclusion

Heuristic function: Shannon Entropy

• Shannon formalized these intuitions• Given a universe of messages M={m1,m2,...,mn} and a probability p(mi)

for the occurrence of each message, the information content (also called entropy)of a message M is given

I(M) = −p(mii=1

n

∑ )log2(p(mi))

Page 6: Lecture 24: Conclusion

• The gain from the property P is computed by subtracting the expected information to complete E(P) fro the total information

E(P) =|Ci ||C |i=1

n

∑ I(Ci)

gain(P) = I(C) − E(P)

Page 7: Lecture 24: Conclusion

Linear and Nonlinear Regression

Page 8: Lecture 24: Conclusion

Linear Regression

Page 9: Lecture 24: Conclusion

Sum-of-squares error

Page 10: Lecture 24: Conclusion

Design Matrix

Page 11: Lecture 24: Conclusion
Page 12: Lecture 24: Conclusion

• Dimensions change since the dimension are not determined by the dimension of the vector x which is D • The number of the is M-1

Page 13: Lecture 24: Conclusion

Posterior Density

Page 14: Lecture 24: Conclusion

Relation between Regularised Least-Squares and MAP

Page 15: Lecture 24: Conclusion

Perceptron/Logistic Regression

Page 16: Lecture 24: Conclusion

Perceptron (1957)

• Linear threshold unit (LTU)

S

x1

x2

xn

...

w1w2

wn

w0X0=1

o

McCulloch-Pitts model of a neuron (1943)

The “bias”, a constant term that does not depend on any input value

Page 17: Lecture 24: Conclusion

Linearly separable patterns

X0=1, bias...

Page 18: Lecture 24: Conclusion

Perceptron learning rule

• Consider linearly separable problems• How to find appropriate weights

• Initialize each vector w to some small random values

• Look if the output pattern o belongs to the desired class, has the desired value d

• h is called the learning rate• 0 < h ≤ 1

Δw =η ⋅ (d −o) ⋅ x

Page 19: Lecture 24: Conclusion

• The update rule for gradient decent is given by

Page 20: Lecture 24: Conclusion

Linear Unit

Page 21: Lecture 24: Conclusion

Sigmoid Unit

Page 22: Lecture 24: Conclusion

Logistic Regression

Page 23: Lecture 24: Conclusion

Logistic Regression

Page 24: Lecture 24: Conclusion

Sigmoid Unit versus Logistic Regression

Page 25: Lecture 24: Conclusion

Linear Unit versus Logistic Regression

Page 26: Lecture 24: Conclusion

Backpropagation

Page 27: Lecture 24: Conclusion

Back-propagation

• The algorithm gives a prescription for changing the weights wij in any feed-forward network to learn a training set of input output pairs {xk,yk}• We consider a simple two-layer network

Page 28: Lecture 24: Conclusion
Page 29: Lecture 24: Conclusion
Page 30: Lecture 24: Conclusion
Page 31: Lecture 24: Conclusion
Page 32: Lecture 24: Conclusion
Page 33: Lecture 24: Conclusion
Page 34: Lecture 24: Conclusion
Page 35: Lecture 24: Conclusion

Learning Theory

Page 36: Lecture 24: Conclusion

Bias-Variance Dilemma

Page 37: Lecture 24: Conclusion

Bias

Page 38: Lecture 24: Conclusion

Variance

Page 39: Lecture 24: Conclusion

VC-dimension

• The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H

• It can be shown that the VC dimension of linear decision surfaces in an ddimensional space (i.e., the VC dimension of a perceptron with d inputs) is d + 1

• Perceptron in d dimensions has d+1 parameter (bias) Through d+1 linear independent chosen points we can learn all dichotomies• For d+2 points in a perceptron in d dimension some vectors (at least two)

are represented as a linear combination, we cannot learn all dichotomies

Page 40: Lecture 24: Conclusion

K-Means, EM-Clustering

Page 41: Lecture 24: Conclusion

K-means Clustering

Page 42: Lecture 24: Conclusion

K-means Clustering

Page 43: Lecture 24: Conclusion

Algorithm: EM for Gaussian mixtures

Page 44: Lecture 24: Conclusion

EM for Gaussian mixtures

Page 45: Lecture 24: Conclusion
Page 46: Lecture 24: Conclusion

EM for Gaussian mixtures

Page 47: Lecture 24: Conclusion
Page 48: Lecture 24: Conclusion

EM for Gaussian mixtures

Page 49: Lecture 24: Conclusion
Page 50: Lecture 24: Conclusion

RBF-Networks, SVM

Page 51: Lecture 24: Conclusion

Interpolation Problem

Page 52: Lecture 24: Conclusion
Page 53: Lecture 24: Conclusion

Micchelli’s Theorem

Page 54: Lecture 24: Conclusion

Radial Basis Function Networks

Page 55: Lecture 24: Conclusion

Radial Basis Function Networks

Page 56: Lecture 24: Conclusion

Radial Basis Function Networks

Page 57: Lecture 24: Conclusion
Page 58: Lecture 24: Conclusion

Constructing new kernels by building them out of simpler kernels as building blocks

• f(·) is any function, q(·) is a polynomial with nonnegative coefficients,• A is a symmetric positive

semidefinite matrix, xa and xb are variables (not necessarily disjoint) with x = (xa, xb), and ka and kb are valid kernel functions over their respective spaces.

Page 59: Lecture 24: Conclusion

Gaussian Kernel

• Since:

• The feature vector that corresponds to the Gaussian kernel has infinite dimensionality

Page 60: Lecture 24: Conclusion
Page 61: Lecture 24: Conclusion
Page 62: Lecture 24: Conclusion

Good Decision Boundary: Margin Should Be Large

• The decision boundary should be as far away from the data of both classes as possible

• We should maximize the margin, 𝞺

Class 1

Class 2

𝞺

Page 63: Lecture 24: Conclusion

Dual Problem

Page 64: Lecture 24: Conclusion

Dual Problem

Page 65: Lecture 24: Conclusion

Design of Support Vector Machine

Page 66: Lecture 24: Conclusion

Classify Data Points

Page 67: Lecture 24: Conclusion
Page 68: Lecture 24: Conclusion
Page 69: Lecture 24: Conclusion

Model Selection, MDL Principle

Page 70: Lecture 24: Conclusion

Attributes of the MDL Principle

• When we have two models that fit a given data sequence equally well, the MDL principle will pick the one that is the simplest in the sense that it allows the use of a shorter description of the data

• MDL principle implements a precise form of Occam’s razor, which states a preference for simple theories

• The MDL principle is a consistent model selection estimator in the sense that it converges to the true model order as the sample size increases

Page 71: Lecture 24: Conclusion

MDL and Regularization

Page 72: Lecture 24: Conclusion

Deep Learning

Page 73: Lecture 24: Conclusion

• It is assumed that an artificial neural network with several hidden layers is less likely to be stuck in a local minima and it is easier to find the right parameters as demonstrated by empirical experiments

Page 74: Lecture 24: Conclusion

How to set network parameters

16 x 16 = 256

1x

2x

……

256x

……

……

……

……

Ink → 1No ink → 0

……

y1

y2

y10

0.1

0.7

0.2

y1 has the maximum value

Set the network parameters 𝜃 such that ……

Input:

y2 has the maximum valueInput:

is 1

is 2

is 0Softm

ax

𝜃 = 𝑊!, 𝑏!,𝑊", 𝑏", ⋯𝑊# , 𝑏#

Page 75: Lecture 24: Conclusion

Rectified Linear Unit (ReLU)

• f(x) = max(0,x)• Function defined as the positive part of its argument

• Does not saturate (in +region) • Very computationally efficient• Converges much faster than

sigmoid/tanh in practice (e.g. 6x) • More biologically plausible

• But: Not zero-centered output L• Non-differentiable at zero; however it is differentiable anywhere else, and

a value of 0 or 1

ReLu function

Derivative of ReLu function

Page 76: Lecture 24: Conclusion

Batch Normalization

• During training time, a batch normalization layer does the following:

• Calculate the mean and variance of the layers input

• Normalize the layer inputs using the previously calculated batch statistics

• Scale and shift in order to obtain the output of the layer

• γ and β are learned during training along with the original parameters of the network.

Page 77: Lecture 24: Conclusion

l2 Regularization

Page 78: Lecture 24: Conclusion

l1 Regularization

Page 79: Lecture 24: Conclusion

l2 versus l1 Regularization

Page 80: Lecture 24: Conclusion

Regularization: Dropout• In each forward pass, randomly set some neurons to zero (for one

pass only)• Probability of dropping is a hyperparameter; 0.5 is common

Page 81: Lecture 24: Conclusion

Convolution NN

Page 82: Lecture 24: Conclusion

• Convolutional Neural Networks

Page 83: Lecture 24: Conclusion
Page 84: Lecture 24: Conclusion

RNN

Page 85: Lecture 24: Conclusion

Recurrent Neural Networks

• Recurrent networks that produce an output at each time step and have recurrent connections between hidden units, Vanilla RNN or Elman RNN

Page 86: Lecture 24: Conclusion

Recurrent Hidden Units

Page 87: Lecture 24: Conclusion

KL-Transform, PCA, ICA

Page 88: Lecture 24: Conclusion

The Karhunen-Loève Transform

Page 89: Lecture 24: Conclusion
Page 90: Lecture 24: Conclusion
Page 91: Lecture 24: Conclusion

• The squares of the eigenvalues represent the variances along the eigenvectors. The eigenvalues corresponding to the covariance matrix of the data set Σ are

Page 92: Lecture 24: Conclusion

The Blind Source Separation Problem

Page 93: Lecture 24: Conclusion

PCA vs ICA

Page 94: Lecture 24: Conclusion
Page 95: Lecture 24: Conclusion

Autoencoders

Page 96: Lecture 24: Conclusion

• Unsupervised Learning• Data: no labels!• Goal: Learn the structure of the

data

• Traditionally, autoencoders were used for dimensionality reduction or feature learning.

Page 97: Lecture 24: Conclusion

Undercomplete AE

𝑥

"𝑥

𝑤

𝑤′𝑓 𝑥

• Hidden layer is Undercomplete if smaller than the input layer• Compresses the input• Compresses well only for the

training dist.

• Hidden nodes will be• Good features for the training

distribution.• Bad for other types on input

Page 98: Lecture 24: Conclusion

• Autoencoders with nonlinear encoder functions f and nonlinear decoder functions g can thus learn a more powerful nonlinear generalization of PCA (later)

Page 99: Lecture 24: Conclusion
Page 100: Lecture 24: Conclusion

Feature Extraction

Page 101: Lecture 24: Conclusion
Page 102: Lecture 24: Conclusion

Edge detection

• Convert a 2D image into a set of curves• Extracts salient features of the scene• More compact than pixels

Page 103: Lecture 24: Conclusion

• The basic idea behind edge detection is to localize discontinuities of the intensity function in the image

Page 104: Lecture 24: Conclusion

• The right angle is the only place where the contour is curved, changes its direction• At the point of extreme curvature, the information is concentrated• Corners yield the greatest information• More strongly curved points yield more information• Information content of a contour is concentrated in the neighborhood

of points where the absolute value of the curvature is a local maximum

Page 105: Lecture 24: Conclusion

KNN,Weighted Regrssion

Page 106: Lecture 24: Conclusion

K-Nearest Neighbor

• Training algorithm• For each training example (x,f(x)) add the example to the list

• Classification algorithm• Given a query instance xq to be classified

• Let x1,..,xk k instances which are nearest to xq

• Where d(a,b)=1 if a=b, else d(a,b)= 0 (Kronecker function)

Page 107: Lecture 24: Conclusion

Continuous-valued target functions

• kNN approximating continous-valued target functions• Calculate the mean value of the k nearest training examples rather

than calculate their most common value

Page 108: Lecture 24: Conclusion

Distance Weighted

• For real valued functions

Page 109: Lecture 24: Conclusion

Ensemble Methods

Page 110: Lecture 24: Conclusion

Bootstrap a data set

• Sampling a dataset with replacement• Define: Size of the sample and the number of repeats. • Example:

• (0.1, 0.2, 0.3, 0.4, 0.5, 0.6) • Randomly choose the first observation from the dataset • sample = (0.2)

• This observation is returned to the dataset and we repeat this step 3 more times.

• sample = (0.2, 0.1, 0.2, 0.6)

Page 111: Lecture 24: Conclusion

Bagging

Page 112: Lecture 24: Conclusion

Bagging

Page 113: Lecture 24: Conclusion

Bagging

Page 114: Lecture 24: Conclusion

Boosting

• Boosting is a powerful technique for combining multiple base (weak) classifiers to produce a form of committee whose performance can be significantly better than that of any of the base classifiers.

• Boosting can give good results even if the base classifiers have a performance that is only slightly better than random, and hence sometimes the base classifiers are known as weak learners.

Page 115: Lecture 24: Conclusion

AdaBoost

Page 116: Lecture 24: Conclusion

Algorithm: EM for linear Regression Models

Page 117: Lecture 24: Conclusion
Page 118: Lecture 24: Conclusion

Bayesian Networks

Page 119: Lecture 24: Conclusion

Naive Bayes Classifier

• Assume target function f: X è V, where each instance x described by attributes a1, a2 .. an

• Most probable value of f(x) is:

Page 120: Lecture 24: Conclusion

Example

• In the example, there are four variables, namely, Burglary(= x2), Earthquake(= x3), Alarm(= x1) and JohnCalls(= x4). • The corresponding network

topology reflects the following “causal” knowledge:

• A burglar can set the alarm off.• An earthquake can set the alarm off. • The alarm can cause John to call.

Page 121: Lecture 24: Conclusion

Causality

Page 122: Lecture 24: Conclusion
Page 123: Lecture 24: Conclusion

Learning CPTs from Fully Observed Data

Page 124: Lecture 24: Conclusion

Expectation Maximization

Page 125: Lecture 24: Conclusion
Page 126: Lecture 24: Conclusion
Page 127: Lecture 24: Conclusion
Page 128: Lecture 24: Conclusion

Boltzmann Machine

Page 129: Lecture 24: Conclusion

Boltzmann Machine

Page 130: Lecture 24: Conclusion
Page 131: Lecture 24: Conclusion

• Assuming the network is composed of n units. Usually the units are updated asynchronously, updated them one at the time.

• For example at each time step, a random unit i is selected and updated with

• with bi being the bias. Unit i then turns on with a probability given by the sigmoid (logistic) function

Page 132: Lecture 24: Conclusion

Stochastic Dynamics

• The stochastic dynamics of Boltzmann machine can be described by Gibs sampling. • Suppose that the system is in a state x and we have chosen an

arbitrary coordinate i. • We can then ignore the actual state of the unit xi and ask for the

conditional probability

Page 133: Lecture 24: Conclusion
Page 134: Lecture 24: Conclusion

Learning

• During the training phase of the network there are two phases to the operation of the Boltzmann machine:

• (1) Positive phase. In this phase, the network operates in its clamped con-dition under the direct influence of the training sample. The visible neurons are all clamped onto specific states determined by the environment.

• (2) Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input. The states of the units are determined randomly. The probability of finding it in any particular globalstate depends on the energy function

Page 135: Lecture 24: Conclusion
Page 136: Lecture 24: Conclusion
Page 137: Lecture 24: Conclusion
Page 138: Lecture 24: Conclusion
Page 139: Lecture 24: Conclusion

Harmonium - Restricted Boltzmann Machine

• A restricted Boltzmann machine (RBM) has only connections between visible and hidden units to make inference and learning easier. • It was initially invented under the name Harmonium by Paul

Smolensky in 1986

Page 140: Lecture 24: Conclusion

A deep belief network

Page 141: Lecture 24: Conclusion

• The model learned to to generate combinations of labels and images.

• To perform recognition we start with a random state of the label units and clamp the input image.

• Then we do up-pass from the image followed by a few iterations of the top-level layers.

Page 142: Lecture 24: Conclusion

• Example of ML: Decision Trees • Linear and Nonlinear Regression• Perceptron/Logistic Regression

• Backpropagation• Learning Theory

• K-Means, EM-Clustering• RBF-Networks, SVM

• Model Selection, MDL Principle• Deep Learning• Convolution NN

• RNN• KL-Transform, PCA, ICA

• Autoencoders• Feature Extraction

• KNN, Weighted Regression• Ensemble Methods• Bayesian Networks• Boltzmann Machine