lecture 24: conclusion

Lecture 24: Conclusion

Andreas WichertDepartment of Computer Science and Engineering

Técnico Lisboa

• Example of ML: Decision Trees • Linear and Nonlinear Regression• Perceptron/Logistic Regression

• Backpropagation• Learning Theory

• K-Means, EM-Clustering• RBF-Networks, SVM

• Model Selection, MDL Principle• Deep Learning• Convolution NN

• RNN• KL-Transform, PCA, ICA

• Autoencoders• Feature Extraction

• KNN, Weighted Regression• Ensemble Methods• Bayesian Networks• Boltzmann Machine

Example of ML: Decision Trees

Top-Down Induction of Decision Trees ID3

1. A ¬ the “best” decision attribute for next node2. Assign A as decision attribute (=property) for

node3. For each value of A create new descendant 4. Sort training examples to leaf node according to

the attribute value of the branch5. If all training examples are perfectly classified

(same value of target attribute) stop, else iterate over new leaf nodes

Heuristic function: Shannon Entropy

• Shannon formalized these intuitions• Given a universe of messages M={m1,m2,...,mn} and a probability p(mi)

for the occurrence of each message, the information content (also called entropy)of a message M is given

€

I(M) = −p(mii=1

n

∑ )log2(p(mi))

• The gain from the property P is computed by subtracting the expected information to complete E(P) fro the total information

€

E(P) =|Ci ||C |i=1

n

∑ I(Ci)

€

gain(P) = I(C) − E(P)

Linear and Nonlinear Regression

Linear Regression

Sum-of-squares error

Design Matrix

• Dimensions change since the dimension are not determined by the dimension of the vector x which is D • The number of the is M-1

Posterior Density

Relation between Regularised Least-Squares and MAP

Perceptron/Logistic Regression

Perceptron (1957)

• Linear threshold unit (LTU)

S

x1

x2

xn

...

w1w2

wn

w0X0=1

o

McCulloch-Pitts model of a neuron (1943)

The “bias”, a constant term that does not depend on any input value

Linearly separable patterns

X0=1, bias...

Perceptron learning rule

• Consider linearly separable problems• How to find appropriate weights

• Initialize each vector w to some small random values

• Look if the output pattern o belongs to the desired class, has the desired value d

• h is called the learning rate• 0 < h ≤ 1

Δw =η ⋅ (d −o) ⋅ x

• The update rule for gradient decent is given by

Linear Unit

Sigmoid Unit

Logistic Regression

Sigmoid Unit versus Logistic Regression

Linear Unit versus Logistic Regression

Backpropagation

Back-propagation

• The algorithm gives a prescription for changing the weights wij in any feed-forward network to learn a training set of input output pairs {xk,yk}• We consider a simple two-layer network

Learning Theory

Bias-Variance Dilemma

Variance

VC-dimension

• The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H

• It can be shown that the VC dimension of linear decision surfaces in an ddimensional space (i.e., the VC dimension of a perceptron with d inputs) is d + 1

• Perceptron in d dimensions has d+1 parameter (bias) Through d+1 linear independent chosen points we can learn all dichotomies• For d+2 points in a perceptron in d dimension some vectors (at least two)

are represented as a linear combination, we cannot learn all dichotomies

K-Means, EM-Clustering

K-means Clustering

Algorithm: EM for Gaussian mixtures

EM for Gaussian mixtures

RBF-Networks, SVM

Interpolation Problem

Micchelli’s Theorem

Radial Basis Function Networks

Constructing new kernels by building them out of simpler kernels as building blocks

• f(·) is any function, q(·) is a polynomial with nonnegative coefficients,• A is a symmetric positive

semidefinite matrix, xa and xb are variables (not necessarily disjoint) with x = (xa, xb), and ka and kb are valid kernel functions over their respective spaces.

Gaussian Kernel

• Since:

• The feature vector that corresponds to the Gaussian kernel has infinite dimensionality

Good Decision Boundary: Margin Should Be Large

• The decision boundary should be as far away from the data of both classes as possible

• We should maximize the margin, 𝞺

Class 1

Class 2

𝞺

Dual Problem

Design of Support Vector Machine

Classify Data Points

Model Selection, MDL Principle

Attributes of the MDL Principle

• When we have two models that fit a given data sequence equally well, the MDL principle will pick the one that is the simplest in the sense that it allows the use of a shorter description of the data

• MDL principle implements a precise form of Occam’s razor, which states a preference for simple theories

• The MDL principle is a consistent model selection estimator in the sense that it converges to the true model order as the sample size increases

MDL and Regularization

Deep Learning

• It is assumed that an artificial neural network with several hidden layers is less likely to be stuck in a local minima and it is easier to find the right parameters as demonstrated by empirical experiments

How to set network parameters

16 x 16 = 256

1x

2x

……

256x

……

……

……

……

Ink → 1No ink → 0

……

y1

y2

y10

0.1

0.7

0.2

y1 has the maximum value

Set the network parameters 𝜃 such that ……

Input:

y2 has the maximum valueInput:

is 1

is 2

is 0Softm

ax

𝜃 = 𝑊!, 𝑏!,𝑊", 𝑏", ⋯𝑊# , 𝑏#

Rectified Linear Unit (ReLU)

• f(x) = max(0,x)• Function defined as the positive part of its argument

• Does not saturate (in +region) • Very computationally efficient• Converges much faster than

sigmoid/tanh in practice (e.g. 6x) • More biologically plausible

• But: Not zero-centered output L• Non-differentiable at zero; however it is differentiable anywhere else, and

a value of 0 or 1

ReLu function

Derivative of ReLu function

Batch Normalization

• During training time, a batch normalization layer does the following:

• Calculate the mean and variance of the layers input

• Normalize the layer inputs using the previously calculated batch statistics

• Scale and shift in order to obtain the output of the layer

• γ and β are learned during training along with the original parameters of the network.

l2 Regularization

l1 Regularization

l2 versus l1 Regularization

Regularization: Dropout• In each forward pass, randomly set some neurons to zero (for one

pass only)• Probability of dropping is a hyperparameter; 0.5 is common

Convolution NN

• Convolutional Neural Networks

Recurrent Neural Networks

• Recurrent networks that produce an output at each time step and have recurrent connections between hidden units, Vanilla RNN or Elman RNN

Recurrent Hidden Units

KL-Transform, PCA, ICA

The Karhunen-Loève Transform

• The squares of the eigenvalues represent the variances along the eigenvectors. The eigenvalues corresponding to the covariance matrix of the data set Σ are

The Blind Source Separation Problem

PCA vs ICA

Autoencoders

• Unsupervised Learning• Data: no labels!• Goal: Learn the structure of the

data

• Traditionally, autoencoders were used for dimensionality reduction or feature learning.

Undercomplete AE

𝑥

"𝑥

𝑤

𝑤′𝑓 𝑥

• Hidden layer is Undercomplete if smaller than the input layer• Compresses the input• Compresses well only for the

training dist.

• Hidden nodes will be• Good features for the training

distribution.• Bad for other types on input

• Autoencoders with nonlinear encoder functions f and nonlinear decoder functions g can thus learn a more powerful nonlinear generalization of PCA (later)

Feature Extraction

Edge detection

• Convert a 2D image into a set of curves• Extracts salient features of the scene• More compact than pixels

• The basic idea behind edge detection is to localize discontinuities of the intensity function in the image

• The right angle is the only place where the contour is curved, changes its direction• At the point of extreme curvature, the information is concentrated• Corners yield the greatest information• More strongly curved points yield more information• Information content of a contour is concentrated in the neighborhood

of points where the absolute value of the curvature is a local maximum

KNN,Weighted Regrssion

K-Nearest Neighbor

• Training algorithm• For each training example (x,f(x)) add the example to the list

• Classification algorithm• Given a query instance xq to be classified

• Let x1,..,xk k instances which are nearest to xq

• Where d(a,b)=1 if a=b, else d(a,b)= 0 (Kronecker function)

Continuous-valued target functions

• kNN approximating continous-valued target functions• Calculate the mean value of the k nearest training examples rather

than calculate their most common value

Distance Weighted

• For real valued functions

Ensemble Methods

Bootstrap a data set

• Sampling a dataset with replacement• Define: Size of the sample and the number of repeats. • Example:

• (0.1, 0.2, 0.3, 0.4, 0.5, 0.6) • Randomly choose the first observation from the dataset • sample = (0.2)

• This observation is returned to the dataset and we repeat this step 3 more times.

• sample = (0.2, 0.1, 0.2, 0.6)

Bagging

Boosting

• Boosting is a powerful technique for combining multiple base (weak) classifiers to produce a form of committee whose performance can be significantly better than that of any of the base classifiers.

• Boosting can give good results even if the base classifiers have a performance that is only slightly better than random, and hence sometimes the base classifiers are known as weak learners.

AdaBoost

Algorithm: EM for linear Regression Models

Bayesian Networks

Naive Bayes Classifier

• Assume target function f: X è V, where each instance x described by attributes a1, a2 .. an

• Most probable value of f(x) is:

Example

• In the example, there are four variables, namely, Burglary(= x2), Earthquake(= x3), Alarm(= x1) and JohnCalls(= x4). • The corresponding network

topology reflects the following “causal” knowledge:

• A burglar can set the alarm off.• An earthquake can set the alarm off. • The alarm can cause John to call.

Causality

Learning CPTs from Fully Observed Data

Expectation Maximization

Boltzmann Machine

• Assuming the network is composed of n units. Usually the units are updated asynchronously, updated them one at the time.

• For example at each time step, a random unit i is selected and updated with

• with bi being the bias. Unit i then turns on with a probability given by the sigmoid (logistic) function

Stochastic Dynamics

• The stochastic dynamics of Boltzmann machine can be described by Gibs sampling. • Suppose that the system is in a state x and we have chosen an

arbitrary coordinate i. • We can then ignore the actual state of the unit xi and ask for the

conditional probability

Learning

• During the training phase of the network there are two phases to the operation of the Boltzmann machine:

• (1) Positive phase. In this phase, the network operates in its clamped con-dition under the direct influence of the training sample. The visible neurons are all clamped onto specific states determined by the environment.

• (2) Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input. The states of the units are determined randomly. The probability of finding it in any particular globalstate depends on the energy function

Harmonium - Restricted Boltzmann Machine

• A restricted Boltzmann machine (RBM) has only connections between visible and hidden units to make inference and learning easier. • It was initially invented under the name Harmonium by Paul

Smolensky in 1986

A deep belief network

• The model learned to to generate combinations of labels and images.

• To perform recognition we start with a random state of the label units and clamp the input image.

• Then we do up-pass from the image followed by a few iterations of the top-level layers.

• Example of ML: Decision Trees • Linear and Nonlinear Regression• Perceptron/Logistic Regression

• Backpropagation• Learning Theory

• K-Means, EM-Clustering• RBF-Networks, SVM

• Model Selection, MDL Principle• Deep Learning• Convolution NN

• RNN• KL-Transform, PCA, ICA

• Autoencoders• Feature Extraction

• KNN, Weighted Regression• Ensemble Methods• Bayesian Networks• Boltzmann Machine

lecture 24: conclusion

Documents