lecture 24: conclusion
TRANSCRIPT
Lecture 24: Conclusion
Andreas WichertDepartment of Computer Science and Engineering
Técnico Lisboa
• Example of ML: Decision Trees • Linear and Nonlinear Regression• Perceptron/Logistic Regression
• Backpropagation• Learning Theory
• K-Means, EM-Clustering• RBF-Networks, SVM
• Model Selection, MDL Principle• Deep Learning• Convolution NN
• RNN• KL-Transform, PCA, ICA
• Autoencoders• Feature Extraction
• KNN, Weighted Regression• Ensemble Methods• Bayesian Networks• Boltzmann Machine
Example of ML: Decision Trees
Top-Down Induction of Decision Trees ID3
1. A ¬ the “best” decision attribute for next node2. Assign A as decision attribute (=property) for
node3. For each value of A create new descendant 4. Sort training examples to leaf node according to
the attribute value of the branch5. If all training examples are perfectly classified
(same value of target attribute) stop, else iterate over new leaf nodes
Heuristic function: Shannon Entropy
• Shannon formalized these intuitions• Given a universe of messages M={m1,m2,...,mn} and a probability p(mi)
for the occurrence of each message, the information content (also called entropy)of a message M is given
€
I(M) = −p(mii=1
n
∑ )log2(p(mi))
• The gain from the property P is computed by subtracting the expected information to complete E(P) fro the total information
€
E(P) =|Ci ||C |i=1
n
∑ I(Ci)
€
gain(P) = I(C) − E(P)
Linear and Nonlinear Regression
Linear Regression
Sum-of-squares error
Design Matrix
• Dimensions change since the dimension are not determined by the dimension of the vector x which is D • The number of the is M-1
Posterior Density
Relation between Regularised Least-Squares and MAP
Perceptron/Logistic Regression
Perceptron (1957)
• Linear threshold unit (LTU)
S
x1
x2
xn
...
w1w2
wn
w0X0=1
o
McCulloch-Pitts model of a neuron (1943)
The “bias”, a constant term that does not depend on any input value
Linearly separable patterns
X0=1, bias...
Perceptron learning rule
• Consider linearly separable problems• How to find appropriate weights
• Initialize each vector w to some small random values
• Look if the output pattern o belongs to the desired class, has the desired value d
• h is called the learning rate• 0 < h ≤ 1
Δw =η ⋅ (d −o) ⋅ x
• The update rule for gradient decent is given by
Linear Unit
Sigmoid Unit
Logistic Regression
Logistic Regression
Sigmoid Unit versus Logistic Regression
Linear Unit versus Logistic Regression
Backpropagation
Back-propagation
• The algorithm gives a prescription for changing the weights wij in any feed-forward network to learn a training set of input output pairs {xk,yk}• We consider a simple two-layer network
Learning Theory
Bias-Variance Dilemma
Bias
Variance
VC-dimension
• The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H
• It can be shown that the VC dimension of linear decision surfaces in an ddimensional space (i.e., the VC dimension of a perceptron with d inputs) is d + 1
• Perceptron in d dimensions has d+1 parameter (bias) Through d+1 linear independent chosen points we can learn all dichotomies• For d+2 points in a perceptron in d dimension some vectors (at least two)
are represented as a linear combination, we cannot learn all dichotomies
K-Means, EM-Clustering
K-means Clustering
K-means Clustering
Algorithm: EM for Gaussian mixtures
EM for Gaussian mixtures
EM for Gaussian mixtures
EM for Gaussian mixtures
RBF-Networks, SVM
Interpolation Problem
Micchelli’s Theorem
Radial Basis Function Networks
Radial Basis Function Networks
Radial Basis Function Networks
Constructing new kernels by building them out of simpler kernels as building blocks
• f(·) is any function, q(·) is a polynomial with nonnegative coefficients,• A is a symmetric positive
semidefinite matrix, xa and xb are variables (not necessarily disjoint) with x = (xa, xb), and ka and kb are valid kernel functions over their respective spaces.
Gaussian Kernel
• Since:
• The feature vector that corresponds to the Gaussian kernel has infinite dimensionality
Good Decision Boundary: Margin Should Be Large
• The decision boundary should be as far away from the data of both classes as possible
• We should maximize the margin, 𝞺
Class 1
Class 2
𝞺
Dual Problem
Dual Problem
Design of Support Vector Machine
Classify Data Points
Model Selection, MDL Principle
Attributes of the MDL Principle
• When we have two models that fit a given data sequence equally well, the MDL principle will pick the one that is the simplest in the sense that it allows the use of a shorter description of the data
• MDL principle implements a precise form of Occam’s razor, which states a preference for simple theories
• The MDL principle is a consistent model selection estimator in the sense that it converges to the true model order as the sample size increases
MDL and Regularization
Deep Learning
• It is assumed that an artificial neural network with several hidden layers is less likely to be stuck in a local minima and it is easier to find the right parameters as demonstrated by empirical experiments
How to set network parameters
16 x 16 = 256
1x
2x
……
256x
……
……
……
……
Ink → 1No ink → 0
……
y1
y2
y10
0.1
0.7
0.2
y1 has the maximum value
Set the network parameters 𝜃 such that ……
Input:
y2 has the maximum valueInput:
is 1
is 2
is 0Softm
ax
𝜃 = 𝑊!, 𝑏!,𝑊", 𝑏", ⋯𝑊# , 𝑏#
Rectified Linear Unit (ReLU)
• f(x) = max(0,x)• Function defined as the positive part of its argument
• Does not saturate (in +region) • Very computationally efficient• Converges much faster than
sigmoid/tanh in practice (e.g. 6x) • More biologically plausible
• But: Not zero-centered output L• Non-differentiable at zero; however it is differentiable anywhere else, and
a value of 0 or 1
ReLu function
Derivative of ReLu function
Batch Normalization
• During training time, a batch normalization layer does the following:
• Calculate the mean and variance of the layers input
• Normalize the layer inputs using the previously calculated batch statistics
• Scale and shift in order to obtain the output of the layer
• γ and β are learned during training along with the original parameters of the network.
l2 Regularization
l1 Regularization
l2 versus l1 Regularization
Regularization: Dropout• In each forward pass, randomly set some neurons to zero (for one
pass only)• Probability of dropping is a hyperparameter; 0.5 is common
Convolution NN
• Convolutional Neural Networks
RNN
Recurrent Neural Networks
• Recurrent networks that produce an output at each time step and have recurrent connections between hidden units, Vanilla RNN or Elman RNN
Recurrent Hidden Units
KL-Transform, PCA, ICA
The Karhunen-Loève Transform
• The squares of the eigenvalues represent the variances along the eigenvectors. The eigenvalues corresponding to the covariance matrix of the data set Σ are
The Blind Source Separation Problem
PCA vs ICA
Autoencoders
• Unsupervised Learning• Data: no labels!• Goal: Learn the structure of the
data
• Traditionally, autoencoders were used for dimensionality reduction or feature learning.
Undercomplete AE
𝑥
"𝑥
𝑤
𝑤′𝑓 𝑥
• Hidden layer is Undercomplete if smaller than the input layer• Compresses the input• Compresses well only for the
training dist.
• Hidden nodes will be• Good features for the training
distribution.• Bad for other types on input
• Autoencoders with nonlinear encoder functions f and nonlinear decoder functions g can thus learn a more powerful nonlinear generalization of PCA (later)
Feature Extraction
Edge detection
• Convert a 2D image into a set of curves• Extracts salient features of the scene• More compact than pixels
• The basic idea behind edge detection is to localize discontinuities of the intensity function in the image
• The right angle is the only place where the contour is curved, changes its direction• At the point of extreme curvature, the information is concentrated• Corners yield the greatest information• More strongly curved points yield more information• Information content of a contour is concentrated in the neighborhood
of points where the absolute value of the curvature is a local maximum
KNN,Weighted Regrssion
K-Nearest Neighbor
• Training algorithm• For each training example (x,f(x)) add the example to the list
• Classification algorithm• Given a query instance xq to be classified
• Let x1,..,xk k instances which are nearest to xq
• Where d(a,b)=1 if a=b, else d(a,b)= 0 (Kronecker function)
Continuous-valued target functions
• kNN approximating continous-valued target functions• Calculate the mean value of the k nearest training examples rather
than calculate their most common value
Distance Weighted
• For real valued functions
Ensemble Methods
Bootstrap a data set
• Sampling a dataset with replacement• Define: Size of the sample and the number of repeats. • Example:
• (0.1, 0.2, 0.3, 0.4, 0.5, 0.6) • Randomly choose the first observation from the dataset • sample = (0.2)
• This observation is returned to the dataset and we repeat this step 3 more times.
• sample = (0.2, 0.1, 0.2, 0.6)
Bagging
Bagging
Bagging
Boosting
• Boosting is a powerful technique for combining multiple base (weak) classifiers to produce a form of committee whose performance can be significantly better than that of any of the base classifiers.
• Boosting can give good results even if the base classifiers have a performance that is only slightly better than random, and hence sometimes the base classifiers are known as weak learners.
AdaBoost
Algorithm: EM for linear Regression Models
Bayesian Networks
Naive Bayes Classifier
• Assume target function f: X è V, where each instance x described by attributes a1, a2 .. an
• Most probable value of f(x) is:
Example
• In the example, there are four variables, namely, Burglary(= x2), Earthquake(= x3), Alarm(= x1) and JohnCalls(= x4). • The corresponding network
topology reflects the following “causal” knowledge:
• A burglar can set the alarm off.• An earthquake can set the alarm off. • The alarm can cause John to call.
Causality
Learning CPTs from Fully Observed Data
Expectation Maximization
Boltzmann Machine
Boltzmann Machine
• Assuming the network is composed of n units. Usually the units are updated asynchronously, updated them one at the time.
• For example at each time step, a random unit i is selected and updated with
• with bi being the bias. Unit i then turns on with a probability given by the sigmoid (logistic) function
Stochastic Dynamics
• The stochastic dynamics of Boltzmann machine can be described by Gibs sampling. • Suppose that the system is in a state x and we have chosen an
arbitrary coordinate i. • We can then ignore the actual state of the unit xi and ask for the
conditional probability
Learning
• During the training phase of the network there are two phases to the operation of the Boltzmann machine:
• (1) Positive phase. In this phase, the network operates in its clamped con-dition under the direct influence of the training sample. The visible neurons are all clamped onto specific states determined by the environment.
• (2) Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input. The states of the units are determined randomly. The probability of finding it in any particular globalstate depends on the energy function
Harmonium - Restricted Boltzmann Machine
• A restricted Boltzmann machine (RBM) has only connections between visible and hidden units to make inference and learning easier. • It was initially invented under the name Harmonium by Paul
Smolensky in 1986
A deep belief network
• The model learned to to generate combinations of labels and images.
• To perform recognition we start with a random state of the label units and clamp the input image.
• Then we do up-pass from the image followed by a few iterations of the top-level layers.
• Example of ML: Decision Trees • Linear and Nonlinear Regression• Perceptron/Logistic Regression
• Backpropagation• Learning Theory
• K-Means, EM-Clustering• RBF-Networks, SVM
• Model Selection, MDL Principle• Deep Learning• Convolution NN
• RNN• KL-Transform, PCA, ICA
• Autoencoders• Feature Extraction
• KNN, Weighted Regression• Ensemble Methods• Bayesian Networks• Boltzmann Machine