non-parametric methods in machine learning

547
Introduction Empirical Distribution Function Why Theory? Computational issues Classification Non-parametric methods in machine learning Daniil Ryabko IDSIA/USI/SUPSI

Upload: heider-jeffer

Post on 21-May-2017

240 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Non-parametric methods in machine learning

Daniil Ryabko

IDSIA/USI/SUPSI

Page 2: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Machine learning and artificial intelligence methods are based on avariety of mathematical disciplines.

One of them is probability theory. Some concepts of probabilitytheory will be used today.

Page 3: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Non-parametric methods are about making inference and predictionwhile making as few as possible assumptions about the data.Consider an example.

Page 4: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Non-parametric methods are about making inference and predictionwhile making as few as possible assumptions about the data.Consider an example.

I ask 20 friends when their cars were made. Then I’m going to ask21st friend the same question. What is the probability that his carwas made before 2000? Before 1990? Before 2006?

The problem is difficult to model beforehand. What is theunderlying probability distribution? I have no idea.

Page 5: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

The data can look as follows.

Page 6: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

We are given a sample X1, . . . ,Xn of independent identicallydistributed (i.i.d.) random variables generated according to aprobability distribution P.

Page 7: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

We are given a sample X1, . . . ,Xn of independent identicallydistributed (i.i.d.) random variables generated according to aprobability distribution P.

We are about to observe a new data point (random variable) X .For any given t we are interested in the probability P(X ≤ t).

In other words, we are interested in the function F (t) = P(X ≤ t).This function is called the distribution function.

Page 8: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Empirical Distribution Function

We will estimate the probabilities P(X ≤ t) by counting thefraction of Xi that are within the interval (−∞, t].

Let

Fn(t) =1

n

n∑i=1

I(−∞,t](Xi ),

where IA is the indicator function for the set A:

IA(t) =

{1 if t is in A0 otherwise

Page 9: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

So if t is 1998 then Fn(t) is 1/2.

Page 10: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Thus we will use the empirical distribution function Fn(t) as anestimate of the real (unknown) distribution function F (t). Its plotcould look like this:

Page 11: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Is Fn(t) a good estimate of F (t)?Well, it is quite good.

Proposition

For any t, Fn(t) converges to F (t) as the sample size n goes toinfinity.

It follows from the law of large numbers. The (strong) law of largenumbers says that if X1, . . . ,Xn are independent identicallydistributed (i.i.d.) random variables with expectation E (X1) then

1

n

n∑i=1

Xi → E (X1)

(almost surely).In our case, for any t, consider random variables I(−∞,t](Xi ). Theyare independent and identically distributed. So we have

Fn(t) =1

n

n∑i=1

I(−∞,t](Xi ) → E (I(−∞,t](Xi )) = P(X ≤ t).

Page 12: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

So, we know that as sample size n grows to infinity the estimationFn(t) converges to the true (unknown) value F (t).This is an asymptotic result.Can we say something about finite sample size? We can.

Proposition

For any t,P(|Fn(t)− F (t)| > ε) ≤ 2e−nε2

.

Again, this fact simply follows from a more general inequality fori.i.d. random variables.

Page 13: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Thus, we take any year t1 estimate Fn(t1), and we know that withprobability at least 1− 2e−nε2

it differs from Fn(t) for no morethan ε.Then we take t2, t3, make a new estimate, and get the samebound. But if for all t we can say that the probability that|Fn(t)− F (t)| is greater than ε is less than δ = 2e−nε2

, it does notmean that the probability that for all t |Fn(t)− F (t)| is greaterthan ε is less than δ.In symbols, we have

supt∈R

P(|Fn(t)− F (t)| > ε) ≤ δ

but can we have

P(supt∈R

|Fn(t)− F (t)| > ε) ≤ δ?

Why do we want this? Because we want to make inference aboutt1, t2, t3 - all of them - from just one data sample X1, . . . ,Xn thatwe have! Think about this!

Page 14: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Glivenko-Cantelli TheoremFor just a finite number of variables t, say two of them, we can dothis:

P(|Fn(t1)− F (t1)| > ε and |Fn(t2)− F (t2)| > ε)

≤ P(|Fn(t1)− F (t1)| > ε) + P(|Fn(t2)− F (t2)| > ε) ≤ 2δ.

For k values of t we can obtain kδ, using the same trick: the unionbound. But it’s quite bad. And what about all infinitely many t atthe same time?Luckily, we can have a single bound for all of them, but the proofis a bit too long for this lecture.

Proposition (Glivenko-Cantelli Theorem)

P(supt∈R

|Fn(t)− F (t)| > ε) ≤ 8(n + 1)e−nε2/32.

Page 15: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Other setsSo far we have seen how to estimate the probability of X < t, for agiven t, based on the sample X1, . . . ,Xn. Can we estimate theprobability of other sets? For example, I want to know, what is theprobability that the next friend I ask will have a car issued between1989 and 1993. Or, issued either in 1991 or in 1999?Interestingly, we can have the same statements as our first twopropositions:

Proposition

For any event A, Pn(A) = 1n

∑ni=1 IA(Xi ) converges to P(A) as the

sample size n goes to infinity (where by Pn we denote the empiricalprobability).

and

Proposition

For any event A, P(|Pn(A)− P(A)| > ε) ≤ 2e−nε2.

And they can be proven in the same way!

Page 16: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

First disappointmentBut let us see that we cannot have any guarantee for all sets Atogether; that is, there are no nontrivial bounds for

P(supA|Pn(A)− P(A)| > ε).

Indeed, if we consider all sets A then we will also encounter this set

Here I have drawn a set that contains exactly the data points Xi

and (almost) nothing else.

Page 17: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Formally:

Proposition

Let P be any continuous distribution (that is, P(t) = 0 for anypoint t ∈ R), and let X1, . . . ,Xn be generated i.i.d. according toP. Then for any ε < 1

P(supA|Pn(A)− P(A)| > ε) = 1

Proof.Let B be a set that consists exactly of the sample points:B = {X1, . . . ,Xn}. Then Pn(B) = 1, while P(A) = 0, since P iscontinuous. Moreover,

P(supA|Pn(A)−P(A)| > ε) ≥ P(|Pn(B)−P(B)| > ε) = P(1 > ε) = 1.

Page 18: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Theoretical analysis gives guarantees of performance on arbitrarydata that fits the assumptions.Typical theoretical results in machine learning/artificial intelligenceinclude:

• Asymptotic performance guarantees.

• Finite-step performance guarantees, e.g. bounds onprobability of error.

• Uniform (convergence, bounds) results.

• Impossibility results.

and

• Computational complexity analysis.

After a thorough theoretical analysis practical evaluation is oftenredundant :)

Page 19: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Computational complexity of empirical DF estimate

Let’s recall the setup: we are given X1, . . . ,Xn. For any t, tocompute F̄n(t) we need O(log n) comparisons, provided X1, . . . ,Xn

is sorted. To sort X1, . . . ,Xn we need

Page 20: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Computational complexity of empirical DF estimate

Let’s recall the setup: we are given X1, . . . ,Xn. For any t, tocompute F̄n(t) we need O(log n) comparisons, provided X1, . . . ,Xn

is sorted. To sort X1, . . . ,Xn we need O(n log n) comparisons.

Page 21: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Computational complexity of empirical DF estimate

Let’s recall the setup: we are given X1, . . . ,Xn. For any t, tocompute F̄n(t) we need O(log n) comparisons, provided X1, . . . ,Xn

is sorted. To sort X1, . . . ,Xn we need O(n log n) comparisons.

Online setting: X1, . . . ,Xn are given, Xn+1 arrives. To put Xn+1 inthe sorted array requires not more than O(log n) comparisons.

Page 22: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Complexity issues

In general, the following complexity issues are considered in ML:

• Complexity of building a classifier from sample(X1,Y1), . . . , (Xn,Yn).

• Complexity of building a classifier “online”: changing theclassifier with the arrival of a new data point (Xn+1,Yn+1).

• Complexity of classifying a new data point X ; that is,evaluating ϕn(X ).

Page 23: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Classification

Classification is one of the most important, most basic and mostpopular problems in machine learning and artificial intelligence.

We are given a sample (X1,Y1), . . . , (Xk ,Yk) of examples, whereeach X is an object and each Yi is its simple label. Based on thissample, we have to construct a classifier — a rule that “predicts”the label Y given the object X .

Example: hand-written digit recognition. Here are four objects

And their labels are: 3 0 1 4An object here is a picture 16x16 grayscale pixels, and a label isjust a digit from the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.

Page 24: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

USPS dataset

This example was from a classical benchmark dataset: USPShandwritten digits.

It consists of about 7000 training examples, based on which youcan construct a classifier, and about 2000 test examples, on whichyou can test it, and compare with the results others achieved onthis dataset.

Page 25: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

other examples

Recognizing a disease, such as cancer, in a patient, based onvarious data, such as X-ray images, gene expression data, etc.Here an object is all this data together, while the label is justbinary: 0 or 1, has cancer or not.

More examples: Sex identification from a photo (M/F). Emotionrecognition by a photo (Happy/Sad/Angry). Textual data: emailspam recognition. Topic identification. And whatnot.

Page 26: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Formal model for classification

We are given a (training) sample (X1,Y1), . . . , (Xk ,Yk), where Xi

are from X = Rd , called object space and Yi are from a finite setY, called label space. We will assume just binary Y = {0, 1}(unless stated otherwise). The objects and labels (Xi ,Yi ) aregenerated i.i.d. by (unknown) probability distribution P on Rd ×A.

We wish to construct a classifier ϕ such that the probability oferror P(ϕ(X ) 6= Y ) on a test pair (X ,Y ) is small.

In essence, ϕ should emulate (or estimate) the conditionaldistribution P(Y |X ) — the probability of a certain label given theobject.

Page 27: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Histogram methodsDivide the object space into bins and classify new object Xaccording to the majority in the bin containing X .

Page 28: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Histogram rule: definition

Let An(x) denote the bin containing x , in the nth partition.

ϕn(x) =

{1 if #{i : Yi = 0,Xi ∈ An(x)} < #{i : Yi = 1,Xi ∈ An(x)}0 otherwise

How to define the bins? For example, they can be cubes of size hn:

d∏j=1

[kjhn, (kj + 1)hn], (1)

where kj are integers.

Page 29: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Class Probability Estimation

Clearly, the histogram rule can also be used to estimate probabilityof a given class label. Let η(x) = P(Y = 1|X ) be the “true”(unknown) probability of the label Y = 1 given object X .

We can define the histogram estimate

η̂n(x) =#{i : Yi = 1,Xi ∈ An(x)}

#{i : Xi ∈ An(x)}. (2)

Page 30: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Plug-in rule

From a class probability estimate η̂n(X ) we can always get aclassifier by

ϕn(x) =

{1 if η̂n(x) ≥ 1/20 otherwise

Page 31: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

How to select the bin size

The bin sizes must be

• Big enough to contain sufficient number of objects

• Small enough to be “local” estimates

In a certain sense, it is easy.

TheoremFor the histogram rule (2) with “cubic” bins (1) if hn → 0 andnhd

n →∞ as n →∞ then

E(|η̂(X )− η(X )|) → 0.

So histogram methods are very good! Or are they? May be, butthis result is only asymptotic. Finite-step results cannot beobtained.

Page 32: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Another advantage of histogram methods: after building aclassifier the data can be discarded. Thus: computationalefficiency for test data.

Disadvantages of histograms:

• Sharp class decision boundaries

• Data-independent bins are not robust

• Curse of dimensionality: the number of bins growsexponentially with the dimension d of the object space.

Page 33: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Kernel methods

Let’s look at a slightly different histogram rule.Let

k(x) =

{1 |xj | ≤ 1/2, j = 1, . . . , d0 otherwise

(3)

(here ji j = 1, . . . , d are coordinate components of x)k(x) can be used to count the number of points within cube ofsize h of X

K =n∑

i=1

k

(X − Xi

h

)So the class probability can be estimated as

η̂(x) =1

K

n∑i=1

IYi=1k

(X − Xi

h

)

Page 34: Non-Parametric Methods in Machine Learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

We can use other functions k() to obtain, in particular, smootherclass decision boundaries. For example, Gaussian:

k(x) =1

(2π)d/2e−||x ||

2/2,

where as usual ||x ||2 =∑d

j=1 x2j .

Page 35: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Non-parametric methods II

Daniil Ryabko

IDSIA/USI/SUPSI

Page 36: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Probability concepts you need for this course

• Elementary definition of probability: axioms of probability.

• Events, random variables

• Independence of events, of r.v.

• Conditional probabilities, conditional independence

• Expectation, variance

• Distribution functions, densities

• Distributions: Bernoulli, Binomial, Geometric, Gaussian,Exponential

These can be found in any book on probability! Or, in the“probability background” sections of ML books, e.g. Bishop.

Page 37: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Probability concepts you need for this course

• Elementary definition of probability: axioms of probability.

• Events, random variables

• Independence of events, of r.v.

• Conditional probabilities, conditional independence

• Expectation, variance

• Distribution functions, densities

• Distributions: Bernoulli, Binomial, Geometric, Gaussian,Exponential

These can be found in any book on probability! Or, in the“probability background” sections of ML books, e.g. Bishop.

Page 38: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Probability concepts you need for this course

• Elementary definition of probability: axioms of probability.

• Events, random variables

• Independence of events, of r.v.

• Conditional probabilities, conditional independence

• Expectation, variance

• Distribution functions, densities

• Distributions: Bernoulli, Binomial, Geometric, Gaussian,Exponential

These can be found in any book on probability! Or, in the“probability background” sections of ML books, e.g. Bishop.

Page 39: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Probability concepts you need for this course

• Elementary definition of probability: axioms of probability.

• Events, random variables

• Independence of events, of r.v.

• Conditional probabilities, conditional independence

• Expectation, variance

• Distribution functions, densities

• Distributions: Bernoulli, Binomial, Geometric, Gaussian,Exponential

These can be found in any book on probability! Or, in the“probability background” sections of ML books, e.g. Bishop.

Page 40: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Probability concepts you need for this course

• Elementary definition of probability: axioms of probability.

• Events, random variables

• Independence of events, of r.v.

• Conditional probabilities, conditional independence

• Expectation, variance

• Distribution functions, densities

• Distributions: Bernoulli, Binomial, Geometric, Gaussian,Exponential

These can be found in any book on probability! Or, in the“probability background” sections of ML books, e.g. Bishop.

Page 41: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Probability concepts you need for this course

• Elementary definition of probability: axioms of probability.

• Events, random variables

• Independence of events, of r.v.

• Conditional probabilities, conditional independence

• Expectation, variance

• Distribution functions, densities

• Distributions: Bernoulli, Binomial, Geometric, Gaussian,Exponential

These can be found in any book on probability! Or, in the“probability background” sections of ML books, e.g. Bishop.

Page 42: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Probability concepts you need for this course

• Elementary definition of probability: axioms of probability.

• Events, random variables

• Independence of events, of r.v.

• Conditional probabilities, conditional independence

• Expectation, variance

• Distribution functions, densities

• Distributions: Bernoulli, Binomial, Geometric, Gaussian,Exponential

These can be found in any book on probability! Or, in the“probability background” sections of ML books, e.g. Bishop.

Page 43: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Probability concepts you DO NOT need for this course

• Measure theory

• σ-algebras (sigma-algebras)

• Extension theorems (e.g. Caratheodory’s, Kolmogorov’s)

• Different convergence concepts

If you see these in your book you can safely skip themAlthough understanding these will do no harm

Page 44: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

I will be giving some background material on probability, butdon’t rely solely on it

Page 45: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Classification setup

Recall the classification problem: we are given a (training) sample(X1,Y1), . . . , (Xn,Yn), where Xi are from the the object spaceX = Rd , and Yi are from the label space Y = {0, 1}. The objectsand labels (Xi ,Yi ) are generated i.i.d. by (unknown) probabilitydistribution P on Rd × A.

We wish to construct a classifier that classifies objects whose labelswe don’t know.

Page 46: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

k-Nearest Neighbours

k-Nearest Neighbours rule classifies an object X according to themajority vote among its k nearest neighbours.

To find k nearest neighbours of X , simply sort all ||Xi − X ||,i = 1, . . . , n and take those Xi that are among k first. LetX(1)(x), . . . ,X(k)(x) be the k nearest neighbours of x andY(1)(x), . . . ,Y(k)(x) — their labels.

The class probability estimate is

η̂n(x) =1

k

k∑i=1

I{Y(i)(x)=1}

Page 47: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

How to select k?

Proposition

If kn →∞ but knn → 0 then the expected error of the kn-nearest

neighbour label probability estimate goes to zero

E(|η̂n(x)− η(x)|) → 0

Here as before, η(x) = P(Y = 1|x) the probability of label 1 ofobject x.

For example, one can select kn =√

n or kn = log(n).

Page 48: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Bigger k give more smooth and robust estimates, smaller k aremore “opportunistic”.

However, for deterministic labels (η(x) ∈ {0, 1} for all x) even1-nearest neighbour rule gives consistent estimates:

Proposition

If η(x) = 0 or η(x) = 1 for all x, then for 1-nearest neighbour rulewe have

E(|η̂n(x)− η(x)|) → 0.

Page 49: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Overfitting

Why 1-Nearest neighbour is not good in general, when labels arenon-deterministic?Because of overfitting.

Overfitting occurs when your classifier fits the training data toowell. We can always fit well the training data, but that’s not whatwe want. We want to make predictions for future data.

Page 50: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Weighted nearest neighbours

The k-nearest neighbours rule “asks” the k nearest neighboursabout their labels, and treats the answers equally. May be it’sbetter to trust the nearer neighbours more?

Let win, i = 1, . . . , n be some positive weights that sum to one∑ni=1 win = 1. Define the weighted nearest neighbour rule:

η̂(x) =n∑

i=1

I{Yi=1}win

For example, for k-nearest neighbours we have win = 1k for

i = 1, . . . , k and win = 0 otherwise.

Page 51: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Probability background interleaver goes here

Page 52: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Density estimation

Histogram and nearest neighbours rules can be used for densityestimation too.Forget about the labels Yi .Now we have a sample X1, . . . ,Xn of i.i.d. r.v. generated accordingto some unknown distribution P that has a density f . This meansthat

P(A) =

∫A

f (x)dx

for every event A.

We want to estimate f (x).

Page 53: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Histogram density estimate

To estimate the density f (x) at point x take a small region Aaround x and get an estimate f̂ (x) = P(A)/∆(A) where ∆(A) isthe area of A.

But we don’t even know P(A)!

So we use our empirical probability P̂(A) from Lecture 1:

f̂n(x) =P̂(A)

∆(A)=

1

n∆(A)

n∑i=1

IA(Xi ).

But how do we select the “small” region A?

It should be small enough to be a “local” estimate, while bigenough to contain some data points.

This brings us right back to histogram rules.

Page 54: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

We break the space into bins. Let An(x) be the bin containing x .Then the histogram density estimate is

f̂ histogramn (x) =

1

n∆(A)

n∑i=1

IAn(Xi ).

In case of cubic histogram rule, ∆(A) = hdn , so

the cubic histogram density estimate is

f̂ cubic histogramn (x) =

h−dn

n

n∑i=1

IAn(Xi )

where hn is the size of the cube. Finally, the kernel densityestimate with kernel k(·) is

f̂ kerneln (x) =

h−dn

n

n∑i=1

k

(x − Xi

h

)

Page 55: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Nearest neighbours for density estimation

The same idea can be used with nearest neighbours. Recall thatwe want to have a small region A for which we takef̂n(x) = 1

n∆(A)

∑ni=1 IA(Xi ).

Define An,kn(x) as a region containing exactly kn nearest neighborsof x , and obtain k-Nearest Neighbours density estimate

f̂ k-NNn (x) =

1

n∆(An,kn)

n∑i=1

IAn,kn(Xi ) =

k

n∆(An,kn).

Page 56: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

TreesIn general, a (binary) tree is something like this:

Figure: A binary tree. From Devroye et al.

Page 57: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Return to classification.

A decision tree is an arrangement of questions, or tests, resultingin a classification (or a class probability estimate).

In a decision tree each node is a subset of the object space, or abin. To determine whether an object is in a subtree, a simplequestion is asked. Asking these question, we move down the tree.

Once in a leaf, classify by majority vote, or build class probabilityestimate by counting the fraction of labels each class in the leaf.

Questions can be:

• Is x j ≤ α? (where x j is j ’th coordinate of x , α is aparameter). This leads to ordinary classification trees.

• Is∑d

j=1 ajxj ≤ α (where α is a parameter). Gives binary

space partition trees.

• Is ||x − z || (where z is a parameter). Gives sphere trees.

Page 58: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Examples of resulting bins:

Figure: Partitioning by an ordinary decision tree. From Devroye et al.

Figure: Partitioning by a BSP and by a sphere tree. From Devroye et al.

Page 59: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

A very simple decision tree

Apparently the simplest tree possible:On the first level split the first coordinate in the middle.On the second level split the second coordinate in the middle.On each next level, split next coordinate in the middle, going inrounds, until a certain depth k is reached.

Page 60: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

The Median Tree

On first level, split at the median of the first coordinate. If themedian is on a data point, then leave it out (ignore; don’t senddown to subtrees). On next levels, proceed in the same fashionwith other coordinates.

If we go down k levels, each leaf will have between n/2k − k andn/2k data points in it.Such a tree is also called balanced tree.

Page 61: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Trees in the above examples have the following property: onlyobjects are used for splits, not the labels. Let An(x) denote theleaf (bin) containing x . For such trees, we have:

Proposition

For a decision tree that has the following properties: For a decisiontree class probability estimate η̂(x) if we have

i) only objects are used for making split decisions, not the labels,

ii) diam(An(x)) → 0 in probability,

iii) #{i : Xi ∈ An(x)} → ∞ in probability

then for each ε > 0

E(|η̂(x)− η(x)|) → 0.

Page 62: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

More trees

k–d trees:First split on the first coordinate, at the value of the first datapoint.Next split on the next coordinate of the next data point.Stop after k splits.

Or don’t stop at all. But then prune the tree, by uniting the nodesup until each leaf has at least k data points in it.

Page 63: Non-Parametric Methods in Machine Learning

Nearest Neighbours Density estimation Decision Trees

Using labels too

Why using only objects to construct the tree? Why not usinglabels?

In particular, one can select a split that minimizes empirical error.For real-valued data it can lead to some problems...But some good trees can be found on this way! we will see themlater.

Page 64: Non-Parametric Methods in Machine Learning

Assignment 1

1 Geometric distribution

The geometric distribution on the set N = {1, 2, 3, 4, . . .} is given by PG(k) =(1 − p)k−1p for k ∈ N.

The Bernoulli distribution on the set B = {0, 1} with parameter p is givenby PB(0) = p. It is a “toss of a biased coin”.

Suppose we are making independent Bernoulli coin tosses with parameter p.Verify that the probability that the first occurrence of 0 in this series is exactlyat step k ∈ N is given by Geometric distribution.

Find the expectation for the geometric distribution.

2 Nearest Neighbors

Download the data data.zip, build a k-nearest neighbors classifier based onthe training data, and test it on the testing data.

Use different values of k, and report the results for each value of k (numberof classification errors).

Each data file is in csv format (comma separated values): each line corre-sponds to one example, and reports a comma-separated list of its numericalattributes, followed by its class label.

For the USPS data (directory images), the attributes are 256 grayscale lev-els, normalized between −1 and 1, and the class label is one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.This dataset contains 7291 training examples and 2007 test examples. Suggestedvalues for k:1, 3, 5, 7, 9, 11, 21, 31.

For the IRIS data (directory iris), the attributes are 4 lengths in cm, andthe class label is one of 0, 1, 2 (see file irisdescr). This dataset contains 100training examples and 50 test examples. Suggested values for k:1, 3, 5, 7, 9.

Submit working and portable code: use only standard libraries. The codeshould be a correct implementation of the proposed algorithm. Coding style,exception handling, etc. will not affect your grade. Still, the code should beclear and easily understandable. Use comments!

Along with the code provide a short description of the experiments youcarried out (the values you tried for k, and the corresponding errors). Feel freeto try other values than the proposed ones. Discuss how the performance varieswith the choice of k.

1

Page 65: Non-Parametric Methods in Machine Learning

Assignment 1 - Part 1

October 8, 2007

1. (20 points) As the Bernoulli trials areindependent of each other, the joint prob-ability of any particular outcome is theproduct of the probabilities of the out-comes of each single trial.

P (x1, x2, ..., xk) = P (x1)P (x2)...P (xk) =

k∏

i=1

P (xi) (1)

As P (xi = 0) = p, and obviouslyP (xi = 1) = 1 − p, we can evaluate theprobability that the first0 occurs at thek-th trial as

Pr{x1 = 1, x2 = 1, ..., xk−1 = 1, xk = 0} = (1 − p)k−1p (2)

Not mentioning the independence condition will cost you5 points.

2. (30 points) The expectation of a discrete distribution over an infinite set can beevaluated as:

E{k} =∞∑

i=1

iP (i) (3)

In the case of the geometric distributionP (k) = qk−1p (whereq = 1 − p)

E{k} =

∞∑

i=1

ipqi−1 = p

∞∑

i=0

iqi−1 = (4)

as0q−1 = 0. We can write the terms to be summed as

= p

∞∑

i=1

d

dqqi = (5)

We can now invert the derivative and sum operators

= pd

dq

∞∑

i=1

qi = pd

dq

1

1 − q= p

1

(1 − q)2=

1

p(6)

as∑

i=0qi = 1

1−qfor q ∈ (0, 1).

1

Page 66: Non-Parametric Methods in Machine Learning

As an alternative, we can simply write

E{k} = p

∞∑

i=1

iqi = p

∞∑

i=1

∞∑

j=i

qj (7)

Each term of the outer sum is the sum of a geometric series starting from i

∞∑

j=i

qj =qi

1 − q(8)

so the result is again

E{k} = p

∞∑

i=1

iqi = p

∞∑

i=1

qi

1 − q=

p

1 − q

i

qi =1

1 − q=

1

p(9)

asp = 1 − q.

2

Page 67: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Parametric methods

Daniil Ryabko1

IDSIA/USI/SUPSI

1Some slides are based on Andrew Moore’s http://www.cs.cmu.edu/ awm/tutorials and on Bishop’s ML

book, Chapter 3

Page 68: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Maximum likelihood learning

A simple and very fundamental way of making inference ismaximum likelihood estimation.

Suppose we have n i.i.d. Gaussian N (µ, σ2) variables x1, . . . , xn.

fN (µ,σ2)(x) =1√

2πσ2e−(x−µ)2/2σ2

Suppose we know σ2 and want to find out µ.

Which µ is most likely given the data?Which µ maximizes the likelihood fN (µ,σ2)(x1, . . . , xn)?

Page 69: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Maximum likelihood learning

A simple and very fundamental way of making inference ismaximum likelihood estimation.

Suppose we have n i.i.d. Gaussian N (µ, σ2) variables x1, . . . , xn.

fN (µ,σ2)(x) =1√

2πσ2e−(x−µ)2/2σ2

Suppose we know σ2 and want to find out µ.

Which µ is most likely given the data?Which µ maximizes the likelihood fN (µ,σ2)(x1, . . . , xn)?

Page 70: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Maximum likelihood learning

A simple and very fundamental way of making inference ismaximum likelihood estimation.

Suppose we have n i.i.d. Gaussian N (µ, σ2) variables x1, . . . , xn.

fN (µ,σ2)(x) =1√

2πσ2e−(x−µ)2/2σ2

Suppose we know σ2 and want to find out µ.

Which µ is most likely given the data?Which µ maximizes the likelihood fN (µ,σ2)(x1, . . . , xn)?

Page 71: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

calculation

argmaxµ {fµ(x1, . . . , xn)} = argminµ

{n∑

i=1

(xi − µ)2)

}Taking the derivative ∂

∂µ and setting it to zero we find

µMLE = 1n

∑ni=1 xi

Which wasn’t too unexpected.

Page 72: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

calculation

argmaxµ {fµ(x1, . . . , xn)} = argminµ

{n∑

i=1

(xi − µ)2)

}Taking the derivative ∂

∂µ and setting it to zero we find

µMLE = 1n

∑ni=1 xi

Which wasn’t too unexpected.

Page 73: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

calculation

argmaxµ {fµ(x1, . . . , xn)} = argminµ

{n∑

i=1

(xi − µ)2)

}Taking the derivative ∂

∂µ and setting it to zero we find

µMLE = 1n

∑ni=1 xi

Which wasn’t too unexpected.

Page 74: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

calculation

argmaxµ {fµ(x1, . . . , xn)} = argminµ

{n∑

i=1

(xi − µ)2)

}Taking the derivative ∂

∂µ and setting it to zero we find

µMLE = 1n

∑ni=1 xi

Which wasn’t too unexpected.

Page 75: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

MLE in general

Suppose we have a vector of parameters θ = (θ1, . . . , θk).

• Write down LL = log fθ(x1, . . . , xn) (LL for “log likelihood”).

• Differentiate it according to the parameters:

∂LL

∂θ=

∂LL∂θ1∂LL∂θ2...

∂LL∂θk

• Solve the set of simultaneous equations ∂LL

∂θ = 0.

• Check that you found a maximum (not a minimum, or asaddle point), and check the boundaries if you have any.

Page 76: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

MLE in general

Suppose we have a vector of parameters θ = (θ1, . . . , θk).

• Write down LL = log fθ(x1, . . . , xn) (LL for “log likelihood”).

• Differentiate it according to the parameters:

∂LL

∂θ=

∂LL∂θ1∂LL∂θ2...

∂LL∂θk

• Solve the set of simultaneous equations ∂LL

∂θ = 0.

• Check that you found a maximum (not a minimum, or asaddle point), and check the boundaries if you have any.

Page 77: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

MLE in general

Suppose we have a vector of parameters θ = (θ1, . . . , θk).

• Write down LL = log fθ(x1, . . . , xn) (LL for “log likelihood”).

• Differentiate it according to the parameters:

∂LL

∂θ=

∂LL∂θ1∂LL∂θ2...

∂LL∂θk

• Solve the set of simultaneous equations ∂LL

∂θ = 0.

• Check that you found a maximum (not a minimum, or asaddle point), and check the boundaries if you have any.

Page 78: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

MLE in general

Suppose we have a vector of parameters θ = (θ1, . . . , θk).

• Write down LL = log fθ(x1, . . . , xn) (LL for “log likelihood”).

• Differentiate it according to the parameters:

∂LL

∂θ=

∂LL∂θ1∂LL∂θ2...

∂LL∂θk

• Solve the set of simultaneous equations ∂LL

∂θ = 0.

• Check that you found a maximum (not a minimum, or asaddle point), and check the boundaries if you have any.

Page 79: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

We have n i.i.d. Gaussian N(µ, σ2) variables x1, . . . , xn.Now we don’t know neither µ nor σ2. Proceed the same way.

log pµ,σ2(x1, . . . , xn) = −n(log√

2π +1

2log σ2)− 1

2σ2

n∑i=1

(xi − µ)2

∂LL

∂µ=

1

σ2

n∑i=1

(xi − µ)

∂LL

∂σ2= − n

2σ2+

1

2σ4

n∑i=1

(xi − µ)2

Setting to 0 and solving, we get

Page 80: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

We have n i.i.d. Gaussian N(µ, σ2) variables x1, . . . , xn.Now we don’t know neither µ nor σ2. Proceed the same way.

log pµ,σ2(x1, . . . , xn) = −n(log√

2π +1

2log σ2)− 1

2σ2

n∑i=1

(xi − µ)2

∂LL

∂µ=

1

σ2

n∑i=1

(xi − µ)

∂LL

∂σ2= − n

2σ2+

1

2σ4

n∑i=1

(xi − µ)2

Setting to 0 and solving, we get

Page 81: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

We have n i.i.d. Gaussian N(µ, σ2) variables x1, . . . , xn.Now we don’t know neither µ nor σ2. Proceed the same way.

log pµ,σ2(x1, . . . , xn) = −n(log√

2π +1

2log σ2)− 1

2σ2

n∑i=1

(xi − µ)2

∂LL

∂µ=

1

σ2

n∑i=1

(xi − µ)

∂LL

∂σ2= − n

2σ2+

1

2σ4

n∑i=1

(xi − µ)2

Setting to 0 and solving, we get

µMLE =1

n

n∑i=1

xi , σ2MLE =

1

n

n∑i=1

(xi − µ)2

Page 82: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

We have n i.i.d. Gaussian N(µ, σ2) variables x1, . . . , xn.Now we don’t know neither µ nor σ2. Proceed the same way.

log pµ,σ2(x1, . . . , xn) = −n(log√

2π +1

2log σ2)− 1

2σ2

n∑i=1

(xi − µ)2

∂LL

∂µ=

1

σ2

n∑i=1

(xi − µ)

∂LL

∂σ2= − n

2σ2+

1

2σ4

n∑i=1

(xi − µ)2

Setting to 0 and solving, we get

µMLE =1

n

n∑i=1

xi , σ2MLE =

1

n

n∑i=1

(xi − µMLE)2

Page 83: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Quality of the estimate

How do we estimate the quality of this estimates? Are they anygood?

An estimate θ̂ of the parameter θ is called unbiased if

E θ̂ = θ.

Let’s check µMLE:

EµMLE =1

nE

n∑i=1

xi = E xi = µ,

hence µMLE is unbiased.

Page 84: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Quality of the estimate

How do we estimate the quality of this estimates? Are they anygood?

An estimate θ̂ of the parameter θ is called unbiased if

E θ̂ = θ.

Let’s check µMLE:

EµMLE =1

nE

n∑i=1

xi = E xi = µ,

hence µMLE is unbiased.

Page 85: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

But σ2MLE = 1

n

∑ni=1(xi − 1

n

∑ni=1 xi )

2 is biased!Check for n = 1 first.In general,

Eσ2MLE =

(1− 1

n

)σ2 6= σ2

However, we can easily correct it, defining

σ2corrected =

n

n − 1σ2

MLE =1

n − 1

n∑i=1

(xi − µMLE)2,

which is unbiased.

Page 86: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

But σ2MLE = 1

n

∑ni=1(xi − 1

n

∑ni=1 xi )

2 is biased!Check for n = 1 first.In general,

Eσ2MLE =

(1− 1

n

)σ2 6= σ2

However, we can easily correct it, defining

σ2corrected =

n

n − 1σ2

MLE =1

n − 1

n∑i=1

(xi − µMLE)2,

which is unbiased.

Page 87: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

But σ2MLE = 1

n

∑ni=1(xi − 1

n

∑ni=1 xi )

2 is biased!Check for n = 1 first.In general,

Eσ2MLE =

(1− 1

n

)σ2 6= σ2

However, we can easily correct it, defining

σ2corrected =

n

n − 1σ2

MLE =1

n − 1

n∑i=1

(xi − µMLE)2,

which is unbiased.

Page 88: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Still, an unbiased estimate does not necessarily mean goodestimate. For example, µ̂ = x1 is an unbiased estimator of µ, butit’s not good.

Further methods to evaluated estimators θ̂ of a parameter θ:

• Mean squared error E(θ̂ − θ)2

• Asymptotic convergence: θ̂ → θ as n →∞• Confidence intervals

But we won’t consider these in details now.

Page 89: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Still, an unbiased estimate does not necessarily mean goodestimate. For example, µ̂ = x1 is an unbiased estimator of µ, butit’s not good.

Further methods to evaluated estimators θ̂ of a parameter θ:

• Mean squared error E(θ̂ − θ)2

• Asymptotic convergence: θ̂ → θ as n →∞• Confidence intervals

But we won’t consider these in details now.

Page 90: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Still, an unbiased estimate does not necessarily mean goodestimate. For example, µ̂ = x1 is an unbiased estimator of µ, butit’s not good.

Further methods to evaluated estimators θ̂ of a parameter θ:

• Mean squared error E(θ̂ − θ)2

• Asymptotic convergence: θ̂ → θ as n →∞• Confidence intervals

But we won’t consider these in details now.

Page 91: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

MLE for multidimensional Gaussian

For d-dimensional Gaussian

fN (µ,Σ)(x) =1

(2π)d/2|Σ|1/2e−

12(x−µ)T Σ−1(x−µ)

sample x1, . . . , xn we get

µMLE =1

n

n∑i=1

xi

and

ΣMLE =1

n

n∑i=1

(xi − µMLE)(xi − µMLE)T

Page 92: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Regression

We are given a sample (X1,Y1), . . . , (Xn,Yn) of i.i.d. r.v.generated according to some probability distribution P on X× R,where as before X = Rd .

That is, we have a sample of n pairs (Xi ,Yi ) where Xi is a vectorand Yi is a number.

We want to estimate the label Y for unseen objects X .

Similar to classification, but the labels Y are reals.

Let’s go parametric now!

Page 93: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Regression

We are given a sample (X1,Y1), . . . , (Xn,Yn) of i.i.d. r.v.generated according to some probability distribution P on X× R,where as before X = Rd .

That is, we have a sample of n pairs (Xi ,Yi ) where Xi is a vectorand Yi is a number.

We want to estimate the label Y for unseen objects X .

Similar to classification, but the labels Y are reals.

Let’s go parametric now!

Page 94: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Linear regression

In linear regression in its simplest form, we assume

yw (x) = w0 + w1x(1) + w2x

(2) + · · ·+ wdx (d), (1)

where w0,w1, . . . ,wd are parameters.

But this is too simple!

We introduce some basis functions

ϕ1(x), . . . , ϕM−1(x)

(x here is the whole vector x = x (1), . . . , x (d)) and let

yw (x) = w0 + w1ϕ1(x) + w2ϕ2(x) + · · ·+ wM−1ϕM−1(x).

Note that the number of basis functions M − 1 doesn’t have toequal the dimension d .

Page 95: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Linear regression

In linear regression in its simplest form, we assume

yw (x) = w0 + w1x(1) + w2x

(2) + · · ·+ wdx (d), (1)

where w0,w1, . . . ,wd are parameters.

But this is too simple!

We introduce some basis functions

ϕ1(x), . . . , ϕM−1(x)

(x here is the whole vector x = x (1), . . . , x (d)) and let

yw (x) = w0 + w1ϕ1(x) + w2ϕ2(x) + · · ·+ wM−1ϕM−1(x).

Note that the number of basis functions M − 1 doesn’t have toequal the dimension d .

Page 96: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Letting ϕ0(x) = 1 we can rewrite

yw (x) =M−1∑j=0

wjϕj(x).

or in vector notation

yw (x) = wTϕ(x).

Without indices means a vector!

Page 97: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Basis functions

Let d = 1 and let’s look at different basis functions.

Polynomial basis functions: ϕk(x) = xk (here upper index is thepower).

Gaussian basis functions: ϕk(x) = e−(x−µk )/2s2, where µk and s

are parameters,

Sigmoidal basis functions ϕk(x) = σ(

x−µj

s

), where σ(a) = 1

1+e−a .

However, for the following it absolutely doesn’t matter what thebasis functions are!

Page 98: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Basis functions

Let d = 1 and let’s look at different basis functions.

Polynomial basis functions: ϕk(x) = xk (here upper index is thepower).

Gaussian basis functions: ϕk(x) = e−(x−µk )/2s2, where µk and s

are parameters,

Sigmoidal basis functions ϕk(x) = σ(

x−µj

s

), where σ(a) = 1

1+e−a .

However, for the following it absolutely doesn’t matter what thebasis functions are!

Page 99: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

So we have (in vector notation)

yw (x) = wTϕ(x).

But this is not probabilistic at all! What’s there to learn then?

So we add some Gaussian noise ε, with zero mean and variance σ2.Now,

y = wTϕ(x) + ε.

So in our sample (X1,Y1), . . . , (X1,Y1) we have

Yi = wTϕ(Xi ) + εi ,

where εi are i.i.d. Gaussian N (0, σ2).

This means that Yi is a Gaussian with mean wTϕ(Xi ) andvariance σ2. The mean depends on X .

Thus we assume that we know very well how the data is generated,except that we don’t know some parameters: w and possibly σ2.

Page 100: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

So we have (in vector notation)

yw (x) = wTϕ(x).

But this is not probabilistic at all! What’s there to learn then?

So we add some Gaussian noise ε, with zero mean and variance σ2.Now,

y = wTϕ(x) + ε.

So in our sample (X1,Y1), . . . , (X1,Y1) we have

Yi = wTϕ(Xi ) + εi ,

where εi are i.i.d. Gaussian N (0, σ2).

This means that Yi is a Gaussian with mean wTϕ(Xi ) andvariance σ2. The mean depends on X .

Thus we assume that we know very well how the data is generated,except that we don’t know some parameters: w and possibly σ2.

Page 101: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

So we have (in vector notation)

yw (x) = wTϕ(x).

But this is not probabilistic at all! What’s there to learn then?

So we add some Gaussian noise ε, with zero mean and variance σ2.Now,

y = wTϕ(x) + ε.

So in our sample (X1,Y1), . . . , (X1,Y1) we have

Yi = wTϕ(Xi ) + εi ,

where εi are i.i.d. Gaussian N (0, σ2).

This means that Yi is a Gaussian with mean wTϕ(Xi ) andvariance σ2. The mean depends on X .

Thus we assume that we know very well how the data is generated,except that we don’t know some parameters: w and possibly σ2.

Page 102: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

In parametric methods, often the main goal shifts toestimating the value of the parameters. When these arerevealed, arbitrary inference can be made.

Page 103: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Maximum likelihood least squares

The log likelihood,

LL = log fN (wT ϕ(Xi ),σ2)((X1,Y1), . . . , (Xn,Yn))

= −n(log√

2π +1

2log σ2)− 1

2σ2

n∑i=1

(Yi − wTϕ(Xi ))2

So (as before) we will have to minimize∑n

i=1(Yi − wTϕ(Xi ))2,

that is why it’s called least squares.

Differentiating with respect to the vector w we have to solve

n∑i=1

(Yi − wTϕ(Xi ))ϕ(Xi )T = 0

Page 104: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

n∑i=1

Yiϕ(Xi )T − wT

n∑i=1

ϕ(Xi )ϕ(Xi )T = 0

solving for w we obtain

wMLLS = (ΦTΦ)−1ΦTY , (2)

where

Φ =

ϕ0(X1) ϕ1(X1) . . . ϕM−1(X1)ϕ0(X2) ϕ1(X2) . . . ϕM−1(X2)

......

. . ....

ϕ0(Xn) ϕ1(Xn) . . . ϕM−1(Xn)

is called the design matrix and (ΦTΦ)−1ΦT is Moore-Penrosepseudo inverse of Φ.Solving also for σ2 we get σ2

MLLS = 1n

∑ni=1(Yi − wT

MLLSϕ(Xi ))2.

Page 105: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Bayes rule

Bayes, Thomas (1763) An essay towards solving a problem in thedoctrine of chances. Philosophical Transactions of the RoyalSociety of London, 53:370-418

Page 106: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

The Bayes rule

P(A|B) =P(B|A)P(A)

P(B);

or for the densities

f (x |y) =f (y |x)f (x)

f (y).

If A can take finitely many values aj we can continue

P(A|B) =P(B|A)P(A)∑

j P(B|A = aj)P(A = aj).

Analogously, for the densities

f (x |y) =f (y |x)f (x)∫

x f (y |x)f (x)dx.

Page 107: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Bayesian inferenceBayesian inference typically proceeds as follows:

• Parameterize the problem

• Define the prior distribution on the parameters

• Estimate the posterior distribution of the parameters given thedata

• Make inference

Look at the Bayes rule again:

P(A|B) =P(B|A)P(A)

P(B);

Here A is the “parameter”, B is the data.

• P(A) is the prior probability of A,

• P(A|B) is the posterior probability of A,

• P(B|A) is the likelihood of B given A,

• P(B) is the prior of B, just acts as a normalizing factor.

Page 108: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Bayesian approach is very general.

It can solve any problem.

However, the quality of solution depends on

• Parametrization

• The choice of prior

Page 109: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Bayesian inference for GaussianRecall the problem: we are given X1, . . . ,Xn generated i.i.d.according to a Gaussian distribution N (µ, σ2) and we want toestimate the parameter µ, meanwhile considering σ2 known.

A Bayesian approach is to maximize the the posterior probability ofthe parameters given the data.

That is, find µ that maximize

f (µ|X1, . . . ,Xn) =f (X1, . . . ,Xn|µ))fprior (µ)

fprior (X1, . . . ,Xn)

Here f (X1, . . . ,Xn|µ) is just fN (µ,σ2)(X1, . . . ,Xn), while fprior (µ)we define somehow; we just chose what seems to be more suitableand what is more convenient for us.Note that since fprior (X1, . . . ,Xn) is independent of µ we can dropit. Thus we define

µMAP = argmaxµ f (X1, . . . ,Xn|µ)fprior (µ)

Page 110: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Bayesian inference for GaussianRecall the problem: we are given X1, . . . ,Xn generated i.i.d.according to a Gaussian distribution N (µ, σ2) and we want toestimate the parameter µ, meanwhile considering σ2 known.

A Bayesian approach is to maximize the the posterior probability ofthe parameters given the data.

That is, find µ that maximize

f (µ|X1, . . . ,Xn) =f (X1, . . . ,Xn|µ))fprior (µ)

fprior (X1, . . . ,Xn)

Here f (X1, . . . ,Xn|µ) is just fN (µ,σ2)(X1, . . . ,Xn), while fprior (µ)we define somehow; we just chose what seems to be more suitableand what is more convenient for us.Note that since fprior (X1, . . . ,Xn) is independent of µ we can dropit. Thus we define

µMAP = argmaxµ f (X1, . . . ,Xn|µ)fprior (µ)

Page 111: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Bayesian inference for GaussianRecall the problem: we are given X1, . . . ,Xn generated i.i.d.according to a Gaussian distribution N (µ, σ2) and we want toestimate the parameter µ, meanwhile considering σ2 known.

A Bayesian approach is to maximize the the posterior probability ofthe parameters given the data.

That is, find µ that maximize

f (µ|X1, . . . ,Xn) =f (X1, . . . ,Xn|µ))fprior (µ)

fprior (X1, . . . ,Xn)

Here f (X1, . . . ,Xn|µ) is just fN (µ,σ2)(X1, . . . ,Xn), while fprior (µ)we define somehow; we just chose what seems to be more suitableand what is more convenient for us.Note that since fprior (X1, . . . ,Xn) is independent of µ we can dropit. Thus we define

µMAP = argmaxµ f (X1, . . . ,Xn|µ)fprior (µ)

Page 112: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

So let’s chose the prior to be also Gaussian fprior (µ) = N (0, α).To find the maximum we follow the same procedure: differentiate,solve.

LP = −n(log√

2π+1

2log σ2)− 1

2σ2

n∑i=1

(xi−µ)2−log√

2πα− 1

2αµ2

Differentiating and solving we find

µMAP =1

n + 1α

n∑i=1

Xi .

Page 113: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Bayesian linear regression

Recall the setup: we are given a sample (X1,Y1), . . . , (Xn,Yn) ofi.i.d. r.v. generated according to the following rule

Yi (x) = wTϕ(x) + εi .

where ϕ0(x) = 1, ϕ1(x), . . . , ϕM−1(x) are basis functions, and εi

are independent 0-mean Gaussians w. variance σ2 (assumedknown).

We have to estimate w .

A Bayesian approach is to have a prior over w , and then maximizethe posterior.

Here’s our prior: w is distributed according to a (multivariate)Gaussian N (0, αI) (where I is the identity matrix). In other words,the components wj are independent Gaussians with zero mean andvariance α.

Page 114: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Bayesian linear regression

Recall the setup: we are given a sample (X1,Y1), . . . , (Xn,Yn) ofi.i.d. r.v. generated according to the following rule

Yi (x) = wTϕ(x) + εi .

where ϕ0(x) = 1, ϕ1(x), . . . , ϕM−1(x) are basis functions, and εi

are independent 0-mean Gaussians w. variance σ2 (assumedknown).

We have to estimate w .

A Bayesian approach is to have a prior over w , and then maximizethe posterior.

Here’s our prior: w is distributed according to a (multivariate)Gaussian N (0, αI) (where I is the identity matrix). In other words,the components wj are independent Gaussians with zero mean andvariance α.

Page 115: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

LP = − 1

2σ2

n∑i=1

(Yi − wTϕ(Xi ))2 − 1

2αwTw + const

Maximizing this is known as regularized least squares.The solution here is

wMAPLS = (ΦTΦ + λI)−1ΦTY ,

where λ = σ2

α .

Regularization helps to prevent overfitting: if we have too manybasis functions then it is easy to overfit. Regularization may helphere, but it introduces another parameter α.

Page 116: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

LP = − 1

2σ2

n∑i=1

(Yi − wTϕ(Xi ))2 − 1

2αwTw + const

Maximizing this is known as regularized least squares.The solution here is

wMAPLS = (ΦTΦ + λI)−1ΦTY ,

where λ = σ2

α .

Regularization helps to prevent overfitting: if we have too manybasis functions then it is easy to overfit. Regularization may helphere, but it introduces another parameter α.

Page 117: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Conjugate priors

Why did we chose Gaussian prior for the distribution of theparameters of the Gaussian?

The answer “because we don’t know any other distributions” iswrong.

The answer “because it simplifies the calculations” is correct.

But how did we know that it would simplify the calculations?

Page 118: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Look again at the posterior likelihood for the parameters

f (X1, . . . ,Xn|µ)fprior (µ)

if we take a Gaussian N (µ0, σ20) prior over µ, as we did, this takes

the form

1

(2πσ2)n/2e−

1σ2 Σn

i=1(Xi−µ)2 1√2πσ2

0

e− 1

σ20(µ−µ0)2

you can check that this equals to another Gaussian N (µn, σ2n)

1√2πσ2

n

e− 1

σ2n(µ−µn)2

,

with

µn =σ2

nσ20 + σ2

µ0 +nσ2

0

nσ20 + σ2

µMLE

and

σ2n =

(1

σ20

+n

σ2

)−1

Page 119: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

So if we take a Gaussian prior distribution on the parameter µ of aGaussian, we get a Gaussian posterior. Thus Gaussian distributionis self-conjugate.

In general, a prior probability distribution f (θ) (of the parameter θ)is said to be conjugate to the likelihood f (x |θ) if the resultingposterior f (θ|x) is in the same family as f (θ).

Page 120: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.

Page 121: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.

Page 122: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.

Page 123: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.

Page 124: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.

Page 125: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.

Page 126: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.

Page 127: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.

Page 128: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.

Page 129: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Useful exercise 1

These exercises are not graded and there’s no deadline. But theyare very useful, especially if you think you may need to refreshsome background material.

Exercise (Matrices and stuff)

In the linear regression problem, suppose that ϕj(x) = x (j), that is,the basis functions are degenerate and the linear regression is givenby (1). Let also the number of data points n = 2 and d = 2. Workout the solution for wMLLS without vector/matrix notation, andcheck that in this case it is indeed the same as the solution givenby the general formula (2) for wMLLS.

Page 130: Non-Parametric Methods in Machine Learning

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Useful exercise 3

Exercise (Different distributions and conjugates)

Fill in what’s missing in the derivation of the conjugate Gaussiandistribution. Make sure you know what is a binomial distributionand how it is related to Bernoulli. Find it’s mean. Check what’sgoing on in the derivation of the conjugate for Bernoulli.

Page 131: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Basics of Bayesian Networks

Daniil Ryabko1

IDSIA/USI/SUPSI

1Some slides are based on Andrew Moore’s http://www.cs.cmu.edu/ awm/tutorials

Page 132: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

This lecture is simpler.

Page 133: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Discrete spaces

We are moving to a simpler classification task: discrete objects.

We are given a sample (X1,Y1), . . . , (Xn,Yn) generated i.i.d.according to P. Yi are binary. But Xi are also from a finite set X.We assume that each X is a d-dimensional vector of discretefeatures: X ∈ X = Bd where B = b1, . . . , bk is a finite set.

Page 134: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Tables for the joint distribution

At first this problem is easy.Given the data, we can simply tabulate the estimate of the jointdistribution P(x (1), . . . , x (d),Y ).

x (1) x (2) y # records

0 0 0 2

0 0 1 4

0 1 0 5

0 1 1 1

1 0 0 3

1 0 1 13

1 1 0 9

1 1 1 2

Page 135: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Tables for the joint distribution

At first this problem is easy.Given the data, we can simply tabulate the estimate of the jointdistribution P(x (1), . . . , x (d),Y ).

x (1) x (2) y P̂

0 0 0 2/39

0 0 1 4/39

0 1 0 5/39

0 1 1 1/39

1 0 0 3/39

1 0 1 13/39

1 1 0 9/39

1 1 1 2/39

Page 136: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

x (1) x (2) y P̂

0 0 0 2/39

0 0 1 4/39

0 1 0 5/39

0 1 1 1/39

1 0 0 3/39

1 0 1 13/39

1 1 0 9/39

1 1 1 2/39

What is the probability

P(Y = 1,X (1) = 0)?Just sum the matching rows.How about

P(Y = 1|X (1) = 0,X (2) = 1)?

Or this P(Y = 1|X (2) = 0)?

In general, for n binary variables 2n rows.For n m-ary variables mn rows.

Page 137: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

x (1) x (2) y P̂

0 0 0 2/39

0 0 1 4/39

0 1 0 5/39

0 1 1 1/39

1 0 0 3/39

1 0 1 13/39

1 1 0 9/39

1 1 1 2/39

What is the probability

P(Y = 1,X (1) = 0)?Just sum the matching rows.How about

P(Y = 1|X (1) = 0,X (2) = 1)?

Or this P(Y = 1|X (2) = 0)?

In general, for n binary variables 2n rows.For n m-ary variables mn rows.

Page 138: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

x (1) x (2) y P̂

0 0 0 2/39

0 0 1 4/39

0 1 0 5/39

0 1 1 1/39

1 0 0 3/39

1 0 1 13/39

1 1 0 9/39

1 1 1 2/39

What is the probability

P(Y = 1,X (1) = 0)?Just sum the matching rows.How about

P(Y = 1|X (1) = 0,X (2) = 1)?

Or this P(Y = 1|X (2) = 0)?

In general, for n binary variables 2n rows.For n m-ary variables mn rows.

Page 139: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

x (1) x (2) y P̂

0 0 0 2/39

0 0 1 4/39

0 1 0 5/39

0 1 1 1/39

1 0 0 3/39

1 0 1 13/39

1 1 0 9/39

1 1 1 2/39

What is the probability

P(Y = 1,X (1) = 0)?Just sum the matching rows.How about

P(Y = 1|X (1) = 0,X (2) = 1)?

Or this P(Y = 1|X (2) = 0)?

In general, for n binary variables 2n rows.For n m-ary variables mn rows.

Page 140: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

This is a table for a database with 3 binary attributes: sex (M/F),hours worked (less than or more than 40.5) and wealth (poor/rich).

Page 141: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So, this way we can get estimates for any probabilities!

But, in general, for n m-ary variables we have mn rows.This requires quite a lot of storage space, but also it’s quite hardto fill such a table with data.In fact, it’s almost impossible to do this for about 40 attributes.

What can help us is (conditional) independence of some variables.

Page 142: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So, this way we can get estimates for any probabilities!

But, in general, for n m-ary variables we have mn rows.This requires quite a lot of storage space, but also it’s quite hardto fill such a table with data.In fact, it’s almost impossible to do this for about 40 attributes.

What can help us is (conditional) independence of some variables.

Page 143: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

If two variables X and Y are independent we can restore all the 4rows of the table from P(X = 0) and P(Y = 0).

Page 144: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

A more interesting case is conditional independence.First recall the conditional probability definition

P(A|B) =P(A,B)

P(B)

and the Bayes rule in this form

P(A|B) =P(B|A)P(A)∑

a P(B|A = a)P(A = a)

we will be using these all the time

Page 145: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

conditional independence

A is conditionally independent of B given Y if

P(A|Y ,B) = P(A|Y )

Check that this is symmetric: if A is independent of B given Ythen B is independent of A given Y .

A (bit more) conditional Bayes rule

P(A|B,X ) =P(B|A,X )P(A|X )

P(B|X )

Page 146: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

If x (1) and x (2) are conditionally independent given y then we onlyneed P(x (1) = 1|y = 0), P(x (1) = 1|y = 1), P(x (2) = 1|y = 0),P(x (2) = 1|y = 1) and P(y = 1) to fill all the 8 rows of the table.

Check it!

Page 147: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Graph

����?

@@@R

���������������

y

x (1) x (2) x (3)

Graphically, we can showconditional independence of x (1), x (2), x (3) given y like this.

Page 148: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Naive Bayes

Assume that all features X (i) are independent given the label Y .

How do we make predictions in such a model?For a given X = X (1), . . . ,X (d) we wantargmaxa P(Y = a|X (1), . . . ,X (d)).

argmaxa P(Y = a|X (1), . . . ,X (d))

= argmaxaP(X (1), . . . ,X (d)|Y = a)P(Y = a)

P(X (1), . . . ,X (d))

= argmaxa P(X (1), . . . ,X (d)|Y = a)P(Y = a)

= argmaxa P(Y = a)d∏

i=1

P(X (i)|Y = a)

Page 149: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Naive Bayes

Assume that all features X (i) are independent given the label Y .

How do we make predictions in such a model?For a given X = X (1), . . . ,X (d) we wantargmaxa P(Y = a|X (1), . . . ,X (d)).

argmaxa P(Y = a|X (1), . . . ,X (d))

= argmaxaP(X (1), . . . ,X (d)|Y = a)P(Y = a)

P(X (1), . . . ,X (d))

= argmaxa P(X (1), . . . ,X (d)|Y = a)P(Y = a)

= argmaxa P(Y = a)d∏

i=1

P(X (i)|Y = a)

Page 150: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Naive Bayes

Assume that all features X (i) are independent given the label Y .

How do we make predictions in such a model?For a given X = X (1), . . . ,X (d) we wantargmaxa P(Y = a|X (1), . . . ,X (d)).

argmaxa P(Y = a|X (1), . . . ,X (d))

= argmaxaP(X (1), . . . ,X (d)|Y = a)P(Y = a)

P(X (1), . . . ,X (d))

= argmaxa P(X (1), . . . ,X (d)|Y = a)P(Y = a)

= argmaxa P(Y = a)d∏

i=1

P(X (i)|Y = a)

Page 151: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Naive Bayes

Assume that all features X (i) are independent given the label Y .

How do we make predictions in such a model?For a given X = X (1), . . . ,X (d) we wantargmaxa P(Y = a|X (1), . . . ,X (d)).

argmaxa P(Y = a|X (1), . . . ,X (d))

= argmaxaP(X (1), . . . ,X (d)|Y = a)P(Y = a)

P(X (1), . . . ,X (d))

= argmaxa P(X (1), . . . ,X (d)|Y = a)P(Y = a)

= argmaxa P(Y = a)d∏

i=1

P(X (i)|Y = a)

Page 152: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Naive Bayes

Assume that all features X (i) are independent given the label Y .

How do we make predictions in such a model?For a given X = X (1), . . . ,X (d) we wantargmaxa P(Y = a|X (1), . . . ,X (d)).

argmaxa P(Y = a|X (1), . . . ,X (d))

= argmaxaP(X (1), . . . ,X (d)|Y = a)P(Y = a)

P(X (1), . . . ,X (d))

= argmaxa P(X (1), . . . ,X (d)|Y = a)P(Y = a)

= argmaxa P(Y = a)d∏

i=1

P(X (i)|Y = a)

Page 153: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

Page 154: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

Page 155: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

Page 156: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

How many frequencies do we have to count, if all sets are binary?2+2dIf the size of the label space |Y| is m and each feature X (i) is froma k-ary set?m+(k-1)mdThis is much better than exponential, which is required for the fulltables!

Page 157: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

How many frequencies do we have to count, if all sets are binary?2+2dIf the size of the label space |Y| is m and each feature X (i) is froma k-ary set?m+(k-1)mdThis is much better than exponential, which is required for the fulltables!

Page 158: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

How many frequencies do we have to count, if all sets are binary?2+2dIf the size of the label space |Y| is m and each feature X (i) is froma k-ary set?m+(k-1)mdThis is much better than exponential, which is required for the fulltables!

Page 159: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

How many frequencies do we have to count, if all sets are binary?2+2dIf the size of the label space |Y| is m and each feature X (i) is froma k-ary set?m+(k-1)mdThis is much better than exponential, which is required for the fulltables!

Page 160: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

How many frequencies do we have to count, if all sets are binary?2+2dIf the size of the label space |Y| is m and each feature X (i) is froma k-ary set?m+(k-1)mdThis is much better than exponential, which is required for the fulltables!

Page 161: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

How many frequencies do we have to count, if all sets are binary?2+2dIf the size of the label space |Y| is m and each feature X (i) is froma k-ary set?m+(k-1)mdThis is much better than exponential, which is required for the fulltables!

Page 162: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

What if want probability estimates from our algorithm, not justpredictions? Recall that

P(Y = a|X (1), . . . ,X (d)) =P(Y = a)

∏di=1 P(X (i)|Y = a)

P(X (1), . . . ,X (d))

=P(Y = a)

∏di=1 P(X (i)|Y = a)∑

a P(Y = a,X (1), . . . ,X (d))

=P(Y = a)

∏di=1 P(X (i)|Y = a)∑

a

(P(Y = a)

∏di=1 P(X (i)|Y = a)

)

Page 163: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Independence does not imply conditional independence

Once again, Independence does not imply conditionalindependence!

Think of an example of 3 binary random variables, X ,Y ,Z suchthat X and Y are independent, but are not independent given Z .

Page 164: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Independence does not imply conditional independence

Once again, Independence does not imply conditionalindependence!

Think of an example of 3 binary random variables, X ,Y ,Z suchthat X and Y are independent, but are not independent given Z .

Page 165: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Independence does not imply conditional independence

Once again, Independence does not imply conditionalindependence!

Think of an example of 3 binary random variables, X ,Y ,Z suchthat X and Y are independent, but are not independent given Z .

Hint: may be Z should be some function of X and Y

Page 166: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Independence does not imply conditional independence

Once again, Independence does not imply conditionalindependence!

Think of an example of 3 binary random variables, X ,Y ,Z suchthat X and Y are independent, but are not independent given Z .

Hint: may be Z should be some function of X and Y

Z = (X + Y )mod2

Page 167: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Another type of relation is this.Let the features x (1), x (2), x (2) be independent. Each of themprovides information about Y .

Page 168: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

����?

@@@R

��

������������

y

x (1) x (2) x (3)

This type of dependence by itselfdoes not give us big advantage in calculation of probabilities.

We need P(Y |x (1), x (2), x (3)) for all assignments of values to thevariables.

But it gives us a nice picture!

Page 169: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

����?

@@@R

��

������������

y

x (1) x (2) x (3)

This type of dependence by itselfdoes not give us big advantage in calculation of probabilities.

We need P(Y |x (1), x (2), x (3)) for all assignments of values to thevariables.

But it gives us a nice picture!

Page 170: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

The burglar example

• Your house has a twitchy burglar alarm that is also sometimestriggered by earthquakes.

• Earth arguably doesnt care whether your house is currentlybeing burgled

• While you are on vacation, one of your neighbors calls andtells you your homes burglar alarm is ringing.

Page 171: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

In general, a Bayesian network (or a “belief network”) is a directedacyclic graph.

Each node is a variable.

Each variable is conditionally independent of all itsnon-descendants in the graph given the value of all its parents.

For each variable X k , we need to know the probability of each ofits values given each assignment of values to all its parentsP(X k |parents(X k).

Page 172: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Inference in Bayesian Networks

Let’s see how we can compute the probability of any row of thetable of the joint distribution with the Bayes net. (And youremember that if we know the joint distribution then we knoweverything) Here I dropped the Y for convenience, we only haveX i .

P(X 1,X 2, . . . ,X d) = P(X d |X 1, . . . ,X d−1)P(X 1, . . . ,X d−1)

= P(X d |parents(X d))P(X 1, . . . ,X d−1)

= P(X d |parents(X d))P(X d−1|X 1, . . . ,X d−2)P(X 1, . . . ,X d−2)

= · · · =d∏

i=1

P(X i |parents(X i ))

Let’s compute something for some example

Page 173: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Inference in Bayesian Networks

Let’s see how we can compute the probability of any row of thetable of the joint distribution with the Bayes net. (And youremember that if we know the joint distribution then we knoweverything) Here I dropped the Y for convenience, we only haveX i .

P(X 1,X 2, . . . ,X d) = P(X d |X 1, . . . ,X d−1)P(X 1, . . . ,X d−1)

= P(X d |parents(X d))P(X 1, . . . ,X d−1)

= P(X d |parents(X d))P(X d−1|X 1, . . . ,X d−2)P(X 1, . . . ,X d−2)

= · · · =d∏

i=1

P(X i |parents(X i ))

Let’s compute something for some example

Page 174: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Computational Complexity(Informally)

• To store a Bayes net we need capacity linear in the number ofnodes, only exponential in the maximal number of parents ofany node.

• We can compute any row of the table of the joint in lineartime.

• So we can compute probabilities of anything given anything!

• But

• In general, to compute probabilities like P(Y |E) where E issome set of nodes, is exponential...

• it can be exponential in the number of nodes.

• General querying of Bayes nets is NP-hard.

Page 175: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Computational Complexity(Informally)

• To store a Bayes net we need capacity linear in the number ofnodes, only exponential in the maximal number of parents ofany node.

• We can compute any row of the table of the joint in lineartime.

• So we can compute probabilities of anything given anything!

• But

• In general, to compute probabilities like P(Y |E) where E issome set of nodes, is exponential...

• it can be exponential in the number of nodes.

• General querying of Bayes nets is NP-hard.

Page 176: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Computational Complexity(Informally)

• To store a Bayes net we need capacity linear in the number ofnodes, only exponential in the maximal number of parents ofany node.

• We can compute any row of the table of the joint in lineartime.

• So we can compute probabilities of anything given anything!

• But

• In general, to compute probabilities like P(Y |E) where E issome set of nodes, is exponential...

• it can be exponential in the number of nodes.

• General querying of Bayes nets is NP-hard.

Page 177: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Computational Complexity(Informally)

• To store a Bayes net we need capacity linear in the number ofnodes, only exponential in the maximal number of parents ofany node.

• We can compute any row of the table of the joint in lineartime.

• So we can compute probabilities of anything given anything!

• But

• In general, to compute probabilities like P(Y |E) where E issome set of nodes, is exponential...

• it can be exponential in the number of nodes.

• General querying of Bayes nets is NP-hard.

Page 178: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Computational Complexity(Informally)

• To store a Bayes net we need capacity linear in the number ofnodes, only exponential in the maximal number of parents ofany node.

• We can compute any row of the table of the joint in lineartime.

• So we can compute probabilities of anything given anything!

• But

• In general, to compute probabilities like P(Y |E) where E issome set of nodes, is exponential...

• it can be exponential in the number of nodes.

• General querying of Bayes nets is NP-hard.

Page 179: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Computational Complexity(Informally)

• To store a Bayes net we need capacity linear in the number ofnodes, only exponential in the maximal number of parents ofany node.

• We can compute any row of the table of the joint in lineartime.

• So we can compute probabilities of anything given anything!

• But

• In general, to compute probabilities like P(Y |E) where E issome set of nodes, is exponential...

• it can be exponential in the number of nodes.

• General querying of Bayes nets is NP-hard.

Page 180: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Computational Complexity(Informally)

• To store a Bayes net we need capacity linear in the number ofnodes, only exponential in the maximal number of parents ofany node.

• We can compute any row of the table of the joint in lineartime.

• So we can compute probabilities of anything given anything!

• But

• In general, to compute probabilities like P(Y |E) where E issome set of nodes, is exponential...

• it can be exponential in the number of nodes.

• General querying of Bayes nets is NP-hard.

Page 181: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

More pictures

����?

����?

����

X

Y

Z

Here Z is conditionally independent of X given Y . By symmetry,X is also conditionally independent of Z given Y . Y is notconditionally independent of anything. Looks like what wemodelled on out first graph.

Page 182: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

What is independent of what on the graph?

• While you are on vacation, one of your neighbors calls andtells you your homes burglar alarm is ringing.

• Then you hear on radio that there was a small earthquake inyou area. So perhaps it wasn’t the burglar after all.

• Earthquake explains away the hypothetical burglar.

• But then it must not be the case that Burglar is independentof the Earthquake given the Phone call?

• Don’t worry it’s not

Page 183: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

What is independent of what on the graph?

• While you are on vacation, one of your neighbors calls andtells you your homes burglar alarm is ringing.

• Then you hear on radio that there was a small earthquake inyou area. So perhaps it wasn’t the burglar after all.

• Earthquake explains away the hypothetical burglar.

• But then it must not be the case that Burglar is independentof the Earthquake given the Phone call?

• Don’t worry it’s not

Page 184: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

What is independent of what on the graph?

• While you are on vacation, one of your neighbors calls andtells you your homes burglar alarm is ringing.

• Then you hear on radio that there was a small earthquake inyou area. So perhaps it wasn’t the burglar after all.

• Earthquake explains away the hypothetical burglar.

• But then it must not be the case that Burglar is independentof the Earthquake given the Phone call?

• Don’t worry it’s not

Page 185: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

What is independent of what on the graph?

• While you are on vacation, one of your neighbors calls andtells you your homes burglar alarm is ringing.

• Then you hear on radio that there was a small earthquake inyou area. So perhaps it wasn’t the burglar after all.

• Earthquake explains away the hypothetical burglar.

• But then it must not be the case that Burglar is independentof the Earthquake given the Phone call?

• Don’t worry it’s not

Page 186: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

What is independent of what on the graph?

• While you are on vacation, one of your neighbors calls andtells you your homes burglar alarm is ringing.

• Then you hear on radio that there was a small earthquake inyou area. So perhaps it wasn’t the burglar after all.

• Earthquake explains away the hypothetical burglar.

• But then it must not be the case that Burglar is independentof the Earthquake given the Phone call?

• Don’t worry it’s not

Page 187: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

d -separation

If two nodes X and Y are d-separated by a set of nodes E thenthey are conditionally independent.

X and Y are d-separated by a set of nodes E if any undirectedpath p between X and Y is blocked by E . Whereas a path p isblocked by E if either

• p contains a chain X → t → Y or a fork X ← t → Y suchthat t is in E or

• p contains an inverted fork (collider) X → t ← Y such that tis not in E and none of descendants of t is in E .

Page 188: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Let’s check whether the following independencies hold:

• Is C independent of D?

• Is C independent of D given A?

• Is C independent of D given A, B?

• Is C independent of D given A, B, J?

• Is C independent of D given A, B, E, J?

Page 189: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Let’s check whether the following independencies hold:

• Is C independent of D?

• Is C independent of D given A?

• Is C independent of D given A, B?

• Is C independent of D given A, B, J?

• Is C independent of D given A, B, E, J?

Page 190: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Let’s check whether the following independencies hold:

• Is C independent of D?

• Is C independent of D given A?

• Is C independent of D given A, B?

• Is C independent of D given A, B, J?

• Is C independent of D given A, B, E, J?

Page 191: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Let’s check whether the following independencies hold:

• Is C independent of D?

• Is C independent of D given A?

• Is C independent of D given A, B?

• Is C independent of D given A, B, J?

• Is C independent of D given A, B, E, J?

Page 192: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Let’s check whether the following independencies hold:

• Is C independent of D?

• Is C independent of D given A?

• Is C independent of D given A, B?

• Is C independent of D given A, B, J?

• Is C independent of D given A, B, E, J?

Page 193: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Bayes nets summarized

• when building a Bayes net we have to be sure that Eachvariable is conditionally independent of all its non-descendantsin the graph given the value of all its parents.

• For each variable X k , we need to know the probability of eachof its values given the each assignment of values to all itsparents P(X k |parents(X k)).

• After that, we can make inference...

• though making arbitrary inference is NP hard.

• We can also look up what independencies are implied by thenet.

• Bayes nets are used in applications where considerable expertknowledge is available. For example, in medical applications.After a net is constructed using this knowledge, a lot of newinference can be mad automatically; this includes predictingconditional and joint probabilities, and analyzingdependencies.

Page 194: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Bayes nets summarized

• when building a Bayes net we have to be sure that Eachvariable is conditionally independent of all its non-descendantsin the graph given the value of all its parents.

• For each variable X k , we need to know the probability of eachof its values given the each assignment of values to all itsparents P(X k |parents(X k)).

• After that, we can make inference...

• though making arbitrary inference is NP hard.

• We can also look up what independencies are implied by thenet.

• Bayes nets are used in applications where considerable expertknowledge is available. For example, in medical applications.After a net is constructed using this knowledge, a lot of newinference can be mad automatically; this includes predictingconditional and joint probabilities, and analyzingdependencies.

Page 195: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Bayes nets summarized

• when building a Bayes net we have to be sure that Eachvariable is conditionally independent of all its non-descendantsin the graph given the value of all its parents.

• For each variable X k , we need to know the probability of eachof its values given the each assignment of values to all itsparents P(X k |parents(X k)).

• After that, we can make inference...

• though making arbitrary inference is NP hard.

• We can also look up what independencies are implied by thenet.

• Bayes nets are used in applications where considerable expertknowledge is available. For example, in medical applications.After a net is constructed using this knowledge, a lot of newinference can be mad automatically; this includes predictingconditional and joint probabilities, and analyzingdependencies.

Page 196: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Bayes nets summarized

• when building a Bayes net we have to be sure that Eachvariable is conditionally independent of all its non-descendantsin the graph given the value of all its parents.

• For each variable X k , we need to know the probability of eachof its values given the each assignment of values to all itsparents P(X k |parents(X k)).

• After that, we can make inference...

• though making arbitrary inference is NP hard.

• We can also look up what independencies are implied by thenet.

• Bayes nets are used in applications where considerable expertknowledge is available. For example, in medical applications.After a net is constructed using this knowledge, a lot of newinference can be mad automatically; this includes predictingconditional and joint probabilities, and analyzingdependencies.

Page 197: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Bayes nets summarized

• when building a Bayes net we have to be sure that Eachvariable is conditionally independent of all its non-descendantsin the graph given the value of all its parents.

• For each variable X k , we need to know the probability of eachof its values given the each assignment of values to all itsparents P(X k |parents(X k)).

• After that, we can make inference...

• though making arbitrary inference is NP hard.

• We can also look up what independencies are implied by thenet.

• Bayes nets are used in applications where considerable expertknowledge is available. For example, in medical applications.After a net is constructed using this knowledge, a lot of newinference can be mad automatically; this includes predictingconditional and joint probabilities, and analyzingdependencies.

Page 198: Non-Parametric Methods in Machine Learning

Tabulating the joint distribution Naive Bayes Bayesian networks

Bayes nets summarized

• when building a Bayes net we have to be sure that Eachvariable is conditionally independent of all its non-descendantsin the graph given the value of all its parents.

• For each variable X k , we need to know the probability of eachof its values given the each assignment of values to all itsparents P(X k |parents(X k)).

• After that, we can make inference...

• though making arbitrary inference is NP hard.

• We can also look up what independencies are implied by thenet.

• Bayes nets are used in applications where considerable expertknowledge is available. For example, in medical applications.After a net is constructed using this knowledge, a lot of newinference can be mad automatically; this includes predictingconditional and joint probabilities, and analyzingdependencies.

Page 199: Non-Parametric Methods in Machine Learning

Overview: Universal versus specific methods

Daniil Ryabko

IDSIA/USI/SUPSI

Page 200: Non-Parametric Methods in Machine Learning

Classification with real-valued features

We are given a (training) sample (X1,Y1), . . . , (Xn,Yn), where Xi

are from the the object space X = Rd , and Yi are from the labelspace Y = {0, 1}. The coordinate components x (1), . . . , x (d) of anobject x are called (input) features. The objects and labels (Xi ,Yi )are generated i.i.d. by (unknown) probability distribution P onRd × A.

We wish to construct a classifier that classifies objects whose labelswe don’t know.

Page 201: Non-Parametric Methods in Machine Learning

Non-parametric classification

Methods for non-parametric classification, such as histogrammethods, k-nearest neighbours, many decision trees, classify newobjects according to the majority vote in the small bin containingthe object.

These methods are “universal” in the sense that they work (inasymptotic) for any distribution P generating the examples; wehave theorems like this:

TheoremFor any distribution P generating the examples the following istrue. For the cubic histogram rule with if hn → 0 and nhd

n →∞ asn →∞ then

E(|η̂(X )− η(X )|) → 0.

Page 202: Non-Parametric Methods in Machine Learning

Curse of dimensionality

A histogram method that brakes each coordinate into k parts (thatis, h = 1

k ), in d dimensions will have kd bins. We have to haveenough data to make inference about kd different values.

If a distribution generating the data is arbitrary, making inferenceabout one of the bins doesn’t tell us anything about the other bins.

Page 203: Non-Parametric Methods in Machine Learning

Only Asymptotic

Even if we can fill kd bins with data, can we fill (k + 1)d? Whichk is enough? The performance guarantees are only asymptotic.

Page 204: Non-Parametric Methods in Machine Learning

To overcome the curse of dimensionality, one has to considersmaller families of distributions. For example, consider thedistributions that have density of some particular form.

A d-dimensional Gaussian distribution is defined by 2d parameters.We only have to estimate those.We only have to have enough data to estimate 2d parameters.

Linear regression with M basis functions has M or M + 1parameters. (Independent of the dimension)

Page 205: Non-Parametric Methods in Machine Learning

Estimating the parameters of a d-dimensional Gaussian distributionis useful only if the real distribution generating the data is indeed(close to) Gaussian.

Thus we can chose between “universal” methods that work for anydistribution and specific methods that work only in specific casesbut are more data efficient.

Page 206: Non-Parametric Methods in Machine Learning

Classification and inference with discrete-valued features

If our objects are from a finite space, we have a seemingly simplertask of classification with discrete features.

However, if an object x = (x (1), . . . , x (d)) is a d-dimensional binaryvector, there are 2d possible values for d .

Tabulating the full joint distribution (that is, counting the numberof data points that have a form (x (1), . . . , x (d), y) for eachcombination of x (i) and y) requires estimation of 2d+1 values.

This is effectively the same as a cubic histogram method.

Page 207: Non-Parametric Methods in Machine Learning

Bayesian networks

Bayesian networks allow to reduce the dimensionality of theproblem by considering distributions of specific form.

For example, if d binary features x (1), . . . , x (d) are conditionallyindependent given the object, we only have to estimate 2d valuesP(x (i) = 0|y = 0),P(x (i) = 0|y = 1) to make classification.

Again we have the same trade-off: universality versus dataefficiency.

Page 208: Non-Parametric Methods in Machine Learning

Real-valued or discrete inputs?

If you like some method that is designed for discrete inputs, butyou have real-valued data, you can quantize it. Quantization in kbins: for x ∈ [0, 1] let [x ] = t such that t/k ≤ x < t/(k + 1).

Page 209: Non-Parametric Methods in Machine Learning

• Universal methods are guaranteed to work for an arbitrarydistribution.

• But only in asymptotic

• Even beyond asymptotic they seem to require too much data:exponential with the number of dimensions.

• Specific (parametric) methods reduce the problem toestimation of a few parameters.

• In some cases, only the number of parameters is only linear inthe dimension.

• But they work only in the specific situations they are designedfor.

• In general, you need some “domain knowledge” to be able tochose a good specific method.

Page 210: Non-Parametric Methods in Machine Learning

• Universal methods are guaranteed to work for an arbitrarydistribution.

• But only in asymptotic

• Even beyond asymptotic they seem to require too much data:exponential with the number of dimensions.

• Specific (parametric) methods reduce the problem toestimation of a few parameters.

• In some cases, only the number of parameters is only linear inthe dimension.

• But they work only in the specific situations they are designedfor.

• In general, you need some “domain knowledge” to be able tochose a good specific method.

Page 211: Non-Parametric Methods in Machine Learning

• Universal methods are guaranteed to work for an arbitrarydistribution.

• But only in asymptotic

• Even beyond asymptotic they seem to require too much data:exponential with the number of dimensions.

• Specific (parametric) methods reduce the problem toestimation of a few parameters.

• In some cases, only the number of parameters is only linear inthe dimension.

• But they work only in the specific situations they are designedfor.

• In general, you need some “domain knowledge” to be able tochose a good specific method.

Page 212: Non-Parametric Methods in Machine Learning

• Universal methods are guaranteed to work for an arbitrarydistribution.

• But only in asymptotic

• Even beyond asymptotic they seem to require too much data:exponential with the number of dimensions.

• Specific (parametric) methods reduce the problem toestimation of a few parameters.

• In some cases, only the number of parameters is only linear inthe dimension.

• But they work only in the specific situations they are designedfor.

• In general, you need some “domain knowledge” to be able tochose a good specific method.

Page 213: Non-Parametric Methods in Machine Learning

• Universal methods are guaranteed to work for an arbitrarydistribution.

• But only in asymptotic

• Even beyond asymptotic they seem to require too much data:exponential with the number of dimensions.

• Specific (parametric) methods reduce the problem toestimation of a few parameters.

• In some cases, only the number of parameters is only linear inthe dimension.

• But they work only in the specific situations they are designedfor.

• In general, you need some “domain knowledge” to be able tochose a good specific method.

Page 214: Non-Parametric Methods in Machine Learning

• Universal methods are guaranteed to work for an arbitrarydistribution.

• But only in asymptotic

• Even beyond asymptotic they seem to require too much data:exponential with the number of dimensions.

• Specific (parametric) methods reduce the problem toestimation of a few parameters.

• In some cases, only the number of parameters is only linear inthe dimension.

• But they work only in the specific situations they are designedfor.

• In general, you need some “domain knowledge” to be able tochose a good specific method.

Page 215: Non-Parametric Methods in Machine Learning

• Universal methods are guaranteed to work for an arbitrarydistribution.

• But only in asymptotic

• Even beyond asymptotic they seem to require too much data:exponential with the number of dimensions.

• Specific (parametric) methods reduce the problem toestimation of a few parameters.

• In some cases, only the number of parameters is only linear inthe dimension.

• But they work only in the specific situations they are designedfor.

• In general, you need some “domain knowledge” to be able tochose a good specific method.

Page 216: Non-Parametric Methods in Machine Learning

Assignment 2

1 MLE and MAP

i) A distribution P on the 3-element set {a, b, c} is defined by two parametersp = P ({a}) and q = P ({b}). Construct a maximum likelihood estimator(MLE) for these two parameters.

ii) Construct a maximum a posteriori estimator (MAP) for the parameter p ofthe Bernoulli distribution, assuming prior distribution over the parameter

with density f(p) =

{3p2 for p ∈ [0, 1]0 otherwise

2 Naive Bayes

Implement a Naive Bayes classifier for the “Iris” dataset provided in thefirst assignment, using quantization for real-valued inputs. Quantize eachinput using k bins of equal size, with different values of k; report theresults.

Implementation details:

– Quantize each feature between 0.0 and 8.0. Use a power of two forthe number of bins: try k = 2, 4, 8, ..., 64.

– If for a test set point two or more labels have the same probability,pick one randomly.

– Your program should output the number of errors. Report thesevalues for each k, and discuss the results. Please include a scriptthat can be used to reproduce the experiment, or implement yourexperiment in the main file.

1

Page 217: Non-Parametric Methods in Machine Learning

A tutorial on PrincipalComponentsAnalysis

LindsayI Smith

February26,2002

Page 218: Non-Parametric Methods in Machine Learning

Chapter 1

Intr oduction

This tutorial is designedto give thereaderanunderstandingof PrincipalComponentsAnalysis (PCA). PCA is a useful statisticaltechniquethat hasfound applicationinfieldssuchasfacerecognitionandimagecompression,andis acommontechniqueforfindingpatternsin dataof highdimension.

Beforegettingto a descriptionof PCA, this tutorial first introducesmathematicalconceptsthatwill beusedin PCA. It coversstandarddeviation,covariance,eigenvec-torsandeigenvalues.This backgroundknowledgeis meantto make the PCA sectionverystraightforward,but canbeskippedif theconceptsarealreadyfamiliar.

Thereareexamplesall theway throughthis tutorial thataremeantto illustratetheconceptsbeingdiscussed.If furtherinformationis required,themathematicstextbook“ElementaryLinearAlgebra5e” by HowardAnton,PublisherJohnWiley & SonsInc,ISBN 0-471-85223-6is agoodsourceof informationregardingthemathematicalback-ground.

1

Page 219: Non-Parametric Methods in Machine Learning

Chapter 2

Background Mathematics

Thissectionwill attemptto givesomeelementarybackgroundmathematicalskills thatwill be requiredto understandthe processof Principal ComponentsAnalysis. Thetopicsarecoveredindependentlyof eachother, andexamplesgiven.It is lessimportantto remembertheexactmechanicsof a mathematicaltechniquethanit is to understandthereasonwhy sucha techniquemaybeused,andwhattheresultof theoperationtellsusaboutourdata.Not all of thesetechniquesareusedin PCA,but theonesthatarenotexplicitly requireddo provide thegroundingon which themostimportanttechniquesarebased.

I have includeda sectionon Statisticswhich looks at distribution measurements,or, how the datais spreadout. The othersectionis on Matrix Algebraandlooks ateigenvectorsandeigenvalues,importantpropertiesof matricesthatarefundamentaltoPCA.

2.1 Statistics

Theentiresubjectof statisticsis basedaroundtheideathatyouhavethisbig setof data,andyou want to analysethat set in termsof the relationshipsbetweenthe individualpointsin thatdataset.I amgoingto look at a few of themeasuresyou cando on a setof data,andwhatthey tell youaboutthedataitself.

2.1.1 Standard Deviation

To understandstandarddeviation, we needa dataset. Statisticiansareusuallycon-cernedwith takinga sample of a population. To useelectionpolls asanexample,thepopulationis all the peoplein the country, whereasa sampleis a subsetof the pop-ulation that the statisticiansmeasure.The greatthing aboutstatisticsis that by onlymeasuring(in thiscaseby doingaphonesurvey or similar)asampleof thepopulation,youcanwork outwhatis mostlikely to bethemeasurementif youusedtheentirepop-ulation. In this statisticssection,I am going to assumethatour datasetsaresamples

2

Page 220: Non-Parametric Methods in Machine Learning

of somebiggerpopulation.Thereis a referencelater in this sectionpointing to moreinformationaboutsamplesandpopulations.

Here’sanexampleset:

������������ ��� ���������������������

I could simply usethe symbol�

to refer to this entiresetof numbers.If I want toreferto anindividualnumberin this dataset,I will usesubscriptson thesymbol

�to

indicatea specificnumber. Eg.���

refersto the3rd numberin�

, namelythenumber4. Note that

���is thefirst numberin thesequence,not

���like you mayseein some

textbooks. Also, the symbol � will beusedto refer to thenumberof elementsin theset�Therearea numberof thingsthatwe cancalculateabouta dataset. For example,

wecancalculatethemeanof thesample.I assumethatthereaderunderstandswhatthemeanof a sampleis, andwill only give theformula:

�� � !"$# � � "�

Noticethesymbol��

(said“X bar”) to indicatethemeanof theset�

. All this formulasaysis “Add upall thenumbersandthendivide by how many thereare”.

Unfortunately, the meandoesn’t tell us a lot aboutthe dataexcept for a sort ofmiddlepoint. For example,thesetwo datasetshave exactly the samemean(10), butareobviouslyquitedifferent:

�&%'���&���%��( �*) �+��,����� ��

Sowhat is differentaboutthesetwo sets?It is thespread of thedatathat is different.TheStandardDeviation (SD)of adatasetis a measureof how spreadout thedatais.

How dowecalculateit? TheEnglishdefinitionof theSDis: “The averagedistancefrom the meanof the datasetto a point”. The way to calculateit is to computethesquaresof the distancefrom eachdatapoint to the meanof the set,addthemall up,divideby �.- � , andtake thepositivesquareroot. As a formula:

/ � !"$# �10 � " - ��32540 ��- ��2

Where/ is theusualsymbolfor standarddeviationof asample.I hearyouasking“Whyareyouusing 0 �6- ��2 andnot � ?”. Well, theansweris abit complicated,but in general,if your datasetis a sample dataset,ie. you have takena subsetof thereal-world (likesurveying 500peopleabouttheelection)thenyoumustuse 0 �7- ��2 becauseit turnsoutthat this givesyou ananswerthat is closerto thestandarddeviation thatwould resultif you hadusedthe entire population,thanif you’d used� . If, however, you arenotcalculatingthestandarddeviation for a sample,but for anentirepopulation,thenyoushoulddivide by � insteadof 0 �8- ��2 . For furtherreadingon this topic, thewebpagehttp://mathcentral.uregina.ca/RR/database/RR.09.95/weston2.html describesstandarddeviation in a similar way, andalsoprovidesan exampleexperimentthat shows the

3

Page 221: Non-Parametric Methods in Machine Learning

Set1:

� 0 � - ��32 0 � - ��32 40 -10 1008 -2 412 2 420 10 100Total 208Dividedby (n-1) 69.333SquareRoot 8.3266

Set2:

� " 0 � " - ��32 0 � " - ��32 48 -2 49 -1 111 1 112 2 4Total 10Dividedby (n-1) 3.333SquareRoot 1.8257

Table2.1: Calculationof standarddeviation

differencebetweeneachof thedenominators.It alsodiscussesthedifferencebetweensamplesandpopulations.

So, for our two datasetsabove, the calculationsof standarddeviation are in Ta-ble2.1.

And so,asexpected,the first sethasa muchlarger standarddeviation dueto thefactthatthedatais muchmorespreadout from themean.Justasanotherexample,thedataset: ��� %�� %��&%��&%�alsohasa meanof 10, but its standarddeviation is 0, becauseall thenumbersarethesame.Noneof themdeviatefrom themean.

2.1.2 Variance

Varianceis anothermeasureof the spreadof datain a dataset. In fact it is almostidenticalto thestandarddeviation. Theformulais this:

/ 4 � !"9# � 0 � " - ��32 40 �:- ��2

4

Page 222: Non-Parametric Methods in Machine Learning

You will noticethat this is simply thestandarddeviation squared,in both thesymbol( / 4 ) andthe formula (thereis no squareroot in the formula for variance). / 4 is theusualsymbolfor varianceof a sample.Both thesemeasurementsaremeasuresof thespreadof the data. Standarddeviation is the mostcommonmeasure,but varianceisalsoused.Thereasonwhy I have introducedvariancein additionto standarddeviationis to provideasolidplatformfrom whichthenext section,covariance,canlaunchfrom.

Exercises

Find themean,standarddeviation,andvariancefor eachof thesedatasets.

; [12 2334 44 59 70 98]

; [12 1525 27 32 88 99]

; [15 3578 82 90 95 97]

2.1.3 Covariance

Thelasttwo measureswe have lookedat arepurely1-dimensional.Datasetslike thiscouldbe: heightsof all thepeoplein theroom,marksfor thelastCOMP101exametc.However many datasetshave morethanonedimension,andtheaim of thestatisticalanalysisof thesedatasetsis usually to seeif thereis any relationshipbetweenthedimensions.For example,we might have as our dataset both the height of all thestudentsin a class,andthemark they receivedfor thatpaper. We could thenperformstatisticalanalysisto seeif theheightof a studenthasany effecton theirmark.

Standarddeviation andvarianceonly operateon 1 dimension,so that you couldonly calculatethestandarddeviation for eachdimensionof thedatasetindependentlyof theotherdimensions.However, it is usefulto haveasimilarmeasureto find outhowmuchthedimensionsvary from themeanwith respect to each other.

Covarianceis sucha measure.Covarianceis alwaysmeasuredbetween 2 dimen-sions. If you calculatethe covariancebetweenonedimensionand itself, you get thevariance.So,if youhada 3-dimensionaldataset(< , = , > ), thenyoucouldmeasurethecovariancebetweenthe < and = dimensions,the < and > dimensions,andthe = and >dimensions.Measuringthecovariancebetween< and < , or = and = , or > and > wouldgiveyou thevarianceof the < , = and > dimensionsrespectively.

Theformulafor covarianceis verysimilar to theformulafor variance.Theformulafor variancecouldalsobewritten like this:

? (+@ 0 �A2�� !"9# � 0 � " - ��32 0 � " - ��A20 �:- ��2

whereI havesimplyexpandedthesquaretermto show bothparts.Sogiventhatknowl-edge,hereis theformulafor covariance:

B&CD? 0 �8E+F�2G� !"9# � 0 � " - ��32 0 F " - �F�20 �.- ��2

5

Page 223: Non-Parametric Methods in Machine Learning

includegraphicscovPlot.ps

Figure2.1: A plot of the covariancedatashowing positive relationshipbetweenthenumberof hoursstudiedagainstthemarkreceived

It is exactly thesameexceptthatin thesecondsetof brackets,the�

’s arereplacedbyF’s. This says,in English,“For eachdataitem,multiply thedifferencebetweenthe <

valueandthemeanof < , by thethedifferencebetweenthe = valueandthemeanof = .Add all theseup,anddivideby 0 �.- ��2 ”.

How doesthis work? Letsusesomeexampledata.Imaginewe havegoneinto theworld andcollectedsome2-dimensionaldata,say, we have askeda bunchof studentshow many hoursin total that they spentstudyingCOSC241,andthe mark that theyreceived. Sowe have two dimensions,thefirst is the H dimension,thehoursstudied,andthesecondis the I dimension,themarkreceived.Figure2.2holdsmy imaginarydata,and the calculationof B&CD? 0 H E I 2 , the covariancebetweenthe Hoursof studydoneandtheMark received.

Sowhatdoesit tell us?Theexactvalueis not asimportantasit’s sign(ie. positiveor negative). If the value is positive, as it is here, then that indicatesthat both di-mensionsincrease together, meaningthat,in general,asthenumberof hoursof studyincreased,sodid thefinal mark.

If thevalueis negative,thenasonedimensionincreases,theotherdecreases.If wehadendedup with a negativecovariancehere,thenthatwould have saidtheopposite,thatasthenumberof hoursof studyincreasedthethefinal markdecreased.

In the last case,if the covarianceis zero,it indicatesthat the two dimensionsareindependentof eachother.

Theresultthatmarkgivenincreasesasthenumberof hoursstudiedincreasescanbeeasilyseenby drawing a graphof thedata,asin Figure2.1.3.However, theluxuryof beingableto visualizedatais only availableat 2 and3 dimensions.Sincethe co-variancevaluecanbecalculatedbetweenany 2 dimensionsin adataset,this techniqueis often usedto find relationshipsbetweendimensionsin high-dimensionaldatasetswherevisualisationis difficult.

You might ask“is B1CD? 0 �8E1F,2 equalto B&CD? 0 FJEK�A2 ”? Well, a quick look at the for-mula for covariancetells us that yes, they are exactly the samesincethe only dif-ferencebetweenB1CD? 0 �8E1F,2 and B&CD? 0 FLEM�N2 is that 0 � " - ��A2 0 F " - �F72 is replacedby0 F " - �F�2 0 � " - ��A2 . And sincemultiplication is commutative, which meansthat itdoesn’t matterwhich wayaroundI multiply two numbers,I alwaysgetthesamenum-ber, thesetwo equationsgive thesameanswer.

2.1.4 The covarianceMatrix

Recallthatcovarianceis alwaysmeasuredbetween2 dimensions.If wehaveadatasetwith morethan2 dimensions,thereis morethanonecovariancemeasurementthatcanbe calculated.For example,from a 3 dimensionaldataset (dimensions< , = , > ) youcouldcalculateB&CD? 0 < E = 2 , 0 B&CD? 0 < E > 2 , and B&CD? 0 = E > 2 . In fact,for an � -dimensionaldataset,you cancalculate !POQ !SR 4DT O U 4 differentcovariancevalues.

6

Page 224: Non-Parametric Methods in Machine Learning

Hours(H) Mark(M)Data 9 39

15 5625 9314 6110 5018 750 3216 855 4219 7016 6620 80

Totals 167 749Averages 13.92 62.42

Covariance:

H I 0 H " - �H 2 0 I " - �I 2 0 H " - �H 2 0 I " - �I 29 39 -4.92 -23.42 115.2315 56 1.08 -6.42 -6.9325 93 11.08 30.58 338.8314 61 0.08 -1.42 -0.1110 50 -3.92 -12.42 48.6918 75 4.08 12.58 51.330 32 -13.92 -30.42 423.4516 85 2.08 22.58 46.975 42 -8.92 -20.42 182.1519 70 5.08 7.58 38.5116 66 2.08 3.58 7.4520 80 6.08 17.58 106.89

Total 1149.89Average 104.54

Table2.2: 2-dimensionaldatasetandcovariancecalculation

7

Page 225: Non-Parametric Methods in Machine Learning

A useful way to get all the possiblecovariancevaluesbetweenall the differentdimensionsis to calculatethemall andput themin a matrix. I assumein this tutorialthatyouarefamiliarwith matrices,andhow they canbedefined.So,thedefinitionforthecovariancematrix for a setof datawith � dimensionsis:

V !PWS! � 0 B "MX Y E B "ZX Y � B&CD? 0M[]\_^ " E [`\Z^ Y 2a2ME

whereV !PWS! is a matrix with � rowsand � columns,and [`\Z^�b is the < th dimension.

All that this ugly looking formulasaysis that if you have an � -dimensionaldataset,thenthematrix has� rows andcolumns(so is square)andeachentry in thematrix istheresultof calculatingthecovariancebetweentwo separatedimensions.Eg. theentryonrow 2, column3, is thecovariancevaluecalculatedbetweenthe2nddimensionandthe3rddimension.

An example.We’ll makeup thecovariancematrix for animaginary3 dimensionaldataset,usingtheusualdimensions< , = and > . Then,thecovariancematrixhas3 rowsand3 columns,andthevaluesarethis:

V � B&CD? 0 < E < 2 B&CD? 0 < E = 2 B&CD? 0 < E > 2B&CD? 0 = E < 2 B&CD? 0 = E = 2 B&CD? 0 = E > 2B&CD? 0 > E < 2 B&CD? 0 > E = 2 B&CD? 0 > E > 2

Somepointsto note:Down themaindiagonal,you seethatthecovariancevalueisbetweenoneof thedimensionsanditself. Thesearethevariancesfor thatdimension.Theotherpoint is thatsince B&CD? 0 (�E+c+2�� B1CD? 0 c&E1(P2 , thematrix is symmetricalaboutthemaindiagonal.

Exercises

Work out the covariancebetweenthe < and = dimensionsin the following 2 dimen-sionaldataset,anddescribewhattheresultindicatesaboutthedata.

Item Number: 1 2 3 4 5< 10 39 19 23 28= 43 13 32 21 20

Calculatethecovariancematrix for this 3 dimensionalsetof data.

Item Number: 1 2 3< 1 -1 4= 2 1 3> 1 3 -1

2.2 Matrix Algebra

This sectionservesto provide a backgroundfor the matrix algebrarequiredin PCA.SpecificallyI will belookingateigenvectorsandeigenvaluesof agivenmatrix. Again,I assumea basicknowledgeof matrices.

8

Page 226: Non-Parametric Methods in Machine Learning

�ed� � f

�d � ���

�ed� � f

d� � � �

� �g fd�

Figure2.2: Exampleof onenon-eigenvectorandoneeigenvector

� fd� � �

�ed� � f

� � ��

� � �g f�

Figure2.3: Exampleof how ascaledeigenvectoris still andeigenvector

2.2.1 Eigenvectors

As you know, you canmultiply two matricestogether, provided they arecompatiblesizes.Eigenvectorsareaspecialcaseof this. Considerthetwo multiplicationsbetweenamatrix andavectorin Figure2.2.

In the first example,the resultingvectoris not an integermultiple of the originalvector, whereasin the secondexample,the exampleis exactly 4 timesthe vectorwebeganwith. Why is this? Well, the vector is a vector in 2 dimensionalspace.The

vectord� (from thesecondexamplemultiplication) representsanarrow pointing

from the origin, 0 %SEh%�2 , to the point 0 dSEh��2 . The othermatrix, the squareone,canbethoughtof as a transformationmatrix. If you multiply this matrix on the left of avector, theansweris anothervectorthatis transformedfrom it’soriginalposition.

It is the natureof the transformationthat the eigenvectorsarisefrom. Imagineatransformationmatrix that, whenmultiplied on the left, reflectedvectorsin the line= � < . Thenyou canseethat if therewerea vectorthat lay on the line = � < , it’sreflectionit itself. This vector(andall multiplesof it, becauseit wouldn’t matterhowlong thevectorwas),wouldbeaneigenvectorof thattransformationmatrix.

Whatpropertiesdo theseeigenvectorshave? You shouldfirst know thateigenvec-tors canonly be found for square matrices.And, not every squarematrix haseigen-vectors.And, givenan � f � matrix thatdoeshave eigenvectors,thereare � of them.Givena

d f d matrix, thereare3 eigenvectors.Anotherpropertyof eigenvectorsis thatevenif I scalethevectorby someamount

beforeI multiply it, I still getthesamemultiple of it asa result,asin Figure2.3. Thisis becauseif you scaleavectorby someamount,all you aredoingis makingit longer,

9

Page 227: Non-Parametric Methods in Machine Learning

not changingit’s direction. Lastly, all theeigenvectorsof a matrix areperpendicular,ie. at right anglesto eachother, nomatterhow many dimensionsyouhave. By theway,anotherwordfor perpendicular, in mathstalk, is orthogonal. This is importantbecauseit meansthat you canexpressthe datain termsof theseperpendiculareigenvectors,insteadof expressingthemin termsof the < and= axes.We will bedoingthis laterinthesectionon PCA.

Another importantthing to know is that whenmathematiciansfind eigenvectors,they like to find theeigenvectorswhoselengthis exactly one.This is because,asyouknow, thelengthof a vectordoesn’t affect whetherit’s aneigenvectoror not,whereasthe directiondoes. So, in order to keepeigenvectorsstandard,whenever we find aneigenvectorwe usuallyscaleit to make it have a lengthof 1, so that all eigenvectorshavethesamelength.Here’sa demonstrationfrom ourexampleabove.

d�

is aneigenvector, andthelengthof thatvectoris

0 d 4�i � 4 2��kj �ld

sowe divide theoriginal vectorby this muchto make it havea lengthof 1.

d� m j � dn�

dSo j �&d�So j �&d

How doesonego aboutfinding thesemysticaleigenvectors? Unfortunately, it’sonly easy(ish)if youhavea rathersmallmatrix, likeno biggerthanabout

d f d . Afterthat, the usualway to find the eigenvectorsis by somecomplicatediterative methodwhich is beyondthescopeof this tutorial (andthisauthor).If youeverneedto find theeigenvectorsof a matrix in a program,just find a mathslibrary thatdoesit all for you.A usefulmathspackage,callednewmat,is availableat http://webnz.com/robert/ .

Furtherinformationabouteigenvectorsin general,how to find them,andorthogo-nality, canbefoundin thetextbook“ElementaryLinearAlgebra5e” by HowardAnton,PublisherJohnWiley & SonsInc, ISBN 0-471-85223-6.

2.2.2 Eigenvalues

Eigenvaluesarecloselyrelatedto eigenvectors,in fact,we saw an eigenvaluein Fig-ure2.2. Noticehow, in boththoseexamples,theamountby which theoriginal vectorwasscaledafter multiplication by the squarematrix wasthe same?In that example,thevaluewas4. 4 is theeigenvalue associatedwith thateigenvector. No matterwhatmultiple of the eigenvectorwe took beforewe multiplied it by the squarematrix, wewouldalwaysget4 timesthescaledvectorasour result(asin Figure2.3).

Soyou canseethateigenvectorsandeigenvaluesalwayscomein pairs.Whenyougetafancy programminglibrary to calculateyoureigenvectorsfor you,youusuallygettheeigenvaluesaswell.

10

Page 228: Non-Parametric Methods in Machine Learning

Exercises

For thefollowing squarematrix:

d % �- � �- �p% - �

Decidewhich, if any, of the following vectorsareeigenvectorsof thatmatrix andgive thecorrespondingeigenvalue.

��- �

- �%�

- ��d

%�%

d��

11

Page 229: Non-Parametric Methods in Machine Learning

Chapter 3

Principal ComponentsAnalysis

Finally we cometo PrincipalComponentsAnalysis (PCA). What is it? It is a wayof identifying patternsin data,andexpressingthe datain sucha way asto highlighttheir similaritiesanddifferences.Sincepatternsin datacanbehardto find in dataofhigh dimension,wherethe luxury of graphicalrepresentationis not available,PCA isapowerful tool for analysingdata.

Theothermainadvantageof PCAis thatonceyouhavefoundthesepatternsin thedata,andyou compressthe data,ie. by reducingthe numberof dimensions,withoutmuchlossof information. This techniqueusedin imagecompression,aswe will seein a latersection.

This chapterwill take you throughthe stepsyou neededto perform a PrincipalComponentsAnalysison a setof data. I am not going to describeexactly why thetechniqueworks,but I will try to provide anexplanationof what is happeningat eachpoint so that you can make informed decisionswhen you try to usethis techniqueyourself.

3.1 Method

Step1: Get somedata

In my simpleexample,I am going to usemy own made-updataset. It’s only got 2dimensions,andthereasonwhy I have chosenthis is sothatI canprovideplotsof thedatato show whatthePCA analysisis doingat eachstep.

ThedataI haveusedis foundin Figure3.1,alongwith a plot of thatdata.

Step2: Subtract the mean

For PCAto work properly, youhaveto subtractthemeanfrom eachof thedatadimen-sions.Themeansubtractedis theaverageacrosseachdimension.So,all the < valueshave�< (themeanof the < valuesof all thedatapoints)subtracted,andall the = values

have�= subtractedfrom them.This producesa datasetwhosemeanis zero.

12

Page 230: Non-Parametric Methods in Machine Learning

Data=

x y2.5 2.40.5 0.72.2 2.91.9 2.23.1 3.02.3 2.72 1.61 1.1

1.5 1.61.1 0.9

DataAdjust=

x y.69 .49

-1.31 -1.21.39 .99.09 .291.29 1.09.49 .79.19 -.31-.81 -.81-.31 -.31-.71 -1.01

-1

0

1

2

3

4

-1 0 1 2 3 4

Original PCA data

"./PCAdata.dat"

Figure3.1: PCAexampledata,originaldataontheleft, datawith themeanssubtractedon theright, andaplot of thedata

13

Page 231: Non-Parametric Methods in Machine Learning

Step3: Calculate the covariancematrix

This is donein exactly thesameway aswasdiscussedin section2.1.4.Sincethedatais 2 dimensional,thecovariancematrix will be

� f � . Thereareno surpriseshere,soIwill justgiveyou theresult:

B&CD? � q �r� ������������� q �r� �������q �r� ������� q ��� �������������

So,sincethenon-diagonalelementsin this covariancematrix arepositive, we shouldexpectthatboththe < and= variableincreasetogether.

Step4: Calculate the eigenvectorsand eigenvaluesof the covariancematrix

Sincethecovariancematrix is square,we cancalculatetheeigenvectorsandeigenval-uesfor this matrix. Theseareratherimportant,asthey tell ususefulinformationaboutour data. I will show you why soon. In the meantime,herearethe eigenvectorsandeigenvalues:

s \Zt s � ? (�uwv sx/ � q %����%���d�d������� q �����%��������

s \_t s � ?rsyBZz1C @ / � - q � d������ ������� - q ����� ���&d�d����q �r���&��� d�d���� - q � d������&�������

It is importantto notice that theseeigenvectorsare both unit eigenvectorsie. theirlengthsareboth1. This is very importantfor PCA,but luckily, mostmathspackages,whenaskedfor eigenvectors,will giveyouunit eigenvectors.

Sowhatdothey mean?If youlook at theplot of thedatain Figure3.2thenyoucanseehow thedatahasquitea strongpattern.As expectedfrom thecovariancematrix,they two variablesdo indeedincreasetogether. On top of thedataI have plottedboththe eigenvectorsaswell. They appearasdiagonaldottedlines on the plot. As statedin theeigenvectorsection,they areperpendicularto eachother. But, moreimportantly,they provide us with informationaboutthe patternsin the data. Seehow oneof theeigenvectorsgoesthroughthemiddleof thepoints,likedrawing a line of bestfit? Thateigenvector is showing us how thesetwo datasetsare relatedalong that line. Thesecondeigenvectorgivesus the other, lessimportant,patternin the data,that all thepointsfollow themainline, but areoff to thesideof themainline by someamount.

So, by this processof taking the eigenvectorsof the covariancematrix, we havebeenableto extractlinesthatcharacterisethedata.Therestof thestepsinvolve trans-forming thedatasothatit is expressedin termsof themlines.

Step5: Choosingcomponentsand forming a feature vector

Hereis wherethenotionof datacompressionandreduceddimensionalitycomesintoit. If you look at the eigenvectorsand eigenvaluesfrom the previous section,you

14

Page 232: Non-Parametric Methods in Machine Learning

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Mean adjusted data with eigenvectors overlayed

"PCAdataadjust.dat"(-.740682469/.671855252)*x

(-.671855252/-.740682469)*x

Figure3.2: A plot of the normaliseddata(meansubtracted)with the eigenvectorsofthecovariancematrixoverlayedon top.

15

Page 233: Non-Parametric Methods in Machine Learning

will notice that the eigenvaluesare quite different values. In fact, it turns out thattheeigenvectorwith thehighest eigenvalueis theprinciple component of thedataset.In our example,the eigenvectorwith the largeseigenvaluewas the one that pointeddown the middle of the data. It is the mostsignificantrelationshipbetweenthe datadimensions.

In general,onceeigenvectorsarefound from thecovariancematrix, thenext stepis to orderthemby eigenvalue,highestto lowest. This givesyou the componentsinorderof significance.Now, if you like, you candecideto ignore the componentsoflessersignificance.Youdolosesomeinformation,but if theeigenvaluesaresmall,youdon’t losemuch. If you leave out somecomponents,the final datasetwill have lessdimensionsthan the original. To be precise,if you originally have � dimensionsinyour data,andsoyou calculate� eigenvectorsandeigenvalues,andthenyou chooseonly thefirst { eigenvectors,thenthefinal datasethasonly { dimensions.

What needsto be donenow is you needto form a feature vector, which is justa fancy namefor a matrix of vectors. This is constructedby taking the eigenvectorsthat you want to keepfrom the list of eigenvectors,andforming a matrix with theseeigenvectorsin thecolumns.

| s ( z v�@ sy}~syBZz1C @n� 0 s \Zt � s \Zt 4 s \Zt � q�q�q�q s \Zt !2

Givenour examplesetof data,andthe fact thatwe have 2 eigenvectors,we havetwo choices.We caneitherform a featurevectorwith bothof theeigenvectors:

- q �r���&��� d�d���� - q �&d������ �������- q �&d��r��� ������� q ����� ���&d�d����

or, we canchooseto leave out thesmaller, lesssignificantcomponentandonly have asinglecolumn:

- q ����� ���&d�d����- q � d������&�������

We shallseetheresultof eachof thesein thenext section.

Step5: Deriving the new data set

Thisthefinal stepin PCA,andis alsotheeasiest.Oncewehavechosenthecomponents(eigenvectors)thatwewish to keepin ourdataandformeda featurevector, wesimplytake the transposeof the vector and multiply it on the left of the original dataset,transposed.

| \ � (�u [ ( z (6�k� CD� | s ( z v�@ sx}7syBZz1C @ f � CD� [ ( z (�� )�� v /hz E

where� CD� | s ( z v�@ sy}6syBZz1C @ is thematrix with theeigenvectorsin thecolumnstrans-

posed sothattheeigenvectorsarenow in therows,with themostsignificanteigenvec-tor at thetop,and

� CD� [ ( z (�� )�� v /hz is themean-adjusteddatatransposed, ie. thedataitemsarein eachcolumn,with eachrow holding a separatedimension. I’m sorry ifthis suddentransposeof all our dataconfusesyou, but theequationsfrom hereon are

16

Page 234: Non-Parametric Methods in Machine Learning

easierif wetakethetransposeof thefeaturevectorandthedatafirst, ratherthathavingalittle T symbolabovetheirnamesfrom now on.

| \ � (�u [ ( z ( is thefinal dataset,withdataitemsin columns,anddimensionsalongrows.

Whatwill thisgiveus?It will giveustheoriginaldatasolely in terms of the vectorswe chose. Our original datasethadtwo axes, < and = , so our datawasin termsofthem. It is possibleto expressdatain termsof any two axesthat you like. If theseaxesareperpendicular, thentheexpressionis themostefficient. This waswhy it wasimportantthateigenvectorsarealwaysperpendicularto eachother. We have changedour datafrom beingin termsof theaxes < and = , andnow they arein termsof our 2eigenvectors.In thecaseof whenthenew datasethasreduceddimensionality, ie. wehave left someof theeigenvectorsout, thenew datais only in termsof thevectorsthatwedecidedto keep.

To show this on our data,I have donethe final transformationwith eachof thepossiblefeaturevectors.I have takenthe transposeof the resultin eachcaseto bringthedatabackto thenicetable-like format. I have alsoplottedthefinal pointsto showhow they relateto thecomponents.

In thecaseof keepingbotheigenvectorsfor thetransformation,wegetthedataandtheplot foundin Figure3.3. This plot is basicallytheoriginal data,rotatedsothat theeigenvectorsaretheaxes.This is understandablesincewe have lost no informationinthisdecomposition.

The othertransformationwe canmake is by taking only the eigenvectorwith thelargesteigenvalue. The tableof dataresultingfrom that is found in Figure3.4. Asexpected,it only hasa singledimension. If you comparethis datasetwith the oneresultingfrom usingbotheigenvectors,you will noticethat this datasetis exactly thefirst columnof theother. So, if you wereto plot this data,it would be1 dimensional,andwould be pointson a line in exactly the < positionsof the points in the plot inFigure3.3. We have effectively thrown away thewholeotheraxis,which is theothereigenvector.

So what have we donehere? Basicallywe have transformedour dataso that isexpressedin termsof thepatternsbetweenthem,wherethepatternsarethe lines thatmostcloselydescribethe relationshipsbetweenthe data. This is helpful becausewehave now classifiedour datapoint asa combinationof thecontributionsfrom eachofthoselines. Initially we had the simple < and = axes. This is fine, but the < and =valuesof eachdatapointdon’t really tell usexactlyhow thatpoint relatesto therestofthedata.Now, thevaluesof thedatapointstell usexactlywhere(ie. above/below) thetrendlinesthedatapointsits. In thecaseof thetransformationusingboth eigenvectors,we have simply alteredthe dataso that it is in termsof thoseeigenvectorsinsteadoftheusualaxes.But thesingle-eigenvectordecompositionhasremovedthecontributiondueto thesmallereigenvectorandleft uswith datathatis only in termsof theother.

3.1.1 Getting the old data back

Wanting to get the original databack is obviously of greatconcernif you areusingthePCA transformfor datacompression(anexampleof which to will seein thenextsection).Thiscontentis takenfromhttp://www.vision.auc.dk/sig/Teaching/Flerdim/Current/hotelling/hotelling.html

17

Page 235: Non-Parametric Methods in Machine Learning

TransformedData=

< =-.827970186 -.1751153071.77758033 .142857227-.992197494 .384374989-.274210416 .130417207-1.67580142 -.209498461-.912949103 .175282444.0991094375 -.3498246981.14457216 .0464172582.438046137 .01776462971.22382056 -.162675287

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Data transformed with 2 eigenvectors

"./doublevecfinal.dat"

Figure3.3: The tableof databy applyingthe PCA analysisusingboth eigenvectors,anda plot of thenew datapoints.

18

Page 236: Non-Parametric Methods in Machine Learning

TransformedData(Singleeigenvector)<

-.8279701861.77758033-.992197494-.274210416-1.67580142-.912949103.09910943751.14457216.4380461371.22382056

Figure3.4: Thedataaftertransformingusingonly themostsignificanteigenvector

So,how do we get the original databack? Beforewe do that, rememberthat only ifwe took all theeigenvectorsin our transformationwill wegetexactly theoriginaldataback. If we have reducedthenumberof eigenvectorsin thefinal transformation,thentheretrieveddatahaslost someinformation.

Recallthatthefinal transformis this:

| \ � (�u [ ( z (6�k� CD� | s ( z v�@ sx}7syBZz1C @ f � CD� [ ( z (�� )�� v /hz E

whichcanbeturnedaroundsothat,to gettheoriginaldataback,

� CD� [ ( z (�� )�� v /hz ��� CD� | s ( z v�@ sx}7syBZz1C @ R�f | \ � (�u [ ( z (

where� CD� | s ( z v�@ sy}6syBZz1C @ R

�is theinverseof

� CD� | s ( z v�@ sx}7syBZz1C @ . However, whenwe take all the eigenvectorsin our featurevector, it turnsout that the inverseof ourfeaturevectoris actuallyequalto thetransposeof our featurevector. This is only truebecausethe elementsof the matrix areall the unit eigenvectorsof our dataset. Thismakesthereturntrip to our dataeasier, becausetheequationbecomes

� CD� [ ( z (�� )�� v /hz ��� CD� | s ( z v�@ sy}7sxBMz1C @�� f | \ � (�u [ ( z (

But, to get the actualoriginal databack,we needto addon the meanof thatoriginaldata(rememberwesubtractedit right at thestart).So,for completeness,

� CD�7� @ \Zt�\ � (�u [ ( z (7� 0 � CD� | s ( z v�@ sy}6syBZz1C @�� f | \ � (�u [ ( z (P2 i � @ \Zt�\ � (�u I s ( �This formulaalsoappliesto whenyou do not have all the eigenvectorsin the featurevector. Soevenwhenyou leave out someeigenvectors,theaboveequationstill makesthecorrecttransform.

I will notperformthedatare-creationusingthecomplete featurevector, becausetheresultis exactly thedatawestartedwith. However, I will do it with thereducedfeaturevectorto show youhow informationhasbeenlost. Figure3.5show thisplot. Compare

19

Page 237: Non-Parametric Methods in Machine Learning

-1

0

1

2

3

4

-1 0 1 2 3 4

Original data restored using only a single eigenvector

"./lossyplusmean.dat"

Figure3.5:Thereconstructionfrom thedatathatwasderivedusingonly asingleeigen-vector

it to the original dataplot in Figure3.1 andyou will noticehow, while the variationalongtheprincipleeigenvector(seeFigure3.2 for theeigenvectoroverlayedon top ofthe mean-adjusteddata)hasbeenkept, the variationalongthe othercomponent(theothereigenvectorthatwe left out)hasgone.

Exercises; Whatdo theeigenvectorsof thecovariancematrixgiveus?

; At what point in the PCA processcanwe decideto compressthe data? Whateffectdoesthis have?

; For anexampleof PCAandagraphicalrepresentationof theprincipaleigenvec-tors,researchthetopic ’Eigenfaces’,whichusesPCAto do facialrecognition

20

Page 238: Non-Parametric Methods in Machine Learning

Chapter 4

Application to Computer Vision

This chapterwill outline the way that PCA is usedin computervision, first showinghow imagesareusuallyrepresented,andthenshowing what PCA canallow us to dowith thoseimages.Theinformationin this sectionregardingfacialrecognitioncomesfrom “FaceRecognition:Eigenface,ElasticMatching,andNeuralNets”,JunZhangetal. Proceedingsof theIEEE,Vol. 85,No. 9,September1997.Therepresentationinfor-mation,is takenfrom “Digital ImageProcessing”RafaelC. GonzalezandPaul Wintz,Addison-Wesley PublishingCompany, 1987.It is alsoanexcellentreferencefor furtherinformationon the K-L transformin general.The imagecompressioninformationistakenfromhttp://www.vision.auc.dk/ sig/Teaching/Flerdim/Current/hotelling/hotelling.html,whichalsoprovidesexamplesof imagereconstructionusingavaryingamountof eigen-vectors.

4.1 Representation

Whenusingthesesortof matrixtechniquesin computervision,wemustconsiderrepre-sentationof images.A square,� by � imagecanbeexpressedasan � 4 -dimensionalvector

��� < � < 4 < � q q <����wherethe rows of pixels in the imageareplacedoneafter the other to form a one-dimensionalimage. E.g. The first � elements(< � -g< � will be the first row of theimage,thenext � elementsarethenext row, andsoon. Thevaluesin thevectoraretheintensityvaluesof theimage,possiblyasinglegreyscalevalue.

4.2 PCA to find patterns

Saywe have 20 images. Eachimageis � pixels high by � pixels wide. For eachimagewe cancreatean imagevectorasdescribedin the representationsection. Wecanthenput all theimagestogetherin onebig image-matrixlike this:

21

Page 239: Non-Parametric Methods in Machine Learning

� ^ ( t sx/ I ( z @ \ < �� ^ ( t sx}6sxB �� ^ ( t sx}6sxB �qq� ^ ( t sy}~syB ��%

which givesusa startingpoint for our PCA analysis.Oncewe have performedPCA,we have our original datain termsof the eigenvectorswe found from the covariancematrix. Why is this useful?Saywe want to do facial recognition,andsoour originalimageswereof peoplesfaces.Then,the problemis, givena new image,whosefacefrom the original set is it? (Note that the new imageis not oneof the 20 we startedwith.) The way this is doneis computervision is to measurethe differencebetweenthenew imageandtheoriginal images,but not alongtheoriginal axes,alongthenewaxesderivedfrom thePCAanalysis.

It turnsout that theseaxesworks muchbetterfor recognisingfaces,becausethePCA analysishasgiven us the original imagesin terms of the differences and simi-larities between them. The PCA analysishasidentifiedthe statisticalpatternsin thedata.

Sinceall thevectorsare � 4 dimensional,wewill get � 4 eigenvectors.In practice,we areableto leave out someof the lesssignificanteigenvectors,andtherecognitionstill performswell.

4.3 PCA for imagecompression

UsingPCAfor imagecompressionalsoknow astheHotelling,or KarhunenandLeove(KL), transform.If wehave20 images,eachwith � 4 pixels,we canform � 4 vectors,eachwith 20dimensions.Eachvectorconsistsof all theintensityvaluesfrom thesamepixel from eachpicture.This is differentfrom thepreviousexamplebecausebeforewehadavectorfor image, andeachitemin thatvectorwasadifferentpixel, whereasnowwehavea vectorfor eachpixel, andeachitem in thevectoris from a differentimage.

Now we performthePCA on this setof data.We will get20 eigenvectorsbecauseeachvectoris 20-dimensional.To compressthedata,we canthenchooseto transformthe dataonly using, say 15 of the eigenvectors. This givesus a final dataset withonly 15 dimensions,which hassavedus

��o1of thespace.However, whentheoriginal

datais reproduced,the imageshave lost someof the information. This compressiontechniqueis saidto be lossy becausethedecompressedimageis not exactly thesameastheoriginal,generallyworse.

22

Page 240: Non-Parametric Methods in Machine Learning

Appendix A

Implementation Code

This is codefor usein Scilab,a freewarealternative to Matlab. I usedthis codetogenerateall the examplesin the text. Apart from the first macro,all the rest werewrittenby me.

// This macro taken from// http://www.cs.montana.edu/˜harkin/co urses/ cs530 /scil ab/mac ros/c ov.sc i// No alterations made

// Return the covariance matrix of the data in x, where each column of x// is one dimension of an n-dimensional data set. That is, x has x columns// and m rows, and each row is one sample.//// For example, if x is three dimensional and there are 4 samples.// x = [1 2 3;4 5 6;7 8 9;10 11 12]// c = cov (x)

function [c]=cov (x)// Get the size of the arraysizex=size(x);// Get the mean of each columnmeanx = mean (x, "r");// For each pair of variables, x1, x2, calculate// sum ((x1 - meanx1)(x2-meanx2))/(m-1)for var = 1:sizex(2),

x1 = x(:,var);mx1 = meanx (var);for ct = var:sizex (2),

x2 = x(:,ct);mx2 = meanx (ct);v = ((x1 - mx1)’ * (x2 - mx2))/(sizex(1) - 1);

23

Page 241: Non-Parametric Methods in Machine Learning

cv(var,ct) = v;cv(ct,var) = v;// do the lower part of c also.

end,end,c=cv;

// This a simple wrapper function to get just the eigenvectors// since the system call returns 3 matricesfunction [x]=justeigs (x)// This just returns the eigenvectors of the matrix

[a, eig, b] = bdiag(x);

x= eig;

// this function makes the transformation to the eigenspace for PCA// parameters:// adjusteddata = mean-adjusted data set// eigenvectors = SORTEDeigenvectors (by eigenvalue)// dimensions = how many eigenvectors you wish to keep//// The first two parameters can come from the result of calling// PCAprepare on your data.// The last is up to you.

function [finaldata] = PCAtransform(adjusteddata,eigenvectors ,dime nsion s)finaleigs = eigenvectors(:,1:dimensions);prefinaldata = finaleigs’*adjusteddata’;finaldata = prefinaldata’;

// This function does the preparation for PCA analysis// It adjusts the data to subtract the mean, finds the covariance matrix,// and finds normal eigenvectors of that covariance matrix.// It returns 4 matrices// meanadjust = the mean-adjust data set// covmat = the covariance matrix of the data// eigvalues = the eigenvalues of the covariance matrix, IN SORTEDORDER// normaleigs = the normalised eigenvectors of the covariance matrix,// IN SORTEDORDERWITH RESPECTTO// THEIR EIGENVALUES, for selection for the feature vector.

24

Page 242: Non-Parametric Methods in Machine Learning

//// NOTE: This function cannot handle data sets that have any eigenvalues// equal to zero. It’s got something to do with the way that scilab treats// the empty matrix and zeros.//function [meanadjusted,covmat,sorteigvalues,s ortno rmale igs] = PCAprepare (data)// Calculates the mean adjusted matrix, only for 2 dimensional datameans = mean(data,"r");meanadjusted = meanadjust(data);covmat = cov(meanadjusted);eigvalues = spec(covmat);normaleigs = justeigs(covmat);sorteigvalues = sorteigvectors(eigvalues’,eigvalue s’);sortnormaleigs = sorteigvectors(eigvalues’,normale igs);

// This removes a specified column from a matrix// A = the matrix// n = the column number you wish to removefunction [columnremoved] = removecolumn(A,n)inputsize = size(A);numcols = inputsize(2);temp = A(:,1:(n-1));for var = 1:(numcols - n)

temp(:,(n+var)-1) = A(:,(n+var));end,columnremoved = temp;

// This finds the column number that has the// highest value in it’s first row.function [column] = highestvalcolumn(A)inputsize = size(A);numcols = inputsize(2);maxval = A(1,1);maxcol = 1;for var = 2:numcols

if A(1,var) > maxvalmaxval = A(1,var);maxcol = var;

end,end,column = maxcol

25

Page 243: Non-Parametric Methods in Machine Learning

// This sorts a matrix of vectors, based on the values of// another matrix//// values = the list of eigenvalues (1 per column)// vectors = The list of eigenvectors (1 per column)//// NOTE: The values should correspond to the vectors// so that the value in column x corresponds to the vector// in column x.function [sortedvecs] = sorteigvectors(values,vectors)inputsize = size(values);numcols = inputsize(2);highcol = highestvalcolumn(values);sorted = vectors(:,highcol);remainvec = removecolumn(vectors,highcol);remainval = removecolumn(values,highcol);for var = 2:numcols

highcol = highestvalcolumn(remainval);sorted(:,var) = remainvec(:,highcol);remainvec = removecolumn(remainvec,highcol);remainval = removecolumn(remainval,highcol);

end,sortedvecs = sorted;

// This takes a set of data, and subtracts// the column mean from each column.function [meanadjusted] = meanadjust(Data)inputsize = size(Data);numcols = inputsize(2);means = mean(Data,"r");tmpmeanadjusted = Data(:,1) - means(:,1);for var = 2:numcols

tmpmeanadjusted(:,var) = Data(:,var) - means(:,var);end,meanadjusted = tmpmeanadjusted

26

Page 244: Non-Parametric Methods in Machine Learning

Dimensionality Reduction:

PCA & LDA

Faustino Gomez

IDSIA

• Data can be (very) high-dimensional

Page 245: Non-Parametric Methods in Machine Learning

Why is high-dimensionality a

problem?

• Computationally intractable

• Intrinsic dimensionality may be lower

• Redundant/irrelevant information

• Visualization and comprehensibility

Solution: extract “features” from high-dimensional

data; project data onto lower dimensional feature-space

Instance x described

by n dimensional

feature vector

),...,,( 21 nxxx{

Feature Extraction ),...,,( 21 nxxxf{

Extracted m-dimensional

feature vector y (m<n)),...,,( 21 myyy{

What is Feature Extraction?

Page 246: Non-Parametric Methods in Machine Learning

Dimensionality Reduction

Methods

• Unsupervised– Preserve statistical/structural

properties

– No class information

• Supervised– Uses class information

– Maximize seperability

NR

variance

separability

Types of “learning”

Dimensionality Reduction

Methods• Principal Component Analysis (PCA)

– linear, unsupervised, maximizes preserved variance

• Linear Discriminant Analysis (LDA)

– linear, supervised, maximizes class separability

• Neural networks– Non-linear, both supervised and unsupervised are

possible

• Other methods– Independent Component Analysis, Self Organising

Maps, Principal curves, Sammon’s mapping

Page 247: Non-Parametric Methods in Machine Learning

Finding structure in data

Principal Component Analysis

• Basic idea: find new set of axes (basis)

that concentrates the most variance in

the fewest components (new axes).

• Project points onto just the “principal”

components = fewer dimensions!

Page 248: Non-Parametric Methods in Machine Learning

Find new basis that accounts

for most variance

Overview of PCA Algorithm

1. Normalize data

2. Compute covariance matrix

3. Compute eigenvectors of covariance matrix

4. Eigenvectors are the components that

define new basis

5. Eigenvalues indicate importance of each

component

Page 249: Non-Parametric Methods in Machine Learning

Mathematical background

!

µi =xij

j=1

N

"N

Mean of dimension i

Let X be a set of N d-dimensional vectors

!

si2 =

xij "µi( ) xij "µi( )

j=1

N

#N "1

Variance of dimension i

!

j =1..N

!

i =1..d

Mathematical background

!

var(Xi ) =xij "µi( ) xij "µi( )

j=1

N

#N "1

Variance

!

cov(Xi ,Xk ) =xij "µi( ) xkj "µk( )

j=1

N

#N "1

Covariance

Page 250: Non-Parametric Methods in Machine Learning

Mathematical background

• Covariance is a pair-wise measure

• Given d dimensions, we have d ! d

covariances

!

cov(X1,X

1) cov(X

1,X

2) cov(X

1,X

3)

cov(X2,X

1) cov(X

2,X

2) cov(X

2,X

3)

cov(X3,X

1) cov(X

3,X

2) cov(X

3,X

3)

"

#

$ $ $

%

&

' ' '

Covariance matrix for d = 3

Mathematical background

• A square d ! d matrix will have d

orthogonal eigenvectors

• The normalized eigenvectors are the

“components” used in PCA

Page 251: Non-Parametric Methods in Machine Learning

PCA algorithm

Find a new basis for data set:

Representation using standard basis:

!

x = x1

1

0

0

"

#

$ $ $

%

&

' ' '

+ x2

0

1

0

"

#

$ $ $

%

&

' ' '

+ x3

0

0

1

"

#

$ $ $

%

&

' ' '

!

x =

1

2

3

"

#

$ $ $

%

&

' ' '

=1

1

0

0

"

#

$ $ $

%

&

' ' '

+ 2

0

1

0

"

#

$ $ $

%

&

' ' '

+ 3

0

0

1

"

#

$ $ $

%

&

' ' '

Example:

PCA algorithm

Find a new basis for data set:

Representation using a different basis:!

x =

1

2

3

"

#

$ $ $

%

&

' ' '

=1

1

0

0

"

#

$ $ $

%

&

' ' '

+ 2

0

1

0

"

#

$ $ $

%

&

' ' '

+ 3

0

0

1

"

#

$ $ $

%

&

' ' '

Example:

!

x =

1

2

3

"

#

$ $ $

%

&

' ' '

=1

0

0

1

"

#

$ $ $

%

&

' ' '

+1

0

1

1

"

#

$ $ $

%

&

' ' '

+1

1

1

1

"

#

$ $ $

%

&

' ' '

!

x =

1

2

3

"

#

$ $ $

%

&

' ' '

=1.5

1

1

0

"

#

$ $ $

%

&

' ' '

+ 3

0

0

1

"

#

$ $ $

%

&

' ' '

( 0.5

1

(1

0

"

#

$ $ $

%

&

' ' '

and another (orthogonal):

Page 252: Non-Parametric Methods in Machine Learning

PCA: original data set

!

x1 = x

1

1x2

1x3

1x4

1x5

1L x

d

1[ ]x2 = x

1

2x2

2x3

2x4

2x5

2L x

d

2[ ]x3 = x

1

3x2

3x3

3x4

3x5

3L x

d

3[ ]M

xN = x

1

Nx2

Nx3

Nx4

Nx5

NLx

d

N[ ]

DimensionsD

ata

poin

ts

PCA

Step 1: normalize the data set

!

ˆ x 1 = ˆ x

1

1 ˆ x 2

1 ˆ x 3

1 ˆ x 4

1 ˆ x 5

1L ˆ x d

1[ ]ˆ x

2 = ˆ x 1

2 ˆ x 2

2 ˆ x 3

2 ˆ x 4

2 ˆ x 5

2L ˆ x d

2[ ]ˆ x

3 = ˆ x 1

3 ˆ x 2

3 ˆ x 3

3 ˆ x 4

3 ˆ x 5

3L ˆ x d

3[ ]M

ˆ x N = ˆ x

1

N ˆ x 2

N ˆ x 3

N ˆ x 4

N ˆ x 5

NL ˆ x d

N[ ]

ˆ x ij = xi

j "µi ,

µi = 1N

xi

j

j=1

N

#

Dimensions

Dat

a poin

ts

where

Page 253: Non-Parametric Methods in Machine Learning

PCA algorithm

!

cov(Xi ,Xk ) =xij "µi( ) xkj "µk( )

j=1

N

#N "1

for all i,k

Step 2: compute the d-dimensional covariance

matrix

!

cov(X1,X

1) cov(X

1,X

2)L cov(X

1,X

3)

cov(X2,X

1) cov(X

2,X

2) cov(X

2,X

3)

cov(X3,X

1) cov(X

3,X

2) cov(X

3,X

3)

"

#

$ $ $

%

&

' ' '

Step 3: compute eigenvectors of covariance matrix

PCA algorithm

i

Step 4: select the K eigenvectors with the largest

eigenvalues. These will be the “principal

components” or features, the new basis.

If K=d, then the data is just represented using the new

basis.

If K<d, then information is lost, but since we ignore the

least significant eigenvectors, we only eliminate

components least variance :

visualization and compression

Page 254: Non-Parametric Methods in Machine Learning

PCA compression

!i

K

!N

Eigenvalue spectrum

cutoff

Limitations of PCA

• Linear PCA mapping is only optimal for

a linear reconstruction

• Possible non-linear relations are not

used.

! Dimensionality of extracted feature-vectors

will be higher then the minimal required

dimensionality (intrinsic dimensionality)

Page 255: Non-Parametric Methods in Machine Learning

Linear Discriminant Analysis

Find a projection that:

- Minimizes distances within classes

- Maximizes distances between classes

Main goal is to optimize the extracted

features for the purpose of classification

Linear Discriminant Analysis

- Suppose we have C classes

- Let be the mean vector of class i,

- Let be the number of samples in class i

!

r µ i

!

Mi

Page 256: Non-Parametric Methods in Machine Learning

PCA vs LDA

Linear Discriminant Analysis: Scatter

Matrices

Distances within classes (within-scatter matrix):

i now refers do class, j refers to datapoint within class

Distances between classes (between-scatter matrix):

Measures for class separability:

!

Sw = [(xj "

r µ i )(x

j "r µ i )

T]

j=1

Mi

#i=1

C

#

!

Sb

= [(r µ i"

r µ )(

r µ i"

r µ )T ]

i=1

C

#

!!"

#$$%

&

w

b

S

Strace

)(

)(

w

b

Strace

Strace

(Mean vector of class I)

(Mean vector regardless of class)

Page 257: Non-Parametric Methods in Machine Learning

Linear Discriminant Analysis

1!

2!

1!

2!

LDA finds a linear projection

on to the subspace spanned

by the m largest eigenvectors

of Sw-1Sb

PCA

Class A

Class B

Optimal for m+1 linear separable classes

x1

x2

f2

f1

Page 258: Non-Parametric Methods in Machine Learning

Assignment 2 - Part 1

October 17, 2007

1 Maximum Likelihood (25 points)

Consider a sample ofn trials,j of which beinga’s andk beingb’s. The number ofc’swill be n − j − k. As the probabilities of the possible symbols have to sum to1, theprobability of c will be 1 − p − q. As the events are independent, the probability ofthe sample can be expressed as the product of the probabilities of the single symbols.We can then express the likelihood of the parameters(p, q) given the data (i.e. theprobability of the data given the parameters) as

L(x1, x2, ..., xn|p, q) = P (x1|p, q)P (x2|p, q)...P (xn|p, q) = pjqk(1 − p − q)n−j−k

(1)We are asked to estimate the values ofp andq according to the Maximum Likeli-

hood criterion. This means that we have to pick the values ofp andq that maximize(1); or, as thelog function is strictly increasing, the values that maximize the logarithmof (1)

(p, q) = argmax(p,q)

L = argmax(p,q)

log L (2)

Let us evaluate it

log L = j log p + k log q + (n − j − k) log(1 − p − q) (3)

In order to maximize it we will need to set its gradient to0, and check which is thereal minimum in case of multiple solutions.

∂ log L

∂p=

j

p−

n − j − k

1 − p − q= 0 (4)

∂ log L

∂q=

k

q−

n − j − k

1 − p − q= 0 (5)

Equations (4,5) form a system with a single solution in(p, q) = (j/n, k/n), whichis also intuitively correct if we compare it with the ML estimator for the Bernoullidistribution. The same solution can also be obtained in a much more complicated wayby differentiatingL directly.

1

Page 259: Non-Parametric Methods in Machine Learning

2 Maximum A Posteriori (25 points)

We have to estimate the value ofp = Pr{0} in a Bernoulli distribution, using MAP.This means we have to maximize the posterior probability ofp given the data

p = argmaxp

f(p|x1, ..., xn) = arg maxp

f(x1, ..., xn|p)f(p)

f(x1, ..., xn)= (6)

= arg maxp

f(x1, ..., xn|p)f(p) (7)

Note that the first passage is just the Bayes rule; in the second passage we drop thenumerator, as it does not depend onp. We now have to evaluate the two remainingterms, and multiply them.

Consider a sample ofn trials,k being0’s. The number of1’s will be n− k. As theprobabilities of the possible symbols have to sum to1, the probability of1 will be 1−p.As the events are independent, the probability of the samplecan be expressed as theproduct of the probabilities of the single symbols. We can then express the probabilityof the sample givenp (i.e. the likelihood ofp given the data) as

f(x1, x2, ..., xn|p) = P (x1|p)P (x2|p)...P (xn|p) = pk(1 − p)n−k (8)

The second term is the prior over the parameter, which is proposed to bef(p) =3p2 for p ∈ [0, 1], and obviously0 elsewhere (note thatp has to be in[0, 1], so anyprior we use has to be0 outside this interval).

Multiplying the two terms gives

f(p|x1, x2, ..., xn) = pk(1 − p)n−k3p2 = 3pk+2(1 − p)n−k (9)

We have to maximize (9) setting its derivative equal to0. Also in this case we firsttake the logarithm, as it makes the equation easier

∂ log f(p|x1, ..., xn)

∂p=

∂[log 3 + (k + 2) log p + (n − k) log(1 − p)]

∂p= 0 (10)

k + 2

p−

n − k

1 − p= 0 (11)

(1 − p)(k + 2) − p(n − k) = k + 2 − p(2 + n) = 0 (12)

which results in our MAP estimator beingp = (k + 2)/(n + 2).

2

Page 260: Non-Parametric Methods in Machine Learning

Artificial Neural Networks

Part I

Faustino Gomez

IDSIA

Intelligent Systems, Fall 2007

What makes a system intelligent?

Page 261: Non-Parametric Methods in Machine Learning

Characteristics of “intelligent” systems

• Autonomy: can the system operate with minimalhuman intervention?

• Adaptivity: can the system adjust to changes in theenvironment?

• Uncertainty: can the system cope with incompleteinformation?

• Anticipation: does the system just react or does italso base its behavior on predictions?

• Generality: how specific is the task?

Ultimately we want to achieve true Artificial Intelligence

Brief History of AI

1950: Turing Test

1956: McCarthy coins term “artificial intelligence”

1957: Rosenblatt invents perceptron

1958: John McCarthy (MIT) invents Lisp language

1963: MIT received a $2.2 million grant from the newlycreated Advanced Research Projects Agency (laterknown as DARPA)

- General Problem Solver (GPS; Newell and Simon)- Physical Symbol System hypothesis

Early years

Page 262: Non-Parametric Methods in Machine Learning

Brief History of AI (cont.)

1969: Shakey robot at

Stanford (natural language,

vision, control)

1972: Prolog language

1979: Expert Systems

invented, later used widely

1966: Eliza

Brief History of AI (cont.)

1974-80: First AI Winter– Philosophical critique: “What computers can’t do” by Dreyfus, and

Searle’s Chinese room

– Bold claims:

1965, H. A. Simon: "machines will be capable, within twenty years, ofdoing any work a man can do.”

1967, Marvin Minsky: "Within a generation ... the problem of creating'artificial intelligence' will substantially be solved.”

– But insufficient progress, methods would not scale

– Belief that low-level (perception, common-sense reasoning) wouldbe easy. Wrong!

– Minsky-Papert book (1969) criticizes Perceptron: NNs banished to“dark age” (1967-1982)

Page 263: Non-Parametric Methods in Machine Learning

1980s: Paradigm Shift

• Rodney Brooks: “world is it’s own bestmodel”, “elephants don’t play chess”

- subsumption architecture

• AI too big: focus on subproblems

• Stuart Wilson: Animats approach

• Importance of Embodiment

• Ever increasing computation power

• Renewed interest in Neural Networks

Summary of AI paradigm shift

Good Old-fashioned AI New AI

Page 264: Non-Parametric Methods in Machine Learning

Neural Networks

• mathematical abstraction of biological nervous systems

• massively parallel distributed processors

• subsymbolic (no explicit symbols maintained)

• universal approximation

• can learn arbitrary mappings from examples

Training Set

!

x1 = x

1

1x2

1x3

1x4

1x5

1L x

n

1, d

1[ ]

x2 = x

1

2x2

2x3

2x4

2x5

2L x

n

2, d

2[ ]

x3 = x

1

3x2

3x3

3x4

3x5

3L x

n

3, d

3[ ]M

xN = x

1

Px2

Px3

Px4

Px5

PLx

n

P, d

P[ ]

!

example " [input pattern, target]

Page 265: Non-Parametric Methods in Machine Learning

Perceptron (Rosenblatt 1957)

!

f (x) =1

0

" # $

if

else

wixii=1

n

% +b > 0

Linear Classification

LDA uses statistics to determine projection plane

Page 266: Non-Parametric Methods in Machine Learning

Representing Lines• How do we represent a

line?

In general a hyperplane is defined by

!

!!

1

2

!

x1 • w is positivex2 • w is negative

Page 267: Non-Parametric Methods in Machine Learning

Now classification is easy!

But... how do welearn this mysterious

model vector?

Perceptron Learning Algorithm

!

Input : list of n training examples (x0,d0 )...(xn ,dn ) where "i :di # {+1,$1}Output : classifying hyperplane w

Algorithm :Randomly initialize w;While makes errors on training set do for (xidi ) do let yi = sign(w • xi ); if yi % di then w& w+'dixi; end endend learning rate

Page 268: Non-Parametric Methods in Machine Learning

A simple example4 linearly separable points

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Z = 1 Z = - 1

(1/2, 1)

(1,1/2)

(-1,1/2)

(-1,1)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

initial weights

W(0) = (0,1)

Class 1

Class -1

Page 269: Non-Parametric Methods in Machine Learning

Updating Weights

Upper left point (-1,1/2) is wrongly classified

!

x = ("1,1/2)d = "1# =1/ 3,w(0) = (0,1)w(1)$ w(0)+#dxw(1) = (0,1)+1/ 3*("1)*("1,1/2) = (0,1)+1/ 3*(1,"1/2)

= (1/ 3,5 /6)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2first correction

W(1) = (1/3,5/6)

Page 270: Non-Parametric Methods in Machine Learning

Updating Weights, Ctd

Upper left point is still wrongly classified

!

x = ("1,1/2)d = "1w(2)# w(1)+$dxw(2) = (1/ 3,5 /6)+1/ 3*("1)*("1,1/2) = (1/ 3,5 /6)+1/ 3*(1,"1/2)

= (2 / 3,2 / 3)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

second correction

W(2) = (2/3,2/3)

Page 271: Non-Parametric Methods in Machine Learning

• By Widrow and Hoff (~1960)

– Adaptive linear elements for signal processing

– The same architecture as perceptron but different learning

method: delta rule, also called the Widrow-Hoff learning rule

– Learning method: try to reduce the mean squared error

(MSE) between the net input and the desired out put

Adaptive Linear Elements (Adaline)

Adaline• Delta rule

– The squared error

• Its value determined by the weights wl

– Modify weights by gradient descent approach

– Obtain partial derivative of error with respect to each weight

Change weights in the opposite direction of

!

E = (d " net)2 = (d " wixi)

i

#2

!

"E /"wi

!

"wi=#(d $ w

ixi) % x

i=#(d $ net)

i

& % xi

!

"E

wi

= 2(net # d)"net

"wi

= 2(net # d)xi

!

(net " d)xi increases error

"(net " d)xi decreases error

Page 272: Non-Parametric Methods in Machine Learning

• Delta rule in batch mode

– Based on mean squared error over all P samples

• E is again a function of w = (w0, w1,…, wn )

• the gradient of E:

• Therefore

!

E =1

P(d

p " net p )2

p=1

P

#

!

"E

"wi

=2

P[(d

p # net p )"

"wip=1

P

$ (dp # net p )]

= #2

P[(d

p # net p )p=1

P

$ % xi ]

=!

!"=

i

iw

Ew #$

!

" [(dp # net p )

p=1

P

$ % xip]

Adaline Learning

• Notes:

– Weights will be changed even if an input is alreadyclassified correctly

– E monotonically decreases until the system reaches astate with (local) minimum E (a small change of any wi

will cause E to increase).

– At a local minimum the partial derivative of all weightswith respect to the error will equal 0, but E isnot guaranteed to be zero (netj != dj)

!

"E /"wi

Adaline Learning

Page 273: Non-Parametric Methods in Machine Learning

• Can a trained perceptron correctly classify patterns

not included in the training samples?

• Depends on the quality of training samples selected.

• Also to some extent depends on the learning rate

and initial weights

• How can we know the learning is ok?

– Reserve a few samples for testing.

• Much more on this later

Generalization

0 1

1

0

1 0

0 1

Problem: some functions are not linearly separable!

u1 u2 u1 XOR u2

0 0 0

0 1 1

1 0 1

1 1 0

XOR function

Since XOR (a simple function) could not be separated by a line

the perceptron is very limited in what kind of functions it can learn.

Funding for neural networks dried up for more than a decade after

Minsky and Papert book Perceptrons (1969).

Page 274: Non-Parametric Methods in Machine Learning

Resurgence of NNs

• Layered Networks

• Non-linear neurons

• 1974: Backpropagation (Werbos)

• 1986: developed further (Rumelhart,

Hinton, Williams)

Solving XOR with combine perceptron

XOR can be composed of

‘simpler’ logical

functions.

A xor B = (A or B) and

not (A and B)

The last term simply

removes the troublesome

value.

Page 275: Non-Parametric Methods in Machine Learning

Non-Linear Perceptron

!

f (x) = " wixii=1

n

# +b$

% &

'

( )

" (net) =1

1+ e*net

Non-Linear Threshold

!

" (y) =1

1+ e#ay

!

x • w+b (weighted sum of inputs)

• can now benefit from combining perceptrons (neurons)

into layered networks

Page 276: Non-Parametric Methods in Machine Learning

Why hidden units must be non-linear?• Multi-layer net with linear hidden layers is equivalent to a

single layer net

– Because z1 and z2 are linear units z1 = x1*v11 + x2*v21 z2 = x1*v12 + x2*v22

– y = z1*w1 + z2*w2

= x1*u1 + x2*u2

where

u1 = (v11w1+ v12w2), u2 = (v21w1 + v22w2)

therefore the output y is still a linear combination of x1 and x2.

Y

z2

z1x1

x2

w1

w2

v11

v22

v12

v21

threshold = 0

StructureTypes of

Decision Regions

Exclusive-OR

Problem

Classes with

Meshed regions

Most General

Region Shapes

Single-Layer

Two-Layer

Three-Layer

Half Plane

Bounded By

Hyperplane

Convex Open

Or

Closed Regions

Abitrary

(Complexity

Limited by No.

of Nodes)

A

AB

B

A

AB

B

A

AB

B

BA

BA

BA

Non linearly separable problems

Neural Networks – An Introduction Dr. Andrew Hunter

Page 277: Non-Parametric Methods in Machine Learning

Artificial Neural NetworksPart II

Faustino GomezIDSIA

Intelligent Systems, Fall 2007

Page 278: Non-Parametric Methods in Machine Learning

Delta Rule (gradient descent) forlinear neuron

• Calculate squared error between output and target

Its value determined by the weights wl• Obtain partial derivative of error with respect to each weight

• Change weights in the opposite direction of

!

E = (d " net)2 = (d " wixi)

i

#2

!

"E /"wi

!

"wi=#(d $ w

ixi)x

i=#(d $ net)

i

% xi

!

"E

"wi

= 2(net # d)"net

"wi

= (net # d)xi

!

(net " d)xi increases error

"(net " d)xi decreases error

We can drop the 2, it doesn’t change the direction!

Page 279: Non-Parametric Methods in Machine Learning

Error surface

!

w2

!

w1

E!

"E"w

1

!

"E"w

2

!

"E

"r w

Page 280: Non-Parametric Methods in Machine Learning

Non-Linear Neuron

!

f (x) = " wixii=1

n

# +b$

% &

'

( )

" (net) =1

1+ e*net

Page 281: Non-Parametric Methods in Machine Learning

1

0

.01

.99

“Soft”, sigmoid decision boundary

“Hard”, perceptrondecision boundary

Page 282: Non-Parametric Methods in Machine Learning

Delta Rule (gradient descent) forNon-linear neuron

• Calculate squared error between output and target

Same as linear case but now with the sigmoid function squashing net• Obtain partial derivative of error with respect to each weight

!

E = (d "# (net))2 = d "# wixi

i

$%

& '

(

) *

%

& ' '

(

) * *

2

!

y =" (net)

!

"E

wi

= (y# d)"y

"net

"net

"wi

!

"E

"wi

="(d # y)2

"wi

Page 283: Non-Parametric Methods in Machine Learning

Delta Rule (gradient descent) for Non-linear neuron

Recall for linear case

!

"y

"net="# (net)

"net=# (net)(1$# (net))

6 7 4 4 4 4 4 4 8 4 4 4 4 4 4

!

"E

"wi

= (y# d)"y

"net

"net

"wi

!

"E

"wi

= (net # d)"net

"wi

But now we have to deal with the sigmoid

Page 284: Non-Parametric Methods in Machine Learning

Sigmoid and its derivative

!

y =" (net)

!

net

derivative

Page 285: Non-Parametric Methods in Machine Learning

Linear vs. non-linear gradient

Linear case:

!

"E

"wi

= (y# d)$ (net)(1#$ (net))6 7 4 4 4 8 4 4 4 "net

"wi

Non-linear case:

!

"E

wi

= (net # d)"net

"wi

= (net # d)xi

!

"E

"wi

= (y# d)"y

"net

"net

"wi

!

"E

"wi

= (y# d)$ (net)(1#$ (net))xi

Page 286: Non-Parametric Methods in Machine Learning

Multi-layer Perceptron (MLP)

Page 287: Non-Parametric Methods in Machine Learning

Backpropagation algorithm

Two steps:1. Forward Pass: present training input

pattern to network and activate networkto produce output (can also do in batch:present all patterns in succession)

2. Backward Pass: calculate error gradientand update weights starting at outputlayer and then going back

Page 288: Non-Parametric Methods in Machine Learning

Forward Pass

• Calculate activation of each hiddennode and store them

• Then calculate activation of each outputnode!

yj = " wij xii=1

n

# +b$

% &

'

( )

!

yk = " wjk yjj=1

n

# +b$

% &

'

( )

Page 289: Non-Parametric Methods in Machine Learning

Backward Pass for output node

!

general update (all nodes) : " yi# j

!

"wjk when k is an output node :

!

"k = (dk # yk )"yk$netk

%wjk =&yj"k =&yj (dk # yk )"yk$netk

Same as non-linear perceptron!

δ is the error term

Page 290: Non-Parametric Methods in Machine Learning

Updating output layer weight

!

wj1k1 only affects output yk1

!

"wj1k1=#yj1 (d1 $ yk1 )

%yk1&netk1

Page 291: Non-Parametric Methods in Machine Learning

Backward Pass for hidden node

!

general update (all nodes) : " yi# j

!

"wij when j is an hidden node :

!

" j = ("kwjk )k

# "y j$net j

%wij =&yi" j =&yi ("kwjk )k

# "y j$net j

!

if network has one hidden layer, yi is an input xi

Page 292: Non-Parametric Methods in Machine Learning

Updating hidden layer weight

Weight in input layer affects all outputs!

Page 293: Non-Parametric Methods in Machine Learning

Updating hidden layer weight

!

"wi1 j1=#x

1$ j1 =#x

1($kwj1k

)k

% $y j1&net j1

Page 294: Non-Parametric Methods in Machine Learning

Artificial Neural NetworksPart III

Faustino GomezIDSIA

Intelligent Systems, Fall 2007

Page 295: Non-Parametric Methods in Machine Learning

Some observations about MLPs

• Each hidden layer implements a set of featuredetector

• Remember: backpropagation refers to thecomputation of weight derivatives w.r.t. toerror

• Gradient descent is one way that weights canbe adjusted using these derivatives

Page 296: Non-Parametric Methods in Machine Learning

Advantages of NNs

• Universal approximation– Single hidden layer is sufficient!

• Efficient hardware implementation• Noise tolerance• Graceful degradation

Page 297: Non-Parametric Methods in Machine Learning

Open Issues with NNs

• Error surface may have many local minima• Model selection: what is the best network

topology for a given problem• Generalization: how can we ensure that a

network captures the underlying function?• Opacity: NNs are black boxes that are not

directly human interpretable

Page 298: Non-Parametric Methods in Machine Learning

Local Minima

Page 299: Non-Parametric Methods in Machine Learning

Local minima

• Most algorithms which have difficulties with simpletasks get much worse with more complex tasks

• Many dimensions make for many descent options• Local minima more common with very simple/toy

problems, more rare with larger problems and largernets

• Even if there are occasional minima problems, couldsimply train multiple nets and pick the best

• Some algorithms add noise to the updates to escapeminima

Page 300: Non-Parametric Methods in Machine Learning

Generalization

• Overfitting/underfitting– How many nodes/layers

• Validation and Test sets• Training set coverage and size

Page 301: Non-Parametric Methods in Machine Learning

Enhancements To Gradient Descent

• Momentum

– Adds a percentage of the last movement tothe current movement

!

"w(t +1)#$%y+&"w(t)

momentum

Page 302: Non-Parametric Methods in Machine Learning

Momentum

Weight update maintains momentum in the directionit has been going• Faster in flats• Could leap past minima• Significant speed-up, common value α ≈ .9• Effectively increases learning rate in areas

where the gradient is consistently the samesign

!

"w(t +1)#$%y+&"w(t)

Page 303: Non-Parametric Methods in Machine Learning

Annealing

• Adjust learning rate during learning• Start with large learning rate (e.g. 0.5)• Gradually decreases it as error goes

down• Allows for big jumps at beginning and

fine tuning later when close to aminimum

Page 304: Non-Parametric Methods in Machine Learning

NN Applications

• NETtalk (Sejnowski and Rosenberg 1987)

• Fraud detection (credit cards, loanapplications)

• Financial prediction• Compression• Continuous, non-linear control

Page 305: Non-Parametric Methods in Machine Learning

NETtalkspeech synth

linguistic features

Page 306: Non-Parametric Methods in Machine Learning

Data Compression

Network learns to output its input

Auto-encoder network

Page 307: Non-Parametric Methods in Machine Learning

Data Compression

Input representation is compressed in smaller hidden layer

Auto-encoder network

!

h << n

Page 308: Non-Parametric Methods in Machine Learning

Recurrent Neural Networks

Intelligent Systems, Fall 2007

Faustino GomezIDSIA

Page 309: Non-Parametric Methods in Machine Learning

Sequence Learning

• Up to now we have looked at staticmappings:

• It doesn’t matter when x was presentedto the network

!

y = f (x(t)), "t

where t, time, just imposes an ordering on the input patterns.

Page 310: Non-Parametric Methods in Machine Learning

Sequence Learning• Now we look at sequential inputs where the

output y can depend on more than just theimmediate input:

• State provides memory which in NNs isimplemented by feedback or recurrentconnections

where s(t) is the state , which can change every time a new input arrives:

!

y = f (s(t)) = F(x(t), x(t "1),..., x(1))

!

s(t +1)" g(s(t), x(t))

Page 311: Non-Parametric Methods in Machine Learning

Why study sequences?• Many natural processes are inherently sequential

– Speech– Vision– Natural language– DNA

• In robotics tasks short-term memory can beessential for determining the state of theworld due to limited sensor information

Page 312: Non-Parametric Methods in Machine Learning

Sequential Training Set

!

[x1(t1), x1(t1 "1), x

1 (t1 " 2),..., x1(1)], d

1[ ][x2 (t2 ),x2 (t2 "1), x

2 (t2 " 2),...,x2 (1)], d2[ ]

[x3(t3 ),x3(t3 "1), x3(t3 " 2),...,x3(1)], d

3[ ]M

[xN(tN

),xN(tN"1),xN(t

N" 2),...,xN(1)], d

N[ ]

where ti is the length of sequence i,

x and d are vectors

!

example " [[input sequence], target]input sequence " [x(t),x(t # 1),x(t # 2), ...,x(1)]

Page 313: Non-Parametric Methods in Machine Learning

Recurrent Non-Linear Neuron

Page 314: Non-Parametric Methods in Machine Learning

Recurrent Non-Linear Neuron

!

yk = " wik xii=1

I

#

inputconnections6 7 8

+ wjk yjj=1

H

#

recurrentconnections6 7 4 8 4

+b

$

%

& & & &

'

(

) ) ) )

where I is the number of inputs to neuron yk and

H is the number of hidden units in the same

layer as yk

Now the output of a unit depends on both the inputand also the output from other neurons

Page 315: Non-Parametric Methods in Machine Learning

Simple Recurrent Network (SRN)

Page 316: Non-Parametric Methods in Machine Learning

Simple Recurrent Network (SRN)

Hidden units now have state or memory which is dependent on all previous inputs

Page 317: Non-Parametric Methods in Machine Learning

SRN: another wayto look at it

Context units are just the hiddenlayer activation from the previoustime step

Page 318: Non-Parametric Methods in Machine Learning

Fully Connected RNN

Can approximate any differentiable trajectorySame as SRN but without output layer

Page 319: Non-Parametric Methods in Machine Learning

Train with truncatedbackpropagation

• Train the same way as an MLP• But treat activation from previous time-

step as just another set of inputs(context units)

• The network can now learn to map thesame external inputs to different outputsdue to “context”

Page 320: Non-Parametric Methods in Machine Learning

Backpropagation Through Time

• Just like backpropagation but network is“unfolded” spatially for each time-step ininput sequence

• For an n-step sequence, we get a networkwith n-layers

• Each layer has the same weights• Error at output is propagated back through all

layers

Page 321: Non-Parametric Methods in Machine Learning

Backpropagation Through Time

Input

Hidden/State

Output

Output Weights

State/Hidden (t-1)

Input State/Hidden (t-2)

Input

Input Weights

State/Hidden (t-3)

Recurrent Weights

Propagate error further back

Input Weights

Input Weights

Recurrent Weights

Recurrent Weights

Same weights

Same weights

for SRN only

Page 322: Non-Parametric Methods in Machine Learning

Simple example: XOR with delay

• Network learns to perform XOR of twoor more inputs,

• But instead of outputting the XOR of thecurrent inputs,

• It outputs the XOR of the input it saw nsteps ago

!

f (x(t)) = XOR(x(t " n))

Page 323: Non-Parametric Methods in Machine Learning

Delayed XOR

Page 324: Non-Parametric Methods in Machine Learning

Vanishing error gradient• Although RNNs can represent arbitrary

sequential behavior, they are not easy to trainwhen the output depends on some input morethan around 10 time-steps in the past

• The error gradient becomes very small sothat the weights cannot be adjusted torespond to events far in past

• Might as well use an MLP with a input layer ntime-steps wide…if you know n in advance!

• Solution:– Long Short-Term Memory– Neuroevolution (later in semester)

Page 325: Non-Parametric Methods in Machine Learning

Long Short-Term Memory (LSTM)

LSTM Cell

•LSTM nets have memory cells with a linear state S thatkeeps error flowing back in timeand is controlled by 3 gates•Input gate (Gi) controls what information enters the state•Output gate (Go) controls what information leaves the state to other cells•Forget gate (Gf) allows cell toforget state when no longer needed

Page 326: Non-Parametric Methods in Machine Learning

Intelligent Systems Midterm

Fall 2007

Question 1 (15 points)

The geometric distribution P over the set of Natural numbers N ={1, 2 . . . } is given by P (k) = (1 − p)k−1p for k ∈ N. Derive the MaximumLikelihood Estimator (MLE) for the parameter p.

Solution

See example at page 15, Lecture 3.The ML estimator is

p̂ = arg maxp

L(k1, . . . , kn|p) = arg maxp

log L(k1, . . . , kn|p) (1)

The likelihood of p in light of an arbitrary sample (k1, . . . , kn) is

L(k1, . . . , kn|p) =n∏

i=1

P (ki|p) =n∏

i=1

[(1 − p)ki−1p], (2)

whose logarithm is

1

Page 327: Non-Parametric Methods in Machine Learning

LL(k1, . . . , kn|p) = log

n∏

i=1

[(1 − p)ki−1p] = (3)

=n∑

i=1

log[(1 − p)ki−1p] = (4)

=n∑

i=1

[(ki − 1) log(1 − p) + log p] = (5)

= [

n∑

i=1

(ki) − n] log(1 − p) + n log p. (6)

The value of p which maximizes (6) can be found evaluating its derivativewith respect to p, and setting it to 0

dLL(k1, . . . , kn|p)

dp= (−1)

[∑n

i=1(ki) − n]

(1 − p)+

n

p= 0 (7)

[−∑n

i=1(ki) + n]p + n(1 − p)

(1 − p)p= 0 (8)

−p

n∑

i=1

(ki) + np + n − np = 0. (9)

Solving for p we obtain the ML estimator

p̂ =n∑n

i=1 ki

. (10)

2

Page 328: Non-Parametric Methods in Machine Learning

Question 2 (15 points)

Find the Maximum A Posteriori (MAP) estimator for the parameter λ

of the exponential distribution

f(x|λ) = λe−λx

with prior f(λ) = e−λ.

Solution

See example at page 45 of Lecture 3.The MAP estimate of λ is obtained maximizing its posterior probability,

given the data. Using the Bayes rule we can write:

λ̂ = arg maxλ

f(λ|x1, . . . , xn) = arg maxλ

f(x1, . . . , xn|λ)f(λ)

f(x1, . . . , xn)(11)

or, as the denominator does not depend on λ

λ̂ = arg maxλ

f(x1, . . . , xn|λ)f(λ) (12)

taking the logarithm

λ̂ = arg maxλ

[log f(x1, . . . , xn|λ) + log f(λ)] (13)

The likelihood of λ in light of an arbitrary sample (x1, . . . , xn) is

f(x1, . . . , xn|λ) =

n∏

i=1

f(xi|λ) =

n∏

i=1

(λe−λxi) (14)

The logarithm of (14) is

log f(x1, . . . , xn|λ) = logn∏

i=1

(λe−λxi) = (15)

=

n∑

i=1

log(λe−λxi) = n log λ − λ

n∑

i=1

xi (16)

while log f(λ) = log(e−λ) = −λ.

3

Page 329: Non-Parametric Methods in Machine Learning

The value of λ which maximizes (13) can be found evaluating its deriva-tive with respect to λ, and setting it to 0

d[log f(x1, . . . , xn|λ) + log f(λ)]

dλ= (17)

d[n log λ − λ∑n

i=1 xi − λ]

dλ= (18)

n

λ−

n∑

i=1

xi − 1 = 0. (19)

Solving for λ we obtain the MAP estimator

λ̂ =n∑n

i=1 xi + 1(20)

Question 3 (10 points)

Derive the Bayes rule.

Solution

The probability of an event A given an event B (p. 14 of Lecture 4) isdefined as

P (A|B) =P (A,B)

P (B)(21)

which impliesP (A,B) = P (A|B)P (B) (22)

Symmetrically, one can obtain

P (A,B) = P (B|A)P (A) (23)

Substituting (23) in (21) we obtain the Bayes rule

P (A|B) =P (B|A)P (A)

P (B)(24)

4

Page 330: Non-Parametric Methods in Machine Learning

Question 4 (5 points)

Consider the classification problem with discrete objects and binary la-bels Y ∈ 0, 1; each object consists of d discrete components (features)(X(1), . . . ,X(d)). Suppose all components X(i) are independent given the la-bel. Assume that we know the (unconditional) class probability P (Y = a),and probabilities P (X(i)|Y = a) of each value of each single componentgiven each class label. How would you assign a label a to a given object(X(1), . . . ,X(d))?

Solution

With Naive Bayes (p. 26, Lecture 4), assigning label a with maximumposterior probability

a = arg maxa

[P (Y = a)

d∏

i=1

P (X(i)|Y = a)] (25)

Question 5 (5 points)

Are the points (3, 2,−6, 0, 2) and (4,−1, 5,−3, 2) classified to be in thesame class or in different classes by the hyperplane represented by the vector(1,−1, 0, 1, 1)?

Solution

YES, the points are in the same class. The sign of the dot-productbetween the point (vector) and the vector representing the hyperplane tellsyou on which side of the hyperplane the data point is on. So if the dot-products for both points have the same sign, then the points are in thesame class, otherwise, they are not.

(3, 2,−6, 0, 2) · (1,−1, 0, 1, 1) = 3 + (−2) + 0 + 0 + 2 = 3

(4,−1, 5,−3, 2) · (1,−1, 0, 1, 1) = 4 + 1 + 0 + (−3) + 2 = 4

5

Page 331: Non-Parametric Methods in Machine Learning

Question 6 (5 points)

What is the difference between supervised and unsupervised learning?

Solution

In supervised learning the examples in the training set have targets whichtell the learning algorithm what the correct output of the learner should bewhen its sees each training input. In the case of classification tasks, thetargets are class labels, for function approximation tasks (regression) thetargets will be real-valued vectors.

In unsupervised learning there are no targets, just the data. Here wewant to learn the underlying structure of the data, e.g. how the data isdistributed, the number of clusters, etc.

Question 7 (5 points)

What do the eigenvalues tell us about the ”components” generated byPCA?

Solution

The eigenvalues tell us the relative “importance” of each of the com-ponents (which are the eigenvectors of the covariance matrix of the data).The first principal component has the largest eigenvalue, and, therefore, thiscomponent captures the most variance in the data. The second principalcomponent has the second largest eigenvalue, and so on.

6

Page 332: Non-Parametric Methods in Machine Learning

Question 8 (10 points)

List at least two factors that affect the ability of a neural network togeneralize after training.

Solution

1. Network architecture (topology): how many layers and how manyunits in each layer. Too many parameters (weights) and the networkmight overfit the training set. That is, it obtains very low error onthe training set but does not generalize to the testing set. Too fewparameter and the network might underfit the train set.

2. Training set. The way in which the training set sample the input spacecan affect generalization on the test set. For instance, if the trainingset is very sparse (small in relation to the dimensionality of the inputspace) then network will have more freedom in how it interpolatesbetween the training set examples when generalizing from examplesin the test set. If the input space is sample densely then, trainingwill take longer, and my require a larger network, but generalizationshould be better.

Other factors such as the initial (random) weights, and the learning ratecan also affect generalization.

Question 9 (5 points)

What is the key drawback of the perceptron?

Solution

It can only solve linearly separable problems.

7

Page 333: Non-Parametric Methods in Machine Learning

Question 10 (5 points)

What are local minima, and why are they a problem?

Solution

Local minima are points in the error function where the derivative iszero, without being the global minimum. They are a problem because ifwe are following the gradient in order to reduce error (as we do in normalbackpropagation), the gradient goes to zero at the local minima so that wecannot reduce the error further even though there are points on the errorsurface with lower error. In order to get out of the local minima we haveto go “uphill”, against the local gradient (see slide 5 in part 3 of the NNlectures).

Question 11 (10 points)

Why is the derivative of the error with respect to a particular weight inan MLP calculated differently if the weight is in the output layer versus anyother layer?

Solution

This is because a weight in the output layer only affects the error throughthe output unit it is connected to. That is, any change to a weight in theoutput layer will only affect the error of the network by affecting the outputof the unit to which it is connected.

A weight in a “previous” layer will affect the error more indirectly. As-suming a two-layer network (input layer and output layer), a change in aweight in the input layer (i.e. a weight connecting an input unit to a hid-den unit) will affect the output of the hidden unit it is connected to. Thishidden unit is then connected to ALL of the output units. Therefore, thederivative of the error with respect to that weight must take into accountthe error term δk for each of the k output units (see slide 16 in part 2 of theNN lectures).

8

Page 334: Non-Parametric Methods in Machine Learning

Question 12 (5 points)

What do the ”context units” in a recurrent neural network represent?

Solution

The context units represent the hidden layer activation from the previoustime-step. When the first element in a sequence is input to a recurrentneural network, the activation of the hidden layer (i.e. the output of thehidden layer units), is copied to the context units and becomes part of theinput to the RNN when the second element in the sequence is processed.Likewise, the hidden layer activation caused by the second element beinginput (along with it’s “context”) units provides the “context” for the thirdelement in the sequence, and so on (see slide 10 in part 4 of the NN lectures).

Question 13 (5 points)

Would a recurrent neural network be in principle able to learn to distin-guish between these two sequences:

a b c d e f gb b c d e f g

What about:a b c d e f ga c b d e f g

Solution

In both cases the answer is YES. In the first case, the sequences differ inthe first element. In the second case, two elements a swapped. However, allthat matters is that the sequence are different. In both cases, the fact thatthe sequences are not identical is enough for the RNN to distinguish them,in principle.

9

Page 335: Non-Parametric Methods in Machine Learning

Support Vector MachinesPart I (Linear SVMs)

Intelligent Systems, Fall 2007

Faustino GomezIDSIA

Page 336: Non-Parametric Methods in Machine Learning

SVMs: Basic Idea• Map original, possibly linearly inseparable datainto higher dimensional feature space

for linear case:

• where the original data is hopefully linearlyseparable• Find the “best” hyperplane in the feature spaceusing a linear classifier

!

x "#(x)

!

"(x) = x

Page 337: Non-Parametric Methods in Machine Learning

Original data Data in feature space

!

x

!

"(x)

Page 338: Non-Parametric Methods in Machine Learning

Perceptron Revisited

Linear Classifier:

w.x + b = 0

w.x + b < 0

w.x + b > 0

!

y(x) = sign(w • x +b)

Page 339: Non-Parametric Methods in Machine Learning

• All of the above hyperplanes correctly classify thethe training set• Which hyperplane is the best? i.e. whichgeneralizes the best.• One possibility: the hyperplane with the maximummargin

Page 340: Non-Parametric Methods in Machine Learning

Statistical Learning:Capacity and VC dimension

• To guarantee an upper bound on generalizationerror, the capacity of the learned functionsmust be controlled.– too much capacity : overfitting– too little capacity: underfitting

• The Vapnik-Chervonenkis (VC) dimensionis one of the most popular measures ofcapacity.

Page 341: Non-Parametric Methods in Machine Learning

VC Dimension

• A classification model f with some parametervector w is said to shatter a set of data pointsif, for all assignments of labels to those points,there exists a w such that f classifies all pointscorrectly.

• For model f, the VC-dimension h is themaximum number of points that can bearranged so that f shatters them.

Page 342: Non-Parametric Methods in Machine Learning

A line can shatter 3 points: VC dimension = 3

VC Dimension of a line

Page 343: Non-Parametric Methods in Machine Learning

A line cannot shatter four points

There is now way to arrange the 4 points such that a linecan correctly classify all possible labelings

Page 344: Non-Parametric Methods in Machine Learning

Structural risk minimization

• A function that: (1) minimizes the empirical risk (i.e. training set error)

and (2) has low VC dimension

will generalize well regardless of thedimensionality of the input space.

with probability (1-δ) (Vapnik, 1995, “Structural Minimization Principle”)

nn(log(2 / ) 1) log( / 4)true train

VC n VCerr err

n

!+ "# +

Page 345: Non-Parametric Methods in Machine Learning

Margin of separation andoptimal hyperplane

• Vapnik has shown that maximizing themargin of separation between the classesis equivalent to minimizing the VCdimension.

• The optimal hyperplane is the one givingthe largest margin of separation betweenthe classes.

Page 346: Non-Parametric Methods in Machine Learning

Definition of Margin• Distance from a data point to the boundary:

• Data points closest to the boundary are called support vectors• The margin d is the perpendicular distance of the closest point to

the hyperplane!

r =w "x+b

w

r

d

Page 347: Non-Parametric Methods in Machine Learning

!

y(x)

w=

w • x +b

w= r

!

"b

w

!

y > 0

!

y = 0

!

y < 0

Class 1

Class 2

!

w

!

x

!

x1

!

x2

!

x"

Distance from vector to hyperplane

Page 348: Non-Parametric Methods in Machine Learning

!

y(x)

w=

w • x +b

w= r

!

"b

w

!

y > 0

!

y = 0

!

y < 0Class 1

Class 2

!

w

!

x

!

x1

!

x2

!

x"

!

x = x" + rw

w

w • x + b = w(x" + rw

w)+b

= w • x" + rw • w

w)+b

= w • x" + rw

2

w+b

= wx" +b

=0

1 2 4 3 4 + r w

y(x) = r w

r =y(x)

w

Distance from vector to hyperplane

The margin is the r of the closest point to the hyperplane

Page 349: Non-Parametric Methods in Machine Learning

Maximizing the Margin

!

since di " {#1,+1}; points are either in one class or the other

there exists a set of parameters w such that :

y(x i ) > 0 for di = +1 and

y(x i ) < 0 for di = #1 so that

diy(x i ) > 0 for all data points

That is, w correctly classifies all points.

so the distances we want to maximize are :

!

diy(x i )

w=di (w • x i +b)

w

Page 350: Non-Parametric Methods in Machine Learning

Maximum Margin Classification

• Maximize the minimum distance from the hyperplane

!

argmaxw,b

1

wmini

di(w • x

i+b)[ ]

" # $

% & '

Page 351: Non-Parametric Methods in Machine Learning

Canonical Hyperplanes• r does not change if we multiply w and b by some number k:

• Therefore, we can set for the point closest to the hyperplane.

• So that all points satisfy the constraints:

!

dy(x)

w=w "x+b

w

=kw "x+ kb

kw#k $ R

!

d(w "x +b) =1

!

di(w "x

i+b) #1 $i

Page 352: Non-Parametric Methods in Machine Learning

How do we find the maximum marginhyperplane?

• Maximize , same as minimizing

• Subject to the linear constraints:

• Recall and for the closest points

known as support vectors

!

w2

!

di(w "x

i+b) #1 $i

!

diy(x i )

w

!

d(w "x +b) =1

!

1

w

This is a quadratic optimization problem

Page 353: Non-Parametric Methods in Machine Learning

Support Vector MachinesPart II (Non-Linear SVMs)

Intelligent Systems, Fall 2007

Faustino GomezIDSIA

Page 354: Non-Parametric Methods in Machine Learning

SVMs: Basic Idea• Map original, possibly linearly inseparable datainto higher dimensional feature space

for linear case:

• where the original data is hopefully linearlyseparable• Find the “best” hyperplane in the feature spaceusing a linear classifier

!

x "#(x)

!

"(x) = x

Page 355: Non-Parametric Methods in Machine Learning

Original data Data in feature space

!

x

!

"(x)

Page 356: Non-Parametric Methods in Machine Learning

Observations• Solution for perceptron is a linear combination oftraining points

• Only uses informative points (mistake driven)• The coefficient of a point in combination reflectsits “difficulty”

!

w = "idixi#

"i$ 0

Page 357: Non-Parametric Methods in Machine Learning

Perceptron Learning Algorithm

!

Input : list of n training examples (x0,d0 )...(xn ,dn ) where "i :di # {+1,$1}Output : classifying hyperplane w

Algorithm :Randomly initialize w;While makes errors on training set do for (xidi ) do let yi = sign(w • xi ); if yi % di then w& w+'dixi; end endend

!

w = "idixi#

"i$ 0

Page 358: Non-Parametric Methods in Machine Learning

Dual Representation

• The decision function can be rewrittenas follows:

• Data only appears within dot products!

!

w = "idixi#

!

f (x) = w • x +b = "idix i • x{# +b

Page 359: Non-Parametric Methods in Machine Learning

Non-Linear SVMs

• So far we have seen only the linearcase: where the feature spaceis the same as the input (data) space

• For data that is not linearly separablewe need to map to richer feature space

• where the data can be correctlyclassified with a line

!

"(x) = x

Page 360: Non-Parametric Methods in Machine Learning

Non-linear SVMs• Datasets that are linearly separable work fine:

• But what do we do if the dataset is just too hard?

• If we map to a higher-dimensional space the data isnow linearly separable:

0 x

0 x

0 x

y

Page 361: Non-Parametric Methods in Machine Learning

Non-linear SVMs: another example

Φ: x → φ(x)

Page 362: Non-Parametric Methods in Machine Learning

Implicit Mapping to Feature Space

Problem: working in high-dimensional featurespaces can be computationally intractable (verylarge vectors).

Solution: use Kernels• solves computational problem of working withhigh dimensions• makes even infinite dimensions possible

Page 363: Non-Parametric Methods in Machine Learning

Kernel Induced Feature Spaces

• Add feature space mapping to dual representation:

• Kernel is a function that returns the dot productbetween two images in feature space

!

f (x) = "idi# K xi ,x( ) +b

!

K x,y( ) = " x( ) •" y( )

before was x•y6 7 4 8 4

where K(.,.) is the kernel function:

Page 364: Non-Parametric Methods in Machine Learning

Kernel Induced Feature Spaces

• For linear case it’s just the dot product of the originaldata:

• For non-linear case can be much morecomplex and map to even infinite dimensional spaces

• Kernel Trick: but we don’t need to know ! Don’t need to represent feature space explicitly!

!

x • y = K x,y( ) = " x( ) •" y( )

!

" x( )

!

" x( )

Page 365: Non-Parametric Methods in Machine Learning

Examples of Kernel Functions• Linear:

• Polynomial of power d:

• Gaussian:

• Sigmoid:

!

K x,y( ) = exp("x " y

2

2# 2)

!

K x,y( ) = x • y

!

K x,y( ) = x,y( )d

!

K x,y( ) = tanh "0x • y + "1( )

Page 366: Non-Parametric Methods in Machine Learning

Example: polynomial kernel

!

K x,y( ) = x,y( )d

Page 367: Non-Parametric Methods in Machine Learning

Polynomial Kernel (cont.)

given two data points:

!

x = (x1, x

2)

y = (y1, y2)

For d=2:

!

K x,y( ) = x1y1+ x

2y2( )2

= x1

2y1

2 + x2

2y2

2 + 2x1y1x2y2

= x1

2, x2

2, 2x

1x2( ) • y

1

2, y2

2, 2y

1y2( )

= " x( ) •" y( )

Page 368: Non-Parametric Methods in Machine Learning

SVM architecture

Input (data) space

b

1( , )K x x

2( , )K x x

1( , )mK x x

1x

2x

omx

y

Kernel layer

Linear output

Output neuron

!

f (x) = "idi# K xi ,x( ) +b

Page 369: Non-Parametric Methods in Machine Learning

Some Issues• Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating

appropriate similarity measures

• Choice of kernel parameters - e.g. σ in Gaussian kernel - σ is the distance between closest points with different

classifications - In the absence of reliable criteria, applications rely on the use

of a validation set or cross-validation to set such parameters.

Page 370: Non-Parametric Methods in Machine Learning

Reinforcement LearningFaustino Gomez

IDSIA

Intelligent Systems, Fall 2007

Page 371: Non-Parametric Methods in Machine Learning

Supervised Learning revisited

What if we don’t know good targets dfor our input samples?

!

input space : x " X

output space : d " D

find f : X # D using training set of example :{(xi, di)},i =1..N

Page 372: Non-Parametric Methods in Machine Learning

Sequential Decision Tasks

Agentstate action

Agent sees state of the environment at each time-step andmust select best action to achieve a goal.Problem: we don’t know what the correct actions (targets)are before hand, so can’t learn from examples!

Page 373: Non-Parametric Methods in Machine Learning

Sequential Decision Tasks

Agentstate action

!

st

at" # " s

t+1

at+1" # " " s

t+2

at+2" # " " s

t+3

at+3" # " " ...!

st

: state of the environment at time t

!

at

: action taken by agent at time t after seeing

!

st

Decision sequence:

Page 374: Non-Parametric Methods in Machine Learning

Examples of SequentialDecision Tasks

• Autonomous robotics• Controlling chemical processes• Network routing• Game playing• Stock trading

Page 375: Non-Parametric Methods in Machine Learning

Example: Pole balancingbenchmark

Page 376: Non-Parametric Methods in Machine Learning

Reinforcement Learning Problem

Agent

Now the agent receives a reward or reinforcement signal thatgives some indication of whether its behavior is “good” or “bad”.

Page 377: Non-Parametric Methods in Machine Learning

Reinforcement Learning Problem

Agent

!

st

at" # " s

t+1

$

rt+1

at+1" # " " s

$

t+2

rt+2

at+2" # " " s

$

t+3

rt+3

at+3" # " " ...

Now we have:

Page 378: Non-Parametric Methods in Machine Learning

Reinforcement Learning ProblemThe goal is to learn a policy that maximizes the reward r overthe long term:

!

Rt= " k r

t+k+1

reward}

k=0

T

# , 0 $ " $1

where γ is the discount rate. γ=1 means the all rewards received matterequally. γ<1 means rewards further in the future are less important.

A policy is the agent’s function that maps states to actions:

For each state the agent encounters, the policy tells itwhat action to take.

!

" (st)# a

t

The best policy is the one that selects the action in eachstate that leads to the highest long-term reward.

note: this is the deterministic case

Page 379: Non-Parametric Methods in Machine Learning

Why are Reinforcement LearningProblems Hard to Solve?

• Not just trying to learn known behavior and then generalizing from it• Have to discover behavior from scratch• Only have scalar reinforcement to guide learning• Reinforcement may be infrequent• Credit assignment problem

–How much credit should each action in the sequence of actions get for the outcome

Page 380: Non-Parametric Methods in Machine Learning

Solving ReinforcementProblems

• If the problem can be formulated as a MarkovDecision Process

• we can use a value-function to represent how“good” each state is in terms of providingreward

• Use various methods to learn value function– Dynamic Programming– Temporal Difference Learning (e.g. Q-learning)

Page 381: Non-Parametric Methods in Machine Learning

Markov Decision Processes

!

a finite set of states : s " S

a finite set of actions : a " A

state transition probabilities : Ps # s

a= Pr{s

t+1 = # s st= s,a

t= a}

reward function : Rs # s

a= E{r

t+1 st= s,a

t= a}

a policy : $(s,a) = Pr{at= a s

t= s}

…and the Markov property must hold.

also known as the state-space

also known as the action-space

Page 382: Non-Parametric Methods in Machine Learning

Markov Decision Processes

!

a finite set of states : s " S

a finite set of actions : a " A

state transition probabilities : Ps # s

a= Pr{s

t+1 = # s st= s,a

t= a}

reward function : Rs # s

a= E{r

t+1 st= s,a

t= a}

a policy : $(s,a) = Pr{at= a s

t= s}

…and the Markov property must hold.

model

Page 383: Non-Parametric Methods in Machine Learning

The Markov Property

This just means that the probability of thenext state and reward only depend on theimmediately preceding state and action!

!

Ps " s

a= Pr{s

t+1= " s ,r

t+1= r s

t,a

t,r

t, s

t#1,a

t#1,r

t#1,..., s

0,a

0,r

0}

$P

s " s

a= Pr{s

t+1= " s ,r

t+1= r s

t,a

t}

It doesn’t matter what the happened before that!

Page 384: Non-Parametric Methods in Machine Learning

e.g. , there is a60% chance of going to state 4when action a is taken in state 3

Example transition matrixEach action will have a transition matrix

!

s1

!

s3

!

s2

!

s1

!

s2

!

s3

!

s4

!

s4

From:

To:

0

0

0

0.5

0.1 0.3

0

0.6

0.5

0.6

0

0 0.3

0.4

0.5

0.2

• example with four states• each entry gives the

probability of going from onestate to another if the actionis taken

• each row sums to 1.0

!

Ps " s

a

!

Ps3s4

a= 0.6

Page 385: Non-Parametric Methods in Machine Learning

State transitions (cont.)

!

s1

!

s3

!

s2

!

s1

!

s2

!

s3

!

s4

!

s4

0

0

0

0.5

0.1 0.3

0

0.6

0.5

0.6

0

0 0.3

0.4

0.5

0.2

=

All edges leaving state add to 1.0

transition graph

Page 386: Non-Parametric Methods in Machine Learning

Agent Policy

• The policy implements the agents behavior• In general, the policy will be stochastic:

• Often the policy is deterministic and we canwrite:

!

"(s,a) = Pr{at= a s

t= s}

!

"(s)# a

for each state the policy says which action to use

policy gives the probability of taking action a in state s

Page 387: Non-Parametric Methods in Machine Learning

Value Functions

Value function tells the agent how “good” it is tobe in a given state

!

V"(s) = E" {Rt s

t= s}

agent’s policy

This says how much reward the agent can expectto receive in the future if it continues with its policyfrom state s

!

Rt= " k r

t+k+1

reward}

k=0

T

# , 0 $ " $1where

Page 388: Non-Parametric Methods in Machine Learning

Value Functions

!

V"(s) = E" {Rt st = s}

= E" # krt+k+1 st = s

k=0

T

$% & '

( ) *

= "(s,a)a

$ Ps + s

a

+ s

$ Rs + s

a+#E" # k

rt+k+2 st+1 = + s k=0

T

$% & '

( ) *

,

- .

/

0 1

= "(s,a)a

$ Ps + s

a

+ s

$ Rs + s

a+#V "

( + s )1 2 3

,

-

.

.

/

0

1 1

Bellman equation6 7 4 4 4 4 4 8 4 4 4 4 4

value of possible next state

Page 389: Non-Parametric Methods in Machine Learning

Reinforcement LearningPart II (Policy Iteration)

Faustino GomezIDSIA

Intelligent Systems, Fall 2007

Page 390: Non-Parametric Methods in Machine Learning

More about R

!

Rt= " k r

t+k+1

reward}

k=0

T

# , 0 $ " $1

R is also known as the return. It is how much rewardthe agent will receive from time t into the future.

• If γ is close to 0, then the agent cares more aboutselecting actions that maximize immediate reward:shortsighted• if γ is close to 1, then the agent takes future rewardsinto account more strongly: farsighted

Page 391: Non-Parametric Methods in Machine Learning

Action-Value function

• Same as value function but gives value foreach action in state s

• Q(s,a) is the expected future reward if wetake action a in state s and then continueselecting actions according to the policy π

!

Q"(s,a) = E" {Rt st = s,at = a}

= E" # krt+k+1 st = s,at = a

k=0

T

$% & '

( ) *

= Ps + s

a

+ s

$ Rs + s

a +#E" # krt+k+2 st+1 = + s

k=0

T

$% & '

( ) *

,

- .

/

0 1

= Ps + s

a

+ s

$ Rs + s

a +#V "( + s )[ ]

Page 392: Non-Parametric Methods in Machine Learning

Value-function vs. Q-function

!

V"(s) = "(s,a)

# s

$ Ps # s

a

# s

$ Rs # s

a +%V "( # s )[ ]

Q"(s,a) = P

s # s

a

# s

$ Rs # s

a +%V "( # s )[ ]

The Q-function implements one step of “lookahead”

It caches the value of taking each action in a given state

Page 393: Non-Parametric Methods in Machine Learning

Agent now consists of two components:1. Value-function (Q-function)2. Policy

A policy can be computed from the values

Page 394: Non-Parametric Methods in Machine Learning

How to compute policy from Q?• Greedy policy: select action in each state with highest value

• ε−greedy policy: select greedy action 1-ε% of the time and some other, random

action ε% of the time (this will be useful later)

• Stochastic Policy: Use action values to select actions probabilistically (more on

this later)

!

"(s) = argmaxa

Q(s,a)

Page 395: Non-Parametric Methods in Machine Learning

Optimal Value functions

• The optimal value-function is thevalue-function of the policy that generates thehighest values for all states.

• Likewise for Q-function

!

V*(s) = max

"V

"(s), for all s # S

!

V*(s)

!

Q*(s,a) = max

"Q

"(s,a), for all s # S

Page 396: Non-Parametric Methods in Machine Learning

Dynamic Programming Methods:Policy Iteration

1. Policy Evaluation compute the value function of the current policy

2. Policy Improvement improve the policy with respect to the computed value function

Two Steps:

Page 397: Non-Parametric Methods in Machine Learning

Policy Evaluation

• Use Bellman equation as update rule

• “sweep” through the state-spacecomputing the value for each state V(s)using the value of each of the possiblenext states

!

" s

!

V"(s)# "(s,a)

$ s

% Ps $ s

a

$ s

% Rs $ s

a +&V "( $ s )[ ]

Page 398: Non-Parametric Methods in Machine Learning

Policy Evaluation algorithm

if change in V is small enough: stop

value backup

Page 399: Non-Parametric Methods in Machine Learning

Value backups

!

"(s,a)taken with probability

next state with probability

!

Ps " s

a

Page 400: Non-Parametric Methods in Machine Learning

Value backup example4 states:3 actions:

!

S = s1, s2, s3, s4{ }

!

A = a1,a2,a

3{ }

!

Ps1" s 1

a1!

"(s1,a1)

!

"(s1,a2)

!

"(s1,a

3)

!

Ps1" s 2

a2

!

Ps1" s 2

a1

!

Ps1" s 4

a3

!

Ps1" s 2

a3

!

Ps1" s 3

a2

1 2 3 42 2

!

"V( # s )

1

backup for

!

s1

Note: in this case, each action only leads to some of the other states. Thatis, in some cases, e.g. and among others.

!

Ps1" s

a= 0

!

Ps1s3

a1 ,Ps1s4

a1 , Ps1s4

a2

Page 401: Non-Parametric Methods in Machine Learning

Value Backup example (cont.)

!

"V( # s 1) + R

s1s1

a1

!

Ps1" s 1

a1

!

Ps1" s 2

a1

!

"V( # s 2) + R

s1s2

a1

2

Page 402: Non-Parametric Methods in Machine Learning

Value Backup example (cont.)

!

"V( # s 1) + R

s1s1

a1

6 7 4 4 8 4 4

!

Ps1" s 1

a1

!

Ps1" s 2

a1

!

"V( # s 2) + R

s1s2

a1

6 7 4 4 8 4 4

**

2

Compute the Bellman equation foreach state.

Page 403: Non-Parametric Methods in Machine Learning

Value Backup example (cont.)

!

"V( # s 1) + R

s1s1

a1

6 7 4 4 8 4 4

!

Ps1" s 1

a1

!

Ps1" s 2

a1

!

"V( # s 2) + R

s1s2

a1

6 7 4 4 8 4 4

**

+

do the same for the other actions

2

Page 404: Non-Parametric Methods in Machine Learning

!

Ps1" s 1

a1 (#V( " s

1) + R

s1s1

a1 ) +

Ps1" s 2

a1 (#V( " s

2) + R

s1s2

a1 )

Value Backup example (cont.)

!

"(s1,a1)

*

Page 405: Non-Parametric Methods in Machine Learning

Value Backup example (cont.)

!

Ps1" s 1

a1 (#V( " s

1) + R

s1s1

a1 ) +

Ps1" s 2

a1 (#V( " s

2) + R

s1s2

a1 )

!

"(s1,a1)

*

do the same for these

Page 406: Non-Parametric Methods in Machine Learning

Value Backup example (cont.)

!

Ps1" s 1

a1 (#V( " s

1) + R

s1s1

a1 ) +

Ps1" s 2

a1 (#V( " s

2) + R

s1s2

a1 )

!

"(s1,a1)

*

add them up

++

!

=V"(s1)

Page 407: Non-Parametric Methods in Machine Learning

Value Backup example (cont.)

!

Ps1" s 1

a1 (#V( " s

1) + R

s1s1

a1 ) +

Ps1" s 2

a1 (#V( " s

2) + R

s1s2

a1 )

!

"(s1,a1)

*

add them up

++

!

=V"(s1)

Now do the same for the other states

Page 408: Non-Parametric Methods in Machine Learning

Policy Improvement

!

For each s " S :

# $ (s)% arg maxa Ps # s

aR

s # s

a +&V $( # s )[ ]

# s

'

Update the current policy to be greedy with respectto the value function.

Page 409: Non-Parametric Methods in Machine Learning

Combining the two steps

!

" V

improvement

evaluation

!

V "V#

!

" # greedy(V )

Cycle through policy evaluation and policy improvement

!

"0

E# $ # V

"0

I# $ # "

1

E# $ # V

"1

I# $ # "

2

E# $ # L

I# $ # " * E

# $ # V*

eventually converges to optimal policy!

Page 410: Non-Parametric Methods in Machine Learning

Policy IterationAlgorithm:

Page 411: Non-Parametric Methods in Machine Learning

Grid World example

S = {1, 2,…,14}

A = {up, down, left, right}

Goal states

The agent can’t leave the goalstates, i.e. they are terminal

Page 412: Non-Parametric Methods in Machine Learning

Grid world example (cont.)

Page 413: Non-Parametric Methods in Machine Learning

Grid World example (cont.)

Page 414: Non-Parametric Methods in Machine Learning

Dynamic ProgrammingObservations

Policy Iteration is guaranteed toconverge for finite MDPs, but…

Problem: what if we don’t know themodel, and ?

!

Rs " s

a

!

Ps " s

a

next lecture

Page 415: Non-Parametric Methods in Machine Learning

Reinforcement LearningPart III

Faustino GomezIDSIA

Intelligent Systems, Fall 2007

Page 416: Non-Parametric Methods in Machine Learning

Value Iteration

• Combine one sweep of Policy Evaluation withPolicy Improvement

• Don’t wait for evaluation to converge beforeswitching to improvement

• Also converges to optimal value function

!

Vk+1(s)"maxa

Ps # s

a

# s

$ Rs # s

a +%Vk ( # s )[ ]

Page 417: Non-Parametric Methods in Machine Learning

Value Iteration Algorithm

Page 418: Non-Parametric Methods in Machine Learning

What if we don’t know the model?

• This is the more general case where theagent can only observe actual statetransitions caused by its own actions

• Cannot use Dynamic Programming directly• It turns out, we can still converge to optimal

value-function without the model!

Page 419: Non-Parametric Methods in Machine Learning

Temporal Difference Methods (TD)

• Use the difference between the value thecurrent state and the next visited state toupdate current state

• Don’t need values of all next states, whichis good because in the real world we canonly visit one at a time

!

V(st)"V(s

t)+# r

t+1 +$V(st+1)%V(st )[ ]

Page 420: Non-Parametric Methods in Machine Learning

TD vs. DP

!

V(st)"V(s

t)+# r

t+1 +$V(st+1)%V(st )[ ]

!

V(s)" #(s,a)$ s

% Ps $ s

a

$ s

% Rs $ s

a +&V( $ s )[ ]

TD:

DP:

• In Dynamic Programming (e.g. Policy Iteration) we haveto take expectations over possible actions and next states• TD uses the current estimate of V at to compute newvalue for• Note: no reference to time in DP

!

st

!

st

Page 421: Non-Parametric Methods in Machine Learning

TD Control: Q-learning

• Uses the Q-value of the “best” action in thenext state

• Version of TD using Q-function• Because we have value for each action it

can be used for on-line learning/control

!

Q(st,a

t)"Q(s

t,a

t)+# r

t+1 +$maxaQ(s

t+1,a)%Q(st ,at )[ ]

Page 422: Non-Parametric Methods in Machine Learning

Q-learning (cont.)

!

Q(st,a

t)"Q(s

t,a

t)+# r

t+1 +$maxaQ(s

t+1,a)

target1 2 4 4 4 3 4 4 4

%Q(st,a

t)

&

'

( ( (

)

*

+ + +

learning rate

Move current Q-value of s and a toward targetby an amount determined by the learning rate(we’ve seen this before)

Page 423: Non-Parametric Methods in Machine Learning

Q-learning Algorithm

Page 424: Non-Parametric Methods in Machine Learning

Exploration vs. Exploitation

• Exploitation: take good actions in eachstate already taken before to maximizereward

• Exploration: take a chance on actionsthat may have lower value in order tolearn more, and maybe find true bestaction to later exploit

Need to balance the two!

Page 425: Non-Parametric Methods in Machine Learning

Balancing exploitation/exploration• ε−greedy policy: select greedy action 1-ε% of the time (exploit) and some other,

random action ε% of the time (explore)

• Stochastic Policy: Use action values to select actions probabilistically,

e.g. soft max:

High temperatures increase exploration by making policy more randomLower temperatures increase exploitation by making policy more greedy

!

"(s,b) =eQ(s,b)

#

eQ(s,a )

#

a

$, where # > 0 is the temperature

Page 426: Non-Parametric Methods in Machine Learning

Limitations of Standard RL Methods

• Continuous state/action spaces– Cannot represent value-function with table– Need some kind of function approximator to

represent V– Loss of convergence guarantees

• Partial Observability– Agent no longer sees underlying state– From the agent’s perspective the Markov Property

does not hold– Need to estimate underlying state

Page 427: Non-Parametric Methods in Machine Learning

What if the state space is verylarge or continuous?

• Cannot represent value-functionwith table

• Need some kind of functionapproximator to represent V or Q

• Loss of convergence guarantees

Page 428: Non-Parametric Methods in Machine Learning

Coarse Coding Approximation

• Each circle is a “receptive field”• Each field has a learnable weight• The state activates all receptive fields it falls within• V(s) or Q(s,a) are computed by summing weights of

activated fields (i.e. linear approximator)

Page 429: Non-Parametric Methods in Machine Learning

Coarse Coding

Page 430: Non-Parametric Methods in Machine Learning

Tile Coding Approximation

Page 431: Non-Parametric Methods in Machine Learning

Tile Coding

• Each axis corresponds to one state variable• Each tile has a learnable weight• The state activates one tile in each tiling• V or Q are computed by adding weights of allactivated tiles

Page 432: Non-Parametric Methods in Machine Learning

Representing Q(s,a) with an NN

!

st

!

at

!

Q(st,a

t)

Page 433: Non-Parametric Methods in Machine Learning

Training the NN Q-functionTraining is performed on-line using the Q-valuesfrom the agent’s state transitions

input: ,

!

st

!

at

target:

!

rt+"max

aQ(s

t+1,a)

For Q-learning:

Page 434: Non-Parametric Methods in Machine Learning

What if agent can’t completely“see” the state?

• The environment is said to be partiallyobservable

• The agent only receives an “observation”of the state provided by its sensory system

!

"(st)# o

t, where o$O,s

t% o

t, and O << S

O is the set of possible observations, which is usuallysmaller than S

Think of Ω as the agent’s sensory system

Page 435: Non-Parametric Methods in Machine Learning

Perceptual Aliasing

• Different states can look the same or similar

• If the action that is best for is not good for , wehave a problem because agent will take the sameaction in both

!

"(si)

!

"(s j )

!

ok , where i " j

!

si

!

sj

Page 436: Non-Parametric Methods in Machine Learning

RL under Partial Observability

• Now, from the agent’s perspective, previous inputs(the observations) are important• The current input is not enough to determine whatstate the environment is in!

Page 437: Non-Parametric Methods in Machine Learning

RL under Partial Observability

The agent now needs some way to determine theunderlying state; this means we need memory

One Solution: use RNN to represent V or Q-function

Page 438: Non-Parametric Methods in Machine Learning

Collective and Swarm Intelligence

Gianni Di [email protected]

IDSIA -USI/SUPSI

Page 439: Non-Parametric Methods in Machine Learning

Road map

■ Generalities on Collective/Swarm Intelligence (SI)

■ Main characteristics of SI design

■ Cellular Automata: the simplest and earliest example of SI

■ Particle Swarm Optimization: state-of-the-art SI framework forcontinuous optimization

■ Generalities on Ant algorithms for adaptive task allocation, andAnt Colony Optimization, state-of-the-art SI framework forcombinatorial optimization and network routing.

Page 440: Non-Parametric Methods in Machine Learning

Let’s start with some examples . . .

■ Collective behaviors and problem solution in natural systems

✦ Vertebrates: swarming, flocking, herding, schooling

✦ Social insects (ants, termites, bees, wasps): nest building,foraging, assembly, sorting,. . .

Page 441: Non-Parametric Methods in Machine Learning

Fish schooling

( c©CORO, CalTech)

Page 442: Non-Parametric Methods in Machine Learning

Birds flocking in V-formation

( c©CORO, Caltech)

Page 443: Non-Parametric Methods in Machine Learning

Termites’ nest

( c©Masson)

Page 444: Non-Parametric Methods in Machine Learning

Honeybee comb

( c©S. Camazine)

Page 445: Non-Parametric Methods in Machine Learning

Swarm of killer bees

Page 446: Non-Parametric Methods in Machine Learning

Cooperation in ant colonies

Ant chain ( c©S. Camazine) Ant wall ( c©S. Camazine)

Page 447: Non-Parametric Methods in Machine Learning

Wasps’ nest

( c©G. Theraulaz)

Page 448: Non-Parametric Methods in Machine Learning

Ants and bees at work

■ Ants: leaf-cutting, breeding, chaining

■ Ants: Food catering

■ Ants: Learning the shortest path between food and nest

■ Bees: waggle dance to recruit workers

Page 449: Non-Parametric Methods in Machine Learning

What do all these behaviors have in common?

■ Distributed society of autonomous individuals/agents

■ Control is fully distributed among the agents

■ Communications among the individuals are localized

■ Interaction rules and information processing seem to be simple:minimalist agent capabilities

■ System-level behaviors appear to transcend the behavioralrepertoire of the single agent

■ The overall response of the system features:✦ Robustness

✦ Adaptivity

✦ Scalability

Page 450: Non-Parametric Methods in Machine Learning

Bottom-up vs. top-down design

■ Ontogenetic and phylogenetic evolution has (necessarily) followedsuch a bottom-up approach (grassroots) to design systems:✦ Instantiation of the basic units (atoms, cells, organs,

organisms, individuals, . . . ) composing the system and letthem (self-)organize to generate more complex/organizedsystem-level behaviors and/or structures

✦ Population + Interaction protocols are “more important” thanthe single modules

■ On the other hand, from an engineering point of view we can alsochoose a top-down approach:✦ Acquisition of comprehensive knowledge about the

problem/system to deal with, analysis, decomposition,definition of a possibly optimal strategy

Page 451: Non-Parametric Methods in Machine Learning

Swarm intelligence (SI): a definition

■ SI refers to the bottom-up design of distributed systems thatdisplay forms of useful and/or interesting behaviors at the globallevel as a result of the actions of a number of composing units(relatively simple) interacting with one another and with theirenvironment at the local level.

■ A relatively new and quite successful computational designparadigm. It finds its main roots in the work developed at thebeginning of the ’90 on algorithms inspired by behaviors of socialinsects (mainly ants). IDSIA has played a main role in laying thefoundations and in the application of SI.

■ SI design has been applied to a wide variety of problems inoptimization, telecommunications, robotics, complex systemmodeling. Many state-of-the-art implementations.

Page 452: Non-Parametric Methods in Machine Learning

Challenges of SI design

■ Given a task/problem to deal with, a number of design choices:1. Characteristics/skills of the agents

2. Size of the population (related to the choice 1.)

3. Neighborhood definition

4. Interaction protocols and information to exchange

5. Where the information is updated (agent, channel,environment)

6. Use or not of randomness

7. Synchronous or asynchronous activities and interactions

8. . . .

■ Lots of parameters

■ Predictability is an issue

Is a top-down approach better?

Page 453: Non-Parametric Methods in Machine Learning

Let’s focus on neighborhood & communication

■ Point-to-point: antennation, trophallaxis (food or liquid exchange),mandibular contact, direct visual contact, chemical contact,hardwired direct connections (neurons, cells), unicast radiocontact

■ Limited-range Broadcast: the signal propagates to some limitedextent throughout the environment and/or is made available for arather short time (e.g., use of lateral line in fishes to detect waterwaves, generic visual detection, radio broadcast

■ Indirect: two individuals interact indirectly when one of themmodifies the environment and the other responds asynchronouslyto the modified environment at a later time. This is calledstigmergy [Grassé, 1959] (e.g., pheromone laying/following,post-it, web)

Page 454: Non-Parametric Methods in Machine Learning

SI algorithmic frameworks (and relatives)

■ Stigmergy has led to Ant Algorithms and in particular to AntColony Optimization (ACO) [Dorigo & Di Caro, 1999], which isbased on the shortest path finding abilities of ant colonies.

■ Also Cultural Algorithms [Reynolds, 1994] are population-basedalgorithms relying on stigmergy. They are derived from processesof cultural evolution and exchange in societies.

■ Broadcast-like communication related to schooling and flockingbehaviors has inspired Particle Swarm Optimization [Kennedy &Eberhart, 2001].

■ Point-to-point communication is used in Hopfield neural networks[Hopfield, 1982], derived from brain’s structure and behavior.

■ Point-to-point and neighbor broadcast is at the basis of CellularAutomata [Wolfram, 1984].

■ Genetic algorithms, artificial immune systems, . . .

Page 455: Non-Parametric Methods in Machine Learning

Frameworks that we will discuss

1. Cellular Automata (CA) (Gossip Algorithms?)

2. Particle Swarm Optimization (PSO)

3. Stigmergy-based algorithms

4. Ant Colony Optimization (ACO)

Page 456: Non-Parametric Methods in Machine Learning

Cellular Automata (CA)

Page 457: Non-Parametric Methods in Machine Learning

CA: general definitions

■ A set of M automata (cells) ai, i = 1, . . . ,M : finite state machineswith a specified number of states S = {s1, s2, . . . , sk}

■ The set of cells have an interconnection topology. Theneighborhood of a cell is a function N which associates to a cell ai

an ordered set of n neighbors, N (a) = {a1i , a

2i , . . . , a

ni }

■ A local state-transition function F : Sn → S that depends on thecurrent state si

(t) of the cell and on the state of its n neighbors

■ At discrete time-steps (and either synchronously orasynchronously) each automaton gets the state from its neighborsand change its state accordingly:si(t + 1) = F (si(t), s(a

1i ), s(a

2i ), . . . , s

ni )

■ In the simplest cases (which are most amenable to analysis), thetopology is a regular 1D or 2D lattice, the units are either boolean,S = {0, 1}, or have very few states, N corresponds to thephysically closest cells, and the n is between 3 and 8.

A boolean 1D CA is just a string of binary digits, while a 2D one

Page 458: Non-Parametric Methods in Machine Learning

CA: dynamic system view

■ The set of the M cells can be seen at each time-step as anM-dimensional vector ~a defined in the state domain. The CAevolves as an M-dimensional discrete-time dynamic system:~a(t + 1) = F (~a(t))

■ Let’s consider the simple but enlightening case where:ai(t + 1) = F (ai−1(t), ai(t), ai+1(t))

■ Analogous to a time-discrete 1D partial differential equation.

■ Boundary conditions (torus): a0 = aM , and aM+1 = a1.

■ “Only” 2M possible configurations.

Page 459: Non-Parametric Methods in Machine Learning

CA: Given the rule what’s the behavior?

■ Time evolution of the cell vector: attractor points, oscillatorybehaviors, emergence of spatial regularities, dependence frominitial conditions, dependence from perturbations, . . .

■ Fixed point: ~a∗ = F (~a∗). From a time evolution point of view, afixed point exist if, given that ~a(0) = ~a∗, ~a(t) = ~a∗ ∀t

■ A fixed point is asymptotically stable if the above relation holds forall initial conditions in a specified neighborhood of ~a(0)

■ For two close initial conditions: ~a(0) and ~a(0) + ǫ, after k iterations,the configurations F k

(~a(0)) and F k(~a(0) + ǫ) can be different.

Lyapounov exponent λ: ǫekλ= |F k

(~a(0) + ǫ) − F k(~a(0))|.

■ For k → ∞, ǫ → 0, eλ(~a(0)) = divergence speed between the twoinitial conditions. λ > 0 dependence on initial conditions, λ < 0

implies convergence.

Page 460: Non-Parametric Methods in Machine Learning

CA: Wolfram notation for rules

■ Wolfram code is a name used for the method of enumeratingelementary cellular automaton rules used by Stephen Wolfram inhis seminal work on CAs.

■ For the simplest case of {0, 1} states and n = 3 the code works asfollows. All the possible neighborhood configurations are writtenas the following list: 111 110 101 100 011 010 001 000.A transition rule associates to each one of these bit triples aboolean value a ∈ {0, 1} that represents the new state. Therefore,a generic rule can be written as:

111 110 101 100 011 010 001 000

a111 a110 a101 a100 a011 a010 a001 a000

The bit-array (a111 a110 a101 a100 a011 a010 a001 a000) can be read asa decimal number, that names the specific rule.

■ For instance, the bit-array 00011110 identifies rule 30

■ The method can be extended to any state set and n size

Page 461: Non-Parametric Methods in Machine Learning

CA: Wolfram’s rule 30

Linear CA, {0, 1} states, F (ai−1, ai, ai+1) is a boolean function of 3 bits (n = 3):

Neighborhood state: 111 110 101 100 011 010 001 000

New cell state: 0 0 0 1 1 1 1 0

Page 462: Non-Parametric Methods in Machine Learning

CA: Wolfram’s rule 184

Linear CA, {0, 1} states, F (ai−1, ai, ai+1) is a boolean function of 3 bits (n = 3):

Neighborhood state: 111 110 101 100 011 010 001 000

New cell state: 1 0 1 1 1 0 0 0

Page 463: Non-Parametric Methods in Machine Learning

CA: Wolfram’s rule 110

Linear CA, {0, 1} states, F (ai−1, ai, ai+1) is a boolean function of 3 bits (n = 3):

Neighborhood state: 111 110 101 100 011 010 001 000

New cell state: 0 1 1 0 1 1 1 0

Page 464: Non-Parametric Methods in Machine Learning

CA: Wolfram classification

■ Class 1: After few step the system reaches a homogeneousconfiguration independent from initial conditions

■ Class 2: After few steps they show simple spatial-temporalconfigurations made of separate regions which are eitherconstant or periodic (≥ 2

M ). The general structure of the arisingconfigurations is relatively independent from initial conditions.

■ Class 3: For certain subsets of initial conditions they show achaotic behavior (no periodic structures). In many cases for allbut one cell set to one, a self-similar behavior arises.

■ Class 4: Strong dependence from initial conditions, highlycomplex, irregular, and moving structures.

■ All have two Lyapounov exponents measuring the propagation ofthe information on the initial conditions in both directions. λ = 0

for 1 and 2, λ > 0 for 3, and λ > 0, λ → 0 for 4.

Page 465: Non-Parametric Methods in Machine Learning

CA: Given the behavior, find the rule!

■ This is called the inverse problem

■ It’s “useful” to let the CA carrying out computations we areinterested in (e.g., analogous to setting/learning the weights ofthe connections in a neural network)

■ A widely studied example is the density/majority/parity problem:Find a transition rule that, given an initial state of a CA with anodd number of cells, and a finite number T of max iterations torun, will result in an “all zero” state (~a(T ) = ~0) if ~a(0) contains amajority of cells at state 0, or in an “all one” state otherwise.

■ The CA becomes a parallel computer that detects the relativedensity of 0 or 1 symbols in a configuration

■ The simplest rule (switch to the same state of the majority of myneighbors) does not always work! Many different rules have beenstudied. For 2D automata with 149 cells the best recognition ratefor large test sets of random initial configurations is ≈ 83%.

Page 466: Non-Parametric Methods in Machine Learning

CA: Further readings and study

■ Golly, a cross-platform program to directly experiment thedynamics generated by different rules:http://golly.sourceforge.net/

■ A collection of Java applets to study the behavior of CAs:http://germain.umemat.maine.edu/faculty/hiebeler/java/CA/CellularAutomata.html

■ A Java applet to observe and enjoy the amazing behavior andcapabilities of 2D CAs:http://www.mirekw.com/ca/mjcell/mjcell.html

■ Another java applet showing a very large set of 2D patterns ofmathematical, physical, biological, and social interest:http://germain.umemat.maine.edu/faculty/hiebeler/java/CA/CellularAutomata.html

■ A comprehensive (55 pages) and accessible summary on CAs:http://citeseer.ist.psu.edu/delorme98introduction.html

Page 467: Non-Parametric Methods in Machine Learning

Summary and conclusions

■ The basic principles of SI design have been discussed

■ It looks apparently easy and simple to design SI algorithms. Onthe other hand the instrumental study of CAs has pointed out howhard might be to deal and understand these systems.

■ However, on Wednesday we will see that the basic principles ofSI, properly integrated with solid heuristic knowledge, can allow arather straightforward design of much more complex SI systemscapable of solving incredibly complex tasks!

Page 468: Non-Parametric Methods in Machine Learning

Collective and Swarm Intelligence (2)

Gianni Di [email protected]

IDSIA -USI/SUPSI

Page 469: Non-Parametric Methods in Machine Learning

Road map

■ Generalities on Collective/Swarm Intelligence (SI)

■ Main characteristics of SI design

■ Bottom-up vs. Top-down design approaches

■ Cellular Automata: the simplest and earliest example of SI

■ Particle Swarm Optimization: state-of-the-art SI framework forcontinuous optimization

■ Short discussion on (these topics will be covered in depth in nextyear’s course on Heuristics):✦ Ant algorithms for adaptive task allocation and coordination✦ Ant Colony Optimization: state-of-the-art SI framework for

combinatorial optimization and network routing.

Page 470: Non-Parametric Methods in Machine Learning

The principles of SI design. . . (1)

■ A population of units/agents/modules with internal state andautonomous behavior

■ Embedded in some physical or abstract environment, orrepresented whithout any reference to an external environment

■ Each unit has non-linear interactions with other units

■ Interactions are localized according to the selected definition ofneighborhood and of communication modality and range

■ Interaction protocols and agent information processing arerelatively simple

Page 471: Non-Parametric Methods in Machine Learning

The principles of SI design. . . (2)

■ Overall system control is not centralized but fully distributed: itresults from agent interactions and communications

■ The goal is to exploit repeated large-scale interactions rather thanthe “intelligence” of the single modules

■ System-level behaviors transcend the behavioral repertoire of thesingle agent and are “more” than the sum of their capabilities

■ In other words: we are investigating how to design artificialdistributed complex systems that can be used for: modeling,pattern recognition, optmization, control, coordination, . . .

■ Hint: taking inspiration from nature can be effective. . .

Page 472: Non-Parametric Methods in Machine Learning

Let’s restart from CAs. . .

F=(s , si−1, si−2, s , si+2i+1i )

i−1 i i+1i−2 i+2

00 0

00 0t=0

t=1

t=2

0 0?

1 11 0 0

00

■ For instance, each cell switch to 1 if there are enough cells instate 1 in its neighborhood:

F (si) =

{1 if si + si−1 + si−2 + si+1 + si+2 > 2

0 otherwise

■ Fixed topology, broadcast range = 2 in the example

Page 473: Non-Parametric Methods in Machine Learning

Local vs. global communication

■ Each individual has local non-linear interactions with its neighbors. . . but in practice every cell depends from the state of all the othercells→ the CA has to be considered as a single complex system

■ Neighbors ≈ individuals that exert an influence, individuals thatwe trust, . . .

■ “Individuals” can be biological cells, people, molecules oraggregate of molecules, birds, . . . (play with the CA java applet!)

Page 474: Non-Parametric Methods in Machine Learning

Particle Swarm Optimization (PSO)

Page 475: Non-Parametric Methods in Machine Learning

PSO: natural/social background (1)

■ Early work on simulation of bird flocking aimed at understandingthe underlying rules of bird flocking [Reynolds, 1984] and roostingbehavior [Heppner & Grenader, 1990]

■ The notion of change in human social behavior/psychology isseen as the analogous of change in spatial position in birds

■ Rules assumed to be simple and based on social behavior :sharing of information and reciprocal respect of the occupancy ofphysical space

■ Social sharing of information among conspeciates seems to offeran evolutionary advantage

Page 476: Non-Parametric Methods in Machine Learning

PSO: natural/social background (2)

■ Initial simulation work [Eberhart & Kennedy, 1995]

■ A population of N >> 1 agents is initialized on a toroidal 2D pixelgrid with random position and velocity, (x̄i, v̄i), i = 1, . . . , N

■ At each iteration loop, each agent determines its new speedvector according to that of its nearest neighbors

■ A random component is used in order to avoid fully unanimous,unchanging, flocking

■ Roosting behavior: looks like a dynamic force such that attractsthe swarm to land on a specific location. The roost could be theequivalent of the optimum in a search space!

Page 477: Non-Parametric Methods in Machine Learning

PSO: natural/social background (3)

■ Birds explore the environment in search for food

■ Agents = solution hunters that socially share knowledge whilethey move across a solution space

■ An agent that has found a “good” point leads its neighbors there

■ . . . and eventually all the agents “flock” toward the best point inthe solution space

Page 478: Non-Parametric Methods in Machine Learning

PSO: the particles and the task

■ Optmization of continuous functions f(~x) : IRn → IR■ For convex functions gradient methods can be effectivel used, but for

non-convex ones . . .■ An agent is an n-dimensional particle moving over function’s domain■ A particle p has an internal state consisting of: {~x,~v, ~xpbest,N (p)} and

makes use of a simple rule to update its velocity and position

Page 479: Non-Parametric Methods in Machine Learning

PSO: pseudo-codeprocedure Particle Swarm Optimization for Minimization()

foreach particle p ∈ ParticleSet do(~x,~v)← random init of positions and velocity();N (p)← random selection of the neighbor set();~xpbest ← f(~x); ~xgbest =∞;

end foreachwhile (¬ stopping criterion)

foreach particle p ∈ ParticleSet do~xpbest ← argmax

(f(~x), f(~xpbest)

);

~xlbest ← get best so far position from neighbors(N (p)

);

~∆individual ← ~xpbest − ~x;~∆social ← ~xlbest − ~x;(~r1, ~r2)← random uniform();~v ← ω~v + w1~r1 ◦ ~∆individual + w2~r2 ◦ ~∆social;~x← ~x + ~v;if f(~x) < f(~xpbest)

~xpbest ← ~x;if f(~x) < f(~xgbest)

~xgbest ← ~x;end foreach

end whilereturnf(~xgbest);

Page 480: Non-Parametric Methods in Machine Learning

PSO vs. Gradient-based search

■ Each local minimum xmin of a continuous function can beassociated to an attraction basin A(xmin). Starting from any pointlocated in the attraction basin, gradient-based (local search)methods can provide some guarantees to reach xmin.

■ In the general case, search must be iterated over different basinsto increase the probability to hit the global minimum

■ PSO (and, in general SI-based methods), are based on globalexploration and search. A number of particles concurrently moveand jump across the landscape and probe different regions.

■ At the beginning search mainly relies on exploration.

■ Search can be locally intensified on a region concentrating theremore particles.

■ Depending on the design, all the particles can eventuallyconverge on searching around the best region found, or a certainlevel of global exploration can be maintained in the swarm.

Page 481: Non-Parametric Methods in Machine Learning

PSO: different alternatives

■ Velocity update is the core formula, that we can rewrite as:

~v ← ω~v + φ1~U(0, 1)~∆individual + φ2

~U(0, 1)~∆social

■ ω is an inertia coefficient, while φ1 and φ2 are the “cognitive” and“social” acceleration coefficients

■ The canonical PSO makes use of a constriction factor χ:

~v ← χ(~v + φ1

~U(0, 1)~∆individual + φ2~U(0, 1)~∆social

),

χ = 2k/(∣∣2− φ−

√φ2 − 4φ)

■ The value of v is usually clamped between [−vmax, vmax]

■ Two main classes based on the use of lbest vs. gbest: multiplevs. one attractor, exploration vs. exploitation

■ A number of different tricks (some based on solid theory). . .

Page 482: Non-Parametric Methods in Machine Learning

PSO: 1D example, one particle, gbestt x xbest

0 20.00 10.001 18.21 18.212 16.43 16.433 14.64 16.434 13.24 16.435 12.03 16.436 11.06 11.067 10.09 11.068 9.71 11.069 8.85 11.06

10 9.14 11.0611 10.13 11.06

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

X

B

1.0 6.01.8 6.04.4 6.07.2 7.2

10.0 10.012.8 12.815.6 15.618.4 18.421.2 21.223.9 23.926.7 26.7

Page 483: Non-Parametric Methods in Machine Learning

PSO: dynamics of a single particle

■ Individual trajectories are “weak”. The figure shows the trajectoryof a single particle in the swarm and the effect on it of thecommunication of a new best from the its social network

■ Optimization is a function of interparticle interactions

Page 484: Non-Parametric Methods in Machine Learning

PSO: distribution of the sampled points

■ The points sampled by a swarm particle are distributed as asymmetric bell-shaped curve with the mean equal to the averageof the previous bests and standard deviation equal to theirdifference

Page 485: Non-Parametric Methods in Machine Learning

PSO at work

■ Java applets to visualize and play with PSO:

http://gecco.org.chemie.uni-frankfurt.de/PsoVis/applet.html

http://www.projectcomputing.com/resources/psovis/

■ The main Internet repository for material related to PSO:

http://www.swarmintelligence.org/

Page 486: Non-Parametric Methods in Machine Learning

PSO: applications and performance

■ A number of optimization problems: benchmark test functions,real-world functions, neural networks learning, shortest paths innetworks, . . .

■ Comparisons with other heuristics such as Genetic Algorithmsusually show that PSO has comparable or better effectivenessand better computational performance

■ Has some convergence properties

■ Still lacking of comprehensive comparison with more “standard”approaches

Page 487: Non-Parametric Methods in Machine Learning

PSO performance: popular test functions

Spere Rastrigin

Rosenbrock Griewank

Page 488: Non-Parametric Methods in Machine Learning

PSO performance: lbest vs. gbest

Page 489: Non-Parametric Methods in Machine Learning

PSO: performance variability for different variants

Page 490: Non-Parametric Methods in Machine Learning

Stigmergy and Ant-inspired algorithms

Note: The lecture given in the classroom only addressed the generalaspects of the Ant Colony Optimization framework. The slides thatfollow contain some additional information which is only provided forsake of completeness.

Page 491: Non-Parametric Methods in Machine Learning

Stigmergy and Ant-inspired algorithms

■ Stigmergy is at the core of most of all the amazing collectivebehaviors exhibited by the ant/termite colonies (nest building,division of labor, structure formation, cooperative transport)

■ Grassé (1959) introduced this term to explain nest building intermite societies

■ Goss, Aron, Deneubourg, and Pasteels (1989) showed howstigmergy allows ant colonies to find shortest paths between theirnest and sources of food

■ These mechanisms have been reverse engineered to give raiseto a multitude of ant colony inspired algorithms based onstigmergic communication and control

Page 492: Non-Parametric Methods in Machine Learning

Pheromone laying-attraction is the key

■While walking, the ants lay on the grounda volatile chemical substance, calledpheromone

■ Pheromone distribution modifies the environment (the way it isperceived) creating a sort of attractive potential field for the ants

■ This is useful for retracing the way back, for mass recruitment, for labordivision and coordination, to find shortest paths. . .

Page 493: Non-Parametric Methods in Machine Learning

Stigmergy and stigmergic variables

■ Stigmergy means any form of indirect communication among aset of possibly concurrent and distributed agents which happensthrough acts of local modification of the environment and localsensing of the outcomes of these modifications

■ The local environment’s variables whose value determine in turnthe characteristics of the agents’ response, are called stigmergicvariables

■ Stigmergic communication and the presence of stigmergicvariables can give raise to self-organized global behaviors

■ Blackboard/post-it, style of asynchronous communication

Page 494: Non-Parametric Methods in Machine Learning

Examples of stigmergic variables (1)

✢ Leading to diverging behavior at the group level:

■ The height of a pile of dirty dishes floating in the sink

■ Nests’ energy level in robot activation [Krieger and Billeter, 1998]

■ Level of customer demand in adaptive task allocation of pick-uppostmen [Bonabeau et al., 1997]. A distributed set of postmenhave to decide whether to serve or not the pick-up request from acustomer. The “stimulus” intensity s depends on the distancefrom the customer and the amount of items already in charge:

Tθ(s) =sn

sn + θ

where θ is the response threshold of the postman, and T is theprobability that he/she will pick-up the item.

■ Diverging stigmergy is a general model to design adaptive taskallocation strategies in distributed systems

Page 495: Non-Parametric Methods in Machine Learning

Examples of stigmergic variables (2)

✢ Leading to converging behavior at the group level:

■ Intensity of pheromone trails in ant foraging: pheromone isdeposited at higher rates on shortest paths connecting nest andsources of food. The presence of higher intensity of pheromoneattracts subsequent ants, eventually converging the majority ofthe ants in the colony on moving along the shortest path.

■ This pheromone-mediated shortest-path behavior, properlyreverse-engineered, has provided the core inspiration for thedesign of the Ant Colony Optimization metaheuristic.

Page 496: Non-Parametric Methods in Machine Learning

Shortest path behavior in ant colonies

■ While walking, at each step a routing decision is issued.Directions locally marked by higher pheromone intensity arepreferred according to some probabilistic rule:

π ( τ η),π

τ

η

Decision RuleStochastic

MorphologyTerrain

Pheromone

???

■ This basic pheromone laying-following behavior is the mainingredient to allow the colony converge on the shortest pathbetween the nest and a source of food

Page 497: Non-Parametric Methods in Machine Learning

Ant colonies: Pheromone and shortest paths

Food

Nest

Food

t = 3

t = 1

Nest

Food

Nest

Food

t = 0

Pheromone Intensity Scale

Nest

t = 2

Page 498: Non-Parametric Methods in Machine Learning

Ant colonies in a more complex discrete world

FoodNest

Pheromone Intensity Scale

✢ Multiple decision nodes

✢ A path is constructed through a sequence of decisions

✢ Decisions must be taken on the basis of local information only

✢ A traveling cost is associated to node transitions

✢ Pheromone intensity locally encodes decision goodness ascollectively estimated by the repeated path sampling

Page 499: Non-Parametric Methods in Machine Learning

Ant colonies: Ingredients for shortest paths

■ A number of concurrent autonomous (simple?) agents (ants)

■ Forward-backward path following/sampling

■ Multiple paths are tried out and implicitly evaluated

■ Local/stigmergic laying and sensing of pheromone

■ Stochastic step-by-step decisions biased by pheromone andother local aspects (e.g., terrain, visibility)

■ Positive feedback effect (local reinforcement of good decisions)

■ Persistence (exploitation) and evaporation (exploration) of thepheromone field

■ Iteration over time of the path sampling actions

■ . . . always convergence onto the shortest path?

Page 500: Non-Parametric Methods in Machine Learning

What pheromone represents in abstract terms?

■ Pheromone trails act as a sort of distributed, dynamic, andcollective memory of the colony, a repository of all the most recentforaging experiences of the ants in the colony

■ Pheromone trails encode the value of goodness of a local move ascollectively learned from the generated paths (solutions)

■ By locally laying pheromone, the ants modify the environment thatthey have visited, and in turn take a decision biased by thepresence/strength of these modifications (stigmergy)

■ Circular relationship: pheromone trails modify environment→locally bias ants decisions→ modify environment

Paths

Outcomes of path construction are used to modify pheromone distribution

Pheromone distribution biases path constructionπ

τ

Page 501: Non-Parametric Methods in Machine Learning

A meta-strategy for shortest path problems

■ By reverse engineering ant colonies’ shortest path behavior weget an effective metaheuristic, ACO, to solve shortest pathproblems

■ . . . in a possibly fully distributed and adaptive way

■ . . . and shortest paths model are a very general models forcombinatorial optimization and decision problems!

■ Note that in PSO and in CA is the state of the agents that evolvesover time, in ACO, and more in general, in stigmergy-basedsystems, is the local state of the environment that evolves

Page 502: Non-Parametric Methods in Machine Learning

ACO: general architecture

procedure ACO metaheuristic()while (¬ stopping criterion)

schedule activitiesant agents construct solutions using pheromone();pheromone updating();daemon actions(); comment: optional

end schedule activitiesend while

return best solution generated;

Page 503: Non-Parametric Methods in Machine Learning

ACO: A solution construction approach

■ A solution to the combinatorial problem at hand is constructed bystarting from an empty solution and adding a new solutioncomponent at each construction step

■ A set of decision nodes is defined: at each node a decision isissued concerning the new component to add to the solutionbeing contructed

■ A decision node encodes the information which is used about thepartial solution (the past decisions/experience) to take a feasibleand optimized decision concerning the new component to add

■ Example: Car traffic. Each crossroad is a decision node, a“solution” is a path from origin to destination, a path is a sequenceof decisions. The case of packet routing in networks is analogous.

Page 504: Non-Parametric Methods in Machine Learning

ACO: Pheromone and heuristic variables

Destination

Source

τ

58

14 59 59

58

6

1

4

3

8

9

5

7

2

τ ;η 14

13τ ;η 13

12τ ;η 12

τ ;η

Pheromone Intensity Scale

■ Each decision node i holds an array of pheromone variables:~τi = [τij ] ∈ IR,∀j ∈ N (i) → Learned through path sampling

■ τij = q(j|i): learning estimate of the quality/goodness/utility of moving tonext node j conditionally to the fact of being in i

■ Each decision node i also holds an array of heuristics variables:~ηi = [ηij ] ∈ IR,∀j ∈ N (i) → Resulting from other sources/measures

■ ηij is also an estimate of q(j|i) but it comes from a process or a prioriknowledge not related to the ant actions (e.g., node-to-node distance)

Page 505: Non-Parametric Methods in Machine Learning

ACO: Path sampling by ant agents

Destination

Source

1

4

3

8

9

5

7

2

6

✢ Each ant is an autonomous agent that proposes a solution to theproblem by constructing a path Ps→d

✢ There might be one or more ants concurrently active at the sametime. Ants do not need synchronization

✢ A stochastic decision policy selects node transitions:

πǫ(i, j; ~τi, ~ηi)

Page 506: Non-Parametric Methods in Machine Learning

Pheromone updating

■ Ants update pheromone online step-by-step → Implicit pathevaluation based on on traveling time and rate of updates

■ Ant’s way is inefficient and risky.

■ A better way is to update step-by-step but offline:✦ Complete the path

✦ Evaluate and Select whether to update pheromone or not

✦ “Retrace” the path and assign credit , i.e., reinforce thegoodness value of the issued decisions (pheromone variables)

✦ Total path cost can be used as reinforcement signal

■ An evaporation mechanism can be applied to decreasepheromone intensity and favor exploration: τij ← ρτij , ρ ∈ [0, 1],

Page 507: Non-Parametric Methods in Machine Learning

Designing an ACO algorithm

■ Representation of the problem→ pheromone model ~τ

■ Heuristic variables ~η

■ Ant-routing table A

■ Stochastic decision policy πǫ

■ Solution evaluation J(s)

■ Policies for pheromone updating

■ Scheduling of the ants

■ Daemon components

■ Pheromone initialization, constants, . . .

Page 508: Non-Parametric Methods in Machine Learning

Applications and performance

■ Traveling salesman: state-of-the-art / good performance■ Quadratic assignment: good / state-of-the-art■ Scheduling: state-of-the-art / good performance■ Vehicle routing: state-of-the-art / good performance■ Sequential ordering: state-of-the-art performance■ Shortest common supersequence: good results■ Graph coloring and frequency assignment: good results■ Bin packing: state-of-the-art performance■ Constraint satisfaction: good performance■ Multi-knapsack: poor performance■ Timetabling: good performance■ Optical network routing: promising performance■ Set covering and partitioning: good performance■ Parallel implementations and models: good parallelization efficiency

■ Routing in telecommunications networks: state-of-the-art performance

Page 509: Non-Parametric Methods in Machine Learning

Evolutionary ComputationPart I

Faustino GomezIDSIA

Intelligent Systems, Fall 2007

Page 510: Non-Parametric Methods in Machine Learning

Ontogenic vs. Phylogenetic Learning

• Single-agent learning– Agent modifies its structure (parameters) to adapt

while it interacts with the environment• Multi-agent learning

– Multiple agents modify their structure to adaptwhile interacting with environment and each other

– Potentially solve task together

• Evolutionary Computation– Population of candidate solutions (e.g. agents)

learn collectively by natural selection, but notindividually (usually)

Phylogenetic:

Ontogenetic:

Page 511: Non-Parametric Methods in Machine Learning

Evolutionary Computation (EC)

• Instead of a single solution, use a populationof candidate solutions

• Search the space of solutions in parallel• Evaluate candidates and assign a fitness

score• Generate new population from most “fit”

candidates that is hopefully better than theprevious population

Basic idea:

Page 512: Non-Parametric Methods in Machine Learning

The Evolutionary Cycle

Recombination

MutationPopulation

Offspring

ParentsSelection

Replacement

Evaluation

Page 513: Non-Parametric Methods in Machine Learning

Branches of EC

• Genetic Algorithms (Holland 1975)• Evolution Strategies (Schwefel 1977)• Evolutionary Programming (Fogel 1966)• Genetic Programming (Koza 89)

Page 514: Non-Parametric Methods in Machine Learning

Genetic Algorithms (GAs)• Inspired by Darwinian principle of naturalselection• Different from other EC methods primarilydue to emphasis in sexual reproduction(crossover)• GAs search the problem space by trying tocorrectly combine genetic building blocksfrom different individuals in the population

Page 515: Non-Parametric Methods in Machine Learning

GA: Basic Procedure1. Initialize random population of candidate

solutions2. Evaluate solutions on problem and

assign a fitness score3. Select some solutions for mating4. Recombine: create new solutions from

selected ones by exchanging structure5. IF good solution not found: Goto 2

The cycle from 2 to 5 is know as a generation

Page 516: Non-Parametric Methods in Machine Learning

Fitness Landscape

Search in parallel to maximize fitness

Page 517: Non-Parametric Methods in Machine Learning

GA Terminology

• Solutions are encoded in strings calledchromosomes

• Each chromosome consists of somenumber of genes

• Each gene can take an a value or allelefrom some specified alphabet, e.g.– Binary {0,1}– Real numbers (infinite alphabet)

Page 518: Non-Parametric Methods in Machine Learning

Genotype Encoding• Binary encoding

• Real valuesgenes

chromosome

Page 519: Non-Parametric Methods in Machine Learning

Mapping Genotypes toPhenotypes

• Genotypes can represent any kind ofstructure or phenotype in the problemspace (environment)

• Before evaluating a genotype it must bemapped into its phenotype

• Once the phenotype is created, it canbe evaluated in the environment

Page 520: Non-Parametric Methods in Machine Learning

GA: Evaluation

Map genotype to phenotype and evaluate phenotype in theenvironment to assign a fitness

Page 521: Non-Parametric Methods in Machine Learning

Selection: fitness proportional1. Calculate a genotype’s probability of being

selected in proportion to its fitness

2. Then select some number of genotypes formating according to probabilities!

pi =fi

f j"

!

pi

Genotypes that are more fit are more likely to be selected

Page 522: Non-Parametric Methods in Machine Learning

Selection: linear ranking1. Sort the genotypes by fitness2. Compute probability of being selected by

3. then select some number of genotypes formating according to probabilities

!

pi = 2 " SP+ 2* (SP "1)* (rank(i)"1) /(N "1)

where SP is the selective pressure [1.0,2.0],and rank denotes the genotype’s position in the sorted population, rank(1) is most fit, rank(N) is least fit

Page 523: Non-Parametric Methods in Machine Learning

Selection: Tournament1. Let T be the tournament size (between 2

and the size of the population, N)2. Select T genotypes at random from the

population and take the most fit as thetournament “winner”

3. Put the winner in the mating pool4. Goto 1 until we have enough genotypes in

mating pool

Larger values of T increase selective pressure

Page 524: Non-Parametric Methods in Machine Learning

Selective Pressure andDiversity

• Similar to exploitation/exploration tradeoff• We want selective pressure to be high

enough to direct search towards a goodsolution,

• but not so high that we converge too soon• Diversity should be high so that we are less

likely to “miss” a good solution,• but not so high that we don’t converge

(convergence allows search to concentrate neargood solution)

Page 525: Non-Parametric Methods in Machine Learning

Reproduction• Generate new individuals (search points)

by mixing or altering the genotypes of theselected members of the population

• Crossover: select alleles from two parentchromosome to form two children(recombination type genetic operator)

• Mutation: random perturb some of thealleles of a parent

Page 526: Non-Parametric Methods in Machine Learning

Genetic Operators: 1-point crossover

Select random crossover points and exchange substrings

Page 527: Non-Parametric Methods in Machine Learning

Genetic Operators: 2-point crossover

Select 2 random crossover points and exchange substrings

Page 528: Non-Parametric Methods in Machine Learning

Mutation: binary

Parent

Child

Randomly flip a bit with probability mutation rate

Page 529: Non-Parametric Methods in Machine Learning

Mutation: real codingParent

• Replace allele with new real number with probability mutationrate• or, add noise to allele, e.g. from Guassian distribution

Child

Page 530: Non-Parametric Methods in Machine Learning

GA: Behavior

Initial population: candidate solutions are distributedthroughout search space

phenotype

genotype

Page 531: Non-Parametric Methods in Machine Learning

genotype

phenotype

After some number of generations…

Page 532: Non-Parametric Methods in Machine Learning

genotype

phenotype

After some more generations…

Page 533: Non-Parametric Methods in Machine Learning

genotype

phenotype

Some more generations…

Page 534: Non-Parametric Methods in Machine Learning

Convergence

genotype

phenotype

All individuals look almost the same after many generationsof recombination (crossover)

Page 535: Non-Parametric Methods in Machine Learning

Premature Convergence

Population converges too quickly and misses global peak

phenotype

genotype

Page 536: Non-Parametric Methods in Machine Learning

How do we avoid prematureconvergence?

• Use larger population– More genetic material– Takes longer to converge, but requires more

evaluations• Reduce selective pressure

– Make the search less greedy– Could take much longer

• Increase mutation– Adds diversity but could disrupt building blocks

• Niching– Force genotypes stay spread out into niches by

penalizing fitness when bunched together

Page 537: Non-Parametric Methods in Machine Learning

Advantages of GAs

• Less sensitive to local minima than non-population based methods

• Little domain knowledge required• Can cope with high-dimensionality• Can be implemented in parallel

because individual evaluations areindependent

Page 538: Non-Parametric Methods in Machine Learning

Example: Function optimisation

- very frequent problem in many domains

- more examples in the tutorial

Page 539: Non-Parametric Methods in Machine Learning

Solution

Page 540: Non-Parametric Methods in Machine Learning

Evolution Strategies (ES)• Genes contain real values• Mutation by adding values of a

Gaussian distribution• Parameters evolve• Originally population size = 1• Possible crossover: average of the

genes of both parents

Page 541: Non-Parametric Methods in Machine Learning

ES vs. GA

parameters

various

Bit-flip

binary

GA

parameters andmutation step sizes

What is evolved

uniform randomParent selection

GaussianMutation

realRepresentation

ES

Note: these are historical differences, not all true today!

Page 542: Non-Parametric Methods in Machine Learning

ES: basic procedure

1. Initialize parents and evaluate them2. Create some offspring by perturbing

parents with Gaussian noise according toparent’s mutation parameters

3. Evaluate offspring4. Select new parents from offspring and

possibly old parents5. IF good solution not found Goto 2

Page 543: Non-Parametric Methods in Machine Learning

ES: Genotype Encoding• Chromosomes contain not only problem

parameters (as in Gas) but also strategyparameters

• Strategy parameters determine the amount ofmutation (standard deviation of the Gaussian)that is applied to the corresponding problemparameter

!

x1,K, xn

problemparameters

1 2 4 3 4 ,"1,K,"

n

strategyparameters

1 2 4 3 4

Page 544: Non-Parametric Methods in Machine Learning

(µ λ) Notation

• µ is the number of parents• λ the number of offspring• “,” select new parents for next

generation from offspring only• “+” select them from old parent and

offspring

!

(µ+

,

,+

Page 545: Non-Parametric Methods in Machine Learning

• (1+1)-ES : 1 parent and 1 offspring, new parent is selected from the old parent and offspring

• (1, λ)-ES: 1 parent and λ, new parent selected only from offspring (best

offspring replaces parent

(µ λ) Notation examples,+

Page 546: Non-Parametric Methods in Machine Learning

1. Generate λ offspring by selecting at random λ chromosomes from the µ parents and mutatingthem according to their strategy parameters

2. If “+”, select µ new parents for next generation bytaking most fit chromosome from both old parentsand offspring for next generation

OR If “,” , replace old parent with the µ most fit

chromosomes from the offspring only3. IF solution not found Goto 1

(µ λ) Algorithm,+

Page 547: Non-Parametric Methods in Machine Learning

Historical example:the jet nozzle experiment

Initial shape

Final shape

Task: to optimize the shape of a jet nozzleApproach: random mutations to shape + selection

H.-P. Schwefel

Used (1+1)-ES