non-parametric methods in machine learning

Introduction Empirical Distribution Function Why Theory? Computational issues Classification

Non-parametric methods in machine learning

Daniil Ryabko

IDSIA/USI/SUPSI


Machine learning and artificial intelligence methods are based on avariety of mathematical disciplines.

One of them is probability theory. Some concepts of probabilitytheory will be used today.


Non-parametric methods are about making inference and predictionwhile making as few as possible assumptions about the data.Consider an example.


Non-parametric methods are about making inference and predictionwhile making as few as possible assumptions about the data.Consider an example.

I ask 20 friends when their cars were made. Then I’m going to ask21st friend the same question. What is the probability that his carwas made before 2000? Before 1990? Before 2006?

The problem is difficult to model beforehand. What is theunderlying probability distribution? I have no idea.


The data can look as follows.


We are given a sample X1, . . . ,Xn of independent identicallydistributed (i.i.d.) random variables generated according to aprobability distribution P.


We are given a sample X1, . . . ,Xn of independent identicallydistributed (i.i.d.) random variables generated according to aprobability distribution P.

We are about to observe a new data point (random variable) X .For any given t we are interested in the probability P(X ≤ t).

In other words, we are interested in the function F (t) = P(X ≤ t).This function is called the distribution function.


Empirical Distribution Function

We will estimate the probabilities P(X ≤ t) by counting thefraction of Xi that are within the interval (−∞, t].

Let

Fn(t) =1

n

n∑i=1

I(−∞,t](Xi ),

where IA is the indicator function for the set A:

IA(t) =

{1 if t is in A0 otherwise


So if t is 1998 then Fn(t) is 1/2.


Thus we will use the empirical distribution function Fn(t) as anestimate of the real (unknown) distribution function F (t). Its plotcould look like this:


Is Fn(t) a good estimate of F (t)?Well, it is quite good.

Proposition

For any t, Fn(t) converges to F (t) as the sample size n goes toinfinity.

It follows from the law of large numbers. The (strong) law of largenumbers says that if X1, . . . ,Xn are independent identicallydistributed (i.i.d.) random variables with expectation E (X1) then

1

n

n∑i=1

Xi → E (X1)

(almost surely).In our case, for any t, consider random variables I(−∞,t](Xi ). Theyare independent and identically distributed. So we have

Fn(t) =1

n

n∑i=1

I(−∞,t](Xi ) → E (I(−∞,t](Xi )) = P(X ≤ t).


So, we know that as sample size n grows to infinity the estimationFn(t) converges to the true (unknown) value F (t).This is an asymptotic result.Can we say something about finite sample size? We can.

Proposition

For any t,P(|Fn(t)− F (t)| > ε) ≤ 2e−nε2

.

Again, this fact simply follows from a more general inequality fori.i.d. random variables.


Thus, we take any year t1 estimate Fn(t1), and we know that withprobability at least 1− 2e−nε2

it differs from Fn(t) for no morethan ε.Then we take t2, t3, make a new estimate, and get the samebound. But if for all t we can say that the probability that|Fn(t)− F (t)| is greater than ε is less than δ = 2e−nε2

, it does notmean that the probability that for all t |Fn(t)− F (t)| is greaterthan ε is less than δ.In symbols, we have

supt∈R

P(|Fn(t)− F (t)| > ε) ≤ δ

but can we have

P(supt∈R

|Fn(t)− F (t)| > ε) ≤ δ?

Why do we want this? Because we want to make inference aboutt1, t2, t3 - all of them - from just one data sample X1, . . . ,Xn thatwe have! Think about this!


Glivenko-Cantelli TheoremFor just a finite number of variables t, say two of them, we can dothis:

P(|Fn(t1)− F (t1)| > ε and |Fn(t2)− F (t2)| > ε)

≤ P(|Fn(t1)− F (t1)| > ε) + P(|Fn(t2)− F (t2)| > ε) ≤ 2δ.

For k values of t we can obtain kδ, using the same trick: the unionbound. But it’s quite bad. And what about all infinitely many t atthe same time?Luckily, we can have a single bound for all of them, but the proofis a bit too long for this lecture.

Proposition (Glivenko-Cantelli Theorem)

P(supt∈R

|Fn(t)− F (t)| > ε) ≤ 8(n + 1)e−nε2/32.


Other setsSo far we have seen how to estimate the probability of X < t, for agiven t, based on the sample X1, . . . ,Xn. Can we estimate theprobability of other sets? For example, I want to know, what is theprobability that the next friend I ask will have a car issued between1989 and 1993. Or, issued either in 1991 or in 1999?Interestingly, we can have the same statements as our first twopropositions:

Proposition

For any event A, Pn(A) = 1n

∑ni=1 IA(Xi ) converges to P(A) as the

sample size n goes to infinity (where by Pn we denote the empiricalprobability).

and

Proposition

For any event A, P(|Pn(A)− P(A)| > ε) ≤ 2e−nε2.

And they can be proven in the same way!


First disappointmentBut let us see that we cannot have any guarantee for all sets Atogether; that is, there are no nontrivial bounds for

P(supA|Pn(A)− P(A)| > ε).

Indeed, if we consider all sets A then we will also encounter this set

Here I have drawn a set that contains exactly the data points Xi

and (almost) nothing else.


Formally:

Proposition

Let P be any continuous distribution (that is, P(t) = 0 for anypoint t ∈ R), and let X1, . . . ,Xn be generated i.i.d. according toP. Then for any ε < 1

P(supA|Pn(A)− P(A)| > ε) = 1

Proof.Let B be a set that consists exactly of the sample points:B = {X1, . . . ,Xn}. Then Pn(B) = 1, while P(A) = 0, since P iscontinuous. Moreover,

P(supA|Pn(A)−P(A)| > ε) ≥ P(|Pn(B)−P(B)| > ε) = P(1 > ε) = 1.


Theoretical analysis gives guarantees of performance on arbitrarydata that fits the assumptions.Typical theoretical results in machine learning/artificial intelligenceinclude:

• Asymptotic performance guarantees.

• Finite-step performance guarantees, e.g. bounds onprobability of error.

• Uniform (convergence, bounds) results.

• Impossibility results.

and

• Computational complexity analysis.

After a thorough theoretical analysis practical evaluation is oftenredundant :)


Computational complexity of empirical DF estimate

Let’s recall the setup: we are given X1, . . . ,Xn. For any t, tocompute F̄n(t) we need O(log n) comparisons, provided X1, . . . ,Xn

is sorted. To sort X1, . . . ,Xn we need




is sorted. To sort X1, . . . ,Xn we need O(n log n) comparisons.




is sorted. To sort X1, . . . ,Xn we need O(n log n) comparisons.

Online setting: X1, . . . ,Xn are given, Xn+1 arrives. To put Xn+1 inthe sorted array requires not more than O(log n) comparisons.


Complexity issues

In general, the following complexity issues are considered in ML:

• Complexity of building a classifier from sample(X1,Y1), . . . , (Xn,Yn).

• Complexity of building a classifier “online”: changing theclassifier with the arrival of a new data point (Xn+1,Yn+1).

• Complexity of classifying a new data point X ; that is,evaluating ϕn(X ).


Classification

Classification is one of the most important, most basic and mostpopular problems in machine learning and artificial intelligence.

We are given a sample (X1,Y1), . . . , (Xk ,Yk) of examples, whereeach X is an object and each Yi is its simple label. Based on thissample, we have to construct a classifier — a rule that “predicts”the label Y given the object X .

Example: hand-written digit recognition. Here are four objects

And their labels are: 3 0 1 4An object here is a picture 16x16 grayscale pixels, and a label isjust a digit from the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.


USPS dataset

This example was from a classical benchmark dataset: USPShandwritten digits.

It consists of about 7000 training examples, based on which youcan construct a classifier, and about 2000 test examples, on whichyou can test it, and compare with the results others achieved onthis dataset.


other examples

Recognizing a disease, such as cancer, in a patient, based onvarious data, such as X-ray images, gene expression data, etc.Here an object is all this data together, while the label is justbinary: 0 or 1, has cancer or not.

More examples: Sex identification from a photo (M/F). Emotionrecognition by a photo (Happy/Sad/Angry). Textual data: emailspam recognition. Topic identification. And whatnot.


Formal model for classification

We are given a (training) sample (X1,Y1), . . . , (Xk ,Yk), where Xi

are from X = Rd , called object space and Yi are from a finite setY, called label space. We will assume just binary Y = {0, 1}(unless stated otherwise). The objects and labels (Xi ,Yi ) aregenerated i.i.d. by (unknown) probability distribution P on Rd ×A.

We wish to construct a classifier ϕ such that the probability oferror P(ϕ(X ) 6= Y ) on a test pair (X ,Y ) is small.

In essence, ϕ should emulate (or estimate) the conditionaldistribution P(Y |X ) — the probability of a certain label given theobject.


Histogram methodsDivide the object space into bins and classify new object Xaccording to the majority in the bin containing X .


Histogram rule: definition

Let An(x) denote the bin containing x , in the nth partition.

ϕn(x) =

{1 if #{i : Yi = 0,Xi ∈ An(x)} < #{i : Yi = 1,Xi ∈ An(x)}0 otherwise

How to define the bins? For example, they can be cubes of size hn:

d∏j=1

[kjhn, (kj + 1)hn], (1)

where kj are integers.


Class Probability Estimation

Clearly, the histogram rule can also be used to estimate probabilityof a given class label. Let η(x) = P(Y = 1|X ) be the “true”(unknown) probability of the label Y = 1 given object X .

We can define the histogram estimate

η̂n(x) =#{i : Yi = 1,Xi ∈ An(x)}

#{i : Xi ∈ An(x)}. (2)


Plug-in rule

From a class probability estimate η̂n(X ) we can always get aclassifier by

ϕn(x) =

{1 if η̂n(x) ≥ 1/20 otherwise


How to select the bin size

The bin sizes must be

• Big enough to contain sufficient number of objects

• Small enough to be “local” estimates

In a certain sense, it is easy.

TheoremFor the histogram rule (2) with “cubic” bins (1) if hn → 0 andnhd

n →∞ as n →∞ then

E(|η̂(X )− η(X )|) → 0.

So histogram methods are very good! Or are they? May be, butthis result is only asymptotic. Finite-step results cannot beobtained.


Another advantage of histogram methods: after building aclassifier the data can be discarded. Thus: computationalefficiency for test data.

Disadvantages of histograms:

• Sharp class decision boundaries

• Data-independent bins are not robust

• Curse of dimensionality: the number of bins growsexponentially with the dimension d of the object space.


Kernel methods

Let’s look at a slightly different histogram rule.Let

k(x) =

{1 |xj | ≤ 1/2, j = 1, . . . , d0 otherwise

(3)

(here ji j = 1, . . . , d are coordinate components of x)k(x) can be used to count the number of points within cube ofsize h of X

K =n∑

i=1

k

(X − Xi

h

)So the class probability can be estimated as

η̂(x) =1

K

n∑i=1

IYi=1k

(X − Xi

h

)


We can use other functions k() to obtain, in particular, smootherclass decision boundaries. For example, Gaussian:

k(x) =1

(2π)d/2e−||x ||

2/2,

where as usual ||x ||2 =∑d

j=1 x2j .

Nearest Neighbours Density estimation Decision Trees

Non-parametric methods II

Daniil Ryabko

IDSIA/USI/SUPSI


Probability concepts you need for this course

• Elementary definition of probability: axioms of probability.

• Events, random variables

• Independence of events, of r.v.

• Conditional probabilities, conditional independence

• Expectation, variance

• Distribution functions, densities

• Distributions: Bernoulli, Binomial, Geometric, Gaussian,Exponential

These can be found in any book on probability! Or, in the“probability background” sections of ML books, e.g. Bishop.


Probability concepts you DO NOT need for this course

• Measure theory

• σ-algebras (sigma-algebras)

• Extension theorems (e.g. Caratheodory’s, Kolmogorov’s)

• Different convergence concepts

If you see these in your book you can safely skip themAlthough understanding these will do no harm


I will be giving some background material on probability, butdon’t rely solely on it


Classification setup

Recall the classification problem: we are given a (training) sample(X1,Y1), . . . , (Xn,Yn), where Xi are from the the object spaceX = Rd , and Yi are from the label space Y = {0, 1}. The objectsand labels (Xi ,Yi ) are generated i.i.d. by (unknown) probabilitydistribution P on Rd × A.

We wish to construct a classifier that classifies objects whose labelswe don’t know.


k-Nearest Neighbours

k-Nearest Neighbours rule classifies an object X according to themajority vote among its k nearest neighbours.

To find k nearest neighbours of X , simply sort all ||Xi − X ||,i = 1, . . . , n and take those Xi that are among k first. LetX(1)(x), . . . ,X(k)(x) be the k nearest neighbours of x andY(1)(x), . . . ,Y(k)(x) — their labels.

The class probability estimate is

η̂n(x) =1

k

k∑i=1

I{Y(i)(x)=1}


How to select k?

Proposition

If kn →∞ but knn → 0 then the expected error of the kn-nearest

neighbour label probability estimate goes to zero

E(|η̂n(x)− η(x)|) → 0

Here as before, η(x) = P(Y = 1|x) the probability of label 1 ofobject x.

For example, one can select kn =√

n or kn = log(n).


Bigger k give more smooth and robust estimates, smaller k aremore “opportunistic”.

However, for deterministic labels (η(x) ∈ {0, 1} for all x) even1-nearest neighbour rule gives consistent estimates:

Proposition

If η(x) = 0 or η(x) = 1 for all x, then for 1-nearest neighbour rulewe have

E(|η̂n(x)− η(x)|) → 0.


Overfitting

Why 1-Nearest neighbour is not good in general, when labels arenon-deterministic?Because of overfitting.

Overfitting occurs when your classifier fits the training data toowell. We can always fit well the training data, but that’s not whatwe want. We want to make predictions for future data.


Weighted nearest neighbours

The k-nearest neighbours rule “asks” the k nearest neighboursabout their labels, and treats the answers equally. May be it’sbetter to trust the nearer neighbours more?

Let win, i = 1, . . . , n be some positive weights that sum to one∑ni=1 win = 1. Define the weighted nearest neighbour rule:

η̂(x) =n∑

i=1

I{Yi=1}win

For example, for k-nearest neighbours we have win = 1k for

i = 1, . . . , k and win = 0 otherwise.


Probability background interleaver goes here


Density estimation

Histogram and nearest neighbours rules can be used for densityestimation too.Forget about the labels Yi .Now we have a sample X1, . . . ,Xn of i.i.d. r.v. generated accordingto some unknown distribution P that has a density f . This meansthat

P(A) =

∫A

f (x)dx

for every event A.

We want to estimate f (x).


Histogram density estimate

To estimate the density f (x) at point x take a small region Aaround x and get an estimate f̂ (x) = P(A)/∆(A) where ∆(A) isthe area of A.

But we don’t even know P(A)!

So we use our empirical probability P̂(A) from Lecture 1:

f̂n(x) =P̂(A)

∆(A)=

1

n∆(A)

n∑i=1

IA(Xi ).

But how do we select the “small” region A?

It should be small enough to be a “local” estimate, while bigenough to contain some data points.

This brings us right back to histogram rules.


We break the space into bins. Let An(x) be the bin containing x .Then the histogram density estimate is

f̂ histogramn (x) =

1

n∆(A)

n∑i=1

IAn(Xi ).

In case of cubic histogram rule, ∆(A) = hdn , so

the cubic histogram density estimate is

f̂ cubic histogramn (x) =

h−dn

n

n∑i=1

IAn(Xi )

where hn is the size of the cube. Finally, the kernel densityestimate with kernel k(·) is

f̂ kerneln (x) =

h−dn

n

n∑i=1

k

(x − Xi

h

)


Nearest neighbours for density estimation

The same idea can be used with nearest neighbours. Recall thatwe want to have a small region A for which we takef̂n(x) = 1

n∆(A)

∑ni=1 IA(Xi ).

Define An,kn(x) as a region containing exactly kn nearest neighborsof x , and obtain k-Nearest Neighbours density estimate

f̂ k-NNn (x) =

1

n∆(An,kn)

n∑i=1

IAn,kn(Xi ) =

k

n∆(An,kn).


TreesIn general, a (binary) tree is something like this:

Figure: A binary tree. From Devroye et al.


Return to classification.

A decision tree is an arrangement of questions, or tests, resultingin a classification (or a class probability estimate).

In a decision tree each node is a subset of the object space, or abin. To determine whether an object is in a subtree, a simplequestion is asked. Asking these question, we move down the tree.

Once in a leaf, classify by majority vote, or build class probabilityestimate by counting the fraction of labels each class in the leaf.

Questions can be:

• Is x j ≤ α? (where x j is j ’th coordinate of x , α is aparameter). This leads to ordinary classification trees.

• Is∑d

j=1 ajxj ≤ α (where α is a parameter). Gives binary

space partition trees.

• Is ||x − z || (where z is a parameter). Gives sphere trees.


Examples of resulting bins:

Figure: Partitioning by an ordinary decision tree. From Devroye et al.

Figure: Partitioning by a BSP and by a sphere tree. From Devroye et al.


A very simple decision tree

Apparently the simplest tree possible:On the first level split the first coordinate in the middle.On the second level split the second coordinate in the middle.On each next level, split next coordinate in the middle, going inrounds, until a certain depth k is reached.


The Median Tree

On first level, split at the median of the first coordinate. If themedian is on a data point, then leave it out (ignore; don’t senddown to subtrees). On next levels, proceed in the same fashionwith other coordinates.

If we go down k levels, each leaf will have between n/2k − k andn/2k data points in it.Such a tree is also called balanced tree.


Trees in the above examples have the following property: onlyobjects are used for splits, not the labels. Let An(x) denote theleaf (bin) containing x . For such trees, we have:

Proposition

For a decision tree that has the following properties: For a decisiontree class probability estimate η̂(x) if we have

i) only objects are used for making split decisions, not the labels,

ii) diam(An(x)) → 0 in probability,

iii) #{i : Xi ∈ An(x)} → ∞ in probability

then for each ε > 0

E(|η̂(x)− η(x)|) → 0.


More trees

k–d trees:First split on the first coordinate, at the value of the first datapoint.Next split on the next coordinate of the next data point.Stop after k splits.

Or don’t stop at all. But then prune the tree, by uniting the nodesup until each leaf has at least k data points in it.


Using labels too

Why using only objects to construct the tree? Why not usinglabels?

In particular, one can select a split that minimizes empirical error.For real-valued data it can lead to some problems...But some good trees can be found on this way! we will see themlater.

Assignment 1

1 Geometric distribution

The geometric distribution on the set N = {1, 2, 3, 4, . . .} is given by PG(k) =(1 − p)k−1p for k ∈ N.

The Bernoulli distribution on the set B = {0, 1} with parameter p is givenby PB(0) = p. It is a “toss of a biased coin”.

Suppose we are making independent Bernoulli coin tosses with parameter p.Verify that the probability that the first occurrence of 0 in this series is exactlyat step k ∈ N is given by Geometric distribution.

Find the expectation for the geometric distribution.

2 Nearest Neighbors

Download the data data.zip, build a k-nearest neighbors classifier based onthe training data, and test it on the testing data.

Use different values of k, and report the results for each value of k (numberof classification errors).

Each data file is in csv format (comma separated values): each line corre-sponds to one example, and reports a comma-separated list of its numericalattributes, followed by its class label.

For the USPS data (directory images), the attributes are 256 grayscale lev-els, normalized between −1 and 1, and the class label is one of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.This dataset contains 7291 training examples and 2007 test examples. Suggestedvalues for k:1, 3, 5, 7, 9, 11, 21, 31.

For the IRIS data (directory iris), the attributes are 4 lengths in cm, andthe class label is one of 0, 1, 2 (see file irisdescr). This dataset contains 100training examples and 50 test examples. Suggested values for k:1, 3, 5, 7, 9.

Submit working and portable code: use only standard libraries. The codeshould be a correct implementation of the proposed algorithm. Coding style,exception handling, etc. will not affect your grade. Still, the code should beclear and easily understandable. Use comments!

Along with the code provide a short description of the experiments youcarried out (the values you tried for k, and the corresponding errors). Feel freeto try other values than the proposed ones. Discuss how the performance varieswith the choice of k.

1

Assignment 1 - Part 1

October 8, 2007

1. (20 points) As the Bernoulli trials areindependent of each other, the joint prob-ability of any particular outcome is theproduct of the probabilities of the out-comes of each single trial.

P (x1, x2, ..., xk) = P (x1)P (x2)...P (xk) =

k∏

i=1

P (xi) (1)

As P (xi = 0) = p, and obviouslyP (xi = 1) = 1 − p, we can evaluate theprobability that the first0 occurs at thek-th trial as

Pr{x1 = 1, x2 = 1, ..., xk−1 = 1, xk = 0} = (1 − p)k−1p (2)

Not mentioning the independence condition will cost you5 points.

2. (30 points) The expectation of a discrete distribution over an infinite set can beevaluated as:

E{k} =∞∑

i=1

iP (i) (3)

In the case of the geometric distributionP (k) = qk−1p (whereq = 1 − p)

E{k} =

∞∑

i=1

ipqi−1 = p

∞∑

i=0

iqi−1 = (4)

as0q−1 = 0. We can write the terms to be summed as

= p

∞∑

i=1

d

dqqi = (5)

We can now invert the derivative and sum operators

= pd

dq

∞∑

i=1

qi = pd

dq

1

1 − q= p

1

(1 − q)2=

1

p(6)

as∑

∞

i=0qi = 1

1−qfor q ∈ (0, 1).

1

As an alternative, we can simply write

E{k} = p

∞∑

i=1

iqi = p

∞∑

i=1

∞∑

j=i

qj (7)

Each term of the outer sum is the sum of a geometric series starting from i

∞∑

j=i

qj =qi

1 − q(8)

so the result is again

E{k} = p

∞∑

i=1

iqi = p

∞∑

i=1

qi

1 − q=

p

1 − q

∑

i

qi =1

1 − q=

1

p(9)

asp = 1 − q.

2

Maximum likelihood Linear regression Bayes rule Bayesian regression Conjugate priors Conclusion and exercises

Parametric methods

Daniil Ryabko1

IDSIA/USI/SUPSI

1Some slides are based on Andrew Moore’s http://www.cs.cmu.edu/ awm/tutorials and on Bishop’s ML

book, Chapter 3


Maximum likelihood learning

A simple and very fundamental way of making inference ismaximum likelihood estimation.

Suppose we have n i.i.d. Gaussian N (µ, σ2) variables x1, . . . , xn.

fN (µ,σ2)(x) =1√

2πσ2e−(x−µ)2/2σ2

Suppose we know σ2 and want to find out µ.

Which µ is most likely given the data?Which µ maximizes the likelihood fN (µ,σ2)(x1, . . . , xn)?


calculation

argmaxµ {fµ(x1, . . . , xn)} = argminµ

{n∑

i=1

(xi − µ)2)

}Taking the derivative ∂

∂µ and setting it to zero we find

µMLE = 1n

∑ni=1 xi

Which wasn’t too unexpected.


MLE in general

Suppose we have a vector of parameters θ = (θ1, . . . , θk).

• Write down LL = log fθ(x1, . . . , xn) (LL for “log likelihood”).

• Differentiate it according to the parameters:

∂LL

∂θ=

∂LL∂θ1∂LL∂θ2...

∂LL∂θk

• Solve the set of simultaneous equations ∂LL

∂θ = 0.

• Check that you found a maximum (not a minimum, or asaddle point), and check the boundaries if you have any.


We have n i.i.d. Gaussian N(µ, σ2) variables x1, . . . , xn.Now we don’t know neither µ nor σ2. Proceed the same way.

log pµ,σ2(x1, . . . , xn) = −n(log√

2π +1

2log σ2)− 1

2σ2

n∑i=1

(xi − µ)2

∂LL

∂µ=

1

σ2

n∑i=1

(xi − µ)

∂LL

∂σ2= − n

2σ2+

1

2σ4

n∑i=1

(xi − µ)2

Setting to 0 and solving, we get



log pµ,σ2(x1, . . . , xn) = −n(log√

2π +1

2log σ2)− 1

2σ2

n∑i=1

(xi − µ)2

∂LL

∂µ=

1

σ2

n∑i=1

(xi − µ)

∂LL

∂σ2= − n

2σ2+

1

2σ4

n∑i=1

(xi − µ)2


µMLE =1

n

n∑i=1

xi , σ2MLE =

1

n

n∑i=1

(xi − µ)2



log pµ,σ2(x1, . . . , xn) = −n(log√

2π +1

2log σ2)− 1

2σ2

n∑i=1

(xi − µ)2

∂LL

∂µ=

1

σ2

n∑i=1

(xi − µ)

∂LL

∂σ2= − n

2σ2+

1

2σ4

n∑i=1

(xi − µ)2


µMLE =1

n

n∑i=1

xi , σ2MLE =

1

n

n∑i=1

(xi − µMLE)2


Quality of the estimate

How do we estimate the quality of this estimates? Are they anygood?

An estimate θ̂ of the parameter θ is called unbiased if

E θ̂ = θ.

Let’s check µMLE:

EµMLE =1

nE

n∑i=1

xi = E xi = µ,

hence µMLE is unbiased.


But σ2MLE = 1

n

∑ni=1(xi − 1

n

∑ni=1 xi )

2 is biased!Check for n = 1 first.In general,

Eσ2MLE =

(1− 1

n

)σ2 6= σ2

However, we can easily correct it, defining

σ2corrected =

n

n − 1σ2

MLE =1

n − 1

n∑i=1

(xi − µMLE)2,

which is unbiased.


Still, an unbiased estimate does not necessarily mean goodestimate. For example, µ̂ = x1 is an unbiased estimator of µ, butit’s not good.

Further methods to evaluated estimators θ̂ of a parameter θ:

• Mean squared error E(θ̂ − θ)2

• Asymptotic convergence: θ̂ → θ as n →∞• Confidence intervals

But we won’t consider these in details now.


MLE for multidimensional Gaussian

For d-dimensional Gaussian

fN (µ,Σ)(x) =1

(2π)d/2|Σ|1/2e−

12(x−µ)T Σ−1(x−µ)

sample x1, . . . , xn we get

µMLE =1

n

n∑i=1

xi

and

ΣMLE =1

n

n∑i=1

(xi − µMLE)(xi − µMLE)T


Regression

We are given a sample (X1,Y1), . . . , (Xn,Yn) of i.i.d. r.v.generated according to some probability distribution P on X× R,where as before X = Rd .

That is, we have a sample of n pairs (Xi ,Yi ) where Xi is a vectorand Yi is a number.

We want to estimate the label Y for unseen objects X .

Similar to classification, but the labels Y are reals.

Let’s go parametric now!


Linear regression

In linear regression in its simplest form, we assume

yw (x) = w0 + w1x(1) + w2x

(2) + · · ·+ wdx (d), (1)

where w0,w1, . . . ,wd are parameters.

But this is too simple!

We introduce some basis functions

ϕ1(x), . . . , ϕM−1(x)

(x here is the whole vector x = x (1), . . . , x (d)) and let

yw (x) = w0 + w1ϕ1(x) + w2ϕ2(x) + · · ·+ wM−1ϕM−1(x).

Note that the number of basis functions M − 1 doesn’t have toequal the dimension d .


Letting ϕ0(x) = 1 we can rewrite

yw (x) =M−1∑j=0

wjϕj(x).

or in vector notation

yw (x) = wTϕ(x).

Without indices means a vector!


Basis functions

Let d = 1 and let’s look at different basis functions.

Polynomial basis functions: ϕk(x) = xk (here upper index is thepower).

Gaussian basis functions: ϕk(x) = e−(x−µk )/2s2, where µk and s

are parameters,

Sigmoidal basis functions ϕk(x) = σ(

x−µj

s

), where σ(a) = 1

1+e−a .

However, for the following it absolutely doesn’t matter what thebasis functions are!


So we have (in vector notation)

yw (x) = wTϕ(x).

But this is not probabilistic at all! What’s there to learn then?

So we add some Gaussian noise ε, with zero mean and variance σ2.Now,

y = wTϕ(x) + ε.

So in our sample (X1,Y1), . . . , (X1,Y1) we have

Yi = wTϕ(Xi ) + εi ,

where εi are i.i.d. Gaussian N (0, σ2).

This means that Yi is a Gaussian with mean wTϕ(Xi ) andvariance σ2. The mean depends on X .

Thus we assume that we know very well how the data is generated,except that we don’t know some parameters: w and possibly σ2.


In parametric methods, often the main goal shifts toestimating the value of the parameters. When these arerevealed, arbitrary inference can be made.


Maximum likelihood least squares

The log likelihood,

LL = log fN (wT ϕ(Xi ),σ2)((X1,Y1), . . . , (Xn,Yn))

= −n(log√

2π +1

2log σ2)− 1

2σ2

n∑i=1

(Yi − wTϕ(Xi ))2

So (as before) we will have to minimize∑n

i=1(Yi − wTϕ(Xi ))2,

that is why it’s called least squares.

Differentiating with respect to the vector w we have to solve

n∑i=1

(Yi − wTϕ(Xi ))ϕ(Xi )T = 0


n∑i=1

Yiϕ(Xi )T − wT

n∑i=1

ϕ(Xi )ϕ(Xi )T = 0

solving for w we obtain

wMLLS = (ΦTΦ)−1ΦTY , (2)

where

Φ =

ϕ0(X1) ϕ1(X1) . . . ϕM−1(X1)ϕ0(X2) ϕ1(X2) . . . ϕM−1(X2)

......

. . ....

ϕ0(Xn) ϕ1(Xn) . . . ϕM−1(Xn)

is called the design matrix and (ΦTΦ)−1ΦT is Moore-Penrosepseudo inverse of Φ.Solving also for σ2 we get σ2

MLLS = 1n

∑ni=1(Yi − wT

MLLSϕ(Xi ))2.


Bayes rule

Bayes, Thomas (1763) An essay towards solving a problem in thedoctrine of chances. Philosophical Transactions of the RoyalSociety of London, 53:370-418


Bayesian inferenceBayesian inference typically proceeds as follows:

• Parameterize the problem

• Define the prior distribution on the parameters

• Estimate the posterior distribution of the parameters given thedata

• Make inference

Look at the Bayes rule again:

P(A|B) =P(B|A)P(A)

P(B);

Here A is the “parameter”, B is the data.

• P(A) is the prior probability of A,

• P(A|B) is the posterior probability of A,

• P(B|A) is the likelihood of B given A,

• P(B) is the prior of B, just acts as a normalizing factor.


Bayesian approach is very general.

It can solve any problem.

However, the quality of solution depends on

• Parametrization

• The choice of prior


Bayesian inference for GaussianRecall the problem: we are given X1, . . . ,Xn generated i.i.d.according to a Gaussian distribution N (µ, σ2) and we want toestimate the parameter µ, meanwhile considering σ2 known.

A Bayesian approach is to maximize the the posterior probability ofthe parameters given the data.

That is, find µ that maximize

f (µ|X1, . . . ,Xn) =f (X1, . . . ,Xn|µ))fprior (µ)

fprior (X1, . . . ,Xn)

Here f (X1, . . . ,Xn|µ) is just fN (µ,σ2)(X1, . . . ,Xn), while fprior (µ)we define somehow; we just chose what seems to be more suitableand what is more convenient for us.Note that since fprior (X1, . . . ,Xn) is independent of µ we can dropit. Thus we define

µMAP = argmaxµ f (X1, . . . ,Xn|µ)fprior (µ)


So let’s chose the prior to be also Gaussian fprior (µ) = N (0, α).To find the maximum we follow the same procedure: differentiate,solve.

LP = −n(log√

2π+1

2log σ2)− 1

2σ2

n∑i=1

(xi−µ)2−log√

2πα− 1

2αµ2

Differentiating and solving we find

µMAP =1

n + 1α

n∑i=1

Xi .


Bayesian linear regression

Recall the setup: we are given a sample (X1,Y1), . . . , (Xn,Yn) ofi.i.d. r.v. generated according to the following rule

Yi (x) = wTϕ(x) + εi .

where ϕ0(x) = 1, ϕ1(x), . . . , ϕM−1(x) are basis functions, and εi

are independent 0-mean Gaussians w. variance σ2 (assumedknown).

We have to estimate w .

A Bayesian approach is to have a prior over w , and then maximizethe posterior.

Here’s our prior: w is distributed according to a (multivariate)Gaussian N (0, αI) (where I is the identity matrix). In other words,the components wj are independent Gaussians with zero mean andvariance α.


LP = − 1

2σ2

n∑i=1

(Yi − wTϕ(Xi ))2 − 1

2αwTw + const

Maximizing this is known as regularized least squares.The solution here is

wMAPLS = (ΦTΦ + λI)−1ΦTY ,

where λ = σ2

α .

Regularization helps to prevent overfitting: if we have too manybasis functions then it is easy to overfit. Regularization may helphere, but it introduces another parameter α.


Conjugate priors

Why did we chose Gaussian prior for the distribution of theparameters of the Gaussian?

The answer “because we don’t know any other distributions” iswrong.

The answer “because it simplifies the calculations” is correct.

But how did we know that it would simplify the calculations?


Look again at the posterior likelihood for the parameters

f (X1, . . . ,Xn|µ)fprior (µ)

if we take a Gaussian N (µ0, σ20) prior over µ, as we did, this takes

the form

1

(2πσ2)n/2e−

1σ2 Σn

i=1(Xi−µ)2 1√2πσ2

0

e− 1

σ20(µ−µ0)2

you can check that this equals to another Gaussian N (µn, σ2n)

1√2πσ2

n

e− 1

σ2n(µ−µn)2

,

with

µn =σ2

nσ20 + σ2

µ0 +nσ2

0

nσ20 + σ2

µMLE

and

σ2n =

(1

σ20

+n

σ2

)−1


So if we take a Gaussian prior distribution on the parameter µ of aGaussian, we get a Gaussian posterior. Thus Gaussian distributionis self-conjugate.

In general, a prior probability distribution f (θ) (of the parameter θ)is said to be conjugate to the likelihood f (x |θ) if the resultingposterior f (θ|x) is in the same family as f (θ).


Topics covered

• What is a maximum likelihood estimator

• MLE estimator for parameters of a Gaussian

• Linear regression with abstract basis functions

• Conditional probabilities, Bayes rule

• Again, the Bayes rule

• Maximum a posteriori estimators

• MAP estimators for the Gaussian

• MAP for linear regression (regularized least squares)

• Conjugate priors: Gaussian, Bernoulli.


Useful exercise 1

These exercises are not graded and there’s no deadline. But theyare very useful, especially if you think you may need to refreshsome background material.

Exercise (Matrices and stuff)

In the linear regression problem, suppose that ϕj(x) = x (j), that is,the basis functions are degenerate and the linear regression is givenby (1). Let also the number of data points n = 2 and d = 2. Workout the solution for wMLLS without vector/matrix notation, andcheck that in this case it is indeed the same as the solution givenby the general formula (2) for wMLLS.


Useful exercise 3

Exercise (Different distributions and conjugates)

Fill in what’s missing in the derivation of the conjugate Gaussiandistribution. Make sure you know what is a binomial distributionand how it is related to Bernoulli. Find it’s mean. Check what’sgoing on in the derivation of the conjugate for Bernoulli.

Tabulating the joint distribution Naive Bayes Bayesian networks

Basics of Bayesian Networks

Daniil Ryabko1

IDSIA/USI/SUPSI

1Some slides are based on Andrew Moore’s http://www.cs.cmu.edu/ awm/tutorials


This lecture is simpler.


Discrete spaces

We are moving to a simpler classification task: discrete objects.

We are given a sample (X1,Y1), . . . , (Xn,Yn) generated i.i.d.according to P. Yi are binary. But Xi are also from a finite set X.We assume that each X is a d-dimensional vector of discretefeatures: X ∈ X = Bd where B = b1, . . . , bk is a finite set.


Tables for the joint distribution

At first this problem is easy.Given the data, we can simply tabulate the estimate of the jointdistribution P(x (1), . . . , x (d),Y ).

x (1) x (2) y # records

0 0 0 2

0 0 1 4

0 1 0 5

0 1 1 1

1 0 0 3

1 0 1 13

1 1 0 9

1 1 1 2


Tables for the joint distribution

At first this problem is easy.Given the data, we can simply tabulate the estimate of the jointdistribution P(x (1), . . . , x (d),Y ).

x (1) x (2) y P̂

0 0 0 2/39

0 0 1 4/39

0 1 0 5/39

0 1 1 1/39

1 0 0 3/39

1 0 1 13/39

1 1 0 9/39

1 1 1 2/39


x (1) x (2) y P̂

0 0 0 2/39

0 0 1 4/39

0 1 0 5/39

0 1 1 1/39

1 0 0 3/39

1 0 1 13/39

1 1 0 9/39

1 1 1 2/39

What is the probability

P(Y = 1,X (1) = 0)?Just sum the matching rows.How about

P(Y = 1|X (1) = 0,X (2) = 1)?

Or this P(Y = 1|X (2) = 0)?

In general, for n binary variables 2n rows.For n m-ary variables mn rows.


This is a table for a database with 3 binary attributes: sex (M/F),hours worked (less than or more than 40.5) and wealth (poor/rich).


So, this way we can get estimates for any probabilities!

But, in general, for n m-ary variables we have mn rows.This requires quite a lot of storage space, but also it’s quite hardto fill such a table with data.In fact, it’s almost impossible to do this for about 40 attributes.

What can help us is (conditional) independence of some variables.


If two variables X and Y are independent we can restore all the 4rows of the table from P(X = 0) and P(Y = 0).


A more interesting case is conditional independence.First recall the conditional probability definition

P(A|B) =P(A,B)

P(B)

and the Bayes rule in this form

P(A|B) =P(B|A)P(A)∑

a P(B|A = a)P(A = a)

we will be using these all the time


If x (1) and x (2) are conditionally independent given y then we onlyneed P(x (1) = 1|y = 0), P(x (1) = 1|y = 1), P(x (2) = 1|y = 0),P(x (2) = 1|y = 1) and P(y = 1) to fill all the 8 rows of the table.

Check it!


Graph

��?

@@@R

��

y

x (1) x (2) x (3)

Graphically, we can showconditional independence of x (1), x (2), x (3) given y like this.


Naive Bayes

Assume that all features X (i) are independent given the label Y .

How do we make predictions in such a model?For a given X = X (1), . . . ,X (d) we wantargmaxa P(Y = a|X (1), . . . ,X (d)).

argmaxa P(Y = a|X (1), . . . ,X (d))

= argmaxaP(X (1), . . . ,X (d)|Y = a)P(Y = a)

P(X (1), . . . ,X (d))

= argmaxa P(X (1), . . . ,X (d)|Y = a)P(Y = a)

= argmaxa P(Y = a)d∏

i=1

P(X (i)|Y = a)


So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)


So the Naive Bayes algorithm works as follows:

• estimate P(Y = a) for each a using frequencies.

• estimate P(X (d) = b|Y = a) for all a, b using frequencies.

• for a new object X = b(1), . . . , b(d) predictargmaxa P(Y = a)

∏di=1 P(X (i) = b(i)|Y = a)

How many frequencies do we have to count, if all sets are binary?2+2dIf the size of the label space |Y| is m and each feature X (i) is froma k-ary set?m+(k-1)mdThis is much better than exponential, which is required for the fulltables!


What if want probability estimates from our algorithm, not justpredictions? Recall that

P(Y = a|X (1), . . . ,X (d)) =P(Y = a)

∏di=1 P(X (i)|Y = a)

P(X (1), . . . ,X (d))

=P(Y = a)

∏di=1 P(X (i)|Y = a)∑

a P(Y = a,X (1), . . . ,X (d))

=P(Y = a)

∏di=1 P(X (i)|Y = a)∑

a

(P(Y = a)

∏di=1 P(X (i)|Y = a)

)


Independence does not imply conditional independence

Once again, Independence does not imply conditionalindependence!

Think of an example of 3 binary random variables, X ,Y ,Z suchthat X and Y are independent, but are not independent given Z .





Hint: may be Z should be some function of X and Y





Hint: may be Z should be some function of X and Y

Z = (X + Y )mod2


Another type of relation is this.Let the features x (1), x (2), x (2) be independent. Each of themprovides information about Y .


��?

@@@R

��

�

��

y

x (1) x (2) x (3)

This type of dependence by itselfdoes not give us big advantage in calculation of probabilities.

We need P(Y |x (1), x (2), x (3)) for all assignments of values to thevariables.

But it gives us a nice picture!


The burglar example

• Your house has a twitchy burglar alarm that is also sometimestriggered by earthquakes.

• Earth arguably doesnt care whether your house is currentlybeing burgled

• While you are on vacation, one of your neighbors calls andtells you your homes burglar alarm is ringing.


In general, a Bayesian network (or a “belief network”) is a directedacyclic graph.

Each node is a variable.

Each variable is conditionally independent of all itsnon-descendants in the graph given the value of all its parents.

For each variable X k , we need to know the probability of each ofits values given each assignment of values to all its parentsP(X k |parents(X k).


Inference in Bayesian Networks

Let’s see how we can compute the probability of any row of thetable of the joint distribution with the Bayes net. (And youremember that if we know the joint distribution then we knoweverything) Here I dropped the Y for convenience, we only haveX i .

P(X 1,X 2, . . . ,X d) = P(X d |X 1, . . . ,X d−1)P(X 1, . . . ,X d−1)

= P(X d |parents(X d))P(X 1, . . . ,X d−1)

= P(X d |parents(X d))P(X d−1|X 1, . . . ,X d−2)P(X 1, . . . ,X d−2)

= · · · =d∏

i=1

P(X i |parents(X i ))

Let’s compute something for some example


Computational Complexity(Informally)

• To store a Bayes net we need capacity linear in the number ofnodes, only exponential in the maximal number of parents ofany node.

• We can compute any row of the table of the joint in lineartime.

• So we can compute probabilities of anything given anything!

• But

• In general, to compute probabilities like P(Y |E) where E issome set of nodes, is exponential...

• it can be exponential in the number of nodes.

• General querying of Bayes nets is NP-hard.


More pictures

��?

��?

��

X

Y

Z

Here Z is conditionally independent of X given Y . By symmetry,X is also conditionally independent of Z given Y . Y is notconditionally independent of anything. Looks like what wemodelled on out first graph.


What is independent of what on the graph?

• While you are on vacation, one of your neighbors calls andtells you your homes burglar alarm is ringing.

• Then you hear on radio that there was a small earthquake inyou area. So perhaps it wasn’t the burglar after all.

• Earthquake explains away the hypothetical burglar.

• But then it must not be the case that Burglar is independentof the Earthquake given the Phone call?

• Don’t worry it’s not


d -separation

If two nodes X and Y are d-separated by a set of nodes E thenthey are conditionally independent.

X and Y are d-separated by a set of nodes E if any undirectedpath p between X and Y is blocked by E . Whereas a path p isblocked by E if either

• p contains a chain X → t → Y or a fork X ← t → Y suchthat t is in E or

• p contains an inverted fork (collider) X → t ← Y such that tis not in E and none of descendants of t is in E .


Let’s check whether the following independencies hold:

• Is C independent of D?

• Is C independent of D given A?

• Is C independent of D given A, B?

• Is C independent of D given A, B, J?

• Is C independent of D given A, B, E, J?


Bayes nets summarized

• when building a Bayes net we have to be sure that Eachvariable is conditionally independent of all its non-descendantsin the graph given the value of all its parents.

• For each variable X k , we need to know the probability of eachof its values given the each assignment of values to all itsparents P(X k |parents(X k)).

• After that, we can make inference...

• though making arbitrary inference is NP hard.

• We can also look up what independencies are implied by thenet.

• Bayes nets are used in applications where considerable expertknowledge is available. For example, in medical applications.After a net is constructed using this knowledge, a lot of newinference can be mad automatically; this includes predictingconditional and joint probabilities, and analyzingdependencies.

Overview: Universal versus specific methods

Daniil Ryabko

IDSIA/USI/SUPSI

Classification with real-valued features

We are given a (training) sample (X1,Y1), . . . , (Xn,Yn), where Xi

are from the the object space X = Rd , and Yi are from the labelspace Y = {0, 1}. The coordinate components x (1), . . . , x (d) of anobject x are called (input) features. The objects and labels (Xi ,Yi )are generated i.i.d. by (unknown) probability distribution P onRd × A.

We wish to construct a classifier that classifies objects whose labelswe don’t know.

Non-parametric classification

Methods for non-parametric classification, such as histogrammethods, k-nearest neighbours, many decision trees, classify newobjects according to the majority vote in the small bin containingthe object.

These methods are “universal” in the sense that they work (inasymptotic) for any distribution P generating the examples; wehave theorems like this:

TheoremFor any distribution P generating the examples the following istrue. For the cubic histogram rule with if hn → 0 and nhd

n →∞ asn →∞ then

E(|η̂(X )− η(X )|) → 0.

Curse of dimensionality

A histogram method that brakes each coordinate into k parts (thatis, h = 1

k ), in d dimensions will have kd bins. We have to haveenough data to make inference about kd different values.

If a distribution generating the data is arbitrary, making inferenceabout one of the bins doesn’t tell us anything about the other bins.

Only Asymptotic

Even if we can fill kd bins with data, can we fill (k + 1)d? Whichk is enough? The performance guarantees are only asymptotic.

To overcome the curse of dimensionality, one has to considersmaller families of distributions. For example, consider thedistributions that have density of some particular form.

A d-dimensional Gaussian distribution is defined by 2d parameters.We only have to estimate those.We only have to have enough data to estimate 2d parameters.

Linear regression with M basis functions has M or M + 1parameters. (Independent of the dimension)

Estimating the parameters of a d-dimensional Gaussian distributionis useful only if the real distribution generating the data is indeed(close to) Gaussian.

Thus we can chose between “universal” methods that work for anydistribution and specific methods that work only in specific casesbut are more data efficient.

Classification and inference with discrete-valued features

If our objects are from a finite space, we have a seemingly simplertask of classification with discrete features.

However, if an object x = (x (1), . . . , x (d)) is a d-dimensional binaryvector, there are 2d possible values for d .

Tabulating the full joint distribution (that is, counting the numberof data points that have a form (x (1), . . . , x (d), y) for eachcombination of x (i) and y) requires estimation of 2d+1 values.

This is effectively the same as a cubic histogram method.

Bayesian networks

Bayesian networks allow to reduce the dimensionality of theproblem by considering distributions of specific form.

For example, if d binary features x (1), . . . , x (d) are conditionallyindependent given the object, we only have to estimate 2d valuesP(x (i) = 0|y = 0),P(x (i) = 0|y = 1) to make classification.

Again we have the same trade-off: universality versus dataefficiency.

Real-valued or discrete inputs?

If you like some method that is designed for discrete inputs, butyou have real-valued data, you can quantize it. Quantization in kbins: for x ∈ [0, 1] let [x ] = t such that t/k ≤ x < t/(k + 1).

• Universal methods are guaranteed to work for an arbitrarydistribution.

• But only in asymptotic

• Even beyond asymptotic they seem to require too much data:exponential with the number of dimensions.

• Specific (parametric) methods reduce the problem toestimation of a few parameters.

• In some cases, only the number of parameters is only linear inthe dimension.

• But they work only in the specific situations they are designedfor.

• In general, you need some “domain knowledge” to be able tochose a good specific method.

Assignment 2

1 MLE and MAP

i) A distribution P on the 3-element set {a, b, c} is defined by two parametersp = P ({a}) and q = P ({b}). Construct a maximum likelihood estimator(MLE) for these two parameters.

ii) Construct a maximum a posteriori estimator (MAP) for the parameter p ofthe Bernoulli distribution, assuming prior distribution over the parameter

with density f(p) =

{3p2 for p ∈ [0, 1]0 otherwise

2 Naive Bayes

Implement a Naive Bayes classifier for the “Iris” dataset provided in thefirst assignment, using quantization for real-valued inputs. Quantize eachinput using k bins of equal size, with different values of k; report theresults.

Implementation details:

– Quantize each feature between 0.0 and 8.0. Use a power of two forthe number of bins: try k = 2, 4, 8, ..., 64.

– If for a test set point two or more labels have the same probability,pick one randomly.

– Your program should output the number of errors. Report thesevalues for each k, and discuss the results. Please include a scriptthat can be used to reproduce the experiment, or implement yourexperiment in the main file.

1

A tutorial on PrincipalComponentsAnalysis

LindsayI Smith

February26,2002

Chapter 1

Intr oduction

This tutorial is designedto give thereaderanunderstandingof PrincipalComponentsAnalysis (PCA). PCA is a useful statisticaltechniquethat hasfound applicationinfieldssuchasfacerecognitionandimagecompression,andis acommontechniqueforfindingpatternsin dataof highdimension.

Beforegettingto a descriptionof PCA, this tutorial first introducesmathematicalconceptsthatwill beusedin PCA. It coversstandarddeviation,covariance,eigenvec-torsandeigenvalues.This backgroundknowledgeis meantto make the PCA sectionverystraightforward,but canbeskippedif theconceptsarealreadyfamiliar.

Thereareexamplesall theway throughthis tutorial thataremeantto illustratetheconceptsbeingdiscussed.If furtherinformationis required,themathematicstextbook“ElementaryLinearAlgebra5e” by HowardAnton,PublisherJohnWiley & SonsInc,ISBN 0-471-85223-6is agoodsourceof informationregardingthemathematicalback-ground.

1

Chapter 2

Background Mathematics

Thissectionwill attemptto givesomeelementarybackgroundmathematicalskills thatwill be requiredto understandthe processof Principal ComponentsAnalysis. Thetopicsarecoveredindependentlyof eachother, andexamplesgiven.It is lessimportantto remembertheexactmechanicsof a mathematicaltechniquethanit is to understandthereasonwhy sucha techniquemaybeused,andwhattheresultof theoperationtellsusaboutourdata.Not all of thesetechniquesareusedin PCA,but theonesthatarenotexplicitly requireddo provide thegroundingon which themostimportanttechniquesarebased.

I have includeda sectionon Statisticswhich looks at distribution measurements,or, how the datais spreadout. The othersectionis on Matrix Algebraandlooks ateigenvectorsandeigenvalues,importantpropertiesof matricesthatarefundamentaltoPCA.

2.1 Statistics

Theentiresubjectof statisticsis basedaroundtheideathatyouhavethisbig setof data,andyou want to analysethat set in termsof the relationshipsbetweenthe individualpointsin thatdataset.I amgoingto look at a few of themeasuresyou cando on a setof data,andwhatthey tell youaboutthedataitself.

2.1.1 Standard Deviation

To understandstandarddeviation, we needa dataset. Statisticiansareusuallycon-cernedwith takinga sample of a population. To useelectionpolls asanexample,thepopulationis all the peoplein the country, whereasa sampleis a subsetof the pop-ulation that the statisticiansmeasure.The greatthing aboutstatisticsis that by onlymeasuring(in thiscaseby doingaphonesurvey or similar)asampleof thepopulation,youcanwork outwhatis mostlikely to bethemeasurementif youusedtheentirepop-ulation. In this statisticssection,I am going to assumethatour datasetsaresamples

2

of somebiggerpopulation.Thereis a referencelater in this sectionpointing to moreinformationaboutsamplesandpopulations.

Here’sanexampleset:

��

I could simply usethe symbol�

to refer to this entiresetof numbers.If I want toreferto anindividualnumberin this dataset,I will usesubscriptson thesymbol

�to

indicatea specificnumber. Eg.��

refersto the3rd numberin�

, namelythenumber4. Note that

��is thefirst numberin thesequence,not

��like you mayseein some

textbooks. Also, the symbol � will beusedto refer to thenumberof elementsin theset�Therearea numberof thingsthatwe cancalculateabouta dataset. For example,

wecancalculatethemeanof thesample.I assumethatthereaderunderstandswhatthemeanof a sampleis, andwill only give theformula:

�� !"$# � � "�

Noticethesymbol��

(said“X bar”) to indicatethemeanof theset�

. All this formulasaysis “Add upall thenumbersandthendivide by how many thereare”.

Unfortunately, the meandoesn’t tell us a lot aboutthe dataexcept for a sort ofmiddlepoint. For example,thesetwo datasetshave exactly the samemean(10), butareobviouslyquitedifferent:

�&%'��&��%��( �*) �+��,��

Sowhat is differentaboutthesetwo sets?It is thespread of thedatathat is different.TheStandardDeviation (SD)of adatasetis a measureof how spreadout thedatais.

How dowecalculateit? TheEnglishdefinitionof theSDis: “The averagedistancefrom the meanof the datasetto a point”. The way to calculateit is to computethesquaresof the distancefrom eachdatapoint to the meanof the set,addthemall up,divideby �.- � , andtake thepositivesquareroot. As a formula:

/ � !"$# �10 � " - ��32540 ��- ��2

Where/ is theusualsymbolfor standarddeviationof asample.I hearyouasking“Whyareyouusing 0 �6- ��2 andnot � ?”. Well, theansweris abit complicated,but in general,if your datasetis a sample dataset,ie. you have takena subsetof thereal-world (likesurveying 500peopleabouttheelection)thenyoumustuse 0 �7- ��2 becauseit turnsoutthat this givesyou ananswerthat is closerto thestandarddeviation thatwould resultif you hadusedthe entire population,thanif you’d used� . If, however, you arenotcalculatingthestandarddeviation for a sample,but for anentirepopulation,thenyoushoulddivide by � insteadof 0 �8- ��2 . For furtherreadingon this topic, thewebpagehttp://mathcentral.uregina.ca/RR/database/RR.09.95/weston2.html describesstandarddeviation in a similar way, andalsoprovidesan exampleexperimentthat shows the

3

Set1:

� 0 � - ��32 0 � - ��32 40 -10 1008 -2 412 2 420 10 100Total 208Dividedby (n-1) 69.333SquareRoot 8.3266

Set2:

� " 0 � " - ��32 0 � " - ��32 48 -2 49 -1 111 1 112 2 4Total 10Dividedby (n-1) 3.333SquareRoot 1.8257

Table2.1: Calculationof standarddeviation

differencebetweeneachof thedenominators.It alsodiscussesthedifferencebetweensamplesandpopulations.

So, for our two datasetsabove, the calculationsof standarddeviation are in Ta-ble2.1.

And so,asexpected,the first sethasa muchlarger standarddeviation dueto thefactthatthedatais muchmorespreadout from themean.Justasanotherexample,thedataset: �� %�� %��&%��&%�alsohasa meanof 10, but its standarddeviation is 0, becauseall thenumbersarethesame.Noneof themdeviatefrom themean.

2.1.2 Variance

Varianceis anothermeasureof the spreadof datain a dataset. In fact it is almostidenticalto thestandarddeviation. Theformulais this:

/ 4 � !"9# � 0 � " - ��32 40 �:- ��2

4

You will noticethat this is simply thestandarddeviation squared,in both thesymbol( / 4 ) andthe formula (thereis no squareroot in the formula for variance). / 4 is theusualsymbolfor varianceof a sample.Both thesemeasurementsaremeasuresof thespreadof the data. Standarddeviation is the mostcommonmeasure,but varianceisalsoused.Thereasonwhy I have introducedvariancein additionto standarddeviationis to provideasolidplatformfrom whichthenext section,covariance,canlaunchfrom.

Exercises

Find themean,standarddeviation,andvariancefor eachof thesedatasets.

; [12 2334 44 59 70 98]

; [12 1525 27 32 88 99]

; [15 3578 82 90 95 97]

2.1.3 Covariance

Thelasttwo measureswe have lookedat arepurely1-dimensional.Datasetslike thiscouldbe: heightsof all thepeoplein theroom,marksfor thelastCOMP101exametc.However many datasetshave morethanonedimension,andtheaim of thestatisticalanalysisof thesedatasetsis usually to seeif thereis any relationshipbetweenthedimensions.For example,we might have as our dataset both the height of all thestudentsin a class,andthemark they receivedfor thatpaper. We could thenperformstatisticalanalysisto seeif theheightof a studenthasany effecton theirmark.

Standarddeviation andvarianceonly operateon 1 dimension,so that you couldonly calculatethestandarddeviation for eachdimensionof thedatasetindependentlyof theotherdimensions.However, it is usefulto haveasimilarmeasureto find outhowmuchthedimensionsvary from themeanwith respect to each other.

Covarianceis sucha measure.Covarianceis alwaysmeasuredbetween 2 dimen-sions. If you calculatethe covariancebetweenonedimensionand itself, you get thevariance.So,if youhada 3-dimensionaldataset(< , = , > ), thenyoucouldmeasurethecovariancebetweenthe < and = dimensions,the < and > dimensions,andthe = and >dimensions.Measuringthecovariancebetween< and < , or = and = , or > and > wouldgiveyou thevarianceof the < , = and > dimensionsrespectively.

Theformulafor covarianceis verysimilar to theformulafor variance.Theformulafor variancecouldalsobewritten like this:

? (+@ 0 �A2�� !"9# � 0 � " - ��32 0 � " - ��A20 �:- ��2

whereI havesimplyexpandedthesquaretermto show bothparts.Sogiventhatknowl-edge,hereis theformulafor covariance:

B&CD? 0 �8E+F�2G� !"9# � 0 � " - ��32 0 F " - �F�20 �.- ��2

5

includegraphicscovPlot.ps

Figure2.1: A plot of the covariancedatashowing positive relationshipbetweenthenumberof hoursstudiedagainstthemarkreceived

It is exactly thesameexceptthatin thesecondsetof brackets,the�

’s arereplacedbyF’s. This says,in English,“For eachdataitem,multiply thedifferencebetweenthe <

valueandthemeanof < , by thethedifferencebetweenthe = valueandthemeanof = .Add all theseup,anddivideby 0 �.- ��2 ”.

How doesthis work? Letsusesomeexampledata.Imaginewe havegoneinto theworld andcollectedsome2-dimensionaldata,say, we have askeda bunchof studentshow many hoursin total that they spentstudyingCOSC241,andthe mark that theyreceived. Sowe have two dimensions,thefirst is the H dimension,thehoursstudied,andthesecondis the I dimension,themarkreceived.Figure2.2holdsmy imaginarydata,and the calculationof B&CD? 0 H E I 2 , the covariancebetweenthe Hoursof studydoneandtheMark received.

Sowhatdoesit tell us?Theexactvalueis not asimportantasit’s sign(ie. positiveor negative). If the value is positive, as it is here, then that indicatesthat both di-mensionsincrease together, meaningthat,in general,asthenumberof hoursof studyincreased,sodid thefinal mark.

If thevalueis negative,thenasonedimensionincreases,theotherdecreases.If wehadendedup with a negativecovariancehere,thenthatwould have saidtheopposite,thatasthenumberof hoursof studyincreasedthethefinal markdecreased.

In the last case,if the covarianceis zero,it indicatesthat the two dimensionsareindependentof eachother.

Theresultthatmarkgivenincreasesasthenumberof hoursstudiedincreasescanbeeasilyseenby drawing a graphof thedata,asin Figure2.1.3.However, theluxuryof beingableto visualizedatais only availableat 2 and3 dimensions.Sincethe co-variancevaluecanbecalculatedbetweenany 2 dimensionsin adataset,this techniqueis often usedto find relationshipsbetweendimensionsin high-dimensionaldatasetswherevisualisationis difficult.

You might ask“is B1CD? 0 �8E1F,2 equalto B&CD? 0 FJEK�A2 ”? Well, a quick look at the for-mula for covariancetells us that yes, they are exactly the samesincethe only dif-ferencebetweenB1CD? 0 �8E1F,2 and B&CD? 0 FLEM�N2 is that 0 � " - ��A2 0 F " - �F72 is replacedby0 F " - �F�2 0 � " - ��A2 . And sincemultiplication is commutative, which meansthat itdoesn’t matterwhich wayaroundI multiply two numbers,I alwaysgetthesamenum-ber, thesetwo equationsgive thesameanswer.

2.1.4 The covarianceMatrix

Recallthatcovarianceis alwaysmeasuredbetween2 dimensions.If wehaveadatasetwith morethan2 dimensions,thereis morethanonecovariancemeasurementthatcanbe calculated.For example,from a 3 dimensionaldataset (dimensions< , = , > ) youcouldcalculateB&CD? 0 < E = 2 , 0 B&CD? 0 < E > 2 , and B&CD? 0 = E > 2 . In fact,for an � -dimensionaldataset,you cancalculate !POQ !SR 4DT O U 4 differentcovariancevalues.

6

Hours(H) Mark(M)Data 9 39

15 5625 9314 6110 5018 750 3216 855 4219 7016 6620 80

Totals 167 749Averages 13.92 62.42

Covariance:

H I 0 H " - �H 2 0 I " - �I 2 0 H " - �H 2 0 I " - �I 29 39 -4.92 -23.42 115.2315 56 1.08 -6.42 -6.9325 93 11.08 30.58 338.8314 61 0.08 -1.42 -0.1110 50 -3.92 -12.42 48.6918 75 4.08 12.58 51.330 32 -13.92 -30.42 423.4516 85 2.08 22.58 46.975 42 -8.92 -20.42 182.1519 70 5.08 7.58 38.5116 66 2.08 3.58 7.4520 80 6.08 17.58 106.89

Total 1149.89Average 104.54

Table2.2: 2-dimensionaldatasetandcovariancecalculation

7

A useful way to get all the possiblecovariancevaluesbetweenall the differentdimensionsis to calculatethemall andput themin a matrix. I assumein this tutorialthatyouarefamiliarwith matrices,andhow they canbedefined.So,thedefinitionforthecovariancematrix for a setof datawith � dimensionsis:

V !PWS! � 0 B "MX Y E B "ZX Y � B&CD? 0M[]\_^ " E [`\Z^ Y 2a2ME

whereV !PWS! is a matrix with � rowsand � columns,and [`\Z^�b is the < th dimension.

All that this ugly looking formulasaysis that if you have an � -dimensionaldataset,thenthematrix has� rows andcolumns(so is square)andeachentry in thematrix istheresultof calculatingthecovariancebetweentwo separatedimensions.Eg. theentryonrow 2, column3, is thecovariancevaluecalculatedbetweenthe2nddimensionandthe3rddimension.

An example.We’ll makeup thecovariancematrix for animaginary3 dimensionaldataset,usingtheusualdimensions< , = and > . Then,thecovariancematrixhas3 rowsand3 columns,andthevaluesarethis:

V � B&CD? 0 < E < 2 B&CD? 0 < E = 2 B&CD? 0 < E > 2B&CD? 0 = E < 2 B&CD? 0 = E = 2 B&CD? 0 = E > 2B&CD? 0 > E < 2 B&CD? 0 > E = 2 B&CD? 0 > E > 2

Somepointsto note:Down themaindiagonal,you seethatthecovariancevalueisbetweenoneof thedimensionsanditself. Thesearethevariancesfor thatdimension.Theotherpoint is thatsince B&CD? 0 (�E+c+2�� B1CD? 0 c&E1(P2 , thematrix is symmetricalaboutthemaindiagonal.

Exercises

Work out the covariancebetweenthe < and = dimensionsin the following 2 dimen-sionaldataset,anddescribewhattheresultindicatesaboutthedata.

Item Number: 1 2 3 4 5< 10 39 19 23 28= 43 13 32 21 20

Calculatethecovariancematrix for this 3 dimensionalsetof data.

Item Number: 1 2 3< 1 -1 4= 2 1 3> 1 3 -1

2.2 Matrix Algebra

This sectionservesto provide a backgroundfor the matrix algebrarequiredin PCA.SpecificallyI will belookingateigenvectorsandeigenvaluesof agivenmatrix. Again,I assumea basicknowledgeof matrices.

8

�ed� � f

�d � ��

�

�ed� � f

d� � � �

� �g fd�

Figure2.2: Exampleof onenon-eigenvectorandoneeigenvector

� fd� � �

�ed� � f

� � ��

� � �g f�

Figure2.3: Exampleof how ascaledeigenvectoris still andeigenvector

2.2.1 Eigenvectors

As you know, you canmultiply two matricestogether, provided they arecompatiblesizes.Eigenvectorsareaspecialcaseof this. Considerthetwo multiplicationsbetweenamatrix andavectorin Figure2.2.

In the first example,the resultingvectoris not an integermultiple of the originalvector, whereasin the secondexample,the exampleis exactly 4 timesthe vectorwebeganwith. Why is this? Well, the vector is a vector in 2 dimensionalspace.The

vectord� (from thesecondexamplemultiplication) representsanarrow pointing

from the origin, 0 %SEh%�2 , to the point 0 dSEh��2 . The othermatrix, the squareone,canbethoughtof as a transformationmatrix. If you multiply this matrix on the left of avector, theansweris anothervectorthatis transformedfrom it’soriginalposition.

It is the natureof the transformationthat the eigenvectorsarisefrom. Imagineatransformationmatrix that, whenmultiplied on the left, reflectedvectorsin the line= � < . Thenyou canseethat if therewerea vectorthat lay on the line = � < , it’sreflectionit itself. This vector(andall multiplesof it, becauseit wouldn’t matterhowlong thevectorwas),wouldbeaneigenvectorof thattransformationmatrix.

Whatpropertiesdo theseeigenvectorshave? You shouldfirst know thateigenvec-tors canonly be found for square matrices.And, not every squarematrix haseigen-vectors.And, givenan � f � matrix thatdoeshave eigenvectors,thereare � of them.Givena

d f d matrix, thereare3 eigenvectors.Anotherpropertyof eigenvectorsis thatevenif I scalethevectorby someamount

beforeI multiply it, I still getthesamemultiple of it asa result,asin Figure2.3. Thisis becauseif you scaleavectorby someamount,all you aredoingis makingit longer,

9

not changingit’s direction. Lastly, all theeigenvectorsof a matrix areperpendicular,ie. at right anglesto eachother, nomatterhow many dimensionsyouhave. By theway,anotherwordfor perpendicular, in mathstalk, is orthogonal. This is importantbecauseit meansthat you canexpressthe datain termsof theseperpendiculareigenvectors,insteadof expressingthemin termsof the < and= axes.We will bedoingthis laterinthesectionon PCA.

Another importantthing to know is that whenmathematiciansfind eigenvectors,they like to find theeigenvectorswhoselengthis exactly one.This is because,asyouknow, thelengthof a vectordoesn’t affect whetherit’s aneigenvectoror not,whereasthe directiondoes. So, in order to keepeigenvectorsstandard,whenever we find aneigenvectorwe usuallyscaleit to make it have a lengthof 1, so that all eigenvectorshavethesamelength.Here’sa demonstrationfrom ourexampleabove.

d�

is aneigenvector, andthelengthof thatvectoris

0 d 4�i � 4 2��kj �ld

sowe divide theoriginal vectorby this muchto make it havea lengthof 1.

d� m j � dn�

dSo j �&d�So j �&d

How doesonego aboutfinding thesemysticaleigenvectors? Unfortunately, it’sonly easy(ish)if youhavea rathersmallmatrix, likeno biggerthanabout

d f d . Afterthat, the usualway to find the eigenvectorsis by somecomplicatediterative methodwhich is beyondthescopeof this tutorial (andthisauthor).If youeverneedto find theeigenvectorsof a matrix in a program,just find a mathslibrary thatdoesit all for you.A usefulmathspackage,callednewmat,is availableat http://webnz.com/robert/ .

Furtherinformationabouteigenvectorsin general,how to find them,andorthogo-nality, canbefoundin thetextbook“ElementaryLinearAlgebra5e” by HowardAnton,PublisherJohnWiley & SonsInc, ISBN 0-471-85223-6.

2.2.2 Eigenvalues

Eigenvaluesarecloselyrelatedto eigenvectors,in fact,we saw an eigenvaluein Fig-ure2.2. Noticehow, in boththoseexamples,theamountby which theoriginal vectorwasscaledafter multiplication by the squarematrix wasthe same?In that example,thevaluewas4. 4 is theeigenvalue associatedwith thateigenvector. No matterwhatmultiple of the eigenvectorwe took beforewe multiplied it by the squarematrix, wewouldalwaysget4 timesthescaledvectorasour result(asin Figure2.3).

Soyou canseethateigenvectorsandeigenvaluesalwayscomein pairs.Whenyougetafancy programminglibrary to calculateyoureigenvectorsfor you,youusuallygettheeigenvaluesaswell.

10

Exercises

For thefollowing squarematrix:

d % �- � �- �p% - �

Decidewhich, if any, of the following vectorsareeigenvectorsof thatmatrix andgive thecorrespondingeigenvalue.

��- �

- �%�

- ��d

%�%

d��

11

Chapter 3

Principal ComponentsAnalysis

Finally we cometo PrincipalComponentsAnalysis (PCA). What is it? It is a wayof identifying patternsin data,andexpressingthe datain sucha way asto highlighttheir similaritiesanddifferences.Sincepatternsin datacanbehardto find in dataofhigh dimension,wherethe luxury of graphicalrepresentationis not available,PCA isapowerful tool for analysingdata.

Theothermainadvantageof PCAis thatonceyouhavefoundthesepatternsin thedata,andyou compressthe data,ie. by reducingthe numberof dimensions,withoutmuchlossof information. This techniqueusedin imagecompression,aswe will seein a latersection.

This chapterwill take you throughthe stepsyou neededto perform a PrincipalComponentsAnalysison a setof data. I am not going to describeexactly why thetechniqueworks,but I will try to provide anexplanationof what is happeningat eachpoint so that you can make informed decisionswhen you try to usethis techniqueyourself.

3.1 Method

Step1: Get somedata

In my simpleexample,I am going to usemy own made-updataset. It’s only got 2dimensions,andthereasonwhy I have chosenthis is sothatI canprovideplotsof thedatato show whatthePCA analysisis doingat eachstep.

ThedataI haveusedis foundin Figure3.1,alongwith a plot of thatdata.

Step2: Subtract the mean

For PCAto work properly, youhaveto subtractthemeanfrom eachof thedatadimen-sions.Themeansubtractedis theaverageacrosseachdimension.So,all the < valueshave�< (themeanof the < valuesof all thedatapoints)subtracted,andall the = values

have�= subtractedfrom them.This producesa datasetwhosemeanis zero.

12

Data=

x y2.5 2.40.5 0.72.2 2.91.9 2.23.1 3.02.3 2.72 1.61 1.1

1.5 1.61.1 0.9

DataAdjust=

x y.69 .49

-1.31 -1.21.39 .99.09 .291.29 1.09.49 .79.19 -.31-.81 -.81-.31 -.31-.71 -1.01

-1

0

1

2

3

4

-1 0 1 2 3 4

Original PCA data

"./PCAdata.dat"

Figure3.1: PCAexampledata,originaldataontheleft, datawith themeanssubtractedon theright, andaplot of thedata

13

Step3: Calculate the covariancematrix

This is donein exactly thesameway aswasdiscussedin section2.1.4.Sincethedatais 2 dimensional,thecovariancematrix will be

� f � . Thereareno surpriseshere,soIwill justgiveyou theresult:

B&CD? � q �r� �� q �r� ��q �r� �� q ��

So,sincethenon-diagonalelementsin this covariancematrix arepositive, we shouldexpectthatboththe < and= variableincreasetogether.

Step4: Calculate the eigenvectorsand eigenvaluesof the covariancematrix

Sincethecovariancematrix is square,we cancalculatetheeigenvectorsandeigenval-uesfor this matrix. Theseareratherimportant,asthey tell ususefulinformationaboutour data. I will show you why soon. In the meantime,herearethe eigenvectorsandeigenvalues:

s \Zt s � ? (�uwv sx/ � q %��%��d�d�� q ��%��

s \_t s � ?rsyBZz1C @ / � - q � d�� - q �� &d�d��q �r��&�� d�d�� - q � d��&��

It is importantto notice that theseeigenvectorsare both unit eigenvectorsie. theirlengthsareboth1. This is very importantfor PCA,but luckily, mostmathspackages,whenaskedfor eigenvectors,will giveyouunit eigenvectors.

Sowhatdothey mean?If youlook at theplot of thedatain Figure3.2thenyoucanseehow thedatahasquitea strongpattern.As expectedfrom thecovariancematrix,they two variablesdo indeedincreasetogether. On top of thedataI have plottedboththe eigenvectorsaswell. They appearasdiagonaldottedlines on the plot. As statedin theeigenvectorsection,they areperpendicularto eachother. But, moreimportantly,they provide us with informationaboutthe patternsin the data. Seehow oneof theeigenvectorsgoesthroughthemiddleof thepoints,likedrawing a line of bestfit? Thateigenvector is showing us how thesetwo datasetsare relatedalong that line. Thesecondeigenvectorgivesus the other, lessimportant,patternin the data,that all thepointsfollow themainline, but areoff to thesideof themainline by someamount.

So, by this processof taking the eigenvectorsof the covariancematrix, we havebeenableto extractlinesthatcharacterisethedata.Therestof thestepsinvolve trans-forming thedatasothatit is expressedin termsof themlines.

Step5: Choosingcomponentsand forming a feature vector

Hereis wherethenotionof datacompressionandreduceddimensionalitycomesintoit. If you look at the eigenvectorsand eigenvaluesfrom the previous section,you

14

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Mean adjusted data with eigenvectors overlayed

"PCAdataadjust.dat"(-.740682469/.671855252)*x

(-.671855252/-.740682469)*x

Figure3.2: A plot of the normaliseddata(meansubtracted)with the eigenvectorsofthecovariancematrixoverlayedon top.

15

will notice that the eigenvaluesare quite different values. In fact, it turns out thattheeigenvectorwith thehighest eigenvalueis theprinciple component of thedataset.In our example,the eigenvectorwith the largeseigenvaluewas the one that pointeddown the middle of the data. It is the mostsignificantrelationshipbetweenthe datadimensions.

In general,onceeigenvectorsarefound from thecovariancematrix, thenext stepis to orderthemby eigenvalue,highestto lowest. This givesyou the componentsinorderof significance.Now, if you like, you candecideto ignore the componentsoflessersignificance.Youdolosesomeinformation,but if theeigenvaluesaresmall,youdon’t losemuch. If you leave out somecomponents,the final datasetwill have lessdimensionsthan the original. To be precise,if you originally have � dimensionsinyour data,andsoyou calculate� eigenvectorsandeigenvalues,andthenyou chooseonly thefirst { eigenvectors,thenthefinal datasethasonly { dimensions.

What needsto be donenow is you needto form a feature vector, which is justa fancy namefor a matrix of vectors. This is constructedby taking the eigenvectorsthat you want to keepfrom the list of eigenvectors,andforming a matrix with theseeigenvectorsin thecolumns.

| s ( z v�@ sy}~syBZz1C @n� 0 s \Zt � s \Zt 4 s \Zt � q�q�q�q s \Zt !2

Givenour examplesetof data,andthe fact thatwe have 2 eigenvectors,we havetwo choices.We caneitherform a featurevectorwith bothof theeigenvectors:

- q �r��&�� d�d�� - q �&d�� - q �&d��r�� q �� &d�d��

or, we canchooseto leave out thesmaller, lesssignificantcomponentandonly have asinglecolumn:

- q �� &d�d��- q � d��&��

We shallseetheresultof eachof thesein thenext section.

Step5: Deriving the new data set

Thisthefinal stepin PCA,andis alsotheeasiest.Oncewehavechosenthecomponents(eigenvectors)thatwewish to keepin ourdataandformeda featurevector, wesimplytake the transposeof the vector and multiply it on the left of the original dataset,transposed.

| \ � (�u [ ( z (6�k� CD� | s ( z v�@ sx}7syBZz1C @ f � CD� [ ( z (�� )�� v /hz E

where� CD� | s ( z v�@ sy}6syBZz1C @ is thematrix with theeigenvectorsin thecolumnstrans-

posed sothattheeigenvectorsarenow in therows,with themostsignificanteigenvec-tor at thetop,and

� CD� [ ( z (�� )�� v /hz is themean-adjusteddatatransposed, ie. thedataitemsarein eachcolumn,with eachrow holding a separatedimension. I’m sorry ifthis suddentransposeof all our dataconfusesyou, but theequationsfrom hereon are

16

easierif wetakethetransposeof thefeaturevectorandthedatafirst, ratherthathavingalittle T symbolabovetheirnamesfrom now on.

| \ � (�u [ ( z ( is thefinal dataset,withdataitemsin columns,anddimensionsalongrows.

Whatwill thisgiveus?It will giveustheoriginaldatasolely in terms of the vectorswe chose. Our original datasethadtwo axes, < and = , so our datawasin termsofthem. It is possibleto expressdatain termsof any two axesthat you like. If theseaxesareperpendicular, thentheexpressionis themostefficient. This waswhy it wasimportantthateigenvectorsarealwaysperpendicularto eachother. We have changedour datafrom beingin termsof theaxes < and = , andnow they arein termsof our 2eigenvectors.In thecaseof whenthenew datasethasreduceddimensionality, ie. wehave left someof theeigenvectorsout, thenew datais only in termsof thevectorsthatwedecidedto keep.

To show this on our data,I have donethe final transformationwith eachof thepossiblefeaturevectors.I have takenthe transposeof the resultin eachcaseto bringthedatabackto thenicetable-like format. I have alsoplottedthefinal pointsto showhow they relateto thecomponents.

In thecaseof keepingbotheigenvectorsfor thetransformation,wegetthedataandtheplot foundin Figure3.3. This plot is basicallytheoriginal data,rotatedsothat theeigenvectorsaretheaxes.This is understandablesincewe have lost no informationinthisdecomposition.

The othertransformationwe canmake is by taking only the eigenvectorwith thelargesteigenvalue. The tableof dataresultingfrom that is found in Figure3.4. Asexpected,it only hasa singledimension. If you comparethis datasetwith the oneresultingfrom usingbotheigenvectors,you will noticethat this datasetis exactly thefirst columnof theother. So, if you wereto plot this data,it would be1 dimensional,andwould be pointson a line in exactly the < positionsof the points in the plot inFigure3.3. We have effectively thrown away thewholeotheraxis,which is theothereigenvector.

So what have we donehere? Basicallywe have transformedour dataso that isexpressedin termsof thepatternsbetweenthem,wherethepatternsarethe lines thatmostcloselydescribethe relationshipsbetweenthe data. This is helpful becausewehave now classifiedour datapoint asa combinationof thecontributionsfrom eachofthoselines. Initially we had the simple < and = axes. This is fine, but the < and =valuesof eachdatapointdon’t really tell usexactlyhow thatpoint relatesto therestofthedata.Now, thevaluesof thedatapointstell usexactlywhere(ie. above/below) thetrendlinesthedatapointsits. In thecaseof thetransformationusingboth eigenvectors,we have simply alteredthe dataso that it is in termsof thoseeigenvectorsinsteadoftheusualaxes.But thesingle-eigenvectordecompositionhasremovedthecontributiondueto thesmallereigenvectorandleft uswith datathatis only in termsof theother.

3.1.1 Getting the old data back

Wanting to get the original databack is obviously of greatconcernif you areusingthePCA transformfor datacompression(anexampleof which to will seein thenextsection).Thiscontentis takenfromhttp://www.vision.auc.dk/sig/Teaching/Flerdim/Current/hotelling/hotelling.html

17

TransformedData=

< =-.827970186 -.1751153071.77758033 .142857227-.992197494 .384374989-.274210416 .130417207-1.67580142 -.209498461-.912949103 .175282444.0991094375 -.3498246981.14457216 .0464172582.438046137 .01776462971.22382056 -.162675287

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Data transformed with 2 eigenvectors

"./doublevecfinal.dat"

Figure3.3: The tableof databy applyingthe PCA analysisusingboth eigenvectors,anda plot of thenew datapoints.

18

TransformedData(Singleeigenvector)<

-.8279701861.77758033-.992197494-.274210416-1.67580142-.912949103.09910943751.14457216.4380461371.22382056

Figure3.4: Thedataaftertransformingusingonly themostsignificanteigenvector

So,how do we get the original databack? Beforewe do that, rememberthat only ifwe took all theeigenvectorsin our transformationwill wegetexactly theoriginaldataback. If we have reducedthenumberof eigenvectorsin thefinal transformation,thentheretrieveddatahaslost someinformation.

Recallthatthefinal transformis this:

| \ � (�u [ ( z (6�k� CD� | s ( z v�@ sx}7syBZz1C @ f � CD� [ ( z (�� )�� v /hz E

whichcanbeturnedaroundsothat,to gettheoriginaldataback,

� CD� [ ( z (�� )�� v /hz �� CD� | s ( z v�@ sx}7syBZz1C @ R�f | \ � (�u [ ( z (

where� CD� | s ( z v�@ sy}6syBZz1C @ R

�is theinverseof

� CD� | s ( z v�@ sx}7syBZz1C @ . However, whenwe take all the eigenvectorsin our featurevector, it turnsout that the inverseof ourfeaturevectoris actuallyequalto thetransposeof our featurevector. This is only truebecausethe elementsof the matrix areall the unit eigenvectorsof our dataset. Thismakesthereturntrip to our dataeasier, becausetheequationbecomes

� CD� [ ( z (�� )�� v /hz �� CD� | s ( z v�@ sy}7sxBMz1C @�� f | \ � (�u [ ( z (

But, to get the actualoriginal databack,we needto addon the meanof thatoriginaldata(rememberwesubtractedit right at thestart).So,for completeness,

� CD�7� @ \Zt�\ � (�u [ ( z (7� 0 � CD� | s ( z v�@ sy}6syBZz1C @�� f | \ � (�u [ ( z (P2 i � @ \Zt�\ � (�u I s ( �This formulaalsoappliesto whenyou do not have all the eigenvectorsin the featurevector. Soevenwhenyou leave out someeigenvectors,theaboveequationstill makesthecorrecttransform.

I will notperformthedatare-creationusingthecomplete featurevector, becausetheresultis exactly thedatawestartedwith. However, I will do it with thereducedfeaturevectorto show youhow informationhasbeenlost. Figure3.5show thisplot. Compare

19

-1

0

1

2

3

4

-1 0 1 2 3 4

Original data restored using only a single eigenvector

"./lossyplusmean.dat"

Figure3.5:Thereconstructionfrom thedatathatwasderivedusingonly asingleeigen-vector

it to the original dataplot in Figure3.1 andyou will noticehow, while the variationalongtheprincipleeigenvector(seeFigure3.2 for theeigenvectoroverlayedon top ofthe mean-adjusteddata)hasbeenkept, the variationalongthe othercomponent(theothereigenvectorthatwe left out)hasgone.

Exercises; Whatdo theeigenvectorsof thecovariancematrixgiveus?

; At what point in the PCA processcanwe decideto compressthe data? Whateffectdoesthis have?

; For anexampleof PCAandagraphicalrepresentationof theprincipaleigenvec-tors,researchthetopic ’Eigenfaces’,whichusesPCAto do facialrecognition

20

Chapter 4

Application to Computer Vision

This chapterwill outline the way that PCA is usedin computervision, first showinghow imagesareusuallyrepresented,andthenshowing what PCA canallow us to dowith thoseimages.Theinformationin this sectionregardingfacialrecognitioncomesfrom “FaceRecognition:Eigenface,ElasticMatching,andNeuralNets”,JunZhangetal. Proceedingsof theIEEE,Vol. 85,No. 9,September1997.Therepresentationinfor-mation,is takenfrom “Digital ImageProcessing”RafaelC. GonzalezandPaul Wintz,Addison-Wesley PublishingCompany, 1987.It is alsoanexcellentreferencefor furtherinformationon the K-L transformin general.The imagecompressioninformationistakenfromhttp://www.vision.auc.dk/ sig/Teaching/Flerdim/Current/hotelling/hotelling.html,whichalsoprovidesexamplesof imagereconstructionusingavaryingamountof eigen-vectors.

4.1 Representation

Whenusingthesesortof matrixtechniquesin computervision,wemustconsiderrepre-sentationof images.A square,� by � imagecanbeexpressedasan � 4 -dimensionalvector

�� < � < 4 < � q q <��wherethe rows of pixels in the imageareplacedoneafter the other to form a one-dimensionalimage. E.g. The first � elements(< � -g< � will be the first row of theimage,thenext � elementsarethenext row, andsoon. Thevaluesin thevectoraretheintensityvaluesof theimage,possiblyasinglegreyscalevalue.

4.2 PCA to find patterns

Saywe have 20 images. Eachimageis � pixels high by � pixels wide. For eachimagewe cancreatean imagevectorasdescribedin the representationsection. Wecanthenput all theimagestogetherin onebig image-matrixlike this:

21

� ^ ( t sx/ I ( z @ \ < �� ^ ( t sx}6sxB �� ^ ( t sx}6sxB �qq� ^ ( t sy}~syB ��%

which givesusa startingpoint for our PCA analysis.Oncewe have performedPCA,we have our original datain termsof the eigenvectorswe found from the covariancematrix. Why is this useful?Saywe want to do facial recognition,andsoour originalimageswereof peoplesfaces.Then,the problemis, givena new image,whosefacefrom the original set is it? (Note that the new imageis not oneof the 20 we startedwith.) The way this is doneis computervision is to measurethe differencebetweenthenew imageandtheoriginal images,but not alongtheoriginal axes,alongthenewaxesderivedfrom thePCAanalysis.

It turnsout that theseaxesworks muchbetterfor recognisingfaces,becausethePCA analysishasgiven us the original imagesin terms of the differences and simi-larities between them. The PCA analysishasidentifiedthe statisticalpatternsin thedata.

Sinceall thevectorsare � 4 dimensional,wewill get � 4 eigenvectors.In practice,we areableto leave out someof the lesssignificanteigenvectors,andtherecognitionstill performswell.

4.3 PCA for imagecompression

UsingPCAfor imagecompressionalsoknow astheHotelling,or KarhunenandLeove(KL), transform.If wehave20 images,eachwith � 4 pixels,we canform � 4 vectors,eachwith 20dimensions.Eachvectorconsistsof all theintensityvaluesfrom thesamepixel from eachpicture.This is differentfrom thepreviousexamplebecausebeforewehadavectorfor image, andeachitemin thatvectorwasadifferentpixel, whereasnowwehavea vectorfor eachpixel, andeachitem in thevectoris from a differentimage.

Now we performthePCA on this setof data.We will get20 eigenvectorsbecauseeachvectoris 20-dimensional.To compressthedata,we canthenchooseto transformthe dataonly using, say 15 of the eigenvectors. This givesus a final dataset withonly 15 dimensions,which hassavedus

��o1of thespace.However, whentheoriginal

datais reproduced,the imageshave lost someof the information. This compressiontechniqueis saidto be lossy becausethedecompressedimageis not exactly thesameastheoriginal,generallyworse.

22

Appendix A

Implementation Code

This is codefor usein Scilab,a freewarealternative to Matlab. I usedthis codetogenerateall the examplesin the text. Apart from the first macro,all the rest werewrittenby me.

// This macro taken from// http://www.cs.montana.edu/˜harkin/co urses/ cs530 /scil ab/mac ros/c ov.sc i// No alterations made

// Return the covariance matrix of the data in x, where each column of x// is one dimension of an n-dimensional data set. That is, x has x columns// and m rows, and each row is one sample.//// For example, if x is three dimensional and there are 4 samples.// x = [1 2 3;4 5 6;7 8 9;10 11 12]// c = cov (x)

function [c]=cov (x)// Get the size of the arraysizex=size(x);// Get the mean of each columnmeanx = mean (x, "r");// For each pair of variables, x1, x2, calculate// sum ((x1 - meanx1)(x2-meanx2))/(m-1)for var = 1:sizex(2),

x1 = x(:,var);mx1 = meanx (var);for ct = var:sizex (2),

x2 = x(:,ct);mx2 = meanx (ct);v = ((x1 - mx1)’ * (x2 - mx2))/(sizex(1) - 1);

23

cv(var,ct) = v;cv(ct,var) = v;// do the lower part of c also.

end,end,c=cv;

// This a simple wrapper function to get just the eigenvectors// since the system call returns 3 matricesfunction [x]=justeigs (x)// This just returns the eigenvectors of the matrix

[a, eig, b] = bdiag(x);

x= eig;

// this function makes the transformation to the eigenspace for PCA// parameters:// adjusteddata = mean-adjusted data set// eigenvectors = SORTEDeigenvectors (by eigenvalue)// dimensions = how many eigenvectors you wish to keep//// The first two parameters can come from the result of calling// PCAprepare on your data.// The last is up to you.

function [finaldata] = PCAtransform(adjusteddata,eigenvectors ,dime nsion s)finaleigs = eigenvectors(:,1:dimensions);prefinaldata = finaleigs’*adjusteddata’;finaldata = prefinaldata’;

// This function does the preparation for PCA analysis// It adjusts the data to subtract the mean, finds the covariance matrix,// and finds normal eigenvectors of that covariance matrix.// It returns 4 matrices// meanadjust = the mean-adjust data set// covmat = the covariance matrix of the data// eigvalues = the eigenvalues of the covariance matrix, IN SORTEDORDER// normaleigs = the normalised eigenvectors of the covariance matrix,// IN SORTEDORDERWITH RESPECTTO// THEIR EIGENVALUES, for selection for the feature vector.

24

//// NOTE: This function cannot handle data sets that have any eigenvalues// equal to zero. It’s got something to do with the way that scilab treats// the empty matrix and zeros.//function [meanadjusted,covmat,sorteigvalues,s ortno rmale igs] = PCAprepare (data)// Calculates the mean adjusted matrix, only for 2 dimensional datameans = mean(data,"r");meanadjusted = meanadjust(data);covmat = cov(meanadjusted);eigvalues = spec(covmat);normaleigs = justeigs(covmat);sorteigvalues = sorteigvectors(eigvalues’,eigvalue s’);sortnormaleigs = sorteigvectors(eigvalues’,normale igs);

// This removes a specified column from a matrix// A = the matrix// n = the column number you wish to removefunction [columnremoved] = removecolumn(A,n)inputsize = size(A);numcols = inputsize(2);temp = A(:,1:(n-1));for var = 1:(numcols - n)

temp(:,(n+var)-1) = A(:,(n+var));end,columnremoved = temp;

// This finds the column number that has the// highest value in it’s first row.function [column] = highestvalcolumn(A)inputsize = size(A);numcols = inputsize(2);maxval = A(1,1);maxcol = 1;for var = 2:numcols

if A(1,var) > maxvalmaxval = A(1,var);maxcol = var;

end,end,column = maxcol

25

// This sorts a matrix of vectors, based on the values of// another matrix//// values = the list of eigenvalues (1 per column)// vectors = The list of eigenvectors (1 per column)//// NOTE: The values should correspond to the vectors// so that the value in column x corresponds to the vector// in column x.function [sortedvecs] = sorteigvectors(values,vectors)inputsize = size(values);numcols = inputsize(2);highcol = highestvalcolumn(values);sorted = vectors(:,highcol);remainvec = removecolumn(vectors,highcol);remainval = removecolumn(values,highcol);for var = 2:numcols

highcol = highestvalcolumn(remainval);sorted(:,var) = remainvec(:,highcol);remainvec = removecolumn(remainvec,highcol);remainval = removecolumn(remainval,highcol);

end,sortedvecs = sorted;

// This takes a set of data, and subtracts// the column mean from each column.function [meanadjusted] = meanadjust(Data)inputsize = size(Data);numcols = inputsize(2);means = mean(Data,"r");tmpmeanadjusted = Data(:,1) - means(:,1);for var = 2:numcols

tmpmeanadjusted(:,var) = Data(:,var) - means(:,var);end,meanadjusted = tmpmeanadjusted

26

Dimensionality Reduction:

PCA & LDA

Faustino Gomez

IDSIA

• Data can be (very) high-dimensional

Why is high-dimensionality a

problem?

• Computationally intractable

• Intrinsic dimensionality may be lower

• Redundant/irrelevant information

• Visualization and comprehensibility

Solution: extract “features” from high-dimensional

data; project data onto lower dimensional feature-space

Instance x described

by n dimensional

feature vector

),...,,( 21 nxxx{

Feature Extraction ),...,,( 21 nxxxf{

Extracted m-dimensional

feature vector y (m<n)),...,,( 21 myyy{

What is Feature Extraction?

Dimensionality Reduction

Methods

• Unsupervised– Preserve statistical/structural

properties

– No class information

• Supervised– Uses class information

– Maximize seperability

NR

variance

separability

Types of “learning”

Dimensionality Reduction

Methods• Principal Component Analysis (PCA)

– linear, unsupervised, maximizes preserved variance

• Linear Discriminant Analysis (LDA)

– linear, supervised, maximizes class separability

• Neural networks– Non-linear, both supervised and unsupervised are

possible

• Other methods– Independent Component Analysis, Self Organising

Maps, Principal curves, Sammon’s mapping

Finding structure in data

Principal Component Analysis

• Basic idea: find new set of axes (basis)

that concentrates the most variance in

the fewest components (new axes).

• Project points onto just the “principal”

components = fewer dimensions!

Find new basis that accounts

for most variance

Overview of PCA Algorithm

1. Normalize data

2. Compute covariance matrix

3. Compute eigenvectors of covariance matrix

4. Eigenvectors are the components that

define new basis

5. Eigenvalues indicate importance of each

component

Mathematical background

!

µi =xij

j=1

N

"N

Mean of dimension i

Let X be a set of N d-dimensional vectors

!

si2 =

xij "µi( ) xij "µi( )

j=1

N

#N "1

Variance of dimension i

!

j =1..N

!

i =1..d


!

var(Xi ) =xij "µi( ) xij "µi( )

j=1

N

#N "1

Variance

!

cov(Xi ,Xk ) =xij "µi( ) xkj "µk( )

j=1

N

#N "1

Covariance


• Covariance is a pair-wise measure

• Given d dimensions, we have d ! d

covariances

!

cov(X1,X

1) cov(X

1,X

2) cov(X

1,X

3)

cov(X2,X

1) cov(X

2,X

2) cov(X

2,X

3)

cov(X3,X

1) cov(X

3,X

2) cov(X

3,X

3)

"

#

$ $ $

%

&

' ' '

Covariance matrix for d = 3


• A square d ! d matrix will have d

orthogonal eigenvectors

• The normalized eigenvectors are the

“components” used in PCA

PCA algorithm

Find a new basis for data set:

Representation using standard basis:

!

x = x1

1

0

0

"

#

$ $ $

%

&

' ' '

+ x2

0

1

0

"

#

$ $ $

%

&

' ' '

+ x3

0

0

1

"

#

$ $ $

%

&

' ' '

!

x =

1

2

3

"

#

$ $ $

%

&

' ' '

=1

1

0

0

"

#

$ $ $

%

&

' ' '

+ 2

0

1

0

"

#

$ $ $

%

&

' ' '

+ 3

0

0

1

"

#

$ $ $

%

&

' ' '

Example:

PCA algorithm

Find a new basis for data set:

Representation using a different basis:!

x =

1

2

3

"

#

$ $ $

%

&

' ' '

=1

1

0

0

"

#

$ $ $

%

&

' ' '

+ 2

0

1

0

"

#

$ $ $

%

&

' ' '

+ 3

0

0

1

"

#

$ $ $

%

&

' ' '

Example:

!

x =

1

2

3

"

#

$ $ $

%

&

' ' '

=1

0

0

1

"

#

$ $ $

%

&

' ' '

+1

0

1

1

"

#

$ $ $

%

&

' ' '

+1

1

1

1

"

#

$ $ $

%

&

' ' '

!

x =

1

2

3

"

#

$ $ $

%

&

' ' '

=1.5

1

1

0

"

#

$ $ $

%

&

' ' '

+ 3

0

0

1

"

#

$ $ $

%

&

' ' '

( 0.5

1

(1

0

"

#

$ $ $

%

&

' ' '

and another (orthogonal):

PCA: original data set

!

x1 = x

1

1x2

1x3

1x4

1x5

1L x

d

1[ ]x2 = x

1

2x2

2x3

2x4

2x5

2L x

d

2[ ]x3 = x

1

3x2

3x3

3x4

3x5

3L x

d

3[ ]M

xN = x

1

Nx2

Nx3

Nx4

Nx5

NLx

d

N[ ]

DimensionsD

ata

poin

ts

PCA

Step 1: normalize the data set

!

ˆ x 1 = ˆ x

1

1 ˆ x 2

1 ˆ x 3

1 ˆ x 4

1 ˆ x 5

1L ˆ x d

1[ ]ˆ x

2 = ˆ x 1

2 ˆ x 2

2 ˆ x 3

2 ˆ x 4

2 ˆ x 5

2L ˆ x d

2[ ]ˆ x

3 = ˆ x 1

3 ˆ x 2

3 ˆ x 3

3 ˆ x 4

3 ˆ x 5

3L ˆ x d

3[ ]M

ˆ x N = ˆ x

1

N ˆ x 2

N ˆ x 3

N ˆ x 4

N ˆ x 5

NL ˆ x d

N[ ]

ˆ x ij = xi

j "µi ,

µi = 1N

xi

j

j=1

N

#

Dimensions

Dat

a poin

ts

where

PCA algorithm

!

cov(Xi ,Xk ) =xij "µi( ) xkj "µk( )

j=1

N

#N "1

for all i,k

Step 2: compute the d-dimensional covariance

matrix

!

cov(X1,X

1) cov(X

1,X

2)L cov(X

1,X

3)

cov(X2,X

1) cov(X

2,X

2) cov(X

2,X

3)

cov(X3,X

1) cov(X

3,X

2) cov(X

3,X

3)

"

#

$ $ $

%

&

' ' '

Step 3: compute eigenvectors of covariance matrix

PCA algorithm

i

Step 4: select the K eigenvectors with the largest

eigenvalues. These will be the “principal

components” or features, the new basis.

If K=d, then the data is just represented using the new

basis.

If K<d, then information is lost, but since we ignore the

least significant eigenvectors, we only eliminate

components least variance :

visualization and compression

PCA compression

!i

K

!N

Eigenvalue spectrum

cutoff

Limitations of PCA

• Linear PCA mapping is only optimal for

a linear reconstruction

• Possible non-linear relations are not

used.

! Dimensionality of extracted feature-vectors

will be higher then the minimal required

dimensionality (intrinsic dimensionality)

Linear Discriminant Analysis

Find a projection that:

- Minimizes distances within classes

- Maximizes distances between classes

Main goal is to optimize the extracted

features for the purpose of classification


- Suppose we have C classes

- Let be the mean vector of class i,

- Let be the number of samples in class i

!

r µ i

!

Mi

PCA vs LDA

Linear Discriminant Analysis: Scatter

Matrices

Distances within classes (within-scatter matrix):

i now refers do class, j refers to datapoint within class

Distances between classes (between-scatter matrix):

Measures for class separability:

!

Sw = [(xj "

r µ i )(x

j "r µ i )

T]

j=1

Mi

#i=1

C

#

!

Sb

= [(r µ i"

r µ )(

r µ i"

r µ )T ]

i=1

C

#

!!"

#$$%

&

w

b

S

Strace

)(

)(

w

b

Strace

Strace

(Mean vector of class I)

(Mean vector regardless of class)


1!

2!

1!

2!

LDA finds a linear projection

on to the subspace spanned

by the m largest eigenvectors

of Sw-1Sb

PCA

Class A

Class B

Optimal for m+1 linear separable classes

x1

x2

f2

f1

Assignment 2 - Part 1

October 17, 2007

1 Maximum Likelihood (25 points)

Consider a sample ofn trials,j of which beinga’s andk beingb’s. The number ofc’swill be n − j − k. As the probabilities of the possible symbols have to sum to1, theprobability of c will be 1 − p − q. As the events are independent, the probability ofthe sample can be expressed as the product of the probabilities of the single symbols.We can then express the likelihood of the parameters(p, q) given the data (i.e. theprobability of the data given the parameters) as

L(x1, x2, ..., xn|p, q) = P (x1|p, q)P (x2|p, q)...P (xn|p, q) = pjqk(1 − p − q)n−j−k

(1)We are asked to estimate the values ofp andq according to the Maximum Likeli-

hood criterion. This means that we have to pick the values ofp andq that maximize(1); or, as thelog function is strictly increasing, the values that maximize the logarithmof (1)

(p, q) = argmax(p,q)

L = argmax(p,q)

log L (2)

Let us evaluate it

log L = j log p + k log q + (n − j − k) log(1 − p − q) (3)

In order to maximize it we will need to set its gradient to0, and check which is thereal minimum in case of multiple solutions.

∂ log L

∂p=

j

p−

n − j − k

1 − p − q= 0 (4)

∂ log L

∂q=

k

q−

n − j − k

1 − p − q= 0 (5)

Equations (4,5) form a system with a single solution in(p, q) = (j/n, k/n), whichis also intuitively correct if we compare it with the ML estimator for the Bernoullidistribution. The same solution can also be obtained in a much more complicated wayby differentiatingL directly.

1

2 Maximum A Posteriori (25 points)

We have to estimate the value ofp = Pr{0} in a Bernoulli distribution, using MAP.This means we have to maximize the posterior probability ofp given the data

p = argmaxp

f(p|x1, ..., xn) = arg maxp

f(x1, ..., xn|p)f(p)

f(x1, ..., xn)= (6)

= arg maxp

f(x1, ..., xn|p)f(p) (7)

Note that the first passage is just the Bayes rule; in the second passage we drop thenumerator, as it does not depend onp. We now have to evaluate the two remainingterms, and multiply them.

Consider a sample ofn trials,k being0’s. The number of1’s will be n− k. As theprobabilities of the possible symbols have to sum to1, the probability of1 will be 1−p.As the events are independent, the probability of the samplecan be expressed as theproduct of the probabilities of the single symbols. We can then express the probabilityof the sample givenp (i.e. the likelihood ofp given the data) as

f(x1, x2, ..., xn|p) = P (x1|p)P (x2|p)...P (xn|p) = pk(1 − p)n−k (8)

The second term is the prior over the parameter, which is proposed to bef(p) =3p2 for p ∈ [0, 1], and obviously0 elsewhere (note thatp has to be in[0, 1], so anyprior we use has to be0 outside this interval).

Multiplying the two terms gives

f(p|x1, x2, ..., xn) = pk(1 − p)n−k3p2 = 3pk+2(1 − p)n−k (9)

We have to maximize (9) setting its derivative equal to0. Also in this case we firsttake the logarithm, as it makes the equation easier

∂ log f(p|x1, ..., xn)

∂p=

∂[log 3 + (k + 2) log p + (n − k) log(1 − p)]

∂p= 0 (10)

k + 2

p−

n − k

1 − p= 0 (11)

(1 − p)(k + 2) − p(n − k) = k + 2 − p(2 + n) = 0 (12)

which results in our MAP estimator beingp = (k + 2)/(n + 2).

2

Artificial Neural Networks

Part I

Faustino Gomez

IDSIA

Intelligent Systems, Fall 2007

What makes a system intelligent?

Characteristics of “intelligent” systems

• Autonomy: can the system operate with minimalhuman intervention?

• Adaptivity: can the system adjust to changes in theenvironment?

• Uncertainty: can the system cope with incompleteinformation?

• Anticipation: does the system just react or does italso base its behavior on predictions?

• Generality: how specific is the task?

Ultimately we want to achieve true Artificial Intelligence

Brief History of AI

1950: Turing Test

1956: McCarthy coins term “artificial intelligence”

1957: Rosenblatt invents perceptron

1958: John McCarthy (MIT) invents Lisp language

1963: MIT received a $2.2 million grant from the newlycreated Advanced Research Projects Agency (laterknown as DARPA)

- General Problem Solver (GPS; Newell and Simon)- Physical Symbol System hypothesis

Early years

Brief History of AI (cont.)

1969: Shakey robot at

Stanford (natural language,

vision, control)

1972: Prolog language

1979: Expert Systems

invented, later used widely

1966: Eliza

Brief History of AI (cont.)

1974-80: First AI Winter– Philosophical critique: “What computers can’t do” by Dreyfus, and

Searle’s Chinese room

– Bold claims:

1965, H. A. Simon: "machines will be capable, within twenty years, ofdoing any work a man can do.”

1967, Marvin Minsky: "Within a generation ... the problem of creating'artificial intelligence' will substantially be solved.”

– But insufficient progress, methods would not scale

– Belief that low-level (perception, common-sense reasoning) wouldbe easy. Wrong!

– Minsky-Papert book (1969) criticizes Perceptron: NNs banished to“dark age” (1967-1982)

1980s: Paradigm Shift

• Rodney Brooks: “world is it’s own bestmodel”, “elephants don’t play chess”

- subsumption architecture

• AI too big: focus on subproblems

• Stuart Wilson: Animats approach

• Importance of Embodiment

• Ever increasing computation power

• Renewed interest in Neural Networks

Summary of AI paradigm shift

Good Old-fashioned AI New AI

Neural Networks

• mathematical abstraction of biological nervous systems

• massively parallel distributed processors

• subsymbolic (no explicit symbols maintained)

• universal approximation

• can learn arbitrary mappings from examples

Training Set

!

x1 = x

1

1x2

1x3

1x4

1x5

1L x

n

1, d

1[ ]

x2 = x

1

2x2

2x3

2x4

2x5

2L x

n

2, d

2[ ]

x3 = x

1

3x2

3x3

3x4

3x5

3L x

n

3, d

3[ ]M

xN = x

1

Px2

Px3

Px4

Px5

PLx

n

P, d

P[ ]

!

example " [input pattern, target]

Perceptron (Rosenblatt 1957)

!

f (x) =1

0

" # $

if

else

wixii=1

n

% +b > 0

Linear Classification

LDA uses statistics to determine projection plane

Representing Lines• How do we represent a

line?

In general a hyperplane is defined by

!

!!

1

2

!

x1 • w is positivex2 • w is negative

Now classification is easy!

But... how do welearn this mysterious

model vector?

Perceptron Learning Algorithm

!

Input : list of n training examples (x0,d0 )...(xn ,dn ) where "i :di # {+1,$1}Output : classifying hyperplane w

Algorithm :Randomly initialize w;While makes errors on training set do for (xidi ) do let yi = sign(w • xi ); if yi % di then w& w+'dixi; end endend learning rate

A simple example4 linearly separable points

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Z = 1 Z = - 1

(1/2, 1)

(1,1/2)

(-1,1/2)

(-1,1)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

initial weights

W(0) = (0,1)

Class 1

Class -1

Updating Weights

Upper left point (-1,1/2) is wrongly classified

!

x = ("1,1/2)d = "1# =1/ 3,w(0) = (0,1)w(1)$ w(0)+#dxw(1) = (0,1)+1/ 3*("1)*("1,1/2) = (0,1)+1/ 3*(1,"1/2)

= (1/ 3,5 /6)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2first correction

W(1) = (1/3,5/6)

Updating Weights, Ctd

Upper left point is still wrongly classified

!

x = ("1,1/2)d = "1w(2)# w(1)+$dxw(2) = (1/ 3,5 /6)+1/ 3*("1)*("1,1/2) = (1/ 3,5 /6)+1/ 3*(1,"1/2)

= (2 / 3,2 / 3)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

second correction

W(2) = (2/3,2/3)

• By Widrow and Hoff (~1960)

– Adaptive linear elements for signal processing

– The same architecture as perceptron but different learning

method: delta rule, also called the Widrow-Hoff learning rule

– Learning method: try to reduce the mean squared error

(MSE) between the net input and the desired out put

Adaptive Linear Elements (Adaline)

Adaline• Delta rule

– The squared error

•

• Its value determined by the weights wl

– Modify weights by gradient descent approach

– Obtain partial derivative of error with respect to each weight

Change weights in the opposite direction of

!

E = (d " net)2 = (d " wixi)

i

#2

!

"E /"wi

!

"wi=#(d $ w

ixi) % x

i=#(d $ net)

i

& % xi

!

"E

wi

= 2(net # d)"net

"wi

= 2(net # d)xi

!

(net " d)xi increases error

"(net " d)xi decreases error

• Delta rule in batch mode

– Based on mean squared error over all P samples

• E is again a function of w = (w0, w1,…, wn )

• the gradient of E:

• Therefore

!

E =1

P(d

p " net p )2

p=1

P

#

!

"E

"wi

=2

P[(d

p # net p )"

"wip=1

P

$ (dp # net p )]

= #2

P[(d

p # net p )p=1

P

$ % xi ]

=!

!"=

i

iw

Ew #$

!

" [(dp # net p )

p=1

P

$ % xip]

Adaline Learning

• Notes:

– Weights will be changed even if an input is alreadyclassified correctly

– E monotonically decreases until the system reaches astate with (local) minimum E (a small change of any wi

will cause E to increase).

– At a local minimum the partial derivative of all weightswith respect to the error will equal 0, but E isnot guaranteed to be zero (netj != dj)

!

"E /"wi

Adaline Learning

• Can a trained perceptron correctly classify patterns

not included in the training samples?

• Depends on the quality of training samples selected.

• Also to some extent depends on the learning rate

and initial weights

• How can we know the learning is ok?

– Reserve a few samples for testing.

• Much more on this later

Generalization

0 1

1

0

1 0

0 1

Problem: some functions are not linearly separable!

u1 u2 u1 XOR u2

0 0 0

0 1 1

1 0 1

1 1 0

XOR function

Since XOR (a simple function) could not be separated by a line

the perceptron is very limited in what kind of functions it can learn.

Funding for neural networks dried up for more than a decade after

Minsky and Papert book Perceptrons (1969).

Resurgence of NNs

• Layered Networks

• Non-linear neurons

• 1974: Backpropagation (Werbos)

• 1986: developed further (Rumelhart,

Hinton, Williams)

Solving XOR with combine perceptron

XOR can be composed of

‘simpler’ logical

functions.

A xor B = (A or B) and

not (A and B)

The last term simply

removes the troublesome

value.

Non-Linear Perceptron

!

f (x) = " wixii=1

n

# +b$

% &

'

( )

" (net) =1

1+ e*net

Non-Linear Threshold

!

" (y) =1

1+ e#ay

!

x • w+b (weighted sum of inputs)

• can now benefit from combining perceptrons (neurons)

into layered networks

Why hidden units must be non-linear?• Multi-layer net with linear hidden layers is equivalent to a

single layer net

– Because z1 and z2 are linear units z1 = x1*v11 + x2*v21 z2 = x1*v12 + x2*v22

– y = z1*w1 + z2*w2

= x1*u1 + x2*u2

where

u1 = (v11w1+ v12w2), u2 = (v21w1 + v22w2)

therefore the output y is still a linear combination of x1 and x2.

Y

z2

z1x1

x2

w1

w2

v11

v22

v12

v21

threshold = 0

StructureTypes of

Decision Regions

Exclusive-OR

Problem

Classes with

Meshed regions

Most General

Region Shapes

Single-Layer

Two-Layer

Three-Layer

Half Plane

Bounded By

Hyperplane

Convex Open

Or

Closed Regions

Abitrary

(Complexity

Limited by No.

of Nodes)

A

AB

B

A

AB

B

A

AB

B

BA

BA

BA

Non linearly separable problems

Neural Networks – An Introduction Dr. Andrew Hunter

Artificial Neural NetworksPart II

Faustino GomezIDSIA


Delta Rule (gradient descent) forlinear neuron

• Calculate squared error between output and target

Its value determined by the weights wl• Obtain partial derivative of error with respect to each weight

• Change weights in the opposite direction of

!

E = (d " net)2 = (d " wixi)

i

#2

!

"E /"wi

!

"wi=#(d $ w

ixi)x

i=#(d $ net)

i

% xi

!

"E

"wi

= 2(net # d)"net

"wi

= (net # d)xi

!

(net " d)xi increases error

"(net " d)xi decreases error

We can drop the 2, it doesn’t change the direction!

Error surface

!

w2

!

w1

E!

"E"w

1

!

"E"w

2

!

"E

"r w

Non-Linear Neuron

!

f (x) = " wixii=1

n

# +b$

% &

'

( )

" (net) =1

1+ e*net

1

0

.01

.99

“Soft”, sigmoid decision boundary

“Hard”, perceptrondecision boundary

Delta Rule (gradient descent) forNon-linear neuron

• Calculate squared error between output and target

Same as linear case but now with the sigmoid function squashing net• Obtain partial derivative of error with respect to each weight

!

E = (d "# (net))2 = d "# wixi

i

$%

& '

(

) *

%

& ' '

(

) * *

2

!

y =" (net)

!

"E

wi

= (y# d)"y

"net

"net

"wi

!

"E

"wi

="(d # y)2

"wi

Delta Rule (gradient descent) for Non-linear neuron

Recall for linear case

!

"y

"net="# (net)

"net=# (net)(1$# (net))

6 7 4 4 4 4 4 4 8 4 4 4 4 4 4

!

"E

"wi

= (y# d)"y

"net

"net

"wi

!

"E

"wi

= (net # d)"net

"wi

But now we have to deal with the sigmoid

Sigmoid and its derivative

!

y =" (net)

!

net

derivative

Linear vs. non-linear gradient

Linear case:

!

"E

"wi

= (y# d)$ (net)(1#$ (net))6 7 4 4 4 8 4 4 4 "net

"wi

Non-linear case:

!

"E

wi

= (net # d)"net

"wi

= (net # d)xi

!

"E

"wi

= (y# d)"y

"net

"net

"wi

!

"E

"wi

= (y# d)$ (net)(1#$ (net))xi

Multi-layer Perceptron (MLP)

Backpropagation algorithm

Two steps:1. Forward Pass: present training input

pattern to network and activate networkto produce output (can also do in batch:present all patterns in succession)

2. Backward Pass: calculate error gradientand update weights starting at outputlayer and then going back

Forward Pass

• Calculate activation of each hiddennode and store them

• Then calculate activation of each outputnode!

yj = " wij xii=1

n

# +b$

% &

'

( )

!

yk = " wjk yjj=1

n

# +b$

% &

'

( )

Backward Pass for output node

!

general update (all nodes) : " yi# j

!

"wjk when k is an output node :

!

"k = (dk # yk )"yk$netk

%wjk =&yj"k =&yj (dk # yk )"yk$netk

Same as non-linear perceptron!

δ is the error term

Updating output layer weight

!

wj1k1 only affects output yk1

!

"wj1k1=#yj1 (d1 $ yk1 )

%yk1&netk1

Backward Pass for hidden node

!

general update (all nodes) : " yi# j

!

"wij when j is an hidden node :

!

" j = ("kwjk )k

# "y j$net j

%wij =&yi" j =&yi ("kwjk )k

# "y j$net j

!

if network has one hidden layer, yi is an input xi

Updating hidden layer weight

Weight in input layer affects all outputs!

Updating hidden layer weight

!

"wi1 j1=#x

1$ j1 =#x

1($kwj1k

)k

% $y j1&net j1

Artificial Neural NetworksPart III

Faustino GomezIDSIA


Some observations about MLPs

• Each hidden layer implements a set of featuredetector

• Remember: backpropagation refers to thecomputation of weight derivatives w.r.t. toerror

• Gradient descent is one way that weights canbe adjusted using these derivatives

Advantages of NNs

• Universal approximation– Single hidden layer is sufficient!

• Efficient hardware implementation• Noise tolerance• Graceful degradation

Open Issues with NNs

• Error surface may have many local minima• Model selection: what is the best network

topology for a given problem• Generalization: how can we ensure that a

network captures the underlying function?• Opacity: NNs are black boxes that are not

directly human interpretable

Local Minima

Local minima

• Most algorithms which have difficulties with simpletasks get much worse with more complex tasks

• Many dimensions make for many descent options• Local minima more common with very simple/toy

problems, more rare with larger problems and largernets

• Even if there are occasional minima problems, couldsimply train multiple nets and pick the best

• Some algorithms add noise to the updates to escapeminima

Generalization

• Overfitting/underfitting– How many nodes/layers

• Validation and Test sets• Training set coverage and size

Enhancements To Gradient Descent

• Momentum

– Adds a percentage of the last movement tothe current movement

!

"w(t +1)#$%y+&"w(t)

momentum

Momentum

Weight update maintains momentum in the directionit has been going• Faster in flats• Could leap past minima• Significant speed-up, common value α ≈ .9• Effectively increases learning rate in areas

where the gradient is consistently the samesign

!

"w(t +1)#$%y+&"w(t)

Annealing

• Adjust learning rate during learning• Start with large learning rate (e.g. 0.5)• Gradually decreases it as error goes

down• Allows for big jumps at beginning and

fine tuning later when close to aminimum

NN Applications

• NETtalk (Sejnowski and Rosenberg 1987)

• Fraud detection (credit cards, loanapplications)

• Financial prediction• Compression• Continuous, non-linear control

NETtalkspeech synth

linguistic features

Data Compression

Network learns to output its input

Auto-encoder network

Data Compression

Input representation is compressed in smaller hidden layer

Auto-encoder network

!

h << n

Recurrent Neural Networks


Faustino GomezIDSIA

Sequence Learning

• Up to now we have looked at staticmappings:

• It doesn’t matter when x was presentedto the network

!

y = f (x(t)), "t

where t, time, just imposes an ordering on the input patterns.

Sequence Learning• Now we look at sequential inputs where the

output y can depend on more than just theimmediate input:

• State provides memory which in NNs isimplemented by feedback or recurrentconnections

where s(t) is the state , which can change every time a new input arrives:

!

y = f (s(t)) = F(x(t), x(t "1),..., x(1))

!

s(t +1)" g(s(t), x(t))

Why study sequences?• Many natural processes are inherently sequential

– Speech– Vision– Natural language– DNA

• In robotics tasks short-term memory can beessential for determining the state of theworld due to limited sensor information

Sequential Training Set

!

[x1(t1), x1(t1 "1), x

1 (t1 " 2),..., x1(1)], d

1[ ][x2 (t2 ),x2 (t2 "1), x

2 (t2 " 2),...,x2 (1)], d2[ ]

[x3(t3 ),x3(t3 "1), x3(t3 " 2),...,x3(1)], d

3[ ]M

[xN(tN

),xN(tN"1),xN(t

N" 2),...,xN(1)], d

N[ ]

where ti is the length of sequence i,

x and d are vectors

!

example " [[input sequence], target]input sequence " [x(t),x(t # 1),x(t # 2), ...,x(1)]

Recurrent Non-Linear Neuron

Recurrent Non-Linear Neuron

!

yk = " wik xii=1

I

#

inputconnections6 7 8

+ wjk yjj=1

H

#

recurrentconnections6 7 4 8 4

+b

$

%

& & & &

'

(

) ) ) )

where I is the number of inputs to neuron yk and

H is the number of hidden units in the same

layer as yk

Now the output of a unit depends on both the inputand also the output from other neurons

Simple Recurrent Network (SRN)

Simple Recurrent Network (SRN)

Hidden units now have state or memory which is dependent on all previous inputs

SRN: another wayto look at it

Context units are just the hiddenlayer activation from the previoustime step

Fully Connected RNN

Can approximate any differentiable trajectorySame as SRN but without output layer

Train with truncatedbackpropagation

• Train the same way as an MLP• But treat activation from previous time-

step as just another set of inputs(context units)

• The network can now learn to map thesame external inputs to different outputsdue to “context”

Backpropagation Through Time

• Just like backpropagation but network is“unfolded” spatially for each time-step ininput sequence

• For an n-step sequence, we get a networkwith n-layers

• Each layer has the same weights• Error at output is propagated back through all

layers

Backpropagation Through Time

Input

Hidden/State

Output

Output Weights

State/Hidden (t-1)

Input State/Hidden (t-2)

Input

Input Weights

State/Hidden (t-3)

Recurrent Weights

Propagate error further back

Input Weights

Input Weights

Recurrent Weights

Recurrent Weights

Same weights

Same weights

for SRN only

Simple example: XOR with delay

• Network learns to perform XOR of twoor more inputs,

• But instead of outputting the XOR of thecurrent inputs,

• It outputs the XOR of the input it saw nsteps ago

!

f (x(t)) = XOR(x(t " n))

Delayed XOR

Vanishing error gradient• Although RNNs can represent arbitrary

sequential behavior, they are not easy to trainwhen the output depends on some input morethan around 10 time-steps in the past

• The error gradient becomes very small sothat the weights cannot be adjusted torespond to events far in past

• Might as well use an MLP with a input layer ntime-steps wide…if you know n in advance!

• Solution:– Long Short-Term Memory– Neuroevolution (later in semester)

Long Short-Term Memory (LSTM)

LSTM Cell

•LSTM nets have memory cells with a linear state S thatkeeps error flowing back in timeand is controlled by 3 gates•Input gate (Gi) controls what information enters the state•Output gate (Go) controls what information leaves the state to other cells•Forget gate (Gf) allows cell toforget state when no longer needed

Intelligent Systems Midterm

Fall 2007

Question 1 (15 points)

The geometric distribution P over the set of Natural numbers N ={1, 2 . . . } is given by P (k) = (1 − p)k−1p for k ∈ N. Derive the MaximumLikelihood Estimator (MLE) for the parameter p.

Solution

See example at page 15, Lecture 3.The ML estimator is

p̂ = arg maxp

L(k1, . . . , kn|p) = arg maxp

log L(k1, . . . , kn|p) (1)

The likelihood of p in light of an arbitrary sample (k1, . . . , kn) is

L(k1, . . . , kn|p) =n∏

i=1

P (ki|p) =n∏

i=1

[(1 − p)ki−1p], (2)

whose logarithm is

1

LL(k1, . . . , kn|p) = log

n∏

i=1

[(1 − p)ki−1p] = (3)

=n∑

i=1

log[(1 − p)ki−1p] = (4)

=n∑

i=1

[(ki − 1) log(1 − p) + log p] = (5)

= [

n∑

i=1

(ki) − n] log(1 − p) + n log p. (6)

The value of p which maximizes (6) can be found evaluating its derivativewith respect to p, and setting it to 0

dLL(k1, . . . , kn|p)

dp= (−1)

[∑n

i=1(ki) − n]

(1 − p)+

n

p= 0 (7)

[−∑n

i=1(ki) + n]p + n(1 − p)

(1 − p)p= 0 (8)

−p

n∑

i=1

(ki) + np + n − np = 0. (9)

Solving for p we obtain the ML estimator

p̂ =n∑n

i=1 ki

. (10)

2


Find the Maximum A Posteriori (MAP) estimator for the parameter λ

of the exponential distribution

f(x|λ) = λe−λx

with prior f(λ) = e−λ.

Solution

See example at page 45 of Lecture 3.The MAP estimate of λ is obtained maximizing its posterior probability,

given the data. Using the Bayes rule we can write:

λ̂ = arg maxλ

f(λ|x1, . . . , xn) = arg maxλ

f(x1, . . . , xn|λ)f(λ)

f(x1, . . . , xn)(11)

or, as the denominator does not depend on λ

λ̂ = arg maxλ

f(x1, . . . , xn|λ)f(λ) (12)

taking the logarithm

λ̂ = arg maxλ

[log f(x1, . . . , xn|λ) + log f(λ)] (13)

The likelihood of λ in light of an arbitrary sample (x1, . . . , xn) is

f(x1, . . . , xn|λ) =

n∏

i=1

f(xi|λ) =

n∏

i=1

(λe−λxi) (14)

The logarithm of (14) is

log f(x1, . . . , xn|λ) = logn∏

i=1

(λe−λxi) = (15)

=

n∑

i=1

log(λe−λxi) = n log λ − λ

n∑

i=1

xi (16)

while log f(λ) = log(e−λ) = −λ.

3

The value of λ which maximizes (13) can be found evaluating its deriva-tive with respect to λ, and setting it to 0

d[log f(x1, . . . , xn|λ) + log f(λ)]

dλ= (17)

d[n log λ − λ∑n

i=1 xi − λ]

dλ= (18)

n

λ−

n∑

i=1

xi − 1 = 0. (19)

Solving for λ we obtain the MAP estimator

λ̂ =n∑n

i=1 xi + 1(20)


Derive the Bayes rule.

Solution

The probability of an event A given an event B (p. 14 of Lecture 4) isdefined as

P (A|B) =P (A,B)

P (B)(21)

which impliesP (A,B) = P (A|B)P (B) (22)

Symmetrically, one can obtain

P (A,B) = P (B|A)P (A) (23)

Substituting (23) in (21) we obtain the Bayes rule

P (A|B) =P (B|A)P (A)

P (B)(24)

4


Consider the classification problem with discrete objects and binary la-bels Y ∈ 0, 1; each object consists of d discrete components (features)(X(1), . . . ,X(d)). Suppose all components X(i) are independent given the la-bel. Assume that we know the (unconditional) class probability P (Y = a),and probabilities P (X(i)|Y = a) of each value of each single componentgiven each class label. How would you assign a label a to a given object(X(1), . . . ,X(d))?

Solution

With Naive Bayes (p. 26, Lecture 4), assigning label a with maximumposterior probability

a = arg maxa

[P (Y = a)

d∏

i=1

P (X(i)|Y = a)] (25)


Are the points (3, 2,−6, 0, 2) and (4,−1, 5,−3, 2) classified to be in thesame class or in different classes by the hyperplane represented by the vector(1,−1, 0, 1, 1)?

Solution

YES, the points are in the same class. The sign of the dot-productbetween the point (vector) and the vector representing the hyperplane tellsyou on which side of the hyperplane the data point is on. So if the dot-products for both points have the same sign, then the points are in thesame class, otherwise, they are not.

(3, 2,−6, 0, 2) · (1,−1, 0, 1, 1) = 3 + (−2) + 0 + 0 + 2 = 3

(4,−1, 5,−3, 2) · (1,−1, 0, 1, 1) = 4 + 1 + 0 + (−3) + 2 = 4

5


What is the difference between supervised and unsupervised learning?

Solution

In supervised learning the examples in the training set have targets whichtell the learning algorithm what the correct output of the learner should bewhen its sees each training input. In the case of classification tasks, thetargets are class labels, for function approximation tasks (regression) thetargets will be real-valued vectors.

In unsupervised learning there are no targets, just the data. Here wewant to learn the underlying structure of the data, e.g. how the data isdistributed, the number of clusters, etc.


What do the eigenvalues tell us about the ”components” generated byPCA?

Solution

The eigenvalues tell us the relative “importance” of each of the com-ponents (which are the eigenvectors of the covariance matrix of the data).The first principal component has the largest eigenvalue, and, therefore, thiscomponent captures the most variance in the data. The second principalcomponent has the second largest eigenvalue, and so on.

6


List at least two factors that affect the ability of a neural network togeneralize after training.

Solution

1. Network architecture (topology): how many layers and how manyunits in each layer. Too many parameters (weights) and the networkmight overfit the training set. That is, it obtains very low error onthe training set but does not generalize to the testing set. Too fewparameter and the network might underfit the train set.

2. Training set. The way in which the training set sample the input spacecan affect generalization on the test set. For instance, if the trainingset is very sparse (small in relation to the dimensionality of the inputspace) then network will have more freedom in how it interpolatesbetween the training set examples when generalizing from examplesin the test set. If the input space is sample densely then, trainingwill take longer, and my require a larger network, but generalizationshould be better.

Other factors such as the initial (random) weights, and the learning ratecan also affect generalization.


What is the key drawback of the perceptron?

Solution

It can only solve linearly separable problems.

7


What are local minima, and why are they a problem?

Solution

Local minima are points in the error function where the derivative iszero, without being the global minimum. They are a problem because ifwe are following the gradient in order to reduce error (as we do in normalbackpropagation), the gradient goes to zero at the local minima so that wecannot reduce the error further even though there are points on the errorsurface with lower error. In order to get out of the local minima we haveto go “uphill”, against the local gradient (see slide 5 in part 3 of the NNlectures).


Why is the derivative of the error with respect to a particular weight inan MLP calculated differently if the weight is in the output layer versus anyother layer?

Solution

This is because a weight in the output layer only affects the error throughthe output unit it is connected to. That is, any change to a weight in theoutput layer will only affect the error of the network by affecting the outputof the unit to which it is connected.

A weight in a “previous” layer will affect the error more indirectly. As-suming a two-layer network (input layer and output layer), a change in aweight in the input layer (i.e. a weight connecting an input unit to a hid-den unit) will affect the output of the hidden unit it is connected to. Thishidden unit is then connected to ALL of the output units. Therefore, thederivative of the error with respect to that weight must take into accountthe error term δk for each of the k output units (see slide 16 in part 2 of theNN lectures).

8


What do the ”context units” in a recurrent neural network represent?

Solution

The context units represent the hidden layer activation from the previoustime-step. When the first element in a sequence is input to a recurrentneural network, the activation of the hidden layer (i.e. the output of thehidden layer units), is copied to the context units and becomes part of theinput to the RNN when the second element in the sequence is processed.Likewise, the hidden layer activation caused by the second element beinginput (along with it’s “context”) units provides the “context” for the thirdelement in the sequence, and so on (see slide 10 in part 4 of the NN lectures).


Would a recurrent neural network be in principle able to learn to distin-guish between these two sequences:

a b c d e f gb b c d e f g

What about:a b c d e f ga c b d e f g

Solution

In both cases the answer is YES. In the first case, the sequences differ inthe first element. In the second case, two elements a swapped. However, allthat matters is that the sequence are different. In both cases, the fact thatthe sequences are not identical is enough for the RNN to distinguish them,in principle.

9

Support Vector MachinesPart I (Linear SVMs)


Faustino GomezIDSIA

SVMs: Basic Idea• Map original, possibly linearly inseparable datainto higher dimensional feature space

for linear case:

• where the original data is hopefully linearlyseparable• Find the “best” hyperplane in the feature spaceusing a linear classifier

!

x "#(x)

!

"(x) = x

Original data Data in feature space

!

x

!

"(x)

Perceptron Revisited

Linear Classifier:

w.x + b = 0

w.x + b < 0

w.x + b > 0

!

y(x) = sign(w • x +b)

• All of the above hyperplanes correctly classify thethe training set• Which hyperplane is the best? i.e. whichgeneralizes the best.• One possibility: the hyperplane with the maximummargin

Statistical Learning:Capacity and VC dimension

• To guarantee an upper bound on generalizationerror, the capacity of the learned functionsmust be controlled.– too much capacity : overfitting– too little capacity: underfitting

• The Vapnik-Chervonenkis (VC) dimensionis one of the most popular measures ofcapacity.

VC Dimension

• A classification model f with some parametervector w is said to shatter a set of data pointsif, for all assignments of labels to those points,there exists a w such that f classifies all pointscorrectly.

• For model f, the VC-dimension h is themaximum number of points that can bearranged so that f shatters them.

A line can shatter 3 points: VC dimension = 3

VC Dimension of a line

A line cannot shatter four points

There is now way to arrange the 4 points such that a linecan correctly classify all possible labelings

Structural risk minimization

• A function that: (1) minimizes the empirical risk (i.e. training set error)

and (2) has low VC dimension

will generalize well regardless of thedimensionality of the input space.

with probability (1-δ) (Vapnik, 1995, “Structural Minimization Principle”)

nn(log(2 / ) 1) log( / 4)true train

VC n VCerr err

n

!+ "# +

Margin of separation andoptimal hyperplane

• Vapnik has shown that maximizing themargin of separation between the classesis equivalent to minimizing the VCdimension.

• The optimal hyperplane is the one givingthe largest margin of separation betweenthe classes.

Definition of Margin• Distance from a data point to the boundary:

• Data points closest to the boundary are called support vectors• The margin d is the perpendicular distance of the closest point to

the hyperplane!

r =w "x+b

w

r

d

!

y(x)

w=

w • x +b

w= r

!

"b

w

!

y > 0

!

y = 0

!

y < 0

Class 1

Class 2

!

w

!

x

!

x1

!

x2

!

x"

Distance from vector to hyperplane

!

y(x)

w=

w • x +b

w= r

!

"b

w

!

y > 0

!

y = 0

!

y < 0Class 1

Class 2

!

w

!

x

!

x1

!

x2

!

x"

!

x = x" + rw

w

w • x + b = w(x" + rw

w)+b

= w • x" + rw • w

w)+b

= w • x" + rw

2

w+b

= wx" +b

=0

1 2 4 3 4 + r w

y(x) = r w

r =y(x)

w

Distance from vector to hyperplane

The margin is the r of the closest point to the hyperplane

Maximizing the Margin

!

since di " {#1,+1}; points are either in one class or the other

there exists a set of parameters w such that :

y(x i ) > 0 for di = +1 and

y(x i ) < 0 for di = #1 so that

diy(x i ) > 0 for all data points

That is, w correctly classifies all points.

so the distances we want to maximize are :

!

diy(x i )

w=di (w • x i +b)

w

Maximum Margin Classification

• Maximize the minimum distance from the hyperplane

!

argmaxw,b

1

wmini

di(w • x

i+b)[ ]

" # $

% & '

Canonical Hyperplanes• r does not change if we multiply w and b by some number k:

• Therefore, we can set for the point closest to the hyperplane.

• So that all points satisfy the constraints:

!

dy(x)

w=w "x+b

w

=kw "x+ kb

kw#k $ R

!

d(w "x +b) =1

!

di(w "x

i+b) #1 $i

How do we find the maximum marginhyperplane?

• Maximize , same as minimizing

• Subject to the linear constraints:

• Recall and for the closest points

known as support vectors

!

w2

!

di(w "x

i+b) #1 $i

!

diy(x i )

w

!

d(w "x +b) =1

!

1

w

This is a quadratic optimization problem

Support Vector MachinesPart II (Non-Linear SVMs)


Faustino GomezIDSIA

SVMs: Basic Idea• Map original, possibly linearly inseparable datainto higher dimensional feature space

for linear case:

• where the original data is hopefully linearlyseparable• Find the “best” hyperplane in the feature spaceusing a linear classifier

!

x "#(x)

!

"(x) = x

Original data Data in feature space

!

x

!

"(x)

Observations• Solution for perceptron is a linear combination oftraining points

• Only uses informative points (mistake driven)• The coefficient of a point in combination reflectsits “difficulty”

!

w = "idixi#

"i$ 0

Perceptron Learning Algorithm

!

Input : list of n training examples (x0,d0 )...(xn ,dn ) where "i :di # {+1,$1}Output : classifying hyperplane w

Algorithm :Randomly initialize w;While makes errors on training set do for (xidi ) do let yi = sign(w • xi ); if yi % di then w& w+'dixi; end endend

!

w = "idixi#

"i$ 0

Dual Representation

• The decision function can be rewrittenas follows:

• Data only appears within dot products!

!

w = "idixi#

!

f (x) = w • x +b = "idix i • x{# +b

Non-Linear SVMs

• So far we have seen only the linearcase: where the feature spaceis the same as the input (data) space

• For data that is not linearly separablewe need to map to richer feature space

• where the data can be correctlyclassified with a line

!

"(x) = x

Non-linear SVMs• Datasets that are linearly separable work fine:

• But what do we do if the dataset is just too hard?

• If we map to a higher-dimensional space the data isnow linearly separable:

0 x

0 x

0 x

y

Non-linear SVMs: another example

Φ: x → φ(x)

Implicit Mapping to Feature Space

Problem: working in high-dimensional featurespaces can be computationally intractable (verylarge vectors).

Solution: use Kernels• solves computational problem of working withhigh dimensions• makes even infinite dimensions possible

Kernel Induced Feature Spaces

• Add feature space mapping to dual representation:

• Kernel is a function that returns the dot productbetween two images in feature space

!

f (x) = "idi# K xi ,x( ) +b

!

K x,y( ) = " x( ) •" y( )

before was x•y6 7 4 8 4

where K(.,.) is the kernel function:

Kernel Induced Feature Spaces

• For linear case it’s just the dot product of the originaldata:

• For non-linear case can be much morecomplex and map to even infinite dimensional spaces

• Kernel Trick: but we don’t need to know ! Don’t need to represent feature space explicitly!

!

x • y = K x,y( ) = " x( ) •" y( )

!

" x( )

!

" x( )

Examples of Kernel Functions• Linear:

• Polynomial of power d:

• Gaussian:

• Sigmoid:

!

K x,y( ) = exp("x " y

2

2# 2)

!

K x,y( ) = x • y

!

K x,y( ) = x,y( )d

!

K x,y( ) = tanh "0x • y + "1( )

Example: polynomial kernel

!

K x,y( ) = x,y( )d

Polynomial Kernel (cont.)

given two data points:

!

x = (x1, x

2)

y = (y1, y2)

For d=2:

!

K x,y( ) = x1y1+ x

2y2( )2

= x1

2y1

2 + x2

2y2

2 + 2x1y1x2y2

= x1

2, x2

2, 2x

1x2( ) • y

1

2, y2

2, 2y

1y2( )

= " x( ) •" y( )

SVM architecture

Input (data) space

b

1( , )K x x

2( , )K x x

1( , )mK x x

1x

2x

omx

y

Kernel layer

Linear output

Output neuron

!

f (x) = "idi# K xi ,x( ) +b

Some Issues• Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating

appropriate similarity measures

• Choice of kernel parameters - e.g. σ in Gaussian kernel - σ is the distance between closest points with different

classifications - In the absence of reliable criteria, applications rely on the use

of a validation set or cross-validation to set such parameters.

Reinforcement LearningFaustino Gomez

IDSIA


Supervised Learning revisited

What if we don’t know good targets dfor our input samples?

!

input space : x " X

output space : d " D

find f : X # D using training set of example :{(xi, di)},i =1..N

Sequential Decision Tasks

Agentstate action

Agent sees state of the environment at each time-step andmust select best action to achieve a goal.Problem: we don’t know what the correct actions (targets)are before hand, so can’t learn from examples!

Sequential Decision Tasks

Agentstate action

!

st

at" # " s

t+1

at+1" # " " s

t+2

at+2" # " " s

t+3

at+3" # " " ...!

st

: state of the environment at time t

!

at

: action taken by agent at time t after seeing

!

st

Decision sequence:

Examples of SequentialDecision Tasks

• Autonomous robotics• Controlling chemical processes• Network routing• Game playing• Stock trading

Example: Pole balancingbenchmark

Reinforcement Learning Problem

Agent

Now the agent receives a reward or reinforcement signal thatgives some indication of whether its behavior is “good” or “bad”.

Reinforcement Learning Problem

Agent

!

st

at" # " s

t+1

$

rt+1

at+1" # " " s

$

t+2

rt+2

at+2" # " " s

$

t+3

rt+3

at+3" # " " ...

Now we have:

Reinforcement Learning ProblemThe goal is to learn a policy that maximizes the reward r overthe long term:

!

Rt= " k r

t+k+1

reward}

k=0

T

# , 0 $ " $1

where γ is the discount rate. γ=1 means the all rewards received matterequally. γ<1 means rewards further in the future are less important.

A policy is the agent’s function that maps states to actions:

For each state the agent encounters, the policy tells itwhat action to take.

!

" (st)# a

t

The best policy is the one that selects the action in eachstate that leads to the highest long-term reward.

note: this is the deterministic case

Why are Reinforcement LearningProblems Hard to Solve?

• Not just trying to learn known behavior and then generalizing from it• Have to discover behavior from scratch• Only have scalar reinforcement to guide learning• Reinforcement may be infrequent• Credit assignment problem

–How much credit should each action in the sequence of actions get for the outcome

Solving ReinforcementProblems

• If the problem can be formulated as a MarkovDecision Process

• we can use a value-function to represent how“good” each state is in terms of providingreward

• Use various methods to learn value function– Dynamic Programming– Temporal Difference Learning (e.g. Q-learning)

Markov Decision Processes

!

a finite set of states : s " S

a finite set of actions : a " A

state transition probabilities : Ps # s

a= Pr{s

t+1 = # s st= s,a

t= a}

reward function : Rs # s

a= E{r

t+1 st= s,a

t= a}

a policy : $(s,a) = Pr{at= a s

t= s}

…and the Markov property must hold.

also known as the state-space

also known as the action-space

Markov Decision Processes

!

a finite set of states : s " S

a finite set of actions : a " A

state transition probabilities : Ps # s

a= Pr{s

t+1 = # s st= s,a

t= a}

reward function : Rs # s

a= E{r

t+1 st= s,a

t= a}

a policy : $(s,a) = Pr{at= a s

t= s}

…and the Markov property must hold.

model

The Markov Property

This just means that the probability of thenext state and reward only depend on theimmediately preceding state and action!

!

Ps " s

a= Pr{s

t+1= " s ,r

t+1= r s

t,a

t,r

t, s

t#1,a

t#1,r

t#1,..., s

0,a

0,r

0}

$P

s " s

a= Pr{s

t+1= " s ,r

t+1= r s

t,a

t}

It doesn’t matter what the happened before that!

e.g. , there is a60% chance of going to state 4when action a is taken in state 3

Example transition matrixEach action will have a transition matrix

!

s1

!

s3

!

s2

!

s1

!

s2

!

s3

!

s4

!

s4

From:

To:

0

0

0

0.5

0.1 0.3

0

0.6

0.5

0.6

0

0 0.3

0.4

0.5

0.2

• example with four states• each entry gives the

probability of going from onestate to another if the actionis taken

• each row sums to 1.0

!

Ps " s

a

!

Ps3s4

a= 0.6

State transitions (cont.)

!

s1

!

s3

!

s2

!

s1

!

s2

!

s3

!

s4

!

s4

0

0

0

0.5

0.1 0.3

0

0.6

0.5

0.6

0

0 0.3

0.4

0.5

0.2

=

All edges leaving state add to 1.0

transition graph

Agent Policy

• The policy implements the agents behavior• In general, the policy will be stochastic:

• Often the policy is deterministic and we canwrite:

!

"(s,a) = Pr{at= a s

t= s}

!

"(s)# a

for each state the policy says which action to use

policy gives the probability of taking action a in state s

Value Functions

Value function tells the agent how “good” it is tobe in a given state

!

V"(s) = E" {Rt s

t= s}

agent’s policy

This says how much reward the agent can expectto receive in the future if it continues with its policyfrom state s

!

Rt= " k r

t+k+1

reward}

k=0

T

# , 0 $ " $1where

Value Functions

!

V"(s) = E" {Rt st = s}

= E" # krt+k+1 st = s

k=0

T

$% & '

( ) *

= "(s,a)a

$ Ps + s

a

+ s

$ Rs + s

a+#E" # k

rt+k+2 st+1 = + s k=0

T

$% & '

( ) *

,

- .

/

0 1

= "(s,a)a

$ Ps + s

a

+ s

$ Rs + s

a+#V "

( + s )1 2 3

,

-

.

.

/

0

1 1

Bellman equation6 7 4 4 4 4 4 8 4 4 4 4 4

value of possible next state

Reinforcement LearningPart II (Policy Iteration)

Faustino GomezIDSIA


More about R

!

Rt= " k r

t+k+1

reward}

k=0

T

# , 0 $ " $1

R is also known as the return. It is how much rewardthe agent will receive from time t into the future.

• If γ is close to 0, then the agent cares more aboutselecting actions that maximize immediate reward:shortsighted• if γ is close to 1, then the agent takes future rewardsinto account more strongly: farsighted

Action-Value function

• Same as value function but gives value foreach action in state s

• Q(s,a) is the expected future reward if wetake action a in state s and then continueselecting actions according to the policy π

!

Q"(s,a) = E" {Rt st = s,at = a}

= E" # krt+k+1 st = s,at = a

k=0

T

$% & '

( ) *

= Ps + s

a

+ s

$ Rs + s

a +#E" # krt+k+2 st+1 = + s

k=0

T

$% & '

( ) *

,

- .

/

0 1

= Ps + s

a

+ s

$ Rs + s

a +#V "( + s )[ ]

Value-function vs. Q-function

!

V"(s) = "(s,a)

# s

$ Ps # s

a

# s

$ Rs # s

a +%V "( # s )[ ]

Q"(s,a) = P

s # s

a

# s

$ Rs # s

a +%V "( # s )[ ]

The Q-function implements one step of “lookahead”

It caches the value of taking each action in a given state

Agent now consists of two components:1. Value-function (Q-function)2. Policy

A policy can be computed from the values

How to compute policy from Q?• Greedy policy: select action in each state with highest value

• ε−greedy policy: select greedy action 1-ε% of the time and some other, random

action ε% of the time (this will be useful later)

• Stochastic Policy: Use action values to select actions probabilistically (more on

this later)

!

"(s) = argmaxa

Q(s,a)

Optimal Value functions

• The optimal value-function is thevalue-function of the policy that generates thehighest values for all states.

• Likewise for Q-function

!

V*(s) = max

"V

"(s), for all s # S

!

V*(s)

!

Q*(s,a) = max

"Q

"(s,a), for all s # S

Dynamic Programming Methods:Policy Iteration

1. Policy Evaluation compute the value function of the current policy

2. Policy Improvement improve the policy with respect to the computed value function

Two Steps:

Policy Evaluation

• Use Bellman equation as update rule

• “sweep” through the state-spacecomputing the value for each state V(s)using the value of each of the possiblenext states

!

" s

!

V"(s)# "(s,a)

$ s

% Ps $ s

a

$ s

% Rs $ s

a +&V "( $ s )[ ]

Policy Evaluation algorithm

if change in V is small enough: stop

value backup

Value backups

!

"(s,a)taken with probability

next state with probability

!

Ps " s

a

Value backup example4 states:3 actions:

!

S = s1, s2, s3, s4{ }

!

A = a1,a2,a

3{ }

!

Ps1" s 1

a1!

"(s1,a1)

!

"(s1,a2)

!

"(s1,a

3)

!

Ps1" s 2

a2

!

Ps1" s 2

a1

!

Ps1" s 4

a3

!

Ps1" s 2

a3

!

Ps1" s 3

a2

1 2 3 42 2

!

"V( # s )

1

backup for

!

s1

Note: in this case, each action only leads to some of the other states. Thatis, in some cases, e.g. and among others.

!

Ps1" s

a= 0

!

Ps1s3

a1 ,Ps1s4

a1 , Ps1s4

a2

Value Backup example (cont.)

!

"V( # s 1) + R

s1s1

a1

!

Ps1" s 1

a1

!

Ps1" s 2

a1

!

"V( # s 2) + R

s1s2

a1

2


!

"V( # s 1) + R

s1s1

a1

6 7 4 4 8 4 4

!

Ps1" s 1

a1

!

Ps1" s 2

a1

!

"V( # s 2) + R

s1s2

a1

6 7 4 4 8 4 4

**

2

Compute the Bellman equation foreach state.


!

"V( # s 1) + R

s1s1

a1

6 7 4 4 8 4 4

!

Ps1" s 1

a1

!

Ps1" s 2

a1

!

"V( # s 2) + R

s1s2

a1

6 7 4 4 8 4 4

**

+

do the same for the other actions

2

!

Ps1" s 1

a1 (#V( " s

1) + R

s1s1

a1 ) +

Ps1" s 2

a1 (#V( " s

2) + R

s1s2

a1 )


!

"(s1,a1)

*


!

Ps1" s 1

a1 (#V( " s

1) + R

s1s1

a1 ) +

Ps1" s 2

a1 (#V( " s

2) + R

s1s2

a1 )

!

"(s1,a1)

*

do the same for these


!

Ps1" s 1

a1 (#V( " s

1) + R

s1s1

a1 ) +

Ps1" s 2

a1 (#V( " s

2) + R

s1s2

a1 )

!

"(s1,a1)

*

add them up

++

!

=V"(s1)


!

Ps1" s 1

a1 (#V( " s

1) + R

s1s1

a1 ) +

Ps1" s 2

a1 (#V( " s

2) + R

s1s2

a1 )

!

"(s1,a1)

*

add them up

++

!

=V"(s1)

Now do the same for the other states

Policy Improvement

!

For each s " S :

# $ (s)% arg maxa Ps # s

aR

s # s

a +&V $( # s )[ ]

# s

'

Update the current policy to be greedy with respectto the value function.

Combining the two steps

!

" V

improvement

evaluation

!

V "V#

!

" # greedy(V )

Cycle through policy evaluation and policy improvement

!

"0

E# $ # V

"0

I# $ # "

1

E# $ # V

"1

I# $ # "

2

E# $ # L

I# $ # " * E

# $ # V*

eventually converges to optimal policy!

Policy IterationAlgorithm:

Grid World example

S = {1, 2,…,14}

A = {up, down, left, right}

Goal states

The agent can’t leave the goalstates, i.e. they are terminal

Grid world example (cont.)

Grid World example (cont.)

Dynamic ProgrammingObservations

Policy Iteration is guaranteed toconverge for finite MDPs, but…

Problem: what if we don’t know themodel, and ?

!

Rs " s

a

!

Ps " s

a

next lecture

Reinforcement LearningPart III

Faustino GomezIDSIA


Value Iteration

• Combine one sweep of Policy Evaluation withPolicy Improvement

• Don’t wait for evaluation to converge beforeswitching to improvement

• Also converges to optimal value function

!

Vk+1(s)"maxa

Ps # s

a

# s

$ Rs # s

a +%Vk ( # s )[ ]

Value Iteration Algorithm

What if we don’t know the model?

• This is the more general case where theagent can only observe actual statetransitions caused by its own actions

• Cannot use Dynamic Programming directly• It turns out, we can still converge to optimal

value-function without the model!

Temporal Difference Methods (TD)

• Use the difference between the value thecurrent state and the next visited state toupdate current state

• Don’t need values of all next states, whichis good because in the real world we canonly visit one at a time

!

V(st)"V(s

t)+# r

t+1 +$V(st+1)%V(st )[ ]

TD vs. DP

!

V(st)"V(s

t)+# r

t+1 +$V(st+1)%V(st )[ ]

!

V(s)" #(s,a)$ s

% Ps $ s

a

$ s

% Rs $ s

a +&V( $ s )[ ]

TD:

DP:

• In Dynamic Programming (e.g. Policy Iteration) we haveto take expectations over possible actions and next states• TD uses the current estimate of V at to compute newvalue for• Note: no reference to time in DP

!

st

!

st

TD Control: Q-learning

• Uses the Q-value of the “best” action in thenext state

• Version of TD using Q-function• Because we have value for each action it

can be used for on-line learning/control

!

Q(st,a

t)"Q(s

t,a

t)+# r

t+1 +$maxaQ(s

t+1,a)%Q(st ,at )[ ]

Q-learning (cont.)

!

Q(st,a

t)"Q(s

t,a

t)+# r

t+1 +$maxaQ(s

t+1,a)

target1 2 4 4 4 3 4 4 4

%Q(st,a

t)

&

'

( ( (

)

*

+ + +

learning rate

Move current Q-value of s and a toward targetby an amount determined by the learning rate(we’ve seen this before)

Q-learning Algorithm

Exploration vs. Exploitation

• Exploitation: take good actions in eachstate already taken before to maximizereward

• Exploration: take a chance on actionsthat may have lower value in order tolearn more, and maybe find true bestaction to later exploit

Need to balance the two!

Balancing exploitation/exploration• ε−greedy policy: select greedy action 1-ε% of the time (exploit) and some other,

random action ε% of the time (explore)

• Stochastic Policy: Use action values to select actions probabilistically,

e.g. soft max:

High temperatures increase exploration by making policy more randomLower temperatures increase exploitation by making policy more greedy

!

"(s,b) =eQ(s,b)

#

eQ(s,a )

#

a

$, where # > 0 is the temperature

Limitations of Standard RL Methods

• Continuous state/action spaces– Cannot represent value-function with table– Need some kind of function approximator to

represent V– Loss of convergence guarantees

• Partial Observability– Agent no longer sees underlying state– From the agent’s perspective the Markov Property

does not hold– Need to estimate underlying state

What if the state space is verylarge or continuous?

• Cannot represent value-functionwith table

• Need some kind of functionapproximator to represent V or Q

• Loss of convergence guarantees

Coarse Coding Approximation

• Each circle is a “receptive field”• Each field has a learnable weight• The state activates all receptive fields it falls within• V(s) or Q(s,a) are computed by summing weights of

activated fields (i.e. linear approximator)

Coarse Coding

Tile Coding Approximation

Tile Coding

• Each axis corresponds to one state variable• Each tile has a learnable weight• The state activates one tile in each tiling• V or Q are computed by adding weights of allactivated tiles

Representing Q(s,a) with an NN

!

st

!

at

!

Q(st,a

t)

Training the NN Q-functionTraining is performed on-line using the Q-valuesfrom the agent’s state transitions

input: ,

!

st

!

at

target:

!

rt+"max

aQ(s

t+1,a)

For Q-learning:

What if agent can’t completely“see” the state?

• The environment is said to be partiallyobservable

• The agent only receives an “observation”of the state provided by its sensory system

!

"(st)# o

t, where o$O,s

t% o

t, and O << S

O is the set of possible observations, which is usuallysmaller than S

Think of Ω as the agent’s sensory system

Perceptual Aliasing

• Different states can look the same or similar

• If the action that is best for is not good for , wehave a problem because agent will take the sameaction in both

!

"(si)

!

"(s j )

!

ok , where i " j

!

si

!

sj

RL under Partial Observability

• Now, from the agent’s perspective, previous inputs(the observations) are important• The current input is not enough to determine whatstate the environment is in!

RL under Partial Observability

The agent now needs some way to determine theunderlying state; this means we need memory

One Solution: use RNN to represent V or Q-function

Collective and Swarm Intelligence

Gianni Di [email protected]

IDSIA -USI/SUPSI

Road map

■ Generalities on Collective/Swarm Intelligence (SI)

■ Main characteristics of SI design

■ Cellular Automata: the simplest and earliest example of SI

■ Particle Swarm Optimization: state-of-the-art SI framework forcontinuous optimization

■ Generalities on Ant algorithms for adaptive task allocation, andAnt Colony Optimization, state-of-the-art SI framework forcombinatorial optimization and network routing.

Let’s start with some examples . . .

■ Collective behaviors and problem solution in natural systems

✦ Vertebrates: swarming, flocking, herding, schooling

✦ Social insects (ants, termites, bees, wasps): nest building,foraging, assembly, sorting,. . .

Swarm of killer bees

Ants and bees at work

■ Ants: leaf-cutting, breeding, chaining

■ Ants: Food catering

■ Ants: Learning the shortest path between food and nest

■ Bees: waggle dance to recruit workers

What do all these behaviors have in common?

■ Distributed society of autonomous individuals/agents

■ Control is fully distributed among the agents

■ Communications among the individuals are localized

■ Interaction rules and information processing seem to be simple:minimalist agent capabilities

■ System-level behaviors appear to transcend the behavioralrepertoire of the single agent

■ The overall response of the system features:✦ Robustness

✦ Adaptivity

✦ Scalability

Bottom-up vs. top-down design

■ Ontogenetic and phylogenetic evolution has (necessarily) followedsuch a bottom-up approach (grassroots) to design systems:✦ Instantiation of the basic units (atoms, cells, organs,

organisms, individuals, . . . ) composing the system and letthem (self-)organize to generate more complex/organizedsystem-level behaviors and/or structures

✦ Population + Interaction protocols are “more important” thanthe single modules

■ On the other hand, from an engineering point of view we can alsochoose a top-down approach:✦ Acquisition of comprehensive knowledge about the

problem/system to deal with, analysis, decomposition,definition of a possibly optimal strategy

Swarm intelligence (SI): a definition

■ SI refers to the bottom-up design of distributed systems thatdisplay forms of useful and/or interesting behaviors at the globallevel as a result of the actions of a number of composing units(relatively simple) interacting with one another and with theirenvironment at the local level.

■ A relatively new and quite successful computational designparadigm. It finds its main roots in the work developed at thebeginning of the ’90 on algorithms inspired by behaviors of socialinsects (mainly ants). IDSIA has played a main role in laying thefoundations and in the application of SI.

■ SI design has been applied to a wide variety of problems inoptimization, telecommunications, robotics, complex systemmodeling. Many state-of-the-art implementations.

Challenges of SI design

■ Given a task/problem to deal with, a number of design choices:1. Characteristics/skills of the agents

2. Size of the population (related to the choice 1.)

3. Neighborhood definition

4. Interaction protocols and information to exchange

5. Where the information is updated (agent, channel,environment)

6. Use or not of randomness

7. Synchronous or asynchronous activities and interactions

8. . . .

■ Lots of parameters

■ Predictability is an issue

Is a top-down approach better?

Let’s focus on neighborhood & communication

■ Point-to-point: antennation, trophallaxis (food or liquid exchange),mandibular contact, direct visual contact, chemical contact,hardwired direct connections (neurons, cells), unicast radiocontact

■ Limited-range Broadcast: the signal propagates to some limitedextent throughout the environment and/or is made available for arather short time (e.g., use of lateral line in fishes to detect waterwaves, generic visual detection, radio broadcast

■ Indirect: two individuals interact indirectly when one of themmodifies the environment and the other responds asynchronouslyto the modified environment at a later time. This is calledstigmergy [Grassé, 1959] (e.g., pheromone laying/following,post-it, web)

SI algorithmic frameworks (and relatives)

■ Stigmergy has led to Ant Algorithms and in particular to AntColony Optimization (ACO) [Dorigo & Di Caro, 1999], which isbased on the shortest path finding abilities of ant colonies.

■ Also Cultural Algorithms [Reynolds, 1994] are population-basedalgorithms relying on stigmergy. They are derived from processesof cultural evolution and exchange in societies.

■ Broadcast-like communication related to schooling and flockingbehaviors has inspired Particle Swarm Optimization [Kennedy &Eberhart, 2001].

■ Point-to-point communication is used in Hopfield neural networks[Hopfield, 1982], derived from brain’s structure and behavior.

■ Point-to-point and neighbor broadcast is at the basis of CellularAutomata [Wolfram, 1984].

■ Genetic algorithms, artificial immune systems, . . .

Frameworks that we will discuss

1. Cellular Automata (CA) (Gossip Algorithms?)

2. Particle Swarm Optimization (PSO)

3. Stigmergy-based algorithms

4. Ant Colony Optimization (ACO)

Cellular Automata (CA)

CA: general definitions

■ A set of M automata (cells) ai, i = 1, . . . ,M : finite state machineswith a specified number of states S = {s1, s2, . . . , sk}

■ The set of cells have an interconnection topology. Theneighborhood of a cell is a function N which associates to a cell ai

an ordered set of n neighbors, N (a) = {a1i , a

2i , . . . , a

ni }

■ A local state-transition function F : Sn → S that depends on thecurrent state si

(t) of the cell and on the state of its n neighbors

■ At discrete time-steps (and either synchronously orasynchronously) each automaton gets the state from its neighborsand change its state accordingly:si(t + 1) = F (si(t), s(a

1i ), s(a

2i ), . . . , s

ni )

■ In the simplest cases (which are most amenable to analysis), thetopology is a regular 1D or 2D lattice, the units are either boolean,S = {0, 1}, or have very few states, N corresponds to thephysically closest cells, and the n is between 3 and 8.

A boolean 1D CA is just a string of binary digits, while a 2D one

CA: dynamic system view

■ The set of the M cells can be seen at each time-step as anM-dimensional vector ~a defined in the state domain. The CAevolves as an M-dimensional discrete-time dynamic system:~a(t + 1) = F (~a(t))

■ Let’s consider the simple but enlightening case where:ai(t + 1) = F (ai−1(t), ai(t), ai+1(t))

■ Analogous to a time-discrete 1D partial differential equation.

■ Boundary conditions (torus): a0 = aM , and aM+1 = a1.

■ “Only” 2M possible configurations.

CA: Given the rule what’s the behavior?

■ Time evolution of the cell vector: attractor points, oscillatorybehaviors, emergence of spatial regularities, dependence frominitial conditions, dependence from perturbations, . . .

■ Fixed point: ~a∗ = F (~a∗). From a time evolution point of view, afixed point exist if, given that ~a(0) = ~a∗, ~a(t) = ~a∗ ∀t

■ A fixed point is asymptotically stable if the above relation holds forall initial conditions in a specified neighborhood of ~a(0)

■ For two close initial conditions: ~a(0) and ~a(0) + ǫ, after k iterations,the configurations F k

(~a(0)) and F k(~a(0) + ǫ) can be different.

Lyapounov exponent λ: ǫekλ= |F k

(~a(0) + ǫ) − F k(~a(0))|.

■ For k → ∞, ǫ → 0, eλ(~a(0)) = divergence speed between the twoinitial conditions. λ > 0 dependence on initial conditions, λ < 0

implies convergence.

CA: Wolfram notation for rules

■ Wolfram code is a name used for the method of enumeratingelementary cellular automaton rules used by Stephen Wolfram inhis seminal work on CAs.

■ For the simplest case of {0, 1} states and n = 3 the code works asfollows. All the possible neighborhood configurations are writtenas the following list: 111 110 101 100 011 010 001 000.A transition rule associates to each one of these bit triples aboolean value a ∈ {0, 1} that represents the new state. Therefore,a generic rule can be written as:

111 110 101 100 011 010 001 000

a111 a110 a101 a100 a011 a010 a001 a000

The bit-array (a111 a110 a101 a100 a011 a010 a001 a000) can be read asa decimal number, that names the specific rule.

■ For instance, the bit-array 00011110 identifies rule 30

■ The method can be extended to any state set and n size

CA: Wolfram’s rule 30

Linear CA, {0, 1} states, F (ai−1, ai, ai+1) is a boolean function of 3 bits (n = 3):

Neighborhood state: 111 110 101 100 011 010 001 000

New cell state: 0 0 0 1 1 1 1 0




New cell state: 1 0 1 1 1 0 0 0




New cell state: 0 1 1 0 1 1 1 0

CA: Wolfram classification

■ Class 1: After few step the system reaches a homogeneousconfiguration independent from initial conditions

■ Class 2: After few steps they show simple spatial-temporalconfigurations made of separate regions which are eitherconstant or periodic (≥ 2

M ). The general structure of the arisingconfigurations is relatively independent from initial conditions.

■ Class 3: For certain subsets of initial conditions they show achaotic behavior (no periodic structures). In many cases for allbut one cell set to one, a self-similar behavior arises.

■ Class 4: Strong dependence from initial conditions, highlycomplex, irregular, and moving structures.

■ All have two Lyapounov exponents measuring the propagation ofthe information on the initial conditions in both directions. λ = 0

for 1 and 2, λ > 0 for 3, and λ > 0, λ → 0 for 4.

CA: Given the behavior, find the rule!

■ This is called the inverse problem

■ It’s “useful” to let the CA carrying out computations we areinterested in (e.g., analogous to setting/learning the weights ofthe connections in a neural network)

■ A widely studied example is the density/majority/parity problem:Find a transition rule that, given an initial state of a CA with anodd number of cells, and a finite number T of max iterations torun, will result in an “all zero” state (~a(T ) = ~0) if ~a(0) contains amajority of cells at state 0, or in an “all one” state otherwise.

■ The CA becomes a parallel computer that detects the relativedensity of 0 or 1 symbols in a configuration

■ The simplest rule (switch to the same state of the majority of myneighbors) does not always work! Many different rules have beenstudied. For 2D automata with 149 cells the best recognition ratefor large test sets of random initial configurations is ≈ 83%.

CA: Further readings and study

■ Golly, a cross-platform program to directly experiment thedynamics generated by different rules:http://golly.sourceforge.net/

■ A collection of Java applets to study the behavior of CAs:http://germain.umemat.maine.edu/faculty/hiebeler/java/CA/CellularAutomata.html

■ A Java applet to observe and enjoy the amazing behavior andcapabilities of 2D CAs:http://www.mirekw.com/ca/mjcell/mjcell.html

■ Another java applet showing a very large set of 2D patterns ofmathematical, physical, biological, and social interest:http://germain.umemat.maine.edu/faculty/hiebeler/java/CA/CellularAutomata.html

■ A comprehensive (55 pages) and accessible summary on CAs:http://citeseer.ist.psu.edu/delorme98introduction.html

Summary and conclusions

■ The basic principles of SI design have been discussed

■ It looks apparently easy and simple to design SI algorithms. Onthe other hand the instrumental study of CAs has pointed out howhard might be to deal and understand these systems.

■ However, on Wednesday we will see that the basic principles ofSI, properly integrated with solid heuristic knowledge, can allow arather straightforward design of much more complex SI systemscapable of solving incredibly complex tasks!

Collective and Swarm Intelligence (2)

Gianni Di [email protected]

IDSIA -USI/SUPSI

Road map

■ Generalities on Collective/Swarm Intelligence (SI)

■ Main characteristics of SI design

■ Bottom-up vs. Top-down design approaches

■ Cellular Automata: the simplest and earliest example of SI

■ Particle Swarm Optimization: state-of-the-art SI framework forcontinuous optimization

■ Short discussion on (these topics will be covered in depth in nextyear’s course on Heuristics):✦ Ant algorithms for adaptive task allocation and coordination✦ Ant Colony Optimization: state-of-the-art SI framework for

combinatorial optimization and network routing.

The principles of SI design. . . (1)

■ A population of units/agents/modules with internal state andautonomous behavior

■ Embedded in some physical or abstract environment, orrepresented whithout any reference to an external environment

■ Each unit has non-linear interactions with other units

■ Interactions are localized according to the selected definition ofneighborhood and of communication modality and range

■ Interaction protocols and agent information processing arerelatively simple

The principles of SI design. . . (2)

■ Overall system control is not centralized but fully distributed: itresults from agent interactions and communications

■ The goal is to exploit repeated large-scale interactions rather thanthe “intelligence” of the single modules

■ System-level behaviors transcend the behavioral repertoire of thesingle agent and are “more” than the sum of their capabilities

■ In other words: we are investigating how to design artificialdistributed complex systems that can be used for: modeling,pattern recognition, optmization, control, coordination, . . .

■ Hint: taking inspiration from nature can be effective. . .

Let’s restart from CAs. . .

F=(s , si−1, si−2, s , si+2i+1i )

i−1 i i+1i−2 i+2

00 0

00 0t=0

t=1

t=2

0 0?

1 11 0 0

00

■ For instance, each cell switch to 1 if there are enough cells instate 1 in its neighborhood:

F (si) =

{1 if si + si−1 + si−2 + si+1 + si+2 > 2

0 otherwise

■ Fixed topology, broadcast range = 2 in the example

Local vs. global communication

■ Each individual has local non-linear interactions with its neighbors. . . but in practice every cell depends from the state of all the othercells→ the CA has to be considered as a single complex system

■ Neighbors ≈ individuals that exert an influence, individuals thatwe trust, . . .

■ “Individuals” can be biological cells, people, molecules oraggregate of molecules, birds, . . . (play with the CA java applet!)

Particle Swarm Optimization (PSO)

PSO: natural/social background (1)

■ Early work on simulation of bird flocking aimed at understandingthe underlying rules of bird flocking [Reynolds, 1984] and roostingbehavior [Heppner & Grenader, 1990]

■ The notion of change in human social behavior/psychology isseen as the analogous of change in spatial position in birds

■ Rules assumed to be simple and based on social behavior :sharing of information and reciprocal respect of the occupancy ofphysical space

■ Social sharing of information among conspeciates seems to offeran evolutionary advantage


■ Initial simulation work [Eberhart & Kennedy, 1995]

■ A population of N >> 1 agents is initialized on a toroidal 2D pixelgrid with random position and velocity, (x̄i, v̄i), i = 1, . . . , N

■ At each iteration loop, each agent determines its new speedvector according to that of its nearest neighbors

■ A random component is used in order to avoid fully unanimous,unchanging, flocking

■ Roosting behavior: looks like a dynamic force such that attractsthe swarm to land on a specific location. The roost could be theequivalent of the optimum in a search space!


■ Birds explore the environment in search for food

■ Agents = solution hunters that socially share knowledge whilethey move across a solution space

■ An agent that has found a “good” point leads its neighbors there

■ . . . and eventually all the agents “flock” toward the best point inthe solution space

PSO: the particles and the task

■ Optmization of continuous functions f(~x) : IRn → IR■ For convex functions gradient methods can be effectivel used, but for

non-convex ones . . .■ An agent is an n-dimensional particle moving over function’s domain■ A particle p has an internal state consisting of: {~x,~v, ~xpbest,N (p)} and

makes use of a simple rule to update its velocity and position

PSO: pseudo-codeprocedure Particle Swarm Optimization for Minimization()

foreach particle p ∈ ParticleSet do(~x,~v)← random init of positions and velocity();N (p)← random selection of the neighbor set();~xpbest ← f(~x); ~xgbest =∞;

end foreachwhile (¬ stopping criterion)

foreach particle p ∈ ParticleSet do~xpbest ← argmax

(f(~x), f(~xpbest)

);

~xlbest ← get best so far position from neighbors(N (p)

);

~∆individual ← ~xpbest − ~x;~∆social ← ~xlbest − ~x;(~r1, ~r2)← random uniform();~v ← ω~v + w1~r1 ◦ ~∆individual + w2~r2 ◦ ~∆social;~x← ~x + ~v;if f(~x) < f(~xpbest)

~xpbest ← ~x;if f(~x) < f(~xgbest)

~xgbest ← ~x;end foreach

end whilereturnf(~xgbest);

PSO vs. Gradient-based search

■ Each local minimum xmin of a continuous function can beassociated to an attraction basin A(xmin). Starting from any pointlocated in the attraction basin, gradient-based (local search)methods can provide some guarantees to reach xmin.

■ In the general case, search must be iterated over different basinsto increase the probability to hit the global minimum

■ PSO (and, in general SI-based methods), are based on globalexploration and search. A number of particles concurrently moveand jump across the landscape and probe different regions.

■ At the beginning search mainly relies on exploration.

■ Search can be locally intensified on a region concentrating theremore particles.

■ Depending on the design, all the particles can eventuallyconverge on searching around the best region found, or a certainlevel of global exploration can be maintained in the swarm.

PSO: different alternatives

■ Velocity update is the core formula, that we can rewrite as:

~v ← ω~v + φ1~U(0, 1)~∆individual + φ2

~U(0, 1)~∆social

■ ω is an inertia coefficient, while φ1 and φ2 are the “cognitive” and“social” acceleration coefficients

■ The canonical PSO makes use of a constriction factor χ:

~v ← χ(~v + φ1

~U(0, 1)~∆individual + φ2~U(0, 1)~∆social

),

χ = 2k/(∣∣2− φ−

√φ2 − 4φ)

■ The value of v is usually clamped between [−vmax, vmax]

■ Two main classes based on the use of lbest vs. gbest: multiplevs. one attractor, exploration vs. exploitation

■ A number of different tricks (some based on solid theory). . .

PSO: 1D example, one particle, gbestt x xbest

0 20.00 10.001 18.21 18.212 16.43 16.433 14.64 16.434 13.24 16.435 12.03 16.436 11.06 11.067 10.09 11.068 9.71 11.069 8.85 11.06

10 9.14 11.0611 10.13 11.06

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

X

B

1.0 6.01.8 6.04.4 6.07.2 7.2

10.0 10.012.8 12.815.6 15.618.4 18.421.2 21.223.9 23.926.7 26.7

PSO: dynamics of a single particle

■ Individual trajectories are “weak”. The figure shows the trajectoryof a single particle in the swarm and the effect on it of thecommunication of a new best from the its social network

■ Optimization is a function of interparticle interactions

PSO: distribution of the sampled points

■ The points sampled by a swarm particle are distributed as asymmetric bell-shaped curve with the mean equal to the averageof the previous bests and standard deviation equal to theirdifference

PSO at work

■ Java applets to visualize and play with PSO:

http://gecco.org.chemie.uni-frankfurt.de/PsoVis/applet.html

http://www.projectcomputing.com/resources/psovis/

■ The main Internet repository for material related to PSO:

http://www.swarmintelligence.org/

PSO: applications and performance

■ A number of optimization problems: benchmark test functions,real-world functions, neural networks learning, shortest paths innetworks, . . .

■ Comparisons with other heuristics such as Genetic Algorithmsusually show that PSO has comparable or better effectivenessand better computational performance

■ Has some convergence properties

■ Still lacking of comprehensive comparison with more “standard”approaches

PSO performance: popular test functions

Spere Rastrigin

Rosenbrock Griewank

PSO performance: lbest vs. gbest

PSO: performance variability for different variants

Stigmergy and Ant-inspired algorithms

Note: The lecture given in the classroom only addressed the generalaspects of the Ant Colony Optimization framework. The slides thatfollow contain some additional information which is only provided forsake of completeness.

Stigmergy and Ant-inspired algorithms

■ Stigmergy is at the core of most of all the amazing collectivebehaviors exhibited by the ant/termite colonies (nest building,division of labor, structure formation, cooperative transport)

■ Grassé (1959) introduced this term to explain nest building intermite societies

■ Goss, Aron, Deneubourg, and Pasteels (1989) showed howstigmergy allows ant colonies to find shortest paths between theirnest and sources of food

■ These mechanisms have been reverse engineered to give raiseto a multitude of ant colony inspired algorithms based onstigmergic communication and control

Pheromone laying-attraction is the key

■While walking, the ants lay on the grounda volatile chemical substance, calledpheromone

■ Pheromone distribution modifies the environment (the way it isperceived) creating a sort of attractive potential field for the ants

■ This is useful for retracing the way back, for mass recruitment, for labordivision and coordination, to find shortest paths. . .

Stigmergy and stigmergic variables

■ Stigmergy means any form of indirect communication among aset of possibly concurrent and distributed agents which happensthrough acts of local modification of the environment and localsensing of the outcomes of these modifications

■ The local environment’s variables whose value determine in turnthe characteristics of the agents’ response, are called stigmergicvariables

■ Stigmergic communication and the presence of stigmergicvariables can give raise to self-organized global behaviors

■ Blackboard/post-it, style of asynchronous communication

Examples of stigmergic variables (1)

✢ Leading to diverging behavior at the group level:

■ The height of a pile of dirty dishes floating in the sink

■ Nests’ energy level in robot activation [Krieger and Billeter, 1998]

■ Level of customer demand in adaptive task allocation of pick-uppostmen [Bonabeau et al., 1997]. A distributed set of postmenhave to decide whether to serve or not the pick-up request from acustomer. The “stimulus” intensity s depends on the distancefrom the customer and the amount of items already in charge:

Tθ(s) =sn

sn + θ

where θ is the response threshold of the postman, and T is theprobability that he/she will pick-up the item.

■ Diverging stigmergy is a general model to design adaptive taskallocation strategies in distributed systems

Examples of stigmergic variables (2)

✢ Leading to converging behavior at the group level:

■ Intensity of pheromone trails in ant foraging: pheromone isdeposited at higher rates on shortest paths connecting nest andsources of food. The presence of higher intensity of pheromoneattracts subsequent ants, eventually converging the majority ofthe ants in the colony on moving along the shortest path.

■ This pheromone-mediated shortest-path behavior, properlyreverse-engineered, has provided the core inspiration for thedesign of the Ant Colony Optimization metaheuristic.

Shortest path behavior in ant colonies

■ While walking, at each step a routing decision is issued.Directions locally marked by higher pheromone intensity arepreferred according to some probabilistic rule:

π ( τ η),π

τ

η

Decision RuleStochastic

MorphologyTerrain

Pheromone

???

■ This basic pheromone laying-following behavior is the mainingredient to allow the colony converge on the shortest pathbetween the nest and a source of food

Ant colonies: Pheromone and shortest paths

Food

Nest

Food

t = 3

t = 1

Nest

Food

Nest

Food

t = 0

Pheromone Intensity Scale

Nest

t = 2

Ant colonies in a more complex discrete world

FoodNest


✢ Multiple decision nodes

✢ A path is constructed through a sequence of decisions

✢ Decisions must be taken on the basis of local information only

✢ A traveling cost is associated to node transitions

✢ Pheromone intensity locally encodes decision goodness ascollectively estimated by the repeated path sampling

Ant colonies: Ingredients for shortest paths

■ A number of concurrent autonomous (simple?) agents (ants)

■ Forward-backward path following/sampling

■ Multiple paths are tried out and implicitly evaluated

■ Local/stigmergic laying and sensing of pheromone

■ Stochastic step-by-step decisions biased by pheromone andother local aspects (e.g., terrain, visibility)

■ Positive feedback effect (local reinforcement of good decisions)

■ Persistence (exploitation) and evaporation (exploration) of thepheromone field

■ Iteration over time of the path sampling actions

■ . . . always convergence onto the shortest path?

What pheromone represents in abstract terms?

■ Pheromone trails act as a sort of distributed, dynamic, andcollective memory of the colony, a repository of all the most recentforaging experiences of the ants in the colony

■ Pheromone trails encode the value of goodness of a local move ascollectively learned from the generated paths (solutions)

■ By locally laying pheromone, the ants modify the environment thatthey have visited, and in turn take a decision biased by thepresence/strength of these modifications (stigmergy)

■ Circular relationship: pheromone trails modify environment→locally bias ants decisions→ modify environment

Paths

Outcomes of path construction are used to modify pheromone distribution

Pheromone distribution biases path constructionπ

τ

A meta-strategy for shortest path problems

■ By reverse engineering ant colonies’ shortest path behavior weget an effective metaheuristic, ACO, to solve shortest pathproblems

■ . . . in a possibly fully distributed and adaptive way

■ . . . and shortest paths model are a very general models forcombinatorial optimization and decision problems!

■ Note that in PSO and in CA is the state of the agents that evolvesover time, in ACO, and more in general, in stigmergy-basedsystems, is the local state of the environment that evolves

ACO: general architecture

procedure ACO metaheuristic()while (¬ stopping criterion)

schedule activitiesant agents construct solutions using pheromone();pheromone updating();daemon actions(); comment: optional

end schedule activitiesend while

return best solution generated;

ACO: A solution construction approach

■ A solution to the combinatorial problem at hand is constructed bystarting from an empty solution and adding a new solutioncomponent at each construction step

■ A set of decision nodes is defined: at each node a decision isissued concerning the new component to add to the solutionbeing contructed

■ A decision node encodes the information which is used about thepartial solution (the past decisions/experience) to take a feasibleand optimized decision concerning the new component to add

■ Example: Car traffic. Each crossroad is a decision node, a“solution” is a path from origin to destination, a path is a sequenceof decisions. The case of packet routing in networks is analogous.

ACO: Pheromone and heuristic variables

Destination

Source

τ

58

14 59 59

58

6

1

4

3

8

9

5

7

2

τ ;η 14

13τ ;η 13

12τ ;η 12

;η

τ ;η


■ Each decision node i holds an array of pheromone variables:~τi = [τij ] ∈ IR,∀j ∈ N (i) → Learned through path sampling

■ τij = q(j|i): learning estimate of the quality/goodness/utility of moving tonext node j conditionally to the fact of being in i

■ Each decision node i also holds an array of heuristics variables:~ηi = [ηij ] ∈ IR,∀j ∈ N (i) → Resulting from other sources/measures

■ ηij is also an estimate of q(j|i) but it comes from a process or a prioriknowledge not related to the ant actions (e.g., node-to-node distance)

ACO: Path sampling by ant agents

Destination

Source

1

4

3

8

9

5

7

2

6

✢ Each ant is an autonomous agent that proposes a solution to theproblem by constructing a path Ps→d

✢ There might be one or more ants concurrently active at the sametime. Ants do not need synchronization

✢ A stochastic decision policy selects node transitions:

πǫ(i, j; ~τi, ~ηi)

Pheromone updating

■ Ants update pheromone online step-by-step → Implicit pathevaluation based on on traveling time and rate of updates

■ Ant’s way is inefficient and risky.

■ A better way is to update step-by-step but offline:✦ Complete the path

✦ Evaluate and Select whether to update pheromone or not

✦ “Retrace” the path and assign credit , i.e., reinforce thegoodness value of the issued decisions (pheromone variables)

✦ Total path cost can be used as reinforcement signal

■ An evaporation mechanism can be applied to decreasepheromone intensity and favor exploration: τij ← ρτij , ρ ∈ [0, 1],

Designing an ACO algorithm

■ Representation of the problem→ pheromone model ~τ

■ Heuristic variables ~η

■ Ant-routing table A

■ Stochastic decision policy πǫ

■ Solution evaluation J(s)

■ Policies for pheromone updating

■ Scheduling of the ants

■ Daemon components

■ Pheromone initialization, constants, . . .

Applications and performance

■ Traveling salesman: state-of-the-art / good performance■ Quadratic assignment: good / state-of-the-art■ Scheduling: state-of-the-art / good performance■ Vehicle routing: state-of-the-art / good performance■ Sequential ordering: state-of-the-art performance■ Shortest common supersequence: good results■ Graph coloring and frequency assignment: good results■ Bin packing: state-of-the-art performance■ Constraint satisfaction: good performance■ Multi-knapsack: poor performance■ Timetabling: good performance■ Optical network routing: promising performance■ Set covering and partitioning: good performance■ Parallel implementations and models: good parallelization efficiency

■ Routing in telecommunications networks: state-of-the-art performance

Evolutionary ComputationPart I

Faustino GomezIDSIA


Ontogenic vs. Phylogenetic Learning

• Single-agent learning– Agent modifies its structure (parameters) to adapt

while it interacts with the environment• Multi-agent learning

– Multiple agents modify their structure to adaptwhile interacting with environment and each other

– Potentially solve task together

• Evolutionary Computation– Population of candidate solutions (e.g. agents)

learn collectively by natural selection, but notindividually (usually)

Phylogenetic:

Ontogenetic:

Evolutionary Computation (EC)

• Instead of a single solution, use a populationof candidate solutions

• Search the space of solutions in parallel• Evaluate candidates and assign a fitness

score• Generate new population from most “fit”

candidates that is hopefully better than theprevious population

Basic idea:

The Evolutionary Cycle

Recombination

MutationPopulation

Offspring

ParentsSelection

Replacement

Evaluation

Branches of EC

• Genetic Algorithms (Holland 1975)• Evolution Strategies (Schwefel 1977)• Evolutionary Programming (Fogel 1966)• Genetic Programming (Koza 89)

Genetic Algorithms (GAs)• Inspired by Darwinian principle of naturalselection• Different from other EC methods primarilydue to emphasis in sexual reproduction(crossover)• GAs search the problem space by trying tocorrectly combine genetic building blocksfrom different individuals in the population

GA: Basic Procedure1. Initialize random population of candidate

solutions2. Evaluate solutions on problem and

assign a fitness score3. Select some solutions for mating4. Recombine: create new solutions from

selected ones by exchanging structure5. IF good solution not found: Goto 2

The cycle from 2 to 5 is know as a generation

Fitness Landscape

Search in parallel to maximize fitness

GA Terminology

• Solutions are encoded in strings calledchromosomes

• Each chromosome consists of somenumber of genes

• Each gene can take an a value or allelefrom some specified alphabet, e.g.– Binary {0,1}– Real numbers (infinite alphabet)

Genotype Encoding• Binary encoding

• Real valuesgenes

…

chromosome

…

Mapping Genotypes toPhenotypes

• Genotypes can represent any kind ofstructure or phenotype in the problemspace (environment)

• Before evaluating a genotype it must bemapped into its phenotype

• Once the phenotype is created, it canbe evaluated in the environment

GA: Evaluation

Map genotype to phenotype and evaluate phenotype in theenvironment to assign a fitness

Selection: fitness proportional1. Calculate a genotype’s probability of being

selected in proportion to its fitness

2. Then select some number of genotypes formating according to probabilities!

pi =fi

f j"

!

pi

Genotypes that are more fit are more likely to be selected

Selection: linear ranking1. Sort the genotypes by fitness2. Compute probability of being selected by

3. then select some number of genotypes formating according to probabilities

!

pi = 2 " SP+ 2* (SP "1)* (rank(i)"1) /(N "1)

where SP is the selective pressure [1.0,2.0],and rank denotes the genotype’s position in the sorted population, rank(1) is most fit, rank(N) is least fit

Selection: Tournament1. Let T be the tournament size (between 2

and the size of the population, N)2. Select T genotypes at random from the

population and take the most fit as thetournament “winner”

3. Put the winner in the mating pool4. Goto 1 until we have enough genotypes in

mating pool

Larger values of T increase selective pressure

Selective Pressure andDiversity

• Similar to exploitation/exploration tradeoff• We want selective pressure to be high

enough to direct search towards a goodsolution,

• but not so high that we converge too soon• Diversity should be high so that we are less

likely to “miss” a good solution,• but not so high that we don’t converge

(convergence allows search to concentrate neargood solution)

Reproduction• Generate new individuals (search points)

by mixing or altering the genotypes of theselected members of the population

• Crossover: select alleles from two parentchromosome to form two children(recombination type genetic operator)

• Mutation: random perturb some of thealleles of a parent

Genetic Operators: 1-point crossover

Select random crossover points and exchange substrings

Genetic Operators: 2-point crossover

Select 2 random crossover points and exchange substrings

Mutation: binary

Parent

Child

Randomly flip a bit with probability mutation rate

Mutation: real codingParent

• Replace allele with new real number with probability mutationrate• or, add noise to allele, e.g. from Guassian distribution

Child

GA: Behavior

Initial population: candidate solutions are distributedthroughout search space

phenotype

genotype

genotype

phenotype

After some number of generations…

genotype

phenotype

After some more generations…

genotype

phenotype

Some more generations…

Convergence

genotype

phenotype

All individuals look almost the same after many generationsof recombination (crossover)

Premature Convergence

Population converges too quickly and misses global peak

phenotype

genotype

How do we avoid prematureconvergence?

• Use larger population– More genetic material– Takes longer to converge, but requires more

evaluations• Reduce selective pressure

– Make the search less greedy– Could take much longer

• Increase mutation– Adds diversity but could disrupt building blocks

• Niching– Force genotypes stay spread out into niches by

penalizing fitness when bunched together

Advantages of GAs

• Less sensitive to local minima than non-population based methods

• Little domain knowledge required• Can cope with high-dimensionality• Can be implemented in parallel

because individual evaluations areindependent

Example: Function optimisation

- very frequent problem in many domains

- more examples in the tutorial

Solution

Evolution Strategies (ES)• Genes contain real values• Mutation by adding values of a

Gaussian distribution• Parameters evolve• Originally population size = 1• Possible crossover: average of the

genes of both parents

ES vs. GA

parameters

various

Bit-flip

binary

GA

parameters andmutation step sizes

What is evolved

uniform randomParent selection

GaussianMutation

realRepresentation

ES

Note: these are historical differences, not all true today!

ES: basic procedure

1. Initialize parents and evaluate them2. Create some offspring by perturbing

parents with Gaussian noise according toparent’s mutation parameters

3. Evaluate offspring4. Select new parents from offspring and

possibly old parents5. IF good solution not found Goto 2

ES: Genotype Encoding• Chromosomes contain not only problem

parameters (as in Gas) but also strategyparameters

• Strategy parameters determine the amount ofmutation (standard deviation of the Gaussian)that is applied to the corresponding problemparameter

!

x1,K, xn

problemparameters

1 2 4 3 4 ,"1,K,"

n

strategyparameters

1 2 4 3 4

(µ λ) Notation

• µ is the number of parents• λ the number of offspring• “,” select new parents for next

generation from offspring only• “+” select them from old parent and

offspring

!

(µ+

,

,+

• (1+1)-ES : 1 parent and 1 offspring, new parent is selected from the old parent and offspring

• (1, λ)-ES: 1 parent and λ, new parent selected only from offspring (best

offspring replaces parent

(µ λ) Notation examples,+

1. Generate λ offspring by selecting at random λ chromosomes from the µ parents and mutatingthem according to their strategy parameters

2. If “+”, select µ new parents for next generation bytaking most fit chromosome from both old parentsand offspring for next generation

OR If “,” , replace old parent with the µ most fit

chromosomes from the offspring only3. IF solution not found Goto 1

(µ λ) Algorithm,+

Historical example:the jet nozzle experiment

Initial shape

Final shape

Task: to optimize the shape of a jet nozzleApproach: random mutations to shape + selection

H.-P. Schwefel

Used (1+1)-ES

non-parametric methods in machine learning

Documents