linear discriminant functionspasserini/teaching/2011-2012/...linear discriminant functions...
TRANSCRIPT
Linear discriminant functions
Andrea [email protected]
Machine Learning
Linear discriminant functions
Discriminative learning
Discriminative vs generative
Generative learning assumes knowledge of the distributiongoverning the dataDiscriminative learning focuses on directly modeling thediscriminant functionE.g. for classification, directly modeling decisionboundaries (rather than inferring them from the modelleddata distributions)
Linear discriminant functions
Discriminative learning
PROSWhen data are complex, modeling their distribution can bevery difficultIf data discrimination is the goal, data distribution modelingis not neededFocuses parameters (and thus use of available trainingexamples) on the desired goal
CONSThe learned model is less flexible in its usageIt does not allow to perform arbitrary inference tasksE.g. it is not possible to efficiently generate new data froma certain class
Linear discriminant functions
Linear discriminant functions
Description
f (x) = wT x + w0
The discriminant function is a linear combination ofexample featuresw0 is called bias or thresholdit is the simplest possible discriminant functionDepending on the complexity of the task and amount ofdata, it can be the best option available (at least it is thefirst to try)
Linear discriminant functions
Linear binary classifier
Description
f (x) = sign(wT x + w0)
It is obtained taking the sign of the linear functionThe decision boundary (f (x) = 0) is a hyperplane (H)The weight vector w is orthogonal to the decisionhyperplane:
∀x ,x ′ : f (x) = f (x ′) = 0wT x + w0 −wT x ′ − w0 = 0wT (x − x ′) = 0
Linear discriminant functions
Linear binary classifier
Functional margin
The value f (x) of the function for a certain point x is calledfunctional marginIt can be seen as a confidence in the prediction
Geometric marginThe distance from x to the hyperplane is called geometricmargin
r x =f (x)
||w ||It is a normalize version of the functional marginThe distance from the origin to the hyperplane is:
r0 =f (0)
||w ||=
w0
||w ||
Linear discriminant functions
Linear binary classifier
Geometric margin (cont)A point can be expressed its projection on H plus itsdistance to H times the unit vector in that direction:
x = xp + r x w||w ||
Linear discriminant functions
Linear binary classifier
Geometric margin (cont)
Then as f (xp) = 0:
f (x) = wT x + w0
= wT (xp + r x w||w ||
) + w0
= wT xp + w0︸ ︷︷ ︸f (x p)
+r xwT w||w ||
= r x ||w ||f (x)
||w ||= r x
Linear discriminant functions
Biological motivation
Human BrainComposed of densely interconnected network of neuronsA neuron is made of:
soma A central body containing the nucleusdendrites A set of filaments departing from the body
axon a longer filament (up to 100 times bodydiameter)
synapses connections between dendrites and axonsfrom other neurons
Linear discriminant functions
Biological motivation
Human BrainElectrochemical reactions allow signals to propagate alongneurons via axons, synapses and dendritesSynapses can either excite on inhibit a neuron potentionalOnce a neuron potential exceeds a certain threshold, asignal is generated and transmitted along the axon
Linear discriminant functions
Perceptron
Single neuron architecture
f (x) = sign(wT x + w0)
Linear combination of input featuresThreshold activation function
Linear discriminant functions
Perceptron
Representational powerLinearly separable sets of examplesE.g. primitive boolean functions (AND,OR,NAND,NOT)⇒ any logic formula can be represented by a network oftwo levels of perceptrons (in disjunctive or conjunctivenormal form).
Problemnon-linearly separable sets of examples cannot beseparatedRepresenting complex logic formulas can require a numberof perceptrons exponential in the size of the input
Linear discriminant functions
Parameter learning
Error minimizationNeed to find a function of the parameters to be optimized(like in maximum likelihood for probability distributions)Reasonable function is measure of error on training set D(i.e. the loss `):
E(w ;D) =∑
(x ,y)∈D
`(y , f (x))
Problem of overfitting training data (less severe for linearclassifier, we will discuss it)
Linear discriminant functions
Parameter learning
Gradient descent1 Initialize w (e.g. w = 0)2 Iterate until gradient is approx. zero:
w = w − η∇E(w ;D)
Noteη is called learning rate and controls the amount ofmovement at each gradient stepThe algorithm is guaranteed to converge to a localoptimum of E(w ;D) (for small enough η)Too low η implies slow convergenceTechniques exist to adaptively modify η
Linear discriminant functions
Parameter learning
ProblemThe misclassification loss is piecewise constantPoor candidate for gradient descent
Perceptron training rule
E(w ;D) =∑
(x ,y)∈DE
−yf (x)
DE is the set of current training errors for which:
yf (x) ≤ 0
The error is the sum of the functional margins of incorrectlyclassified examples
Linear discriminant functions
Parameter learning
Perceptron training ruleWe focus on unbiased (i.e. w0 = 0) perceptron forsimplicity
∇E(w ;D) = ∇∑
(x ,y)∈DE
−yf (x)
= ∇∑
(x ,y)∈DE
−y(wT x)
=∑
(x ,y)∈DE
−yx
the amount of update is:
−η∇E(w ;D) = η∑
(x ,y)∈DE
yx
Linear discriminant functions
Perceptron learning
Stochastic perceptron training rule1 Initialize weights randomly2 Iterate until all examples correctly classified:
1 For each training example (x , y) update each weight:
wi ← wi + ηyxi
Note on stochasticwe make a gradient step for each training error (rather thanon the sum of them in batch learning)Each gradient step is very fastStochasticity can sometimes help to avoid local minima,being guided by various gradients for each trainingexample (which won’t have the same local minima ingeneral)
Linear discriminant functions
Perceptron learning
Linear discriminant functions
Perceptron learning
Linear discriminant functions
Perceptron learning
Linear discriminant functions
Perceptron learning
Linear discriminant functions
Perceptron learning
Linear discriminant functions
Perceptron regression
Exact solution
Let X ∈ IRn × IRd be the input training matrix (i.e.X = [x1 · · · xn]T for n = |D| and d = |x |)Let y ∈ IRn be the output training matrix (i.e. yi is output forexample xi )Regression learning could be stated as a set of linearequations):
Xw = y
Giving as solution:
w = X−1y
Linear discriminant functions
Perceptron regression
ProblemMatrix X is rectangular, usually more rows than columnsSystem of equations is overdetermined (more equationsthan unknowns)No exact solution typically exists
Linear discriminant functions
Perceptron regression
Mean squared error (MSE)Resort to loss minimizationStandard loss for regression is the mean squared error:
E(w ;D) =∑
(x ,y)∈D
(y − f (x))2 = (y − Xw)T (y − Xw)
Closed form solution existsCan be always be solved by gradient descent (can befaster)Can also be used as a classification loss
Linear discriminant functions
Perceptron regression
Closed form solution
∇E(w ;D) = ∇∑
(x ,y)∈D
(y −wT x)2
=∑
(x ,y)∈D
2(y −wT x)(−x)
= 2X T (y − Xw) = 0X T y = X T Xw
(X T X )−1X T y = w
Linear discriminant functions
Perceptron regression
w = (X T X )−1X T y
Note
(X T X )−1X T is called pseudoinverseIf X is square and nonsingular, inverse and pseudoinversecoincide and the MSE solution corresponds to the exactoneThe pseudoinverse always exists→ approximate solutioncan always be found
Linear discriminant functions
Perceptron regression
Gradient descent
∂E∂wi
=∂
∂wi
12
∑(x ,y)∈D
(y − f (x))2
=12
∑(x ,y)∈D
∂
∂wi(y − f (x))2
=12
∑(x ,y)∈D
2(y − f (x))∂
∂wi(y −wT x)
=∑
(x ,y)∈D
(y − f (x))(−xi)
Linear discriminant functions
Multiclass classification
One-vs-allLearn one binary classifier for each class:
positive examples are examples of the classnegative examples are examples of all other classes
Predict a new example in the class with maximumfunctional marginDecision boundaries for which fi(x) = fj(x) are pieces ofhyperplanes:
wTi x = wT
j x
(w i −w j)T x = 0
Linear discriminant functions
Multiclass classification
Linear discriminant functions
Multiclass classification
all-pairsLearn one binary classifier for each pair of classes:
positive examples from one classnegative examples from the other
Predict a new example in the class winning the largestnumber of pairwise classifications
Linear discriminant functions
Generative linear classifiers
Gaussian distributionslinear decision boundaries are obtained when covarianceis shared among classes (Σi = Σ)
Naive Bayes classifier
fi(x) = P(x |yi)P(yi) =
|x |∏j=1
K∏k=1
θzk (x [j])kyi
|Di ||D|
=K∏
k=1
θNkxkyi
|Di ||D|
where Nkx is the number of times feature k (e.g. a word)appears in x
Linear discriminant functions
Generative linear classifiers
Naive Bayes classifier (cont)
log fi(x) =K∑
k=1
Nkx log θkyi︸ ︷︷ ︸w T x ′
+ log(|Di ||D|
)︸ ︷︷ ︸w0
x ′ = [N1x · · ·NKx ]T
w = [log θ1yi · · · log θKyi ]T
Naive Bayes is a log-linear model (as Gaussiandistributions with shared Σ)
Linear discriminant functions