cmput 551 fall 2011 machine learning lecture notesjwen4/files/cmput551notes.pdf · this document is...

41
CMPUT 551 Fall 2011 Machine Learning Lecture Notes Junfeng Wen Department of Computing Science University of Alberta Last Update: February 5, 2017

Upload: others

Post on 26-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • CMPUT 551 Fall 2011 Machine Learning

    Lecture Notes

    Junfeng WenDepartment of Computing Science

    University of Alberta

    Last Update: February 5, 2017

  • 2

  • Preface

    This document is my lecture note taken from CMPUT 551 Fall 2011, Machine Learning atUniversity of Alberta, given by Prof. Dale Schuurmans.

    This course mainly attempted to organize existing algorithms from an optimization per-spective. In order to understand these notes thoroughly, you should be familiar with basiclinear algebra as well as fundamental optimization. (They will come back as we proceed,hopefully.)

    In this note, there are two kinds of blocks that provide more information on the idea orconcept being discussed. The first kind introduces examples:

    An exampleThis is an example.

    The second kind provides more detailed information:

    Further information

    Some further details.

    Readers may want to skip some of these contents.

    Notations

    • Bold letter (e.g., a) as vector1, normal letter (e.g., a) as scalar.

    • a � b means each component of a is greater than or equal to the correspondingcomponent of b. Specifically, a � 0 means each component of a is non-negative.

    • a> means the transpose of a.

    • Aij means the entry of matrix A that is in the ith row and jth column. Ai: meansthe ith row of A (as a row vector) and A:j means the jth column of A (as a columnvector). A � 0 means matrix A is positive semi-definite unless specified otherwise.

    1In this note, by default we assume all vectors are column vectors unless specified otherwise.

    3

  • 4

  • Contents

    1 Linear Framework 71.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Data Representation and Error Function . . . . . . . . . . . . . . . . . . . . . 71.3 Solving Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Geometry Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Convexity/Robustness Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Beyond Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2 Regularization and Kernel 132.1 Learning with Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 The Representer Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Operations that Preserve PSD . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 Feature Selection: L1 Regularization . . . . . . . . . . . . . . . . . . . . . . . 15

    3 Output Transformation 173.1 Transform Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Matching Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Surrogate Loss for Classification . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4 Support Vector Machine 214.1 Large Margin Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Application of Duality: SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    5 Large Output 275.1 Multivariate Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    5.1.1 Feature Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.1.2 Kernelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.1.3 Feature Selection: Sparsity . . . . . . . . . . . . . . . . . . . . . . . . 285.1.4 Output Transformation and Matching Loss . . . . . . . . . . . . . . . 305.1.5 Surrogate Loss for Classification . . . . . . . . . . . . . . . . . . . . . 31

    5.2 Structured Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    6 Unsupervised Learning 356.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2 Hard Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3 Kernelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    6.3.1 Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . 376.3.2 Kernelized Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3.3 Normalized Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . 39

    5

  • 6 CONTENTS

  • Chapter 1

    Linear Framework

    1.1 General Framework

    Most machine learning algorithms attempt to learn a function (mapping) from given exam-ples:

    (xt,yt)

    ...

    (x1,y1)

    Learner h : X 7→ Y

    Figure 1.1: Learning procedure.

    where xi comes from a domain X and yi lies in a range Y for all i ∈ {1, 2, · · · , t}. Forexample, our input xi can be an image, and the output yi ∈ {0, 1} can be an indicator todetermine whether there is a dog in the image. In most cases, we define an error functionerr(ŷ, y) to measure the difference between the true output y and our prediction ŷ = h(x)for some prediction function h(·). A generic learning algorithm usually find the functionh(·) via empirical error minimization on the training set: given a hypothesis space H and ttraining examples (xi, yi), we do

    minh∈H

    t∑i=1

    err(h(xi), yi)

    1.2 Data Representation and Error Function

    As an example that we will further investigate in the following chapters, assume that xi ∈ Rnand yi ∈ R. For instance, a 16×16 gray-scale image can be vectorized into a 256-dimensionalinput. Accordingly, our hypothesis space can also be parametrized, that is

    H = {hw|w ∈ Rn, hw(x) = w>x}.

    Note that for this linear model, the offset (bias) is omitted. Instead of hw,b(x) = w>x+ b,

    we simply augmented the input xi by appending a 1 on each data point: [xi; 1]. The offsetis then encoded in the last element of w. Therefore, the offset is ignored without loss ofgenerality hereafter. With the above assumptions, we can represent the input as a matrixX ∈ Rt×n, whose rows are data points, and output as a vector y ∈ Rt.

    As for error function, the following are commonly used in the machine learning literature.

    L1 error: err(ŷ,y) = ‖ŷ − y‖1L2 error: err(ŷ,y) = ‖ŷ − y‖22Lp error: err(ŷ,y) = ‖ŷ − y‖pp 0 ≤ p ≤ ∞

    (Recall that the L1 norm of a vector a is the sum of its absolute values |aj |; general infor-mation about Lp norm can be found online

    1.)If y ∈ [0, 1] is a probability, typically we can use the cross entropy to measure the error: cross entropy

    err(ŷ, y) = y ln

    (y

    )+ (1− y) ln

    (1− y1− ŷ

    )1http://en.wikipedia.org/wiki/Lp_space

    7

    http://en.wikipedia.org/wiki/Lp_space

  • 8 CHAPTER 1. LINEAR FRAMEWORK

    1.3 Solving Optimization

    Now that we see the learning problem as an optimization, when is the training efficient? Ifthe error function err(ŷ,y) is convex in ŷ, we can reliably identify the global minimizer h∗.convexSimply set the derivative to zero will lead to global optimal solution in most cases. If theerror function is non-convex in ŷ, then we have trouble finding the best prediction function.

    Now let’s investigate two concrete examples. It is not so obvious how to solve them.

    L1 error minimization

    minw

    t∑i=1

    |w>xi − yi|

    The above formulation tries to reduce the residual of true output yi and predicted out-put ŷi = w

    >xi. We introduce a residual vector δ � 0, whose component representsthe absolute value of the difference between yi and ŷi. Here � means element-wisegreater than or equal to. Then the minimization becomes

    minw,δ

    δ>1

    s.t. δ � 0Xw � y + δXw � y − δ.

    This new formulation is a linear programminga and can be solved efficiently.

    ahttp://en.wikipedia.org/wiki/Linear_programming

    L∞ error minimization

    minw

    maxi

    |w>xi − yi|

    Recall that the L∞-norm of a vector is defined to be its component with maximumabsolute value. Again, we can introduce a residual variable δ ≥ 0 and transform theoriginal problem to

    minw,δ

    δ

    s.t. δ ≥ 0w>xi ≤ yi + δ i = 1, · · · , tw>xi ≥ yi − δ i = 1, · · · , t

    Again, this is a linear programming.

    1.4 Geometry Interpretation

    The geometry interpretation of the linear model ŷ = Xw is shown in Figure 1.2. Ourprediction ŷ lies in the column span of matrix X. Using the notation X:i to represent theith column of X, ŷ is in fact a linear combination of X:i, with wi as coefficients. The trueoutput y can be factorized as y = ŷ + y⊥. To minimize the L2 squared error is equivalentto minimize the “length” of y⊥. Therefore, the best bet would be to choose the ŷ such thaty⊥ is perpendicular to the column space of X.

    1.5 Convexity/Robustness Tradeoff

    Figure 1.3 shows the Lp error as a function of our prediction ŷ with different p values.If p ≥ 1, the error function is convex in ŷ; if p > 1, the error function is smooth anddifferentiable. When p goes below 1, the error function is no longer convex, but growsgradually slower as we make prediction further from the true output y. Theoretically, whenp reaches 0, the error function will be zero only when the prediction matches the outputperfectly, otherwise the error is 1.

    http://en.wikipedia.org/wiki/Linear_programming

  • 1.5. CONVEXITY/ROBUSTNESS TRADEOFF 9

    X:1

    X:2ŷ

    y

    y⊥

    Figure 1.2: Learning procedure.

    y−2 y−1.5 y−1 y−0.5 y y+0.5 y+1 y+1.5 y+2 0

    5

    10

    15

    20

    25

    30

    35

    Err

    or

    Error function with p = 5

    (a) p = 5

    y−2 y−1.5 y−1 y−0.5 y y+0.5 y+1 y+1.5 y+2 0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    Err

    or

    Error function with p = 2

    (b) p = 2

    y−2 y−1.5 y−1 y−0.5 y y+0.5 y+1 y+1.5 y+2 0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    Err

    or

    Error function with p = 1

    (c) p = 1

    y−2 y−1.5 y−1 y−0.5 y y+0.5 y+1 y+1.5 y+2 0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    Err

    or

    Error function with p = 0.1

    (d) p = 0.1

    Figure 1.3: Error function with different p values.

    Error function with smaller p is more robust. To see this, Figure 1.4 shows some trainingpoints generated from y = x · 0 + �, � ∼ N (0, 12) and one single outlier, which is far awayfrom regularity. Here N (µ, σ2) is the Normal distribution with mean µ and variance σ2. Normal distri-

    butionWe fit three linear model ŷ = w · x with L1, L2 and L∞ error functions. Obviously, the L∞line chases the outlier significantly, followed by the L2 line, then the L1 line, which basicallyignores the outlier.

    −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

    −30

    −20

    −10

    0

    10

    20

    30

    x

    y

    Robust Learning

    Training Points

    Outlier

    L1 Error

    L2 Error

    L∞

    Error

    Figure 1.4: Learning with outlier, different error functions.

    Let’s discuss the phenomenon in Figure 1.4. Large p value in Lp indicates that the errorgrows rapidly as the prediction ŷ getting further away from the true output y, which makesthe learned model more likely to chase the outlier since it creates huge error. On the otherhand, small p value indicates that the error of one single outlier is insignificant (even thoughthe its prediction is far away from its truth, see Figure 1.3(d)). As a result, the model canrobustly ignore the outlier.

    If robustness is so helpful and prevent fitting noise, why not just focus on error functionwith small p value? There is no such thing as a free lunch. Error function with p smallerthan 1 is not convex and difficult to minimize. Therefore, we need to trade-off betweenconvexity and robustness. Table 1.1 summarizes the pros and cons of convex and robusterrors.

  • 10 CHAPTER 1. LINEAR FRAMEWORK

    Table 1.1: Pros and cons of convex versus robust errors.Error type Pros ConsConvex error Efficient optimization Sensitive to noiseRobust error Slow growth, robust Difficult to optimize

    Robust error functionsExample 1:

    errσ(ŷ, y) =(ŷ − y)2

    σ2 + (ŷ − y)2σ > 0

    Example 2:err(ŷ, y) = min

    (1, (ŷ − y)2

    )Figure 1.5 illustrates how they behave as ŷ getting further away from y.

    y−10 y−8 y−6 y−4 y−2 y y+2 y+4 y+6 y+8 y+100

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Err

    or

    Robust error example 1

    (a) Example 1, σ = 1.

    y−10 y−8 y−6 y−4 y−2 y y+2 y+4 y+6 y+8 y+100

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Err

    or

    Robust error example 1

    (b) Example 1, σ = 2.

    y−2 y−1.5 y−1 y−0.5 y y+0.5 y+1 y+1.5 y+2 0

    0.2

    0.4

    0.6

    0.8

    1

    Err

    or

    Robust error example 2

    (c) Example 2.

    Figure 1.5: Robust error functions.

    1.6 Beyond Linear Model

    One may think why we focus exclusively on linear model ŷ = w>x rather than morecomplicated ones. In fact, linear model can be extended to non-linear one by expanding theinput x. That is, we can construct some functions ϕi(x) to expand the input. The inputmatrix X is transformed into Φ, an feature-expanded representation, whose ith column isthe feature constructed from ϕi(·). We predict on x by ŷ =

    ∑i wiϕi(x) and we optimize

    w = argminw

    ‖Φw − y‖ (1.1)

    The following examples show how to use feature expansion to construct non-linear model.

    Polynomial curve fittingGiven {(x1, y1), (x2, y2), · · · , (xt, yt)}, where both xi and yi are in R, we constructthe expanded feature representation as

    Φ =

    1 x1 · · · xd1

    ......

    . . ....

    1 xt · · · xdt

    ∈ Rt×(d+1),where d is the degree of the polynomial curve we are fitting. With d = t − 1 and(1.1), we can find the coefficients wi of the perfect curve, that passes through all tpoints precisely.

  • 1.6. BEYOND LINEAR MODEL 11

    Neural network (feed forward perceptrons)

    x1 x2 xn· · · · · ·

    ϕ1(x) ϕl(x)

    · · ·

    · · ·

    As shown in the above figure, the neural network consists of one input layer, whosenode represents one single feature of the input x ∈ Rn; one hidden layer, whosenode represents an expanded (hidden) feature ϕj(x) for the input; and one output

    layer which is a linear combination of hidden features ŷ =∑lj=1 wjϕj(x). Typically,

    a neural network learning procedure constructs the weight wj and hidden featureϕj(x) simultaneously.

    How to construct such features, or basis functions? One intuitive idea is that the newrepresentation should reflect the similarity between inputs: similar inputs should have similarfeatures. Or, we can simply choose an alternative “similarity function”. The following aretwo examples of such similarity functions.

    Gaussian similarity (a radial basis function)

    κ(xi,xj) = exp

    (−‖xi − xj‖

    2

    2σ2

    )1. If we use this similarity function as feature function and choose some fixed

    instances µ1, · · · ,µl ∈ X , the new representation would be

    x 7→ ϕ(x) = [κ(x,µ1), · · · , κ(x,µl)]> .

    2. If we place the centres µj on the inputs xi themselves, then the new datamatrix Φ would be: Φi,j = κ(xi,xj). Note that this Φ is symmetric sinceκ(xi,xj) = κ(xj ,xi). Moreover, if Φ is invertible, the solution to the learningproblem (1.1) with L2 error will be w

    ∗ = argminw‖Φw − y‖22 = Φ−1y.

    k-nearest neighbor similarity

    κ(xi,xj) =

    {1 if xi ∈ Nk(xj)0 otherwise

    where Nk(xj) is the set of points that are the top (closest) k nearest neighbors of xj .Note that this similarity function is not symmetric.

  • 12 CHAPTER 1. LINEAR FRAMEWORK

  • Chapter 2

    Regularization and Kernel

    2.1 Learning with Regularization

    As one may notice, too many features may lead to over-fitting, while too few features maylead to under-fitting. For example, given t distinct 2-dimensional data points (xi, yi) withdifferent x values, we can always find a polynomial curve of degree t−1 to perfectly fit all thepoints, at a price of highly quirky curve and poor generalization. To avoid such situation,we have two strategies:

    • Regularization – to smooth the weights w. Concretely, we add a regularizer R(w)to our learning problem

    minw

    ‖Φw − y‖+ βR(w)

    where β ≥ 0 is a parameter to control the degree of regularization. R(w) = ‖w‖1 andR(w) = 12‖w‖

    22 are widely used in machine learning literature.

    • Feature selection – select a subset of the expanded features for training. We willsee later that feature selection can be achieved by R(w) = ‖w‖1 regularization.

    L2 loss with L2 regularization.

    minw

    ‖Xw − y‖22 + β‖w‖22

    We can solve the above optimization by setting derivative w.r.t. w to zero:

    X>(Xw − y) + βw = 0=⇒(X>X + βIt)w = X>y

    where It is the identity matrix of size t × t, with ones on the diagonal and zeroseverywhere else. Note that (X>X + βIt) is invertible

    a as long as β > 0. Therefore,the solution is

    w∗ = (X>X + βIt)−1X>y

    aReal symmetric matrix X>X can be eigen-decomposed as X>X = Q>ΛQ, where Q is orthogo-nal matrix, consisting of eigenvectors of X>X, and Λ is diagonal, with the corresponding eigenvalues(greater than or equal to 0) on the diagonal. Then (X>X+βIt) = Q>(Λ+βIt)Q. All elements on thediagonal of (Λ+βIt) is now greater than 0 as long as β > 0. Thus (X>X+βIt)−1 = Q>(Λ+βIt)−1Qis always invertible.

    2.2 The Representer Theorem

    The representer theorem1 is one of the most important theorems in machine learning. Itstates an important property for L2 regularization.

    Theorem 1. For any loss function L (> −∞, i.e., bounded from below), the solution

    w∗ = argminw

    L(Xw,y) + β‖w‖22 (2.1)

    always satisfies w∗ = X>α, for some α ∈ Rt.1The representer theorem here is simplified for the ease of mathematical burden. Visit http://en.

    wikipedia.org/wiki/Representer_theorem for a complete and formal statement.

    13

    http://en.wikipedia.org/wiki/Representer_theoremhttp://en.wikipedia.org/wiki/Representer_theorem

  • 14 CHAPTER 2. REGULARIZATION AND KERNEL

    Proof. Any w can be factorized as w = w0 + w1, where w1 ∈ RowSpace(X), that is,w1 = X

    >α for some α ∈ Rt, and w0 ⊥ RowSpace(X) such that Xw0 = 0. First, becauseXw = X(w0 +w1) = Xw1, the w0 component cannot affect the loss function L. Second,since ‖w‖22 = ‖w0‖22 + ‖w1‖22, to minimize β‖w‖22, we simply set w0 to zero. Therefore, theoptimal w∗ should be w∗1 , which is equal to X

    >α for some α ∈ Rt.

    With the representer theorem, we obtain an equivalent formulation of the learning prob-lem:

    minα

    L(XX>α,y) + βα>XX>α

    To predict the output of a new point x0, instead of w>x0, we compute the its prediction

    as ŷ = α>Xx0. Furthermore, it turns out that we do not care about the input, and whatreally matters is the inner product of the inputs x>i xj . With the notation K = XX

    > andk0 = Xx0, the learning problem (2.1) becomes

    minα

    L(Kα,y) + βα>Kα

    and during test, ŷ = k>0 α. Recall that in previous chapter, we use a new feature vector ϕ(x)for training. By the representer theorem, we no longer need the feature vector explicitlyif we have access to their inner product ϕ>(xi)ϕ(xj) directly. This technique of replacinginner product is sometimes called the kernel trick .kernel trick

    2.3 Kernel

    Now let us take a side step and learn the definitions of kernel operator and kernel matrix.

    Definition 1. A kernel operator k(·, ·) is a symmetric function k : X ×X 7→ R such thatfor any finite subset x1, x2, · · · , xt ∈ X ,

    K =

    k(x1, x1) · · · k(x1, xt)... . . . ...k(xt, x1) · · · k(xt, xt)

    is positive semidefinite (PSD). Or equivalently, k(xi, xj) = 〈ϕ(xi), ϕ(xj)〉, the inner productof ϕ(xi) and ϕ(xj) for some feature representation ϕ(·).

    Definition 2. A kernel matrix K is a matrix such that K = ΦΦ> for some matrix Φ.

    Theorem 2. The following statements are equivalent:

    1. K is a kernel matrix;

    2. K is positive semidefinite;

    3. Eigenvalues of K are non-negative.

    Proof. 1⇒ 2: ∀a,

    a>Ka = a>ΦΦ>a

    = v>v with v = Φ>a

    ≥ 0

    2⇒ 3: Trivial, by definition.3⇒ 1: Let K = QΛQ> be the eigen-decomposition of K. Since all the eigenvalues of K

    are non-negative, Φ = QΛ12 exists and K = ΦΦ>.

    2.4 Operations that Preserve PSD

    Given the expanded feature function ϕ(·), we can always construct a kernel matrix byK = ΦΦ>. However, the converse is not that easy: ϕ(·) is not always easily accessible givena kernel matrix K. We can find ϕ(·) easily if it is transformed by PSD-preserving operationslisted in Table 2.1.

  • 2.5. FEATURE SELECTION: L1 REGULARIZATION 15

    Table 2.1: Operations that preserve PSD, given x, y ∈ X .Kernel modification Corresponding feature modification

    k̃(x, y) = k(x, y) + c, c ≥ 0 ϕ̃(x) = [ϕ(x);√c]

    k̃(x, y) = k(x,y)√〈k(x,x),k(y,y)〉

    ϕ̃(x) = ϕ(x)√〈ϕ(x),ϕ(x)〉

    k̃(x, y) = k1(x, y) + k2(x, y) ϕ̃(x) = [ϕ1(x);ϕ2(x)]

    k̃(x, y) = ak(x, y), a > 0 ϕ̃(x) =√aϕ(x)

    k̃(x, y) = k1(x, y)k2(x, y) ϕ̃(x) = [· · · , ϕ1i(x)ϕ2j(x), · · · ] (feature cross product)

    k̃(x, y) = k2(ϕ1(x), ϕ1(y)), a > 0 ϕ̃(x) = ϕ2 (ϕ1(x))

    2.5 Feature Selection: L1 Regularization

    So far we have seen the benefits of the L2 regularization (R(w) =12‖w‖

    22 leads to the kernel

    trick). Now let us talk about the L1 regularization (R(w) = ‖w‖1).Consider the problem of feature selection: given l features ϕi(x), i = 1, · · · , l, choose a

    subset of features to minimize the loss. This problem is difficult because we have 2l subsetsof features in total. It is intractable to enumerate all the possible subsets. Generally, findinga subset of bounded size that minimizes training loss is NP-hard.

    The problem of feature selection can be formulated using the L0 regularizer:

    minw

    ∑i

    L(x>i w, yi) + β‖w‖0, β ≥ 0,

    where ‖w‖0 is the number of non-zero elements in w. Again, this is intractable, so we triedto use a convex relaxation: L1 regularization. That is,

    minw

    ∑i

    L(x>i w, yi) + β‖w‖1, β ≥ 0. (2.2)

    Recall that the L1 norm of w is convex. Moreover, as long as the loss function is convex inw, the objective function remains convex in w, and thus solvable using subgradient descent.

    SubgradientSuppose f(·) is a convex function. If it is differentiable, we have

    f(x) ≥ f(x0) + (x− x0)>∇f(x0),

    where ∇f(x0) is the gradient of f at x0.A subgradient at x0 is any d0 such that

    f(x) ≥ f(x0) + (x− x0)>d0.

    If f(·) is differentiable at x0, then d0 is unique and d0 = ∇f(x0). However, if f(·) isnot differentiable at x0, then d0 is not unique (see the figure below).

    Gradient and (non-unique) Subgradient.

    Let F (w) be our objective:

    F (w) = L(w) + β‖w‖1 =∑i

    L(x>i w, yi

    )+ β‖w‖1.

    Consider a descent step fromw(k), the parameters in the kth iteration. The partial derivative

  • 16 CHAPTER 2. REGULARIZATION AND KERNEL

    w.r.t. w(k)j is

    1

    ∂F

    ∂w(k)j

    =∂L

    ∂w(k)j

    + β · sign(w

    (k)j

    )=∑i

    L′(x>i w

    (k), yi

    )Xij + β · sign

    (w

    (k)j

    ), if w

    (k)j 6= 0.

    (2.3)

    Then to iterate:

    w(k+1)j = w

    (k)j − η

    ∂F

    ∂w(k)j

    with η > 0 as step size.

    What if w(k)j = 0? Let’s suppose w

    (k)j = 0 and look at Eq.(2.3). If

    ∣∣∣∂L/∂w(k)j ∣∣∣ =∣∣∑i L′(x>i w, yi)Xij

    ∣∣ < β, then wj remains at 0, otherwise we can reduce the objective bymoving wj away from 0, to the direction of

    (−∂L/∂w(k)j

    ). To see this, we assume w

    (k)j = 0

    and consider the difference of two consecutive objective values when wj moves away from0. For smooth loss (where the following approximation is valid in a small neighbourhood ofw(k)),

    L(w(k+1)

    )≈ L

    (w(k)

    )+(∇L(w(k))

    )> (w(k+1) −w(k)

    )= L

    (w(k)

    )− η · ∇L

    (w(k)

    )>∇L

    (w(k)

    )when w(k+1) = w(k) − η∇L

    (w(k)

    )= L

    (w(k)

    )− η

    ∥∥∥∇L(w(k))∥∥∥22

    We can see that contribution of the jth component to the difference in loss is (approximately)

    −η(∂L/∂w

    (k)j

    )2. For the regularizer, the difference is βη

    ∣∣∣∂L/∂w(k)j ∣∣∣ (changing from w(k)j =0 to w

    (k+1)j = −η∂L/∂w

    (k)j ). So the total contribution to the difference in objectives is

    −η

    (∂L

    ∂w(k)j

    )2+ βη

    ∣∣∣∣∣ ∂L∂w(k)j∣∣∣∣∣ .

    • If∣∣∣∂L/∂w(k)j ∣∣∣ < β, this quantity is positive. wj moving away from 0 will increase the

    objective.

    • If∣∣∣∂L/∂w(k)j ∣∣∣ > β, this quantity is negative. wj moving towards (−∂L/∂w(k)j ) will

    reduce the objective.

    Therefore, larger β indicates more zeros in w. This sparse solution implicitly select featuressince wj = 0 means the jth feature is unused.

    If the loss function L(·, ·) is convex in the first argument, then the L1 regularized problem(2.2) can be solved efficiently. To solve the problem, we introduce a slack variable ξ ∈ Rnand reformulate the problem as

    minw,ξ

    ∑i

    L(x>i w, yi) + β1>n ξ

    s.t. ξ � wξ � −w

    For example, if L(ŷ, y) = (ŷ − y)2, the above optimization is a quadratic programming; ifL(ŷ, y) = |ŷ − y|, it is a linear programming.

    1We write ∂F∂wj

    ∣∣∣wj=w

    (k)j

    as ∂F∂w

    (k)j

  • Chapter 3

    Output Transformation

    3.1 Transform Function

    So far we have talked about regression task, where the output y ∈ R is a real number. Insome cases, we need a special type of output. For instances, we may need non-negativeoutput y ≥ 0, probability output y ∈ [0, 1], or class indicator y ∈ {−1,+1}. For thesespecial outputs, we can choose a transform function f : R 7→ Y, such that the range of f(·)is Y, the desired output domain.

    Formally speaking, let ẑ = wTx be the pre-prediction and ŷ = f(ẑ) is the post-prediction.

    • For y ≥ 0, f1(ẑ) = eẑ. That is, ŷ = ewTx.

    • For y ∈ [0, 1], f2(ẑ) = σ(ẑ) = (1 + e−ẑ)−1, where σ(·) is the so-called sigmoid functionin the literature. So, ŷ = (1 + e−w

    Tx)−1.

    • For y ∈ {−1,+1}, f3(ẑ) = sign(ẑ), which is the sign function. So, ŷ = sign(wTx).Technically, when ẑ = 0, ŷ becomes 0, so we need to specify this particular case asŷ = −1 or ŷ = +1.

    Figure 3.1 shows the above transform functions.

    −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

    −1

    0

    1

    2

    3

    4

    z

    y

    Transform Function Examples

    f1(z)

    f2(z)

    f3(z)

    Figure 3.1: Transform functions examples.

    3.2 Matching Loss

    Once we set up the function to transform the pre-prediction to our desired domain, wecan perform training as usual. However, we must adjust the loss function accordingly.If a specific f(·) is combined with arbitrary loss, exponentially many local minima maybe created1. This means arbitrary combination will not produce convex formulation foroptimization. To effectively solve the training problem, we need to specify a matching lossfor the transform function f(·).

    1Peter Auer, Mark Herbster, Manfred K Warmuth, et al. Exponentially many local minima for singleneurons. Advances in neural information processing systems, pages 316–322, 1996.

    17

  • 18 CHAPTER 3. OUTPUT TRANSFORMATION

    Given any differentiable and strictly increasing function f(·), define the matching lossL(ŷ, y) as

    L(ŷ, y) = L (f(ẑ), f(z))

    =

    ∫ ẑz

    (f(u)− f(z)) du

    = F (u)|ẑz − f(z)u|ẑz= F (ẑ)− F (z)− f(z)(ẑ − z)

    (3.1)

    where F ′(·) = f(·). Note that since F (·) is strictly convex (F ′′(u) = f ′(u) > 0 because f(·)is strictly increasing), L(ŷ, y) is strictly convex in ẑ, and hence, in w. Because of convexity,

    F (ẑ) ≥ F (z) + f(z)(ẑ − z) =⇒ L(ŷ, y) ≥ 0.

    Moreover,L(f(ẑ), f(z)) = 0 ⇐⇒ ẑ = z.

    Eq.(3.1) is also known as Bregman Divergence DF (ẑ, z).Bregman Diver-gence Now let us look at our previous examples.

    • Identity transform: y = f(z) = z, F (z) = 12z2.

    L(f(ẑ), f(z)) = F (ẑ)− F (z)− f(z)(ẑ − z)

    =1

    2ẑ2 − 1

    2z2 − z(ẑ − z)

    =1

    2ẑ2 − ẑz + 1

    2z2

    =1

    2(ẑ − z)2

    =1

    2(ŷ − y)2 (squared loss, L2 loss).

    • Exponent transform: y = f1(z) = ez, F1(z) = ez.

    L(f(ẑ), f(z)) = F (ẑ)− F (z)− f(z)(ẑ − z)= eẑ − ez − ez(ẑ − z)

    = y lny

    ŷ+ ŷ − y (un-normalized entropy).

    • Sigmoid transform: y = f2(z) = σ(z), f−12 (y) = lny

    1−y , F2(z) = ln(ez + 1).

    L(f(ẑ), f(z)) = F (ẑ)− F (z)− f(z)(ẑ − z)= ln(eẑ + 1)− ln(ez + 1)− σ(z)(ẑ − z)

    = y lny

    ŷ+ (1− y) ln 1− y

    1− ŷ(cross entropy).

    Sadly, we do not have “the” matching loss for sign(·) transform function, because it is notdifferentiable, and not strictly increasing. One intuitive alternative is to count how manyinstances we misclassified, using an indicator loss function:

    L(ŷ, y) = 1(ŷ 6= y) (3.2)

    where 1(A) is one when A is true. Assume ŷ = sign(wTx − b) (previously we ignore b byaugmenting x), then the problem becomes

    minw,b

    t∑i=1

    1(ŷi 6= yi)

    s.t. ŷ = sign(Xw − b1t)

    It is not surprising that the above problem is not tractable. However, if the dataset islinearly separable, we can run linear programming to solve it, by introducing slack variableδ ∈ R:

    maxw,b,δ

    δ

    s.t. yi(wTxi − b) ≥ δ ∀i

    δ ≤ 1.

  • 3.3. SURROGATE LOSS FOR CLASSIFICATION 19

    The last constraint (δ ≤ 1) is necessary because otherwise we can amplify w, b and δsimultaneously and make the objective value go to infinity. Note that it is solvable evenwhen the dataset is not linearly separable. For the attained solution, the dataset is

    • linearly separable if and only if δ = 1.

    • not linearly separable if and only if δ < 0.

    Another linear programming formulationIntroduce slack variable ξ ∈ Rt,

    minw,b,ξ

    ξT1t

    s.t. yi(wTxi − b) ≥ 1− ξi ∀i

    ξ � 0t.

    After solving the problem, the dataset is

    • linearly separable if and only if ξ = 0t, the vector of size t with all zeros.

    • not linearly separable if and only if ∃i, ξi < 0.

    3.3 Surrogate Loss for Classification

    To overcome the difficulty of non-continuous sign transform, one can apply two standardapproaches. The first is to approximate the sign function with tanh function; the second isto approximate the loss with surrogate loss.

    The tanh function is defined as

    ŷ = tanh(ẑ) =eẑ − e−ẑ

    eẑ + e−ẑ.

    Luckily, this replacement recovers differentiability and strictly increasing property, so wecan construct its matching loss and hence, convex objective. However, the transformedoutput ŷ is in the range of [−1, 1], which is a “soft” classification, rather than {−1,+1}, the“hard” classification. It does not automatically guarantee good approximation of minimumclassification error.

    The second approach is to retain the sign transform, but use some surrogate losses. Theproblem is, how to find effective approximation of the indicator loss (3.2) that is efficientlycomputable? To illustrate alternative approximations, let’s first define a new concept, themargin.

    Definition 3. The margin m is the product of true output (label) y and pre-prediction ẑ.That is m = yẑ.

    Note that m > 0 indicates sign(ẑ) = y and the instance is correctly classified; whilem < 0 indicates sign(ẑ) 6= y and the instance is misclassified. Therefore, we can try somesurrogate losses based on m. For example, the squared loss

    L(m) = (wTx− y)2

    = (ẑ − y)2

    = ẑ2 − 2ẑy + y2

    = m2 − 2m+ 1= (m− 1)2.

    The derivation is based on the fact that y2 = 1 for all instances. Now we can see the severedrawback of this loss. It penalizes the model when the outcome is too correct (m > 1),which does not make sense in the context of classification (ŷ = sign(wTx)). Therefore, herewe list some more reasonable choices as surrogate loss, and their illustrative plots are shownin Figure 3.2.

    • Exponential loss: L(m) = e−m. Obviously, this surrogate loss is convex in ẑ, so thetraining is efficient. However, it is unbounded, grows exponentially when the modelmisclassifies an instance (m < 0), and as a result, not robust.

  • 20 CHAPTER 3. OUTPUT TRANSFORMATION

    • Binomial deviance: L(m) = ln(1+e−m). It is more robust compared with exponentialloss because as m→ −∞, the slope approaches −1. However, it is not an upper boundof the indicator loss, see Figure 3.2.

    • Hinge loss: L(m) = max(0, 1 −m). This loss is piece-wise linear, so we can solve itusing linear programming. Moreover, the growth rate is fixed (−1) for m < 1. In fact,this surrogate loss is widely adopted in the classification literature.

    • Robust loss: L(m) = 1 − tanh(m). As shown in Figure 3.2, this surrogate loss isbounded. Therefore the model does not trace outlier much and is very robust. Unfor-tunately, unlike previous surrogate losses, this one is not convex in ẑ.

    −2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

    −1

    0

    1

    2

    3

    4

    m

    loss

    Surrogate loss

    Indicator lossSquared lossExponential lossBinomial devianceHinge lossRobust loss

    Figure 3.2: Surrogate loss.

    Given convex surrogate loss, efficient minimization is guaranteed and extensions such asfeature expansion, kernelization (L2 regularization) and feature selection (L1 regularization)are retained.

  • Chapter 4

    Support Vector Machine

    4.1 Large Margin Classification

    In this section, we talk about how to effectively find a classifier from geometrical point ofview. As a threshold example, consider a dataset of classification that is linearly separable.Probably, there are many linear classifiers ((w, b) pairs) that are consistent with the datasetas shown in Figure 4.1.

    Figure 4.1: Linearly separable dataset, where squares and circles are from different classesand dashed lines are possible line classifiers that are consistent with the dataset.

    Are some of them better than the others? Intuitively, the one that maximizes the marginprobably generalize better, that is, the linear classifier that maximizes the minimum distancebetween data points and decision hyperplane. As shown in Figure 4.2.

    Figure 4.2: Large margin linear classifier. Both H1 and H2 are consistent with the dataset,but H2 has larger margin than H1.

    To maximize such margin, we first need to know how to compute it efficiently. Recallthat the classification hyperplane is defined as {x|w>x − b = 0}. The distance between apoint xi and the hyperplane is

    δi =|w>xi − b|‖w‖2

    .

    Recall that yi ∈ {−1,+1} for binary classification, so we can get rid of the absolute operation:

    δi =yi(w

    >xi − b)‖w‖2

    .

    21

  • 22 CHAPTER 4. SUPPORT VECTOR MACHINE

    Therefore, formally speaking, to find the maximum margin classifier, we solve the followingproblem:

    maxw,b

    mini

    δi

    ⇐⇒ maxw,b,γ

    γ s.t.yi(w

    >xi − b)‖w‖2

    ≥ γ ∀i

    ⇐⇒ maxw,b,γ

    γ s.t. yi(w>xi − b) ≥ γ‖w‖2 ∀i,

    where γ serves as a universal (for all i) lower bound of margin. Note that the last optimiza-tion is under-constrained: we can rescale w, b and γ simultaneously so that the objectivevalue goes to infinity. Hence, we impose one more constraint, which is γ = 1‖w‖2 and optimize

    maxw,b

    1

    ‖w‖2s.t. yi(w

    >xi − b) ≥ 1,∀i

    ⇐⇒ minw,b

    1

    2‖w‖22 s.t. yi(w>xi − b) ≥ 1,∀i,

    which is a quadratic programming.

    The above derivation is based on the fact that the dataset is linearly separable. Actually,after solving the quadratic programming, we will find that only a few data points are criticalto determine the hyperplane’s parameters. These points are called support vectors, sincesupport vectorthey are closest to the hyperplane and hence, “support” it.

    Definition 4. Support vectors are subset of training points that are closest to the optimalhyperplane defined by w, b.

    Mathematically, support vectors are the points that satisfy the equality in the constraintsyi(w

    >xi − b) = 1. In fact, removing data other than support vectors does not change theoptimal solution. The resultant classifier is called support vector machine (SVM).support vector

    machine What if the dataset is not linearly separable? In this case, there is no linear classifierthat is consistent with the dataset. The previous quadratic programming is infeasible (nosolution can be found). A standard (heuristic) approach is to introduce slack variable ξi toeach margin constraint and at the same time, try to minimize minimum margin at the priceof sum of slacks. Formally,

    minw,b,ξ

    β

    2‖w‖22 + ξ>1t

    s.t. yi(w>xi − b) ≥ 1− ξi,∀i

    ξ � 0t

    (4.1)

    β > 0 is a tuning trade-off parameter. ξi allows the classifier to “give up” some noisy pointsand introduces a “cost” for such slackness in the objective (see Figure 4.3). This is calledsoft support vector machine.

    Figure 4.3: Soft SVM with slack variable.

    One interesting fact is that if we treat β2 ‖w‖22 as a regularizer, problem (4.1) is equivalent

    to using hinge loss as surrogate loss with L2 regularization. The geometric approach leadsto the same solution as discussed in previous chapter.

  • 4.2. DUALITY 23

    4.2 Duality

    In this section, we revisit the concept of duality in optimization. Consider a constrainedoptimization problem:

    minw

    l(w)

    s.t. Aw � bCw = d,

    which is called primal problem. Its Lagrangian is Lagrangian

    L(w,λ,µ) = l(w) + λ>(b−Aw) + µ>(d− Cw)

    where λ � 0t and µ ∈ R> are Lagrangian multipliers for constraint violations. The dualfunction is defined as dual function

    D(λ,µ) = minw

    L(w,λ,µ)

    = minw

    l(w) + λ>(b−Aw) + µ>(d− Cw)

    The dual function has two important properties:

    • D(λ,µ) is always concave in λ,µ, because the min of linear functions (in λ and µ) isconcave.

    • D(λ,µ) provides a lower bound on the objective of the primal problem.

    Let

    p∗ = minw

    l(w) s.t. Aw � b, Cw = d

    w∗ = argminw

    l(w) s.t. Aw � b, Cw = d.

    The following theorem states weak duality : weak duality

    Theorem 3. For any λ � 0t,µ, we have D(λ,µ) ≤ p∗.

    Proof. Fix any λ � 0t,µ

    D(λ,µ) = minw

    l(w) + λ>(b−Aw) + µ>(d− Cw)

    ≤ l(w∗) + λ>(b−Aw∗) + µ>(d− Cw∗)≤ l(w∗) = p∗

    For the first inequality, w∗ is a particular choice of w, so its objective value must be greaterthan (or equal to, if w∗ is indeed the minimizer in the Lagrangian) that of the dual function.For the second inequality, note that w∗ is feasible so Aw∗ � b and Cw∗ = d. As a result,we have λ>(b−Aw∗) ≤ 0 and µ>(d− Cw∗) = 0 since λ � 0t.

    Now we define the dual problem (in contrast to the primal problem) as

    maxλ�0t,µ

    D(λ,µ)

    ⇐⇒ maxλ�0t,µ

    minw

    l(w) + λ>(b−Aw) + µ>(d− Cw).

    Sometimes we can solve the inner minimization over w analytically to get a closed form ofD(λ,µ). Notice that since D(λ,µ) is concave and thus the outer maximization is efficientlysolvable, even when the primal problem is not convex.

    Let

    d∗ = maxλ�0t,µ

    D(λ,µ)

    (λ∗,µ∗) = argmaxλ�0t,µ

    D(λ,µ).

    The following theorem states strong duality : strong duality

    Theorem 4 (ref). If l(w) is convex in w and Aw � b, Cw = d are strictly feasible (i.e.,there exists w0 such that both constraints are satisfied), then d

    ∗ = p∗.

  • 24 CHAPTER 4. SUPPORT VECTOR MACHINE

    The strong duality guarantees that the optimal objective value of the dual problem isidentical to that of the primal problem. Therefore, the efficiently computable dual problemis a bridge to find the solution to the primal problem. The remaining problem is how torecover w∗ from λ∗ and µ∗. The following theorem provides an effective approach to recoverw∗.

    Theorem 5. If strong duality holds, then

    maxλ�0t,µ

    minw

    L(w,λ,µ) = minw

    maxλ�0t,µ

    L(w,λ,µ).

    Proof. We first prove a lemma. Note that for any feasible w (i.e., Aw ≥ b and Cw = d),we have

    l(w) ≥ maxλ�0t,µ

    l(w) + λ>(b−Aw) + µ>(d− Cw) = maxλ�0t,µ

    L(w,λ,µ). (4.2)

    This is because the term λ>(b− Aw) is always non-negative given the w and chosen λ. Ifw∗ is feasible and minimizes l(w), then Eq.(4.2) implies

    l(w∗) ≥ maxλ�0t,µ

    L(w∗,λ,µ) ≥ minw

    maxλ�0t,µ

    L(w,λ,µ).

    This inequality means

    minw:Aw�b,Cw=d

    l(w) ≥ minw:Aw�b,Cw=d

    maxλ�0t,µ

    L(w,λ,µ)

    ≥ minw

    maxλ�0t,µ

    L(w,λ,µ)(4.3)

    Now we are ready to prove the theorem:

    minw

    maxλ�0t,µ

    L(w,λ,µ) ≥ maxλ�0t,µ

    minw

    L(w,λ,µ) (by max-min inequality1)

    = minw:Aw�b,Cw=d

    l(w) (by strong duality) (4.4)

    ≥ minw

    maxλ�0t,µ

    L(w,λ,µ) (by the lemma 4.3)

    Because the LHS (left-hand-side) is the same as the last row, the inequalities in betweenmust be equalities. Therefore the claim is proved.

    Let w∗,λ∗,µ∗ be the optimal solutions of the primal/dual problems. The theorem aboveimplies complementary slackness property:complementary

    slackness• if λ∗i > 0, then Ai:w∗ = bi;

    • if Ai:w∗ > bi, then λ∗i = 0.

    To see this, note that:

    l(w∗) = maxλ�0t,µ

    minw

    L(w,λ,µ) (from Eq.(4.4))

    = D(λ∗,µ∗) (definition of D)

    = minw

    L(w,λ∗,µ∗) (definition of D)

    ≤ L(w∗,λ∗,µ∗) (w∗ may not attain the minw

    )

    = l(w∗) + λ∗>(b−Aw∗) + µ∗>(d− Cw∗)≤ l(w∗)

    The last inequality is because λ∗ � 0t and Aw∗ � b so λ∗>(b − Aw∗) must be non-positive. However, the LHS and the last row are again the same, so equality holds inbetween. Specifically, this indicates λ∗i (bi − Ai:w∗) = 0 and as a result, complementaryslackness follows.

    Now, back to our question on finding w∗ given λ∗ and µ∗. With the help of complemen-tary slackness property, we can find w∗ such that Ãw∗ = b̃, Cw∗ = d, where Ã, b̃ are builtfrom the constraints with λ∗i > 0.

    1https://en.wikipedia.org/wiki/Max-min_inequality

    https://en.wikipedia.org/wiki/Max-min_inequality

  • 4.3. APPLICATION OF DUALITY: SVM 25

    4.3 Application of Duality: SVM

    Recall the soft SVM optimization (hinge loss with L2 regularizer):

    minw,b

    β

    2‖w‖22 +

    >∑i=1

    max(0, 1− yi

    (w>xi − b

    ))which, in matrix form, is equivalent to

    minw,b,ξ

    β

    2‖w‖22 + ξ>1t

    s.t. ξ � 1t −∆(y)(Xw − b1t)ξ � 0t,

    where ∆(y) is diagonalized matrix of y. Introduce the Lagrange function

    L(w, b, ξ,λ,µ) =β

    2w>w + ξ>1t + µ

    >(0t − ξ) + λ>[1t −∆(y)(Xw − b1t)− ξ]

    2w>w + ξ>(1t − µ− λ) + λ>[1t −∆(y)(Xw − b1t)].

    where λ � 0t,µ � 0t are Lagrangian multipliers for constraint violations. Now, we find thedual problem analytically. Given fixed λ and µ, consider the minimization over w, b, ξ. Wefirst eliminate ξ:

    ∂L

    ∂ξ= 1t − µ− λ = 0

    =⇒ µ = 1t − λ=⇒ λ � 1t (as µ � 0t)

    This eliminates ξ and µ simultaneously:

    L(w, b,λ) =β

    2w>w + λ>[1t −∆(y)(Xw − b1t)] 0t � λ � 1t.

    Next, we eliminate b:

    ∂L

    ∂b= λ>y = 0,

    =⇒ L(w,λ) = β2w>w + λ>1t − λ>∆(y)Xw 0t � λ � 1t,λ>y = 0.

    Finally, we eliminate w:

    ∂L

    ∂w= βw −X>∆(y)λ = 0

    =⇒ w = 1βX>∆(y)λ.

    =⇒ L(λ) = λ>1t −1

    2βλ>∆(y)XX>∆(y)λ 0t � λ � 1t,λ>y = 0.

    Therefore, the dual problem of the regularized hinge loss is

    maxλ

    λ>1t −1

    2βλ>∆(y)XX>∆(y)λ

    s.t. 0t � λ � 1t,λ>y = 0.

    (4.5)

    This is a quadratic programming. Once we find the optimal λ∗, the remaining issue is torecover the classifier (w∗, b∗) from it. w∗ is easy to compute:

    w∗ =1

    βX>∆(y)λ∗.

    As for b∗, complementary slackness is required. For any 0 < λ∗i < 1, we have

    λ∗i > 0 =⇒ 1− yi(Xi:w∗ − b∗)− ξ∗i = 0,λ∗i < 1 =⇒ µ∗i > 0 =⇒ ξ∗i = 0.

  • 26 CHAPTER 4. SUPPORT VECTOR MACHINE

    Therefore,

    yi(Xi:w∗ − b∗) = 1

    =⇒ b∗ = Xi:w∗ − yi =1

    βXi:X

    >∆(y)λ∗ − yi.

    For test point x,

    ŷ = sign(w∗Tx− b∗)

    = sign

    (1

    βx>X>∆(y)λ∗ − b∗

    )Note that the above classification rule and problem (4.5) can be kernelized. All we need

    is the inner product x>i xj .

  • Chapter 5

    Large Output

    5.1 Multivariate Prediction

    So far we have talked about single output prediction, where yi ∈ R is a one-dimensionaloutput. What if the target is a vector of multiple values, say yi ∈ Rk? Consider the linearprediction yi = w

    >xi, because of multiple outputs, the linear model w becomes a matrixW ∈ Rn×k, whose rows correspond to features and columns correspond to outputs. That isy>i = x

    >i W . The matrix form is

    Ŷ = XW

    where X ∈ Rt×n, Ŷ ∈ Rt×k.

    Square loss with multiple outputs

    minW

    t∑i=1

    k∑j=1

    (Xi:W:j − Yij)2

    ⇐⇒ minW

    tr[(XW − Y )>(XW − Y )]

    ⇐⇒ minW

    tr(W>X>XW )− 2tr(Y >XW ) + tr(Y >Y ).

    Setting its derivative with respect to W to zero:

    dObj

    dW= 2X>X − 2X>Y = 0 =⇒ X>XW = X>Y.

    When X is of full rank (i.e., features are not correlated when t > n), X>X isinvertible. Then

    W ∗ = (X>X)−1X>Y.

    (In Matlab implementation, it is one-line code: W = (X ′X)\(X ′Y ).)

    Trace of matrixHere we list some properties of matrix trace (assume proper sizes):

    • tr(A) = tr(A>);

    • tr(A>B) =∑ij AijBij ;

    • tr(αA) = αtr(A), α ∈ R;

    • tr(ABC) = tr(CAB) = tr(BCA).

    The derivative a of matrix trace:

    • ddW tr(C>W ) = C;

    • ddW tr(W>AW ) = (A+A>)W .

    aOne easy way to check the correctness of derivative is to see whether the resultant derivative isof the same size as the argument.

    Now we investigate the properties we have before:

    27

  • 28 CHAPTER 5. LARGE OUTPUT

    1. feature expansion;

    2. L2 regularization → kernelization;

    3. L1 regularization → sparsity;

    4. output transformation and matching loss;

    5. surrogate loss for classification.

    5.1.1 Feature Expansion

    Feature expansion of data point is the same as before: construct the expanded representationΦ ∈ Rt×l, where l is the number of features. Accordingly, W ∈ Rl×k, and

    Ŷ = ΦW.

    5.1.2 Kernelization

    Analogize to the L2 regularizer, for matrix W , R(W ) =∑ijW

    2ij = tr(W

    >W ) = ‖W‖2F ,where ‖ · ‖F is the Frobenius norm. The representer theorem holds: forFrobenius norm

    minW

    β

    2tr(W>W

    )+ L(XW,Y ),

    the optimal W ∗ = X>A for some A ∈ Rt×k. Therefore, the adjoint form is

    minA

    β

    2tr(A>XX>A

    )+ L

    (XX>A, Y

    ).

    To predict the output of a new point x,

    ŷ> = x>W ∗ = x>X>A∗.

    5.1.3 Feature Selection: Sparsity

    As for feature selection, we want the rows of W to be zeros (remember rows of W correspondto feature, while columns of W correspond to output). Can we choose a matrix norm thatis similar to the L1 regularizer in single output formulation? Yes, we can. However, first weneed to investigate several matrix norms and check their properties.

    Matrix Norms

    The definition of matrix norm ‖ · ‖ satisfies three properties

    • ‖αX‖ = |α|‖X‖, α ∈ R;

    • ‖X + Y ‖ ≤ ‖X‖+ ‖Y ‖;

    • ‖X‖ = 0 ⇐⇒ X = 0.

    Convexity of matrix normThe definition of matrix norm guarantees that a norm must be convex. For 0 ≤ α ≤ 1,

    ‖αX + (1− α)Y ‖ ≤‖αX‖+ ‖(1− α)Y ‖=|α|‖X‖+ |1− α|‖Y ‖=α‖X‖+ (1− α)‖Y ‖.

    Therefore, using matrix norm as regularizer maintains convexity as long as the lossL(·, ·) is convex.

    Here we introduce three types of norms: vector induced norm, Schatten p-norm andblock norm.

  • 5.1. MULTIVARIATE PREDICTION 29

    • Vector induced norm. Given vector norm ‖x‖p for x ∈ Rn, define

    ‖W‖p = maxx6=0t

    ‖Wx‖p‖x‖p

    .

    Vector induced norm examples

    ‖W‖1 = maxj‖W:j‖1

    is the maximum column absolute sum.

    ‖W‖∞ = maxi‖Wi:‖1

    is the maximum row absolute sum.

    ‖W‖2 = σmax(W )

    is the biggest singular value of W .

    • Schatten p-norm. The Schatten p-norm is based on the singular values of a matrix,σ(W ). It is defined as ‖σ(W )‖p, which is the vector p-norm of singular value vector.

    Schatten p-norm examplesTrace norm:

    ‖W‖tr = ‖σ(W )‖1.

    Spectral norm:

    ‖W‖sp = ‖σ(W )‖∞ = σmax(W ) = ‖W‖2.

    Frobenius norm:

    ‖W‖F = ‖σ(W )‖2 =√

    tr(W>W ) =

    √∑ij

    W 2ij .

    • Block norm. It is defined as

    ‖W‖p,q =

    ∑i

    j

    W qij

    1qp

    1p

    .

    which looks complicated. In fact, assume W ∈ Rn×m, let v ∈ Rn, whose entries arevi = ‖Wi:‖q, then ‖W‖p,q = ‖v‖p. That is, ‖W‖p,q first compute the q vector normof rows, then compute p vector norm of the resultant column vector.

    Block norm examples

    ‖W‖2,1 =∑i

    ‖Wi:‖2;

    ‖W‖2,2 = ‖W‖F .

    To illustrate the properties of these norms, let’s look at some concrete examples. Let

    U =

    [1 10 0

    ], V =

    [1 01 0

    ],W =

    [1 00 1

    ].

    Table 5.1 summarizes the values of diverse norms.Now, come back to our feature selection problem. We want sparse features. Remember

    what we want is to find a norm that penalizes less on matrix with sparse rows. Because Uhas the sparsest rows compared with V and W , so ‖ · ‖2,1 norm is our choice. Therefore, forfeature selection with multivariate output, we optimize

    minW

    β

    2‖W‖22,1 + L(XW,Y ).

  • 30 CHAPTER 5. LARGE OUTPUT

    Table 5.1: Matrix norm concrete examples.

    Norm U V W

    ‖ · ‖1 1 2 1

    ‖ · ‖∞ 2 1 1

    ‖ · ‖2√

    2√

    2 1

    ‖ · ‖tr√

    2√

    2 2

    ‖ · ‖F√

    2√

    2√

    2

    ‖ · ‖2,1√

    2 2 2

    As a side note, sometimes we want to reduce not only the number of feature, but alsothe number of output. Say, we want to eliminate some of the outputs in yi for conciseness.Formally, we want W with sparse rows and columns. So a norm that penalizes less onmatrix with sparse rows and columns is favoured. Trace norm ‖ · ‖tr is our choice accordingto Table 5.1, because U and V are penalized less compared to W . Then we optimize

    minW

    β

    2‖W‖2tr + L(XW,Y ).

    5.1.4 Output Transformation and Matching Loss

    As for output transformation, again we introduce the transform function f(·) such thatits range is the desired range Y. If ẑ> = x>W is pre-prediction, then ŷ = f(ẑ) is post-prediction.

    • For y � 0k, f(z) = ez;

    • For y � 0k,y>1k = 1, f(z) = ez/(1>k e

    z), which is sometimes called “softmax”

    transform in machine learning literature;

    • For y = 1c = [0, 0, · · · , 1, 0, 0, · · · , 0]>, which is a class indicator that has 1 in the cthlocation and 0s everywhere else, f(z) = indmax(z), which puts 1 in the maximumentry and 0s everywhere else.

    For matching loss, we first pick a strictly convex function F : Rk 7→ R such that f(z) =∇F (z). Then define loss function as

    L(ŷ,y) = L(f(ẑ), f(z))

    = F (ẑ)− F (z)− f>(z)(ẑ − z)

    Because of strict convexity, we have

    F (ẑ) ≥ F (z) + f>(z)(ẑ − z).

    Hence, L(ŷ,y) ≥ 0 andL(ŷ,y) = 0 ⇐⇒ f(ẑ) = f(z).

    For previous examples:

    • y � 0k. f(z) = ez, F (z) = 1>k ez, z = ln(y), pair-wisely. So,

    L(ŷ,y) = F (ẑ)− F (z)− f>(z)(ẑ − z)= 1>k e

    ẑ − 1>k ez − (ez)>(ẑ − z)= 1>k ŷ − 1>k y − y>(ln ŷ − lny)= y>(lny − ln ŷ) + 1>k (ŷ − y)

    which is known as un-normalized entropy.

  • 5.2. STRUCTURED OUTPUT 31

    • y � 0k,y>1k = 1. f(z) = ez/(1>k e

    z), F (z) = ln

    (1>k e

    z). So,

    L(ŷ,y) = F (ẑ)− F (z)− f>(z)(ẑ − z)

    = ln(1>k e

    ẑ)− ln

    (1>k e

    z)−(

    ez

    1>k ez

    )>(ẑ − z)

    = 1>k

    (ez

    1>k ez

    )ln(1>k e

    ẑ)− 1>k

    (ez

    1>k ez

    )ln(1>k e

    z)−(

    ez

    1>k ez

    )> (ln eẑ − ln ez

    )=

    (ez

    1>k ez

    )> (1k ln

    (1>k e

    ẑ)− 1k ln

    (1>k e

    z)− ln eẑ + ln ez

    )=

    (ez

    1>k ez

    )>(ln

    ez

    1>k ez− ln e

    1>k eẑ

    )= y> (lny − ln ŷ)

    which is known as KL-divergence. KL-divergence

    5.1.5 Surrogate Loss for Classification

    The last piece of multivariate output is choosing a reasonable surrogate loss for classificationtransform f(z) = indmax(z). The ideal loss is an indicator loss function

    L(ŷ,y) = 1(y 6= ŷ),

    but it leads to intractability. Here we list some surrogate losses for multivariate classification.

    • Multinomial deviance (logistic margin loss).

    L(ẑ,y) = ln(1>k eẑ)− y>ẑ. (5.1)

    • Multiclass SVM loss (analogized to hinge loss in univariate case).

    L(ẑ,y) = max( 1k − y︸ ︷︷ ︸misclassifi-cation error

    + ẑ − y>ẑ1k︸ ︷︷ ︸margin

    ). (5.2)

    where the max(·) operation is to find the maximum value over k entries of given vector.Therefore, the multiclass SVM is

    minW

    β

    2‖W‖2F +

    ∑i

    max(1>k − Yi: +Xi:W −Xi:WY >i: 1>k )

    ⇐⇒ minW,ξ

    β

    2‖W‖2F + 1>t ξ

    s.t. ξ1>k � 1t1>k − Y +XW − δ(XWY >)1>kwhere δ(A) is the column vector of the diagonal of A (opposite of ∆(a)) and the � in theconstraint means entry-wise grater than or equal to instead of PSD notation. This is aquadratic programming and can be kernelized because of ‖W‖2F . During test time, givenx0,

    ŷ>0 = indmax(x>0 W )

    5.2 Structured Output

    In many real world scenarios, the number of classes is really large, say exponential. Oneillustrative example is optical character recognition (OCR).

    Optical character recognition: exponential number of classesConsider the simple hand-written phrase “the red apple”, shown as images. We knowthat, generally speaking, there are 26 possibilities for each letter. A simple phrase likethis has (3 + 3 + 5)26 possible “classes”, one of which is “the rcd qpple”. This set istoo large to handle without further thinking. One solution is to decouple the phraseinto individual letters. Indeed, this approach simplifies the problem, but ignores thestrong correlation or context. For example, given “r d”, “e” is much more likely than“c”; “a” is much more likely than “q” given “ pple”.

    When the number of class, |C| = k, is really large, we have to utilize the structure of theoutput. We first assume that the output label is decoupled as ci = (ci1, ci2, · · · , cil) ∈ C′l

  • 32 CHAPTER 5. LARGE OUTPUT

    instead of ci ∈ C, where C′ is the candidate set for individual cij and l is the length ofthe structure. In the OCR example or speech recognition, this assumption is reasonablesince OCR image is composed of sequential letter images and speech consists of decoupledphonemes. Then we give the model a decoupled representation, i.e.,

    ϕ̃(xi, ci) =

    l−1∑j=1

    ϕ(xi, cij , ci(j+1)).

    In the OCR example, ϕ(xi, cij , ci(j+1)) can be an image (vectorized pixel values) of twoconsecutive letters. This implies

    ẑi = w̃>ϕ̃(xi, ci) =

    l−1∑j=1

    w>ϕ(xi, cij , ci(j+1)).

    This decouple approach maintains (at lease partially) the context information, and moreimportantly, creates tractability as we will see soon. Consider the multivariate surrogatelosses introduced in previous section, multinomial deviance (Eq.(5.1)) and multiclass SVMloss (Eq.(5.2)). The main difficulty in computing multinomial deviance is to sum up expo-nential number of values, while for multiclass SVM, is to find the maximum in exponentiallymany values.

    • Case of multinomial deviance. To compute 1>k eẑi =∑eẑi is hard when k, the number

    of class, is really large. However, with the help of decouple formulation, we have

    ∑eẑi =

    ∑{ci1,··· ,cil}

    exp

    l−1∑j=1

    w>ϕ(xi, cij , ci(j+1))

    =

    ∑{ci1,··· ,cil}

    l−1∏j=1

    exp(w>ϕ(xi, cij , ci(j+1))

    ),

    which can be calculated using “sum-product” algorithm.

    “Sum-product” algorithmConsider the following example:

    p =∑

    {c1,c2,c3,c4,c5}

    f4(c4, c5)f3(c3, c4)f2(c2, c3)f1(c1, c2)

    where {c1, c2, c3, c4, c5} is for purpose of illustration. In fact, there may bemany ci. To efficiently compute this, we push the summation inward,

    p =∑c5

    ∑c4

    f4(c4, c5)∑c3

    f3(c3, c4)∑c2

    f2(c2, c3)∑c1

    f1(c1, c2)︸ ︷︷ ︸m2(c2)︸ ︷︷ ︸

    m3(c3)︸ ︷︷ ︸m4(c4)︸ ︷︷ ︸

    m5(c5)

    By doing so, the computation is reduced from O(|C′|l) to O(l|C′|2).

    • Case of multiclass SVM. To compute max ẑi, we have

    max ẑi = max{ci1,··· ,cil}

    l−1∑j=1

    w>ϕ(xi, cij , ci(j+1))

    ,which can be solved efficiently using “max-sum” algorithm.

  • 5.2. STRUCTURED OUTPUT 33

    “Max-sum” algorithmConsider the following example:

    m = max{c1,c2,c3,c4,c5}

    f4(c4, c5) + f3(c3, c4) + f2(c2, c3) + f1(c1, c2)

    where {c1, c2, c3, c4, c5} is for purpose of illustration. In fact, there may bemany ci. To efficiently compute this, we push the max operation inward,

    m = maxc5

    maxc4

    f4(c4, c5) + maxc3

    f3(c3, c4) + maxc2

    f2(c2, c3) + maxc1

    f1(c1, c2)︸ ︷︷ ︸m2(c2)︸ ︷︷ ︸

    m3(c3)︸ ︷︷ ︸m4(c4)︸ ︷︷ ︸

    m5(c5)

    By doing so, the computation is reduced from O(|C′|l) to O(l|C′|2).

    These structure learning techniques have been generalized from sequential structure totree and graph, i.e., any structure that satisfies the sum-product or max-sum properties.

  • 34 CHAPTER 5. LARGE OUTPUT

  • Chapter 6

    Unsupervised Learning

    In supervised learning, the label information (Y ∈ Y) is given in training time. For unsu-pervised learning, the label is removed, i.e., we know X ∈ X but do not know Y ∈ Y. Whatif we still want to find a mapping h : X 7→ Y for specific Y? For example, we may want tolearn a projection matrix W ∈ Rn×k that project X ∈ Rt×n to Y ∈ Rt×k. Here we list somewell studied unsupervised learning scenarios:

    • Dimensionality reduction: Y ∈ Rt×k, where 0 < k < n.

    • Hard clustering (unsupervised classification): Y ∈ {0, 1}t×k, such that Y 1k = 1t.

    • Soft clustering: Y ∈ Rt×k, such that Y 1k = 1t.

    6.1 Dimensionality Reduction

    Before we precede, let us first investigate a reverse formulation of regular supervised leastsquare problem. In regular least square formulation, given X and Y ,

    minW∈Rn×k

    1

    2‖XW − Y ‖2F .

    It is not difficult to solve this and we can show that W ∗ = (X>X)−1X>Y if X>X hasfull rank. Define X+ = (X>X)−1X>, which is called left Moore-Penrose pseudoinverse, Moore-Penrose

    pseudoinversethen W ∗ = X+Y . Let the singular value decomposition (SVD) of X be X = QXΣXV>X

    where ΣX is diagonal matrix with singular values and QX and VX are unitary such thatQ>XQX = Irank(X), V

    >X VX = Irank(X), then we have X

    + = VXΣ−1X Q

    >X . Now consider the

    reverse formulation of the least square problem, given X and Y ,

    minU∈Rk×n

    1

    2‖X − Y U‖2F .

    The optimal reverse projection is given by U∗ = Y +X. Notice that on optimal W and Ufor the above problems, we have

    X>XW = X>Y = U>Y >Y

    =⇒W = (X>X)−1U>(Y >Y ),U = (Y >Y )−1W>(X>X)

    Therefore, given X and Y , we can always compute one of W and U from the other.Now back to unsupervised learning where Y is unknown. The main idea to solve unsu-

    pervised learning is to guess a Y , while optimizing W . Consider

    minY,W

    1

    2‖XW − Y ‖2F .

    This formulation is meaningless without constraints since we can always find a trivial so-lution, for example, W = 0, Y = 0, that obtains zero loss. However, when it comes to itsreverse formulation, we have

    minY,U

    1

    2‖X − Y U‖2F . (6.1)

    35

  • 36 CHAPTER 6. UNSUPERVISED LEARNING

    Fix Y , we know that U∗ = Y +X. Plug in the reverse formulation, we have

    minY

    1

    2‖X − Y Y +X‖2F

    =⇒ minY

    1

    2tr[(It − Y Y +)XX>(It − Y Y +)>

    ]=⇒ min

    Y

    1

    2tr[(It − Y Y +)K

    ]=⇒ max

    Ytr(Y Y +K),

    (6.2)

    where K = XX> is the kernel matrix. The second derivation above is based on the factthat Y Y + is symmetric and tr(Y Y +XX>Y Y +) = tr(Y Y +Y Y +XX>) = tr(Y Y +XX>).Let the SVD of Y be Y = QY ΣY V

    >Y , then Y

    + = VY Σ−1Y Q

    >Y , Y Y

    + = QYQ>Y . With such

    notation, the reverse formulation becomes

    maxQY

    tr(QYQ>YK), s.t. Q

    >YQY = Irank(Y )

    =⇒ maxQY

    tr(Q>YKQY ), s.t. Q>YQY = Irank(Y )

    =⇒ maxqi

    rank(Y )∑i=1

    q>i Kqi,

    where qi are mutually orthogonal with unit length. Assume rank(Y ) = k, it is intuitive tosee that by setting the qi to be the top k eigenvectors of K, the objective value is maximized.By top k, we mean the k eigenvectors that correspond to k maximal eigenvalues. We denotethe matrix of top k eigenvectors of K as QX;1:k. If K = XX

    >, with X = QXΣXV>X , then

    it is equivalent to setting qi to be the top k left singular vectors of X.

    To conclude, optimal solutions are given by Y ∗ = QX;1:k, U∗ = Q>X;1:kX. The original

    data matrixX is approximated byX ≈ QX;1:kQ>X;1:kX. This technique is known as principalcomponent analysis (PCA).

    One may notice that to deal with new data points, we still need the regular projectionmatrix W . Thanks to the relation between W and U , we can reconstruct W by setting

    W = (X>X)−1U>(Y >Y ) = X+QX,1:k = VX;1:kΣ−1X;1:k.

    To this point, we have done dimensionality reduction.

    PCA and data centeringPCA is also derived from a different perspective in statistics, by preserving maximumvariance in the new feature space. In order to capture covariance information, thedataset must be centered before eigenvalue decomposition:

    xi ← xi −1

    t

    t∑j=1

    xj . (6.3)

    Data centering is beneficial in general since we mainly want to preserve the relativeinformation between data points instead of the shift of the whole set in the X space.

    6.2 Hard Clustering

    Next we will talk about clustering problem, where Y ∈ {0, 1}t×k and Y 1k = 1t. The problemis to find W and Y simultaneously. The clustering assignment rule is ŷ> = indmax(x>W ).The regular supervised problem with square surrogate loss is

    minW,Y ∈Y

    1

    2‖XW − Y ‖2F ,

    where Y = {Y ∈ {0, 1}t×k, Y 1k = 1t}. Its reverse formulation is

    minU,Y ∈Y

    1

    2‖X − Y U‖2F .

  • 6.3. KERNELIZATION 37

    • Given fixed Y , we already know that the optimal U is given by U∗ = Y +X =(Y >Y )−1Y >X. Note that Y >Y = ∆([tc1 , tc2 , · · · , tck ]>), where tcj is the numberof points in the jth cluster cj , i.e., Y

    >Y is a diagonal matrix with numbers of points ofeach cluster on the diagonal. Also note that Y >X = [

    ∑x∈c1 x,

    ∑x∈c2 x, · · · ,

    ∑x∈ck x]

    >.

    To combine, (Y >Y )−1Y >X = Y +X = [m1,m2, · · · ,mk]>, where mj is the mean ofpoints in cluster cj , i.e., U

    ∗ = Y +X is a matrix with rows of cluster means.

    • If U is fixed, how to solve Y ? Notice that

    ‖X − Y U‖2F =∑i

    ‖Xi: − Yi:U‖22.

    Due to the fact that Yi: = [0, 0, · · · , 1, · · · , 0], it only has a single one indicating thatthe ith point belongs to the jth cluster. Obviously, the best strategy is to assign theith data point to the jth cluster such that the distance between Xi: and cluster meanYi:U = Uj: is minimized.In another point of view, define the squared (Euclidean) distance matrix D ∈ Rt×kthat represents the distances between points data X and cluster center U such that(compared to one-dimensional counterpart (x− u)2 = x2 + u2 − 2xu)

    D(X,U) = δ(XX>)1>k + 1tδ>(UU>)− 2XU>.

    Then we have Y ∗i: equals to [0, 0, · · · , 1, · · · , 0], where the single one is in the jth entryand j = argmin

    jDij .

    Therefore, to solve the problem (reverse clustering formulation with square surrogateloss), we have a two-step optimization procedure:

    • Fix Y , solve U to be the matrix of cluster means.

    • Fix U , solve Y to be the indicator matrix such that each point is assigned to the closestcluster (based on the distance to cluster mean).

    This procedure is the well-known k-means clustering algorithm. (And we know its drawbackimmediately because of square loss: not robust, chasing outliers)

    As a side note, solution to problem 6.1 is globally optimal, because after plugging inU∗ = Y +X, we can still effectively find a globally optimal Y to problem 6.2. However, inthe case of hard clustering, we cannot effectively find the optimal Y for

    minY ∈Y

    1

    2‖X − Y (Y >Y )−1Y >X‖2F .

    after plugging in U∗ = Y +X = (Y >Y )−1Y >X. So k-means is a local optimization proce-dure, minimizing U and Y alternatively. Generally, we cannot solve the clustering problemeffectively, not even for square loss, due to non-continuous output space Y.

    6.3 Kernelization

    6.3.1 Kernel Principal Component Analysis

    Recall the supervised formulation that leads to kernelization

    minW

    β

    2‖W‖2F +

    1

    2‖XW − Y ‖2F

    ⇐⇒ minA

    β

    2tr(A>KA) +

    1

    2‖KA− Y ‖2F

    (6.4)

    which is based on W ∗ = X>A∗ and the representor Theorem. Its derivative

    dObj

    dA= βKA+K(KA− Y ) = 0

    ⇐⇒ K(K + βIt)A = KY⇐= (K + βIt)A = Y

    ⇐⇒ A = (K + βIt)−1Y.

    (6.5)

  • 38 CHAPTER 6. UNSUPERVISED LEARNING

    The reverse formulation is

    minU

    1

    2‖X − Y U‖2F .

    It is easy to see that U∗ = Y +X = BX for some particular B ∈ Rk×t. Therefore,

    minU

    1

    2‖X − Y U‖2F

    ⇐⇒ minB

    1

    2‖X − Y BX‖2F

    ⇐⇒ minB

    1

    2tr((I − Y B)K(I − Y B)>

    )We take the derivative

    dObj

    dB= βY >(Y B − It)K = 0

    ⇐⇒ Y >Y BK = Y >K⇐= Y >Y B = Y >

    ⇐⇒ B = (Y >Y )−1Y > = Y +.

    (6.6)

    Here we may assume that Y >Y is invertible (rank(Y ) = k, k < t). From Eq. (6.5) andEq. (6.6), we know that

    (K + βIt)A = Y = B>Y >Y

    A = (K + βIt)−1B>Y >Y.

    (6.7)

    Thus, we also have a relation between A and B given X and Y .Now come back to the unsupervised case, where K is given but Y is not. As before,

    minimizing Y and A simultaneously for Eq.(6.4) is meaningless because we can find trivialsolution that lead to zero loss. However, the reverse form is readily usable, i.e.,

    minB,Y

    1

    2tr((It − Y B)K(It − Y B)>

    ).

    With B∗ = Y +, we have

    minY

    1

    2tr((It − Y Y +)K(It − Y Y +)>

    ).

    The rest of our derivative is identical to problem 6.2, and the optimal Y is given by Y =QX;1:k, the top k eigenvectors of K. Here K is the kernel matrix and not necessarily equalsXX>. This method is referred to as kernel principal component analysis (KPCA). Bycomparison, PCA always takes the top k left singular vectors of X. If K = XX>, thenKPCA is identical to PCA.

    6.3.2 Kernelized Clustering

    In clustering setting, the output space is constrained as Y = {Y ∈ {0, 1}t×k, Y 1k = 1t}.With square surrogate loss, we have

    minB,Y ∈Y

    1

    2tr((It − Y B)K(It − Y B)>

    ).

    Unsurprisingly, this problem is NP-hard due the the non-continuous output space, but wecan always alternate between Y and B and get a local solution. We already know that theoptimal B given Y is B = Y +. With fixed B, how to compute Y ?

    Let us, for a moment, convert the kernel to original X representation:

    (It − Y B)K(It − Y B)> = (X − Y BX)(X − Y BX)>.

    Let U = BX ∈ Rk×n. Then consider the matrix D ∈ Rt×k such that

    D(X,U) = δ(XX>)1>k + 1>t δ>(UU>)− 2XU>

    = δ(K)1>k + 1>t δ>(BKB>)− 2KB>.

  • 6.3. KERNELIZATION 39

    This is analogized to the Euclidean distance matrix of each point Xi: to the cluster centerUj:. Now given B, we assign Yi: to be [0, 0, · · · , 1, · · · , 0], where the single one is in the jthentry and j = argmin

    jDij . This is called kernel k-means clustering.

    One benefit of regular and reverse formulations is that, we can recover the regular solutionA with the relation between A and B. With Eq. 6.7, we can extend our model to newlyobserved data, even for clustering problem. That is, given x0, its cluster index is computedas

    ŷ>0 = indmax(x>0 W )

    = indmax(x>0 X>A)

    = indmax(k>0 A),

    where k>0 = ϕ(x0)>Φ> is the inner product of (feature-extended) new point x0 and (feature-

    extended) training set Φ.

    6.3.3 Normalized Kernel Methods

    Here we introduce the optimization scheme that utilizes normalized kernels. Assume thekernel matrix K is doubly non-negative, i.e., Kij ≥ 0,∀i, j and K � 0. Let

    Λ = ∆(K1t)

    K̃ = Λ−12KΛ−

    12

    Ã = Λ−12A

    Ỹ = Λ12Y

    B̃ = BΛ12 .

    Supervised learning. Compared to (6.4), consider a different (supervised) formulation:

    minA

    β

    2tr(A> Λ−1KΛ−1︸ ︷︷ ︸

    rescaled kernel

    A) +1

    2tr( Λ︸︷︷︸

    instance weight

    (Λ−1KΛ−1A− Y )(Λ−1KΛ−1A︸ ︷︷ ︸prediction of Y from rescaled kernel

    −Y )>). (6.8)

    The motivation of this optimization is to minimize square training loss for rescaled kernelΛ−1KΛ−1 where each entry is normalized by the importance weight Λ = ∆(K1t). Moreover,the square loss is also rescaled according to the instance importance Λ. It is equivalent to

    minÃ

    β

    2tr(Ã>K̃Ã) +

    1

    2tr(

    (K̃Ã− Ỹ )(K̃Ã− Ỹ )>). (6.9)

    After taking derivative w.r.t. Ã and setting it to zero, we have

    A∗ = Λ12 (K̃ + βIt)

    −1Λ12Y. (6.10)

    Recall from (6.8) that we predict the output Y by

    Y ≈ Λ−1KΛ−1A∗ = Λ−1XX>Λ−1A∗ = Λ−1XW ∗

    This can be seen as to rescale input X then predict via projection W ∗. For a new data pointx, define λ = x>X>1t = k

    >1t as normalization constant. To predict its label,

    ŷ = λ−1x>W ∗ =k>Λ−1A∗

    k>1t.

    Unsupervised regression (dimensionality reduction): normalized PCA. Thereverse formulation is

    minB̃,Ỹ

    1

    2tr(

    (It − Ỹ B̃)K̃(It − Ỹ B̃)>)

    =⇒ minỸ

    1

    2tr(

    (It − Ỹ Ỹ +)K̃(It − Ỹ Ỹ +)>),

    (6.11)

    withB∗ = Ỹ + = (Y >ΛY )−1Y >. (6.12)

  • 40 CHAPTER 6. UNSUPERVISED LEARNING

    Similar to KPCA, the optimal Ỹ ∗ is the top k eigenvectors of K̃. Combining (6.12) and(6.10), we have

    Λ−12 (K̃ + βIt)Λ

    − 12A = Y = B>(Y >ΛY )

    A = Λ12 (K̃ + βIt)

    −1Λ12B>(Y >ΛY ).

    (6.13)

    Again, this relation allows us to predict new data as we did in previous methods.

    Side notes about normalized PCA• Top eigenvector. In terms of the original (un-normalized) Y , we can show

    that Y:1 = 1t. That is, the first component of the new representation yi isalways 1 for very point i. To see this, let λ1/2 = Λ1/21t, then

    K̃λ1/2 = Λ−1/2KΛ−1/2Λ1/21t

    = Λ−1/2K1t

    = Λ−1/2Λ1t since Λ = ∆(K1t)⇒ Λ1t = K1t= Λ1/21t = λ

    1/2

    This means λ1/2 is always the eigenvector of K̃ with eigenvalue of 1. It is knownthat the maximum eigenvalue of K̃ is 1. Therefore, λ1/2 is the top eigenvectorof K̃ and

    Y:1 = Λ−1/2Ỹ:1 = Λ−1/2λ1/2 = Λ−1/2λ

    1/2 = 1t.

    So we usually ignore the first component of Y .

    • Kernel centering. We already know how to center dataset when we haveexplicit feature representation (Eq.(6.3)). However, when we are given thekernel matrix K, how to center the dataset in its implicit feature space? Noticethat the centering operation Eq.(6.3) is equivalent in matrix form to

    X ←(It −

    1

    t1t1>t

    )X.

    H = It − 1t1t1>t is known as centering matrix. In the linear kernel case, K =

    XX> so we can center it by K ← HKH. In fact, even in non-linear cases,K = ΦΦ> for some (possibly infinite dimensional) representation Φ and K ←HKH is still valid for such implicit centering.

    Unsupervised classification (clustering): normalized k-means, normalized graphcut. We can expand (6.11),

    minB̃,Ỹ ∈Ỹ

    1

    2tr(

    (It − Ỹ B̃)K̃(It − Ỹ B̃)>)

    ⇐⇒ minB,Y ∈Y

    1

    2tr(Λ(Λ−1 − Y B)K(Λ−1 − Y B)>

    )where Y = {Y ∈ {0, 1}t×k, Y 1k = 1t} in clustering setting. This problem is NP-hard. Torecover local solution, we can alternate between B and Y as we did in Section 6.2.

    • Given Y , B∗ = (Y >ΛY )−1Y >.

    • Given B, compute the matrix of all rescaled squared distances

    D(Λ−1X,BX) = δ(Λ−1XX>Λ−1)1>k + 1>t δ(BXX

    >B>)> − 2Λ−1XX>B>

    = δ(Λ−1KΛ−1)1>k + 1>t δ(BKB

    >)> − 2Λ−1KB>.

    Then Y ∗i: equals to [0, 0, · · · , 1, · · · , 0], where the single one is in the jth entry andj = argmin

    jDij .

    This method is known as normalized graph cut clustering. Standard technique (6.13) can beapplied to assign new data point to clusters.

  • 6.3. KERNELIZATION 41

    Normalized cut (N-cut) from a different perspectiveNormalized cut was also introduced in the context of graph cut. Suppose we havean “affinity” matrix K, where Kij represents the similarity of object i and object j(for example, in image segmentation, this can be seen as the similarity between twopixels). Suppose also that K � 0 (in many cases, this can be satisfied when we choosea proper similarity measure). Consider objects as nodes, while their similarities asedges, we may want to cut this graph into k disjoint components. One intuition is topartition the objects to minimize the normalized cost:

    minY ∈Y

    ∑j

    cut(classj , classj)

    vol(classj),

    where

    • Y = {Y ∈ {0, 1}t×k, Y 1k = 1t};

    • cut(classj , classj) is the cost of cutting class j from the rest of the graph:

    cut(classj , classj) =1

    2

    ∑i,l:Yij 6=Ylj

    Kil,

    which is the sum of edges being cut;

    • vol(classj) is considered to be the volume of class j:

    vol(classj) =∑

    l,i:Yij=1

    Kil

    which measures the association of class j and the rest of the graph.

    Note that

    cut(classj , classj) =1

    2

    ∑i,l:Yij 6=Ylj

    Kil =1

    2

    ∑i,l

    Kil(Yij − Ylj)2

    =1

    2

    ∑i,l

    Kil(Y2ij + Y

    2lj − 2YijYlj)

    = Y >:j (∆(K1t)−K)Y:j = Y >:j (Λ−K)Y:j ,

    vol(classj) =∑

    l,i:Yij=1

    Kil =∑i,l

    KilY2ij =

    ∑i

    (∑l

    Kil

    )Y 2ij

    = Y >:j ∆(K1t)Y:j = Y>:j ΛY:j .

    Therefore, the problem becomes

    minY ∈Y

    tr((Y >ΛY )−1Y >(Λ−K)Y

    )⇐⇒ min

    Y ∈Ytr(I − Y (Y >ΛY )−1Y >K

    )⇐⇒ min

    Y ∈Ytr(Λ−1K − Y (Y >ΛY )−1Y >K

    )⇐⇒ min

    Y ∈Ytr(Λ(Λ−1 − Y (Y >ΛY )−1Y >)K(Λ−1 − Y (Y >ΛY )−1Y >)

    )⇐⇒ min

    Y ∈YminB

    tr(Λ(Λ−1 − Y B)K(Λ−1 − Y B)

    ),

    which is equivalent to what we have seen before.

    Linear FrameworkGeneral FrameworkData Representation and Error FunctionSolving OptimizationGeometry InterpretationConvexity/Robustness TradeoffBeyond Linear Model

    Regularization and KernelLearning with RegularizationThe Representer TheoremKernelOperations that Preserve PSDFeature Selection: L1 Regularization

    Output TransformationTransform FunctionMatching LossSurrogate Loss for Classification

    Support Vector MachineLarge Margin ClassificationDualityApplication of Duality: SVM

    Large OutputMultivariate PredictionFeature ExpansionKernelizationFeature Selection: SparsityOutput Transformation and Matching LossSurrogate Loss for Classification

    Structured Output

    Unsupervised LearningDimensionality ReductionHard ClusteringKernelizationKernel Principal Component AnalysisKernelized ClusteringNormalized Kernel Methods