cs 189 introduction to machine learning hw5 - eecs189.org · cs 189 introduction to machine...

CS 189 Introduction to Machine LearningSpring 2018 HW5

1 Getting StartedRead through this page carefully. You may typeset your homework in latex or submit neatlyhandwritten/scanned solutions. Please start each question on a new page. Deliverables:

1. Submit a PDF of your writeup, with an appendix for your code, to assignment on Grade-scope, “HW5 Write-Up”. If there are graphs, include those graphs in the correct sections.Do not simply reference your appendix.

2. If there is code, submit all code needed to reproduce your results, “HW5 Code”.

3. If there is a test set, submit your test set evaluation results, “HW5 Test Set”.

After you’ve submitted your homework, watch out for the self-grade form.

(a) Who else did you work with on this homework? In case of course events, just describe thegroup. How did you work on this homework? Any comments about the homework?

(b) Please copy the following statement and sign next to it. We just want to make it extra clear sothat no one inadverdently cheats.

I certify that all solutions are entirely in my words and that I have not looked at anotherstudent’s solutions. I have credited all external sources in this write up.

HW5, ©UC Berkeley CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 1

This homework is due Friday, February 23rd at 10pm.

2 Total Least SquaresIn most of the models we have looked at so far, we’ve accounted for noise in the observed ymeasurement and adjusted accordingly. However, in the real world it could easily be that ourfeature matrix X of data is also corrupted or noisy. Total least squares is a way to account for this.Whereas previously we were minimizing the y distance from the data point to our predicted linebecause we had assumed the features were definitively accurate, now we are minimizing the entiredistance from the data point to our predicted line. In this problem we will explore the mathematicalintuition for the TLS formula. We will then apply the formula to adjusting the lighting of an imagewhich contains noise in its feature matrix due to inaccurate assumptions we make about the image,such as the image being a perfect sphere.

Let X and y be the true measurements. Recall that in the least squares problem, we want to solvefor w in minw ||Xw − y||. We measure the error as the difference between Xw and y, which canbe viewed as adding an error term εy such that the equation Xw = y + εy has a solution:

minεy,w||εy||2, subject to Xw = y + εy (1)

Although this optimization formulation allows for errors in the measurements of y, it does notallow for errors in the feature matrix X that is measured from the data. In this problem, we willexplore a method called total least squares that allows for both error in the matrix X and the vectory, represented by εX and εy, respectively. For convenience, we absorb the negative sign into εyand εX and define true measurements y and X like so:

ytrue = y + εy (2)Xtrue = X + εX (3)

Specifically, the total least squares problem is to find the solution for w in the following minimiza-tion problem:

minεy,εX,w

||[εX, εy]||2F , subject to (X + εX)w = y + εy (4)

where the matrix [εX, εy] is the concatenation of the columns of εX with the column vector y.Notice that the minimization is over w because it’s a free parameter, and it does not necessarilyhave to be in the objective function. Intuitively, this equation is finding the smallest perturbationto the matrix of data points X and the outputs y such that the linear model can be solved exactly.The constraint in the minimization problem can be rewritten as


[X + εX,y + εy]

[w−1

]= 0 (5)

(a) Let the matrix X ∈ Rn×d and y ∈ Rn and note that εX ∈ Rn×d and εy ∈ Rn. Assuming thatn > d and rank(X + εX) = d, explain why rank([X + εX,y + εy]) = d.

(b) For the solution w to be unique, the matrix [X + εX,y + εy] must have exactly d linearlyindependent columns. Since this matrix has d+1 columns in total, it must be rank-deficient by1. Recall that the Eckart-Young-Mirsky Theorem tells us that the closest lower-rank matrixin the Frobenius norm is obtained by discarding the smallest singular values. Therefore, thematrix [X + εX,y + εy] that minimizes

||[εX, εy]||2F = ||[Xtrue,ytrue]− [X,y]||2Fis given by

[X + εX,y + εy] = U

[Σd

0

]V >

where [X,y] = UΣV > and Σd is the diagonal matrix of the d largest singular values of [X,y].

Suppose we express the SVD of [X,y] in terms of submatrices and vectors:

[X,y] =

[Uxx uxyu>yx uyy

][Σd

σd+1

][Vxx vxyv>yx vyy

]>where the vectors uxy ∈ Rd and vxy ∈ Rd are the first d elements of the (d+1)-th column ofU and V respectively, uyy and vyy are the (d+1)-th element of the d+1 column of U and Vrespectively, Uxx ∈ Rd×d and Vxx ∈ Rd×d are the d × d top left submatrices of U and Vrespectively, and σd+1 is the (d+1)-th eigenvalue of [X,y].

Using this information show that

[εX, εy] = −

[uxyuyy

]σd+1

[vxyvyy

]>(c) Using the result from the previous part and that fact that vyy is not 0 (see notes on Total

Least Squares), find a nonzero solution to [X + εX,y + εy]

[w−1

]= 0 and thus solve for

w in Equation (5).HINT: Looking at the last column of the product [X,y]V may or may not be useful for thisproblem, depending on how you solve it.


Figure 1: Tennis ball pasted on top of image of St. Peter’s Basillica without lighting adjustment (left) andwith lighting adjustment (right)

(d) From the previous part, you can see that

[w−1

]is a right-singular vector of [X,y]. Show that

(X>X− σ2d+1I)w = X>y (14)

(e) In this problem, we will use total least squares to approximately learn the lighting in a photo-graph, which we can then use to paste new objects into the image while still maintaining therealism of the image. You will be estimating the lighting coefficients for the interior of St. Pe-ter’s Basillica, and you will then use these coefficients to change the lighting of an image of atennis ball so that it can be pasted into the image. In Figure 1, we show the result of pastingthe tennis ball in the image without adjusting the lighting on the ball. The ball looks too brightfor the scene and does not look like it would fit in with other objects in the image.

To convincingly add a tennis ball to an image, we need to need to apply the appropriate lightingfrom the environment onto the added ball. To start, we will represent environment lighting asa spherical function f(n) where n is a 3 dimensional unit vector (||n||2 = 1), and f outputs a3 dimensional color vector, one component for red, green, and blue light intensities. Becausef(n) is a spherical function, the input n must correspond to a point on a sphere. The functionf(n) represents the total incoming light from the direction n in the scene. The lighting functionof a spherical object f(n) can be approximated by the first 9 spherical harmonic basis functions.

The first 9 unnormalized sperical harmonic basis functions are given by:

L1 = 1

L2 = y

L3 = x

L4 = z

L5 = xy

L6 = yz


Figure 2: Image of a spherical mirror inside of St. Peter’s Basillica

L7 = 3z2 − 1

L8 = xz

L9 = x2 − y2

where n = [x, y, z]>. The lighting function can then be approximated as

f(n) ≈9∑i=1

γiLi(n)

— f(n1) —— f(n2) —

...— f(nn) —

n×3

=

L1(n1) L2(n1) ... L9(n1)L1(n2) L2(n2) ... L9(n2)

...L1(nn) L2(nn) ... L9(nn)

n×9

— γ1 —— γ2 —

...— γ9 —

9×3

where Li(n) is the ith basis function from the list above.

The function of incoming light f(n) can be measured by photographing a spherical mirrorplaced in the scene of interest. In this case, we provide you with an image of the sphere as seenin Figure 2. In the code provided, there is a function extractNormals(img) that will extract thetraining pairs (ni, f(ni)) from the image. An example using this function is in the code.

Use the spherical harmonic basis functions to create a 9 dimensional feature vector foreach sample. Use this to formulate an ordinary least squares problem and solve for theunknown coefficients γi. Report the estimated values for γi and include a visualizationof the approximation using the provided code. The code provided will load the images,extracts the training data, relights the tennis ball with incorrect coefficients, and saves theresults. Your task is to compute the basis functions and solve the least squares problems toprovide the code with the correct coefficients. To run the starter code, you will need to use


Python with numpy and scipy. Because the resulting data set is large, we reduce it in thecode by taking every 50th entry in the data. This is done for you in the starter code, but youcan try using the entire data set or reduce it by a different amount.

(f) When we extract from the data the direction n to compute (ni, f(ni)), we make some approx-imations about how the light is captured on the image. We also assume that the sphericalmirror is a perfect sphere, but in reality, there will always be small imperfections. Thus, ourmeasurement for n contains some error, which makes this an ideal problem to apply total leastsquares. Solve this problem with total least squares by allowing perturbations in the ma-trix of basis functions. Report the estimated values for γi and include a visualization ofthe approximation. The output image will be visibly wrong, and we’ll explore how to fix thisproblem in the next part. Your implementation may only use the SVD and the matrix inversefunctions from the linear algebra library in numpy as well as np.linalg.solve.

(g) In the previous part, you should have noticed that the visualization is drastically different thanthe one generated using least squares. Recall that in total least squares we are minimizing||[εX, εy]||2F . Intuitively, to minimize the Frobenius norm of components of both the inputsand outputs, the inputs and outputs should be on the same scale. However, this is not thecase here. Color values in an image will typically be in [0, 255], but the original image had amuch larger range. We compressed the range to a smaller scale using tone mapping, but theeffect of the compression is that relatively bright areas of the image become less bright. Asa compromise, we scaled the image colors down to a maximum color value of 384 instead of255. Thus, the inputs here are all unit vectors, and the outputs are 3 dimensional vectors whereeach value is in [0, 384]. Propose a value by which to scale the outputs f(ni) such that thevalues of the inputs and outputs are roughly on the same scale. Solve this scaled totalleast squares problem, report the computed spherical harmonic coefficients and providea rendering of the relit sphere.


3 PCA and Random ProjectionsIn this question, we revisit the task of dimensionality reduction. Dimensionality reduction is usefulfor several purposes, including but not restricted to, visualization, storage, faster computation etc.While reducing dimension is useful, it is not undesirable to demand that such reductions preservesome properties of the original data. Often, certain geometric properties like distance and innerproducts are important to perform certain machine learning tasks. And as a result, we may wantto perform dimensionality reduction but ensuring that we approximately maintain the pairwisedistances and inner products.

While you have already seen many properties of PCA so far, in this question we investigate if ran-dom projections work are a good idea for dimensionality reduction. A few advantages of randomprojections over PCA can be listed as: (1) PCA is expensive when the underlying dimension ishigh and the number of principal components is also large (however note that there are severalvery fast algorithms dedicated to doing PCA), (2) PCA requires you to have access to the featurematrix for performing computations. The second requirement of PCA is a bottle neck when youwant to take only a low dimensional measurement of a very high dimensional data, e.g., in FMRIand in compressed sensing. In such cases, one needs to design a projection scheme before seeingthe data. We now turn to a concrete setting study a few properties of PCA and random projections.

Suppose you are given n points x1, . . . ,xn in <d. Define the n × d matrix X =

x>1...

x>n

where

each row of the matrix represents one of the given points. In this problem, we will consider a fewlow-dimensional embedding ψ : <d 7→ <k that maps vectors from <d to <k.Let X = UΣV> denote the singular value decomposition of the matrix X. Assume that N ≥ dand let σ1 ≥ σ2 ≥ . . . ≥ σd denote the singular values of X.

Let v1,v2, . . . ,vd denote the columns of the matrix V. We now consider the following k-dimensionalPCA embedding: ψPCA(x) = (v>1 x, . . . ,v>k x)>. Note that this embedding projects a d-dimensionalvector on the linear span of the set {v1, . . . ,vk} and that v>i x denotes the i-th coordinate of theprojected vector in the new space.

We begin with a few matrix algebra relationships, and use it to investigate certain mathematicalproperties of PCA and random projections in the first few parts, and then see them in action on asynthetic dataset in the later parts.

Notation: The symbol [n] stands for the set {1, . . . , n}.

(a) What is the ij-th entry of the matrices XX> and X>X? Express the matrix XX> interms of U and Σ, and, express the matrix X>X in terms of Σ and V.

(b) Show that

ψPCA(xi)>ψPCA(xj) = x>i VkV

>k xj where Vk =

[v1 v2 . . . vk

].

Also show that VkV>k = VIkV>, where the matrix Ik denotes a d × d diagonal matrix with

first k diagonal entries as 1 and all other entries as zero.


(c) Suppose that we know the first k singular values are the dominant singular values. In particular,we are given that ∑k

i=1 σ4i∑d

i=1 σ4i

≥ 1− ε,

for some ε ∈ (0, 1). Then show that the PCA projection to the first k-right singular vectorspreserves the inner products on average:

1∑ni=1

∑nj=1(x

>i xj)2

n∑i=1

n∑j=1

|(x>i xj)− (ψPCA(xi)>ψPCA(xj))|2 ≤ ε. (6)

Thus, we find that if there are dominant singular values, PCA projection can preserve the innerproducts on average.

Hint: Using previous two parts and the definition of Frobenius norm might be useful.

(d) Now consider a different embedding ψ : <d 7→ <k which preserves all pairwise distances andnorms up-to a multiplicative factor, that is,

(1− ε)‖xi‖2 ≤ ‖ψ(xi)‖2 ≤ (1 + ε)‖xi‖2 for all i ∈ [n], and (7)(1− ε)‖xi − xj‖2 ≤ ‖ψ(xi)− ψ(xj)‖2 ≤ (1 + ε)‖xi − xj‖2 for all i, j ∈ [n], (8)

where 0 < ε � 1 is a small scalar. Further assume that ‖xi‖ ≤ 1 for all i ∈ [n]. Show thatthe embedding ψ satisfying equations (8) and (7) preserves each pairwise inner product:

|ψ(xi)>ψ(xj)− (x>i xj)| ≤ Cε, for all i, j ∈ [n], (9)

for some constant C. Thus, we find that if an embedding approximately preserves distancesand norms upto a small multiplicative factor, and the points have bounded norms, then innerproducts are also approximately preserved upto an additive factor.

Hint: You may use the Cauchy-Schwarz inequality.

(e) Now we consider the random projection using a Gaussian matrix as introduced in the section.In next few parts, we work towards proving that if the dimension of projection is moderatelybig, then with high probability, the random projection preserves norms and pairwise distancesapproximately as described in equations (8) and (7).

Consider the random matrix J ∈ <k×d with each of its entry being i.i.d. N (0, 1) and considerthe map ψJ : <d 7→ <k such that ψJ(x) = 1√

kJx. Show that for any fixed non-zero vector u,

the random variable‖ψJ(u)‖2

‖u‖2can be written as

1

k

k∑i=1

Z2i

where Zi’s are i.i.d. N (0, 1) random variables.


(f) For i.i.d. Zi ∼ N (0, 1), we have the following probability bound

P

∣∣∣∣∣∣1k

k∑i=1

Z2i

∣∣∣∣∣∣ /∈ (1− t, 1 + t)

≤ 2e−kt2/8, for all t ∈ (0, 1).

Note that this bound suggests that∑k

i=1 Z2i ≈

∑i=1E[Z2

i ] = k with high probability. In otherwords, sum of square of Gaussian random variables concentrates around its mean with highprobability. Using this bound and the previous part, now show that if k ≥ 16

ε2log(Nδ

), then

P

for all i, j ∈ [n], i 6= j,

∥∥ψJ(xi)− ψJ(xj)∥∥2∥∥xi − xj

∥∥2 ∈ (1− ε, 1 + ε)

≥ 1− δ.

That is show that for k large enough, with high probability the random projection ψJ ap-proximately preserves the pairwise distances. Using this result, we can conclude that randomprojection serves as a good tool for dimensionality reduction if we project to enough numberof dimensions. This result is popularly known as the Johnson-Lindenstrauss Lemma.

Hint 1: The following (powerful technique cum) bound might be useful: For a set of eventsAij , we have

P[∩i,jAij

]= 1− P

[(∩i,jAij)c

]= 1− P

[∪i,jAcij

]≥ 1−

∑ij

P[Acij

].

You may define the event Aij =

∥∥ψJ(xi)− ψJ(xj)

∥∥2∥∥xi − xj∥∥2 ∈ (1− ε, 1 + ε)

and use the union

bound above.

(g) Suppose there are two clusters of points S1 = {u1, . . . ,un} and S2 = {v1, . . . ,vm} which arefar apart, i.e., we have

d2(S1, S2) = minu∈S1,v∈S2

‖u− v‖2 ≥ γ.

Then using the previous part, show that the random projection ψJ also approximatelymaintains the distance between the two clusters if k is large enough, that is, with highprobability

d2(ψJ(S1), ψJ(S2)) = minu∈S1,v∈S2

∥∥ψJ(u)− ψJ(v)∥∥2 ≥ (1− ε)γ if k ≥ C

ε2log(m+ n)

for some constant C. Note that such a property can help in several machine learning tasks.For example, if the clusters of features for different labels were far in the original dimension,then this problem shows that even after randomly projecting the clusters they will remain farenough and a machine learning model may perform well even with the projected data. We nowturn to visualizing some of these conclusions on a synthetic dataset.


(h) In the next few parts, we visualize the effect of PCA projections and random projections on aclassification task. You are given 3 datasets in the data folder and a starter code.

Use the starter code to load the three datasets one by one. Note that there are two unique valuesin y. Visualize the features of X these datasets using (1) Top-2 PCA components, and (2)2-dimensional random projections. Use the code to project the features to 2 dimensionsand then scatter plot the 2 features with a different color for each class. Note that youwill obtain 2 plots for each dataset (total 6 plots for this part). Do you observe a difference inPCA vs random projections? Do you see a trend in the three datasets?

(i) For each dataset, we will now fit a linear model on different projections of features to performclassification. The code for fitting a linear model with projected features and predicting a labelfor a given feature, is given to you. Use these functions and write a code that does predictionin the following way: (1) Use top k-PCA features to obtain one set of results, and (2) Use k-dimensional random projection to obtain the second set of results (take average accuracy over10 random projections for smooth curves). Use the projection functions given in the startercode to select these features. You should vary k from 1 to d where d is the dimension ofeach feature xi. Plot the accuracy for PCA and Random projection as a function of k.Comment on the observations on these accuracies as a function of k and across differentdatasets.

(j) Now plot the singular values for the feature matrices of the three datasets. Do you observea pattern across the three datasets? Does it help to explain the performance of PCAobserved in the previous parts?

4 Fruits and Veggies!The goal of the problem is help the class build a dataset for image classification, which will beused later in the course to classify fruits and vegetables. Please pick ten of the following fruits:

1. Apple

2. Banana

3. Oranges

4. Grapes

5. Strawberry

6. Peach

7. Cherry

8. Nectarine

9. Mango


10. Pear

11. Plum

12. Watermelon

13. Pineapple

and ten of the following veggies:

1. Spinach

2. Celery

3. Potato (not sweet potato)

4. Bell Peppers

5. Tomato

6. Cabbage

7. Radish

8. Broccoli

9. Cauliflower

10. Carrot

11. Eggplant

12. Garlic

13. Ginger

Take two pictures of each specific fruit and veggie, for a total of 20 fruit pictures, 20 veg-gie pictures against any background such that the fruit is centered in the image and the objecttakes up approximately a third of the image in area; see below for examples. Save these picturesas .png files. Do not save the pictures as .jpg files, since these are lower quality and will inter-fere with the results of your future coding project. Place all the images in a folder titled data.Each image should be titled [fruit name] [number].png or [veggie name] [number].png where[number]∈ {0, 1}. Ex: apple 0.png, apple 1.png, banana 0.png, etc . . . (the ordering is irrelevant).Please also include a file titled rich labels.txt which contain entries on new lines prefixed by thefile name [fruit name] [number] or [veggie name] [number].png, followed by a description of theimage (maximum of eight words) with only a space delimiter in between. Ex: apple 0 one fuji redapple on wood table. To turn in the folder compress the file to a .zip and upload it to Gradescope.

Please keep in mind that data is an integral part of Machine Learning. A large part of researchin this field relies heavily on the integrity of data in order to run algorithms. It is, therefore, vital


Figure 3: Example of properly centered images of four bananas against a table (left) and one orange on atree (right).

that your data is in the proper format and is accompanied by the correct labeling not only for yourgrade on this section, but for the integrity of your data.

Note that if your compressed file is over 100 MB, you will need to downsample the images. Youcan use the following functions from skimage to help.

1 from skimage.io import imread, imsave2 from skimage.transform import resize

5 Your Own QuestionWrite your own question, and provide a thorough solution.

Writing your own problems is a very important way to really learn the material. The famous“Bloom’s Taxonomy” that lists the levels of learning is: Remember, Understand, Apply, Analyze,Evaluate, and Create. Using what you know to create is the top-level. We rarely ask you any HWquestions about the lowest level of straight-up remembering, expecting you to be able to do thatyourself. (e.g. make yourself flashcards) But we don’t want the same to be true about the highestlevel.

As a practical matter, having some practice at trying to create problems helps you study for examsmuch better than simply counting on solving existing practice problems. This is because thinkingabout how to create an interesting problem forces you to really look at the material from theperspective of those who are going to create the exams.

Besides, this is fun. If you want to make a boring problem, go ahead. That is your prerogative. Butit is more fun to really engage with the material, discover something interesting, and then come upwith a problem that walks others down a journey that lets them share your discovery. You don’thave to achieve this every week. But unless you try every week, it probably won’t happen ever.


cs 189 introduction to machine learning hw5 - eecs189.org · cs 189 introduction to machine...

Documents