btry6520/stsci6520 fall, 2012 department of statistical...
TRANSCRIPT
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 1
BTRY6520/STSCI6520: Fall 2012
Computationally Intensive Statistical Methods
Instructor: Ping Li
Department of Statistical Science
Cornell University
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 2
General Information
• Lectures : Mon, Wed 2:55pm - 4:10pm, Hollister Hall 362
• Instructor : Ping Li, [email protected], Comstock Hall 1192
Office Hours: After class, Computing Labs, and by appointments.
• TA: No TA for this course
• Textbook : No textbook for this course
• Homework
– About 5-7 homework assignments.
– No late homework will be accepted.
– Before computing your overall homework grade, the assignment with the
lowest grade (if≥ 40%) will be dropped.
– It is the students’ responsibility to keep copies of the submitted homework.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 3
• Course grading :
1. Homework: 80%
2. Class Participation: 20%
• Computing : All the homework assignments will be programming in matlab.
The students who register for the class should be willing to learn matlab. The
programming assignments will be graded on correctness, efficiency (to an
extent), and demos. Questions will be asked during the (one-to-one) demos
• Labs : In additional to regular lectures, several computing labs will be
provided, usually in the evenings. Note that, as there is no TA, this is
additional time the instructor will offer to help students. Due to the conflicts
with several conferences, a small number of regular lectures will be canceled.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 4
Course Material
• Matlab programming.
• Basic numerical optimization.
• Contingency table estimation.
• Linear regression and Logistic regressions.
• Clustering algorithms.
• Random projection algorithms and applications.
• Hashing algorithms and applications.
• Other topics on modern statistical computing, if time permits.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 5
Prerequisite
This is a first-year Statistics Ph.D. course. Students are expected to be
well-prepared: probability theory, mathematical statistics, some programming
experience, basic numerical optimization etc.
Nevertheless, the instructor is happy to accommodate motivated graduate
students whose are willing to quickly learn the prerequisite material. The
instructor will often review relevant material, at a fairly fast pace.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 6
Contingency Table Estimations
Original Contingency Table Sample Contingency Table
2221
Ν Ν
ΝΝ
11 12
22
n 11 n 12
n 21 n
Suppose we only observe the sample contingency table, how can we estimate the
original table, if N = N11 + N12 + N21 + N22 is known?
(Almost) equivalently, how can we estimate πij =Nij
N ?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 7
An Example of Contingency Table
The task is to estimate how many times two words (e.g., Cornell and University)
co-occur in all the Web pages (over 10 billion).
No Word 2
21
Ν Ν
ΝΝ
11 12
22
Word 1
Word 2
No Word 1
N11: number of documents containing both word 1 and word 2.
N22: number of documents containing neither word 1 nor word 2.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 8
Google Pagehits
Google tells user the number of Web pages containing the input query word(s).
Pagehits by typing the following queries in Google (numbers can change):
• a : 25,270,000,000 pages (a surrogate for N , the total # of pages).
• Cornell : 99,600,000 pages. (N11 + N12)
• University : 2,700,000,000 pages. (N11 + N21)
• Cornell University : 31,800,000 pages. (N11)
No Word 2
21
Ν Ν
ΝΝ
11 12
22
Word 1
Word 2
No Word 1
How much do we believe these numbers?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 9
Suppose there are in total n = 107 word items.
It is easy to store 107 numbers (how many documents each word occurs in), but it
would be difficult to store a matrix of 107 × 107 numbers (how many documents
a pair of words co-occur in).
Even if storing 107 × 107 is not a problem (it is Google), it is much more difficult
to store 107 × 107 × 107 numbers, for 3-way co-occurrences (e.g., Cornell,
University, Statistics).
Even if we can store 3-way or 4-way co-occurrences, most of the items will be so
rare that they will almost never be used.
Therefore, it is realistic to believe that the counts for individual words are exact,
but the numbers of co-occurrences may be estimated, eg, from some samples.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 10
Estimating Contingency Tables by MLE of Multinomial Sampli ng
Original Contingency Table Sample Contingency Table
π
11 12
21 22
π π
π22
n 11 n 12
n 21 n
Observations: (n11, n12, n21, n22), n = n11 + n12 + n21 + n22.
Parameters (π11, π12, π21, π22), (π11 + π12 + π21 + π22 = 1)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 11
The likelihood
n!
n11!n12!n21!n22!πn11
11 πn1212 πn21
21 πn2222
The log likelihood
l = logn!
n11!n12!n21!n22!
+ n11 log π11 + n12 log π12 + n21 log π21 + n22 log π22
We can choose to write π22 = 1− π11 − π12 − π21.
Finding the maximum (setting first derivatives to be zero)
∂l
π11=
n11
π11+
−n22
1− π11 − π12 − π21= 0,
=⇒ n11
π11=
n22
π22or written as
π11
π22=
n11
n22
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 12
Similarly
n11
π11=
n12
π12=
n21
π21=
n22
π22.
Therefore
π11 = n11λ, π12 = n12λ, π21 = n21λ, π22 = n22λ,
Summing up all the terms
1 = π11 + π12 + π21 + π22 = [n11 + n12 + n21 + n22] λ = nλ
yields λ = 1n .
The MLE solution is
π11 =n11
n, π12 =
n12
n, π21 =
n21
n, π22 =
n22
n.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 13
Finding the MLE Solution by Lagrange Multiplier
MLE as a constrained optimization:
argmaxπ11,π12,π21,π22
n11 log π11 + n12 log π12 + n21 log π21 + n22 log π22
subject to : π11 + π12 + π21 + π22 = 1
The unconstrained optimization problem:
argmaxπ11,π12,π21,π22
L = n11 log π11 + n12 log π12 + n21 log π21 + n22 log π22
− λ (π11 + π12 + π21 + π22 − 1)
Finding the optimum: ∂L∂z = 0, z ∈ π11, π12, π21, π22, λ
n11
π11− λ = 0,
n12
π12=
n21
π21=
n22
π22= λ, π11 + π12 + π21 + π22 = 1
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 14
Maximum Likelihood Estimation (MLE)
Observations xi, i = 1 to n, are i.i.d. samples from a distribution with probability
density function fX (x; θ1, θ2, ..., θk),
where θj , j = 1 to k, are parameters to be estimated.
The maximum likelihood estimator seeks the θ to maximize the joint likelihood
θ = argmaxθ
n∏
i=1
fX(xi; θ)
Or, equivalently, to maximize the log joint likelihood
θ = argmaxθ
n∑
i=1
log fX(xi; θ)
This is a convex optimization if fX is concave or -log-convex.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 15
An Example: Normal Distribution
If X ∼ N(
µ, σ2)
, then fX
(
x; µ, σ2)
= 1√2πσ
e−(x−µ)2
2σ2
Fix σ2 = 1, x = 0. fX
(
x; µ, σ2)
log fX
(
x; µ, σ2)
−2 −1 0 1 20
0.1
0.2
0.3
0.4
µ
f X (
x; µ
,σ2 )
−2 −1 0 1 2−3
−2.5
−2
−1.5
−1
−0.5
µ
log
f X(x
; µ, σ
2 )
It is Not concave, but it is a -log convex, i.e., a unique MLE solution exists.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 16
Another Example of Exact MLE Solution
Given n i.i.d. samples, xi ∼ N(µ, σ2), i = 1 to n.
l(
x1, x2, ..., xn; µ, σ2)
=n∑
i=1
log fX(xi; µ, σ2)
= − 1
2σ2
n∑
i=1
(xi − µ)2 − 1
2n log(2πσ2)
∂l
∂µ=
1
2σ22
n∑
i=1
(xi − µ) = 0 =⇒ µ =1
n
n∑
i=1
xi
∂l
∂σ2=
1
2σ4
n∑
i=1
(xi − µ)2 − n
2σ2= 0 =⇒ σ2 =
1
n
n∑
i=1
(xi − µ)2.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 17
Convex Functions
A function f(x) is convex if the second derivative f ′′(x) ≥ 0.
−2 −1 0 1 20
1
2
3
4
x
f(x)
=x2
f(x) = x2
0 0.5 1 1.5 2−0.5
0
0.5
1
1.5
x
f(x)
=xl
og(x
)
f(x) = xlog(x)
f(x) = x2 =⇒ f ′′ = 2 > 0, i.e., f(x) = x2 is convex for all x.
f(x) = x log x =⇒ f ′′ = 1x , i.e., f(x) = x log x is convex if x > 0.
Both are widely used in statistics and data mining as loss functions,
=⇒ computationally tractable algorithms: least square, logistic regression.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 18
Left panel: f(x) = 1√2π
e−x2
2 is -log convex,∂2[− log f(x)]
∂x2 = 1 > 0.
0 0.5 1 1.5 20.5
1
1.5
2
2.5
3
x
log
f(x)
−10 0 10 20 30 400
2
4
6
8
x
log
f(x)
Right panel: a mixture of normals is not -log convex
f(x) = 1√2π
e−x2
2 + 1√2π10
e−(x−10)2
200
The mixture of normals is an extremely useful model in statistics.
In general, only a local minimum can be obtained.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 19
Steepest Descent
y
y=f(x)
xxx 012x34x x
Procedure:
Start with an initial guess x0.
Compute x1 = x0 −∆f ′(x0), where ∆ is the step size.
Continue the process xt+1 = xt −∆f ′(xt).
Until some criterion is met, e.g., f(xt+1) ≈ f(xt)
The meaning of “steepest” is more clear in the two-dimensional situation.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 20
An Example of Steepest Descent: f(x) = x2
f(x) = x2. The minimum is attained at x = 0, f ′(x) = 2x.
The steepest descent iteration formula xt+1 = xt −∆f ′(xt) = xt − 2∆xt.
0 1 2 3 4 5 6 7 8 910 12 14 16 17−10
−5
0
5
10
Iteration
∆ = 0.45∆ = 0.1∆ = 0.9
Choosing the step size ∆ is important (even when f(x) is convex).
Too small ∆ =⇒ slow convergence, i.e., many iterations,
Too large ∆ =⇒ oscillations, i.e., also many iterations.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 21
Steepest Descent in Practice
Steepest descent is one of the most widely techniques in real world
• It is extremely simple
• It only requires knowing the first derivative
• It is numerically stable (for above reasons)
• For real applications, it is often affordable to use very small ∆
• In machine learning, ∆ is often called learning rate
• It is used in Neural Nets and Gradient Boosting (MART)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 22
Newton’s Method
Recall the goal is to find the x∗ to minimize f(x).
If f(x) is convex, it is equivalent to finding the x∗ such that f ′(x∗) = 0.
Let h(x) = f ′(x). Take Taylor expansion about the optimum solution x∗:
h(x∗) = h(x) + (x∗ − x)h′(x) + “negligible” higher order terms
Because h(x∗) = f ′(x∗) = 0, we know approximately
0 ≈ h(x) + (x∗ − x)h′(x) =⇒ x∗ ≈ x− h(x)
h′(x)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 23
The procedure of Newton’s Method
Start with an initial guess x0
Update x1 = x0 − f ′(x0)f ′′(x0)
Repeat xt+1 = xt − f ′(xt)f ′′(xt)
Until some stopping criterion is reached, e.g., xt+1 ≈ xt.
An example: f(x) = (x− c)2. f ′(x) = 2(x− c), f ′′(x) = 2.
x1 = x0 − f ′(x0)f ′′(x0)
=⇒ x1 = x0 − 2(x0−c)2 = c
But we already know that x = c minimizes f(x) = (x− c)2.
Newton’s method may find the minimum solution using only one step.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 24
An Example of Newton’s Method: f(x) = x log x
f ′(x) = log x + 1, f ′′(x) = 1x . xt+1 = xt − log xt+1
1/xt
0 1 2 3 4 5 6 7 8 910 12 14 16 18
0
0.2
0.4
0.6
0.8
1
Iteration
x0 = 0.5
x0 = 10−10
x0 = 1−10−10
When x0 is close to optimum solution, the convergence is very fast
When x0 is far from the optimum, the convergence is slow initially
When x0 is badly chosen, no convergence. This example requires 0 < x0 < 1.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 25
Steepest Descent for f(x) = x log x
f ′(x) = log x + 1, xt+1 = xt −∆(log xt + 1)
0 10 20 30 400
2
4
6
8
10
Iteration
x t
x0 = 0, ∆ = 0.1
x0 = 10, ∆ = 0.1
x0 = 10, ∆ = 0.9
Regardless of x0, convergence is guaranteed if f(x) is convex.
May be oscillating if step size ∆ is too large
Convergence is slow near the optimum solution.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 26
General Comments on Numerical Optimization
Numerical Optimization is tricky!, even for convex problems.
Multivariate optimization is much trickier!
Whenever possible, try to avoid intensive numerical optimization,
even maybe at the cost of losing some accuracy.
An example :
Iterative Proportional Scaling for contingency table with known margins
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 27
Contingency Table with Margin Constraints
Original Contingency Table Sample Contingency Table
2221
Ν Ν
ΝΝ
11 12
22
n 11 n 12
n 21 n
Margins: M1 = N11 + N12, M2 = N11 + N21.
Margins are much easier to be counted exactly than interactions.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 28
An Example of Contingency Tables with Known Margins
Term-by-Document matrix n = 106 words and m = 1010 (Web) documents.
Cell xij = 1 if word i appears in document j. xij = 0 otherwise.
0
Word n
Word 4
Word 2
Word 1
Word 3
Doc 1 Doc 2 Doc m
1 0
10
0 1 0
0 1 0 0 1 0
10
No Word 2
21
Ν Ν
ΝΝ
11 12
22
Word 1
Word 2
No Word 1
N11: number of documents containing both word 1 and word 2.
N22: number of documents containing neither word 1 nor word 2.
Margins (M1 = N11 + N12, M2 = N11 + N21) for all rows costs nm, easy!
Interactions (N11, N12, N21, N22) for all pairs costs n(n− 1)m/2, difficult!.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 29
To avoid storing all pairwise contingency tables (n(n− 1)/2 pairs in total), one
strategy is to sample a fraction (k) of the columns of the original (term-doc) data
matrix and and build sample contingency tables on demand, from which one can
estimate the original contingency tables.
However, we observe that the margins (the total number of ones in each row) can
be easily counted. This naturally leads to the conjecture that one might (often
considerably) improves the estimation accuracy by taking advantage of the known
margins. The next question is how.
Two approaches :
1. Maximum likelihood estimator (MLE) accurate but fairly complicated.
2. Iterative proportional scaling (IPS) simple but usually not as accurate.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 30
An Example of IPS for 2 by 2 Tables
22
n 11 n 12
n 21 n
The steps of IPS
(1) Modify the counts to satisfy the row margins.
(2) Modify the counts to satisfy the column margins.
(3) Iterate until some stopping criterion is met.
An example: n11 = 30, n12 = 5, n21 = 10, n22 = 10, D = 600.
M1 = N11 + N12 = 400, M2 = N11 + N21 = 300.
In the first iteration: N11 ← M1
n11+n12n11 = 400
35 30 = 342.8571.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 31
Iteration 1
342.8571 57.1429
100.0000 100.0000
232.2581 109.0909
67.7419 190.9091
Iteration 2
272.1649 127.8351
52.3810 147.6190
251.5807 139.2265
48.4193 160.7735
Iteration 3
257.4985 142.5015
46.2916 153.7084
254.2860 144.3248
45.7140 155.6752
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 32
Iteration 4
255.1722 144.8278
45.3987 154.6013
254.6875 145.1039
45.3125 154.8961
Iteration 5
254.8204 145.1796
45.2653 154.7347
254.7477 145.2211
45.2523 154.7789
Iteration 6
254.7676 145.2324
45.2453 154.7547
254.7567 145.2386
45.2433 154.7614
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 33
Error = |current step - previous step counts|, sum over four cells.
0 1 2 3 4 5 6 7 8 910 12 14 16 18 20
10−12
10−10
10−8
10−6
10−4
10−2
100
102
Iteration
Abs
olut
e er
ror
IPS converges fast and it always converges.
But how good are the estimates?: My general observation is that it is very good
for 2 by 2 tables and the accuracy decreases (compared to the MLE) as the table
size increases.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 34
The MLE for 2 by 2 Table with Known Margins
Total samples : n = n11 + n12 + n21 + n22
Total original counts : N = N11 + N12 + N21 + N22, i.e., πij = Nij/N .
Sample Contingency Table Original Contingency Table
22
n 11 n 12
n 21 n 2221
Ν Ν
ΝΝ
11 12
Margins: M1 = N11 + N12, M2 = N11 + N21.
If margins M1 and M2 are known, then only need to estimate N11.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 35
The likelihood
∝(
N11
N
)n11(
N12
N
)n12(
N21
N
)n21(
N22
N
)n22
The log likelihood
n11 log
(
N11
N
)
+ n12 log
(
N12
N
)
+ n21 log
(
N21
N
)
+ n22 log
(
N22
N
)
=n11 log
(
N11
N
)
+ n12 log
(
M1 −N11
N
)
+ n21 log
(
M2 −N11
N
)
+ n22 log
(
N −M1 −M2 + N11
N
)
The MLE equation
n11
N11− n12
M1 −N11− n21
M2 −N11+
n22
N −M1 −M2 + N11= 0.
which is a cubic equation and can be solved either analytically or numerically.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 36
Error Analysis
To assess the quality of the estimator θ of θ, it is common to use bias, variance,
and MSE (mean square error):
Bias : E(θ)− θ
Var : E(
θ −E(θ))2
= E(θ2)− E2(θ)
MSE : E(
θ − θ)2
= V ar + Bias2
The last equality is known as the bias variance trade-off. For unbiased estimators,
it is desirable to have smaller variance as possible. As the sample size increases,
the MLE (under certain conditions) becomes unbiased and achieves the smallest
variance. Therefore, the MLE is often a desirable estimator. However, in some
cases, biased estimators may achieve smaller MSE than the MLE.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 37
The Expectations and Variances of Common Distributions
The derivations of variances are not required in this course. Nevertheless, it is
useful to know the expectations and variances of common distributions.
• Binomial : X ∼ binomail(n, p), E(X) = np, V ar(X) = np(1− p).
• Normal : X ∼ N(µ, σ2), E(X) = µ, V ar(X) = σ2.
• Chi-square : X ∼ χ2(k), E(X) = k, V ar(X) = 2k.
• Exponential : X ∼ exp(λ), E(X) = 1λ , V ar(X) = 1
λ2 .
• Poisson : X ∼ Pois(λ), E(X) = λ, V ar(λ).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 38
Multinomial Distribution
The multinomial is a natural extension to the binomial distribution. For example,
the 2 by 2 contingency table often assumes to follow the multinomial distribution.
Consider c cells and denote the observations by (n1, n2, ..., nc), which follow a
c-cell multinomial distribution with the underlying probabilities (π1, π2, ..., πc)
(with∑c
i=1 πi = 1). Denote n =∑c
i=1 ni. We write
(n1, n2, ..., nc) ∼Multinomial (n, π1, π2, ..., πc)
The expectations are (for i = 1 to c and i 6= j)
E (ni) = nπi, V ar (ni) = nπi(1− πi), Cov (ninj) = −nπiπj .
Note that the cells are negatively correlated.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 39
Variances of the 2 by 2 Contingency Table Estimates
Using previous notation, the MLE estimator of N11 is
N11 =n11
nN, (n11, n12, n21, n22) ∼Multinomial(n, π11, π12, π21, π22)
Using the general equalities about the expectations:
E(aX) = aE(X), V ar(aX) = a2V ar(X)
we know
E(
N11
)
=N
nE(n11) =
N
nnπ11 = Nπ11 = N11
V ar(
N11
)
=N2
n2V ar(n11) =
N2
n2nπ11(1− π11) =
N2
nπ11(1− π11)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 40
The Asymptotic Variance of the MLE Using Margins
When the margins are known: M1 = N11 + N12, M2 = N12 + N21
The MLE equation
n11
N11− n12
M1 −N11− n21
M2 −N11+
n22
N −M1 −M2 + N11= 0.
The asymptotic variance of the solution, denoted by N11,M , can be shown to be
V ar(
N11,M
)
=N
n
11
N11+ 1
N12+ 1
N21+ 1
N22
which is smaller than the variance of the MLE without using margins.
———-
What about the variance of IPS? : No closed-form answer and the estimates are
usually biased.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 41
Logistic Regression
Logistic regression is one of the most widely used statistical tools for predicting
cateogrical outcomes.
General setup for binary logistic regression
n observations: xi, yi, i = 1 to n. xi can be a vector.
yi ∈ 0, 1. For example, “1” = “YES” and “0” = “NO”.
Define
p(xi) = Pr (yi = 1|xi) = π(xi)
i.e., Pr (yi = 0|xi) = 1− p(xi).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 42
The major assumption of logistic regression
logp(xi)
1− p(xi)= β0 + β1xi,1 + ... + βpxi,p =
p∑
j=0
βjxi,j .
Here, we treat xi,0 = 1. We can also use vector notation to write
logp(xi; β)
1− p(xi; β)= xiβ.
Here, we view xi as a row-vector and β as a column-vector.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 43
The model in vector notation
p(xi; β) =exiβ
1 + exiβ, 1− p(xi; β) =
1
1 + exiβ,
Log likelihood for the ith observation:
li(β|xi) =(1− yi) log [1− p(xi; β)] + yi log p(xi; β)
=
log p(xi; β) if yi = 1
log [1− p(xi; β)] if yi = 0
To understand this, consider binomial with only one sample binomial(1, p(xi))
(i.e., Bernouli). When yi = 1, the log likelihood is log p(xi) and when yi = 0,
the log likelihood is log (1− p(xi)). These two formulas can be written into one.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 44
Joint log likelihood for n observations:
l(β|x1, ..., xn) =
n∑
i=1
li(β|xi)
=n∑
i=1
(1− yi) log [1− p(xi; β)] + yi log p(xi; β)
=n∑
i=1
yi logp(xi; β)
1− p(xi; β)+ log [1− p(xi; β)]
=
n∑
i=1
yixiβ − log(
1 + exiβ)
The remaining task is to solve the optimization problem by MLE.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 45
Logistic Regression with Only One Variable
Basic assumption
logit(π(xi)) = logp(xi; β)
1− p(xi; β)= β0 + β1xi
Joint Log likelihood
l(β|x1, ..., xn) =n∑
i=1
[
yixiβ − log(
1 + eβ0+xiβ1)]
Next, we solve the optmization problem for maximizing the joint likelihood, given
the data.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 46
First derivatives
∂l(β)
∂β0=
n∑
i=1
yi − p(xi),∂l(β)
∂β1=
n∑
i=1
xi (yi − p(xi)) ,
Second derivatives
∂2l(β)
∂β20
= −n∑
i=1
p(xi) (1− p(xi)) ,
∂2l(β)
∂β21
= −n∑
i=1
x2i p(xi) (1− p(xi)) ,
∂2l(β)
∂β0β1= −
n∑
i=1
xip(xi) (1− p(xi))
Solve the MLE by Newton’s Method or steepest descent (two-dim problem).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 47
Logistic Regression without Intercept ( β0 = 0)
The simplified model
logit(π(xi)) = logp(xi)
1− p(xi)= βxi
Equivalently,
p(xi) =eβxi
1 + eβxi= π(xi), 1− p(xi) =
1
1 + eβxi,
Joint log likelihood for n observations:
l(β|x1, ..., xn) =n∑
i=1
xiyiβ − log(
1 + eβxi)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 48
First derivative
l′ (β) =
n∑
i=1
xi (yi − p(xi)) ,
Second derivative
l′′ (β) = −n∑
i=1
x2i p(xi) (1− p(xi)) ,
Newton’s Method updating formula
βt+1 = βt −l′(βt)
l′′(βt)
Steepest descent (in fact ascent ) updating formula
βt+1 = βt+∆l′(βt)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 49
A Numerical Example of Logistic Regression
Data
x = 8, 14,−7, 6, 5, 6,−5, 1, 0,−17y = 1, 1, 0, 0, 1, 0, 1, 0, 0, 0Log likelihood function
−1 −0.5 0 0.5 1−60
−40
−20
0
β
l(β)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 50
0 1 2 3 4 5 6 7 8 9 10 12 150.05
0.1
0.15
0.2
Iteration
β t
∆ =0.015β0 =0.2
Newton’s MethodSteepest Descent
0 10 20 30−0.2
0
0.2
0.4
Iteration
β t
∆ =0.02β0 =0.32
Newton’s MethodSteepest Descent
Steepest descent is quite sensitive to the step size ∆.
Too large ∆ leads to oscillation.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 51
0 5 10 15 20−6
−4
−2
0
2
4
6
Iteration
β t
∆ =0.02β0 =0.33
Newton’s MethodSteepest Descent
0 10 20 30
0
5
10
Iteration
β t
∆ =0.02β0 =10
Newton’s MethodSteepest Descent
Newton’s Method is sensitive to the starting point β0. May not converge at all.
The starting point (mostly) only affects computing time of steepest descent.
——————
In general, with multiple variables, we need to use the matrix formulation, which in
fact is easier to implement in matlab.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 52
Newon’s Method for Logistic Regression with β0 and β1
Analogous to the one variable case, the Newton’s update formula is
βnew = βold −[
(
∂2l(β)
∂β∂βT
)−1∂l(β)
∂β
]
βold
where β =
β0
β1
,
∂l(β)
∂β=
[
∑n
i=1 yi − p(xi)∑n
i=1 xi (yi − p(xi))
]
=
1 x1
1 x2
...
1 xn
T
y1 − p(x1)
y2 − p(x2)
...
yn − p(xn)
= XT (y − p)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 53
(
∂2l(β)
∂β∂βT
)
=
−∑ni=1 p(xi) (1− p(xi)) −∑n
i=1 xip(xi) (1− p(xi))
−∑ni=1 xip(xi) (1− p(xi)) −∑n
i=1 x2i p(xi) (1− p(xi))
=−XTWX
W =
p(x1)(1− p(x1)) 0 0... 0
0 p(x2)(1− p(x2)) 0... 0
...
0 0 0... p(xn)(1− p(xn))
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 54
Multivariate Logistic Regression Solution in Matrix Form
Newton’ update formula
βnew = βold −[
(
∂2l(β)
∂β∂βT
)−1∂l(β)
∂β
]
βold
where, in a matrix form
∂l(β)
∂β=
n∑
i=1
xi (yi − p(xi; β)) = XT(y − p)
∂2l(β)
∂β∂βT= −
n∑
i=1
xTixip(xi; β) (1− p(xi; β)) = −XTWX
We can write the update formula in a matrix form
βnew =[
XTWX]−1
XTWz,
z = Xβold + W−1(y − p)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 55
X =
1 x1,1 x1,2 ... x1,p
1 x2,1 x2,2 ... x2,p
...
1 xn,1 xn,2 ... xn,p
∈ Rn×(p+1)
W =
p1(1− p1) 0 0 ... 0
0 p2(1− p2) 0 ... 0
...
0 0 0 ... pn(1− pn)
∈ Rn×n
where pi = p(xi; βold).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 56
Derivation
βnew =βold −[
(
∂2l(β)
∂β∂βT
)−1∂l(β)
∂β
]
βold
=βold +[
XTWX]−1
XT(y − p)
=[
XTWX]−1 [
XTWX]
βold +[
XTWX]−1
XT(y − p)
=[
XTWX]−1
XTW(
Xβold + W−1(y − p))
=[
XTWX]−1
XTWz
Note that[
XTWX]−1
XTWz looks a lot like (weighted) least square.
Two major practical issues:
• The inverse may not (usually does not) exist, especially with large datasets.
• Newton update steps may be too agressive and lead to divergence.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 57
Fitting Logistic Regression with a Learning Rate
At time t, update each coefficient vector:
βt =β(t−1) + ν[
XTWX]−1
XT(y − p)∣
∣
∣
t−1
where
W = diag [pi(1− pi)]ni=1
The magic parameter ν can be viewed as the learning rate to help make sure that
the procedure converges. Practically, it is often set to be ν = 0.1.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 58
Revisit The Simple Example with Only One β
Data
x = 8, 14,−7, 6, 5, 6,−5, 1, 0,−17y = 1, 1, 0, 0, 1, 0, 1, 0, 0, 0Log likelihood function
−1 −0.5 0 0.5 1−60
−40
−20
0
β
l(β)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 59
Newton’s Method with Learning Rate ν = 1
0 10 20 30−0.4
−0.2
0
0.2
0.4
Iteration
β t
ν = 1
β0 = 0.2
β0 = 0.32
β0 = 0.33
When initial β0 = 0.32, the method coverges. When β0 = 0.33, it does not
converge.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 60
Newton’s Method with Learning Rate ν = 0.1
0 10 20 30−0.4
−0.2
0
0.2
0.4
Iteration
β tν = 0.1
β0 = 0.2
β0 = 0.32
β0 = 0.33
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 61
Fitting Logistic Regression With Regularization
The almost correct update formula:
βt =β(t−1) + ν[
XTWX + λI]−1
XT(y − p)∣
∣
∣
t−1
Adding the regularization parameter λ usually improves the numerical stability
and some times may even result in better test errors.
There are also good statsitical interpretations.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 62
Fitting Logistic Regression With Regularization
The update formula:
βt =β(t−1) + ν[
XTWX + λI]−1 [
XT(y − p)− λβ]
∣
∣
∣
t−1
To understand the formula, consider the following modified (regularized) likelihood
function:
l(β) =
n∑
i=1
yi log pi + (1− yi) log(1− pi) −λ
2
p∑
j=0
β2j
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 63
Newton’s Method with No Regularization λ = 0 (ν = 0.1)
0 10 20 30−10
−5
0
5
10
Iteration
β tν = 0.1 , λ = 0
β0 = 1
β0 = 5
β0 = 10
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 64
Newton’s Method with Regularization λ = 1 (ν = 0.1)
0 10 20 30−10
−5
0
5
10
Iteration
β tν = 0.1 , λ = 1
β0 = 1
β0 = 5
β0 = 10
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 65
Newton’s Method with Regularization λ = 1 (ν = 0.1)
0 10 20 30−50
0
50
Iteration
β tν = 0.1 , λ = 1
β0 = 10
β0 = 20
β0 = 50
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 66
Crab Data Analysis (from Agresti’s Book)
Color (C) Spine (S) Width (W, cm) Weight (Wt, Kg) # Saterlites (Sa)
2 3 28.3 3.05 8
3 3 22.5 1.55 0
1 1 26.0 2.30 9
3 3 24.8 2.10 0
3 3 26.0 2.60 4
2 3 23.8 2.10 0
1 1 26.5 2.35 0
3 2 24.7 1.90 0
...
It is natural to view color as (norminal) cateogrical variable and weight and width
as numerical variables. The distinction, however, is often not clear in practice.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 67
Logistic regression for Sa classification using width only
y = 1 if Sa > 0, y = 0 if Sa = 0. Only one variable x = W . The task is to
compute Pr (y = 1|x) and classify the data using a simple classification rule:
yi = 1, if pi > 0.5
Using own matlab code, the fitted model is
p(xi) =e−12.3108+0.497xi
1 + e−12.3108+0.497xi
If we choose not to include the intercept term, the fitted model becomes
p(xi) =e0.02458xi
1 + e0.02458xi
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 68
Training mis-classification errors
20 40 60 80 10020
30
40
50
60
70
Iteration
Mis
−C
lass
ifica
tion
Err
or (
%)
λ = 0
λ = 1λ = 5
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 69
Training log likelihood
20 40 60 80 100−120
−110
−100
−90
Iteration
Log
likel
ihoo
d λ = 0
λ = 1
λ = 5
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 70
Logistic regression for Sa classification using S, W, and Wt
Using own matlab code, the fitted model is
p(S, W, Wt) =e−9.4684+0.0495S+0.3054W+0.8447Wt
1 + e−9.4684+0.0495S+0.3054W+0.8447Wt
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 71
Training mis-classification errors
20 40 60 80 10020
30
40
50
60
70
Iteration
Mis
−C
lass
ifica
tion
Err
or (
%)
λ = 0λ = 1
λ = 5
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 72
Training log likelihood
20 40 60 80 100−120
−110
−100
−90
Iteration
Log
likel
ihoo
d
λ = 0
λ = 1
λ = 5
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 73
Multi-Class Logistic Regression
Data: xi, yini=1, xi ∈ Rn×p, yi ∈ 0, 1, 2, ..., K − 1.
Probability model
pi,k = Pr yi = k|xi , k = 0, 1, ..., K − 1,
K−1∑
k=0
pk = 1, (only K − 1 degrees of freedom).
Label assignment
yi|xi = argmaxk
pi,k
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 74
Multi-Class Example: USPS ZipCode Recognition
Person 1:
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Person 2:
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Person 3:
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
This task can be cast (simplified) as a K = 10-class classification problem.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 75
Multinomial Logit Probability Model
pi,k =eFi,k
∑K−1s=0 eFi,s
where Fi,k = Fk(xi) is the function to be learned from the data.
Linear logistic regression : Fi,k = Fk(xi) = xiβk
Note that, βk = [βk,0, βk,1, ..., βk,p]T
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 76
Multinomial Maximum Likelihood
Mutlinomial likelihood: Suppose yi = k,
Lik ∝ p0i,0 × ...× pi,k
1 × ...× p0i,K−1 = pi,k
log likelihood:
li = log pi,k, if yi = k
Total log-likelihood in a double summation form:
l(β) =n∑
i=1
li =n∑
i=1
K−1∑
k=0
ri,k log pi,k
ri,k =
1 if yi = k
0 otherwise
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 77
Derivatives of Multi-Class Log-likelihood
The first derivative :
∂li∂Fi,k
= (ri,k − pi,k)
Proof:
∂pi,k
∂Fi,k=
[
∑K−1s=0 eFi,s
]
eFi,k − e2Fi,k
[
∑K−1s=0 eFi,s
]2 = pi,k (1− pi,k)
∂pi,k
∂Fi,t=
−eFi,keFi,t
[
∑K−1s=0 eFi,s
]2 = −pi,kpi,t
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 78
∂li∂Fi,k
=K−1∑
s=0
ri,s1
pi,s
∂pi,s
∂Fi,k= ri,k
1
pi,kpi,k(1− pi,k) +
∑
s 6=k
ri,s1
pi,s
∂pi,s
∂Fi,k
=ri,k(1− pi,k)−∑
s 6=k
ri,spi,k = ri,k −K−1∑
s=0
ri,spi,k = ri,k − pi,k
The second derivatives :
∂2li∂F 2
i,k
= −pi,k (1− pi,k) ,
∂2li∂Fi,kFi,s
= −pi,kpi,s
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 79
Multi-class logistic regression can be fairly complicated. Here, we introduce a
simpler approach, which does not seem to explicitly appear in common textbooks.
Conceptually, we fit K binary classification problems (one vs rest) at each
iteration. That is, at each iteration, we update βk seperately for each class. At the
end of each iteration, we jointly update the probabilities pi,k = exiβk∑K−1
s=0 exiβs.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 80
A Simple Implementation for Multi-Class Logistic Regressi on
At time t, update each coefficient vector:
βt
k=β
(t−1)k
+ ν[
XTWkX]−1
XT(rk − pk)∣
∣
∣
t−1
where
rk = [r1,k, r2,k, ..., rn,k]T
pk = [p1,k, p2,k, ..., pn,k]T
Wk = diag [pi,k(1− pi,k)]ni=1
Then update pk,Wk for the next iteration.
Again, the magic parameter ν can be viewed as the learning rate to help make
sure that the procedure converges. Practically, it is often set to be ν = 0.1.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 81
Logistic Regression With L2 Regularization
Total log-likelihood in a double summation form:
l(β) =n∑
i=1
K−1∑
k=0
ri,k log pi,k
− λ
2
K−1∑
k=0
d∑
j=0
β2k,j
ri,k =
1 if yi = k
0 otherwise
Let g(β) = λ2
∑K−1k=0
∑dj=0 β2
k,j , then
∂g(β)
βk,j= βk,jλ,
∂2g(β)
β2k,j
= λ
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 82
At time t, the updating formula becomes
βt
k=β
(t−1)k
+ ν[
XTWkX + λI]−1 [
XT(rk − pk)− λβk
]
∣
∣
∣
t−1
L2 regularization sometimes improves the numerical stability and some times
may even result in better test errors.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 83
Logistic Regression Results on Zip Code Data
Zip code data: 7291 training examples in 256 dimensions. 2007 test examples.
20 40 60 80 1000
5
10
15
Iteration
Mis
−C
lass
ifica
tion
Err
or (
%)
λ = 0
Training Mis−Classfication Error
λ = 5λ = 10
With no regularization (λ = 0), numerical problems may occur.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 84
20 40 60 80 1005
10
15
20
25
30
Iteration
Mis
−C
lass
ifica
tion
Err
or (
%)
λ = 0
Testing Mis−Classfication Error
λ = 5λ = 10
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 85
20 40 60 80 100−15000
−10000
−5000
0
Iteration
Log
likel
ihoo
d
Training Log Likelihood
λ = 5λ = 10
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 86
Another Example on Letter ( K = 26) Recognition
Letter dataset: 2000 training samples in 16 dimensions. 18000 testing samples.
0 200 400 600 800 1000
0.2
0.25
0.3
0.35
0.4
0.45
0.5
λ = 0λ = 1
λ = 10
Iterations
Tra
inin
g er
ror
rate
../../data/letter−train.txt: ν = 0.1
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 87
0 200 400 600 800 10000.2
0.25
0.3
0.35
0.4
0.45
0.5
λ = 0λ = 1λ = 10
Iterations
Tes
ting
erro
r ra
te
../../data/letter−test.txt: ν = 0.1
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 88
0 200 400 600 800 1000−6000
−5500
−5000
−4500
−4000
−3500
−3000
−2500
−2000
−1500
−1000
λ = 0λ = 1λ = 10
Iterations
Log−
likel
ihoo
d
../../data/letter−train.txt: ν = 0.1
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 89
Revisit Crab Data as a Multi-Class Problem
Color (C) Spine (S) Width (W, cm) Weight (Wt, Kg) # Saterlites (Sa)
2 3 28.3 3.05 8
3 3 22.5 1.55 0
1 1 26.0 2.30 9
3 3 24.8 2.10 0
3 3 26.0 2.60 4
2 3 23.8 2.10 0
1 1 26.5 2.35 0
3 2 24.7 1.90 0
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 90
0 5 10 150
20
40
60
Counts
Fre
quen
cy
Crab # Saterlites
It appears reasonable to treat this as a binary classification problem, given the
counts distribution and # samples. Nevertheless, it might be still interesting to
consider it as a multi-class problem.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 91
We consider a 6-class (0 to 5) classification problem by grouping all samples with
counts≥ 5 as class 5. Use 3 variales (S, W, Wt).
0 50 10054
55
56
57
Iteration
Mis
−C
lass
ifica
tion
Err
or (
%)
λ = 0
Train Mis−Classfication Error: ν = 0.1
Compared to the binary-classification problem, it seems the mis-classification
error is much higher. Why?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 92
Some thoughts
• Multi-class problems are usually (but not always) more difficult.
• For binary-classifiction, an error rate of 50% is very bad because a random
guess can achieve that. For K-class problem, the error rate of random
guessing would be 1− 1/K (5/6 in this example). So the results may be
actually not too bad.
• Multi-class models are more complex (in that they require more parameters)
and need more data samples. The crab dataset is very small.
• This problem may be actually ordinal classification instead of nomial, for
biological reaons.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 93
Dealing with Nominal Categorical Variables
It might be reasonable to consider “Color (C)” as a nominal cateogrical variable.
Then how can we include it in our logistic regression model?
The trick is simple. Suppose the color variable take four different values. We add
four binary variable (i.e., only taking values in 0, 1. For one particular sample,
only one of the four variables will take value 1.
This is basically the same trick as we expand the y in multi-class logistic
regression.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 94
Adding Color as Four Binary Variables
C1 C2 C3 C4 S W Wt Sa
0 1 0 0 3 28.3 3.05 8
0 0 1 0 3 22.5 1.55 0
1 0 0 0 1 26.0 2.30 9
0 0 1 0 3 24.8 2.10 0
0 0 1 0 3 26.0 2.60 4
0 1 0 0 3 23.8 2.10 0
1 0 0 0 1 26.5 2.35 0
0 0 1 0 2 24.7 1.90 0
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 95
Adding the color variable noticeably reduced the (binary) classification error.
Colors Not Included Colors Included
0 50 10027
27.5
28
28.5
Iteration
Mis
−C
lass
ifica
tion
Err
or (
%)
λ = 1e−010
Train Mis−Classfication Error: ν = 0.1
0 50 10024
25
26
27
Iteration
Mis
−C
lass
ifica
tion
Err
or (
%)
λ = 1e−010
Train Mis−Classfication Error: ν = 0.1
Here to minimize the effect of regularization, only λ = 10−10 is used, just
enough to ensure numerical stability.
Logistic regression does not directly minimize mis-classification errors. The log
likelihood probably better illustrates the effect of adding the color variable.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 96
Adding the color variable noticeably improved the log likelihood.
Colors Not Included Colors Included
0 50 100−115
−110
−105
−100
−95
−90
Iteration
Log
likel
ihoo
d λ = 1e−010
Train Log Likelihood: ν = 0.1
0 50 100−115
−110
−105
−100
−95
−90
Iteration
Log
likel
ihoo
d
λ = 1e−010
Train Log Likelihood: ν = 0.1
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 97
Adding Pairwise (Interaction) Variables
Feature expansion is a common trick to boost the performance. For example,
(x1, x2, x3, ..., xp) =⇒(x1, x2, x3, ..., xp, x
21, x1x2, ..., x1xp, x
22, x2x3, ..., x2xp, ..., x
2p)
In other words, the original p variables can be expanded to be
p +p(p + 1)
2variables
The expansion often helps, but not always. In general, when the number of
examples n is large, feature expansion usually helps.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 98
Adding Pairwise Interactions on Crab Data
Adding all pairwise (interaction) variables only help slightly in terms of the log
likelihood (red denotes using only the original variables).
0 50 10024
25
26
27
28
Iteration
Mis
−C
lass
ifica
tion
Err
or (
%)
λ = 1e−010
Train Mis−Classfication Error: ν = 0.1
λ = 1e−008λ = 1e−010
0 50 100−115
−110
−105
−100
−95
−90
Iteration
Log
likel
ihoo
d
λ = 1e−010
Train Log Likelihood: ν = 0.1
λ = 1e−008λ = 1e−010
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 99
Simplify Label Assignments
Recall label assignment in logistic regression:
yi|xi = argmaxk
pi,k
and the probability model of logistic regression:
pi,k =exiβk
∑K−1s=0 exiβs
It is equivalent to assign labels directly by
yi|xi = argmaxk
xiβk
This raises an interesting question: maybe we don’t need a probability model for
the purpose of classification? For example, a linear regression may be sufficient?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 100
Linear Regression and Its Applications in Classification
Both linear regression and logistic regression are examples of
Generalized Linear Models (GLM) .
We first review linear regression and then discuss how to use it for (multi-class)
classification.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 101
Review Linear Regression
Given data xi, yini=1, where xi is a p-dimensional vector and yi is a scalar
(not limited to be categories).
We again construct the data matrix
X =
1 x1,1 x1,2 ... x1,p
1 x2,1 x2,2 ... x2,p
...
1 xn,1 xn,2 ... xn,p
, y =
y1
y2
...
yn
The data model is
y = X× β
β (a vector of length p + 1) is obtained by minimizing the mean square errors
(equivalent to maximizing the joint likelihood under the normal distribution model).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 102
Linear Regression Estimation by Least Square
The idea is to minimize the mean square errors
MSE(β) =
n∑
i=1
|yi − xiβ|2 = (Y −Xβ)T(Y −Xβ)
We can find the optimal β by setting the first derivative to be zero
∂MSE(β)
β= XT (Y −Xβ) = 0
=⇒XTY = XTXβ
=⇒β = (XTX)−1XTY
Don’t worry much about how to do matrix derivatives. The trick is to view this
simply as a scalar derivative but we need to manipulate the order (and add
transposes) to get the dimensions correct.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 103
Ridge Regression
Similar to l2-regularized logistic regression, we can add a regularization
parameter
β = (XTX + λI)−1XTY
which is known as ridge regression .
Adding regularization not only improves the numerical stability but also often
increases the test accuracy.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 104
Linear Regression for Classification
For binary classification, i.e., yi ∈ 0, 1, we can simply treat yi as numerical
response and fit a linear regression. To obtain the classification result, we can
simply use y = 0.5 as the classification threshold.
Multi-class classification (with K classes) is more interesting. We can use exactly
the same trick as in multi-class logistic regression by first expanding the yi into a
vector of length K with only one entry being 1 and then fitting K binary linear
regressions simultaneously and using the location of the maximum fitted value as
the class label prediction. Since you have completed the homework in multi-class
logistic regression, this idea should be straightforward now. Also see sample
code.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 105
Mis-Classification Errors on Zipcode Data
0 20 40 60 80 1006
8
10
12
14
λ
Mis
−cl
assi
ficat
ion
erro
r (%
)
TrainTest
• This is essentially the first iteration of multi-class logistic regression. Clearly,
the results are not as good as logistic regression with many iterations.
• Adding regularization (λ) slightly increases the training errors but decreases
the testing errors at certain range.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 106
Linear Regression Classification on Crab Data
Binary classification. 50% of the data points are used for training and the rest for
testing. Three models are compared:
• Model using S, W, and Wt.
• Model using the above three as well as colors.
• Model using all four plus all pairwise interactions.
Both linear regression and logistic regressions are experimented. For logistic
regression, we use ν = 0.1 and only report the errors at the 100th iterations
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 107
0 20 40 60 80 100
26
28
30
32
34
36
38
λ
Mis
−cl
assi
ficat
ion
erro
r (%
)
Spine, Width, and Weight
Linear RegressionLogistic Regression
Linear regression and logistic regression produce almost the same results.
Regularization does not appear to be helpful in this example.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 108
0 20 40 60 80 100
26
28
30
32
34
36
38
λ
Mis
−cl
assi
ficat
ion
erro
r (%
)
Colors Included
Linear RegressionLogistic Regression
Linear regression seems to be even slightly better
Regularization still does not appear to be helpful.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 109
0 20 40 60 80 100
26
28
30
32
34
36
38
λ
Mis
−cl
assi
ficat
ion
erro
r (%
)
Colors and Pairwise Interactions Included
Linear RegressionLogistic Regression
Now logistic regression seems to be slightly better
Regularization really helps.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 110
Limitations of Using Linear Regression for Classification
• For many datasets, the classification accuracies of using linear regressions
are actually quite similar to using logistic regressions, especially when the
datasets are “not so good.”
• However, for many “good” datasets (such as zip code data), logistic
regressions may have some noticeable advantages.
• Linear regression does not (directly) provide an probabilistic interpretations of
the classification results, which may be needed in many applications, for
example, learning to rank using classification.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 111
Poisson Log-Linear Model
Revisit the crab data. It appears very natural to model the Sa counts as a
poisson random variable, which may be parameterized by a linear model.
Color (C) Spine (S) Width (W, cm) Weight (Wt, Kg) # Saterlites (Sa)
2 3 28.3 3.05 8
3 3 22.5 1.55 0
1 1 26.0 2.30 9
3 3 24.8 2.10 0
3 3 26.0 2.60 4
2 3 23.8 2.10 0
1 1 26.5 2.35 0
3 2 24.7 1.90 0
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 112
Poisson Distribution
Denote Y ∼ Poisson(µ). The probability mass function (PMF) is
Pr (Y = y) =e−µuy
y!, y = 0, 1, 2, ...
E(Y ) = µ, V ar(Y ) = µ
One drawback of the Poisson model is that its variance is the same as the mean
which often contradicts real data observations.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 113
Fitting Poisson Distribution
Given n observations, yi, i = 1 to n, the MLE of µ is simply the sample mean:
µ =1
n
n∑
i=1
yi
Observed counts Fitted counts
0 5 10 150
20
40
60
80
100
120
Sa
Obs
erve
d F
requ
ncy
0 5 10 150
20
40
60
80
100
120
Sa
Fitt
ed F
requ
ncy
No need to perform any test. It is obviously not a good fit.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 114
Linear Regression for Predicting Counts
Maybe we can simply model
yi ∼ N(µi, σ2)
µi = xiβ = β0 + xi,1β1 + ... + xi,pβp
i.e., µi is the mean of a normal distribution N(µi, σ2).
This way, we can easily predict the counts by
β =(
XTX)−1
XTy
y = Xβ
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 115
Histograms of the predictions of counts by using linear regression using only the
width. The sum of square error (SE) is
SE =n∑
i=1
(yi − yi)2
= 1.5079× 103
0 2 4 6 80
10
20
30
40
Sa
Fre
quen
cyHistograms by Linear Regression
Clearly, linear regression can not possibly be the best approach.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 116
Poisson Regression Model
Assumption:
yi ∼ Poisson(µi)
log µi = xiβ = β0 + xi,1β1 + ... + xi,pβp
Note that this is very different from assuming that the logarithms of the counts
follow a linear regression model. Why?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 117
0 5 100
20
40
60
80
Sa
Fre
quen
cy
Histograms by Poisson Regression
Clearly, this looks better than the histogram from linear regression.
However, the square error SE =∑n
i=1 (yi − yi)2 = 1.5373× 103 is actually
larger than the SE from linear regression. Why is it not too surprising?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 118
Comparing Fitted Counts
y y Linear y Poisson
8.0000 3.9344 3.8103
0 0.9916 1.4714
9.0000 2.7674 2.6127
0 2.1586 2.1459
4.0000 2.7674 2.6127
0 1.6512 1.8212
0 3.0211 2.8361
0 2.1079 2.1110
0 1.6005 1.7916
0 2.5645 2.4468
Need to see more rows to understand the differences...
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 119
Now we use 3 variables, S, W, Wt, to fit linear regression and Poisson regression.
0 5 100
10
20
30
40
50
Sa
Fre
quen
cy
Histograms by Linear Regression
0 5 10 150
20
40
60
80
100
SaF
requ
ency
Histograms by Poisson Regression
Clearly, Poisson regression looks better, although SE values are 1.4696× 103
and 1.5343× 103, respectively, for linear regression and Poisson regression.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 120
Fitting Poisson Regression
Log Likelihood:
li = −µi + yi log µi = −exiβ + yixiβ
First Derivatives:
∂li∂β
= (yi − µi)xTi
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 121
Given n observations, the log likelihood is l =∑n
i=1 li.
First Derivatives (matrix form) :
∂l
∂β= XT (y − µ)
Second Derivatives (matrix form) :
∂2l
∂ββT= −XTWX
where W is the diagonal matrix of µ.
They look very similar to logistic regression.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 122
Newton’s Method for Solving Poisson Regression Model
βnew = βold −[
(
∂2l(β)
∂β∂βT
)−1∂l(β)
∂β
]
βold
βt =β(t−1) + ν[
XTWX]−1 [
XT(y − µ)]
∣
∣
∣
t−1
where again ν (e.g., 0.1) is a shrinkage parameter which helps the numerical
stability.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 123
Why “Log Linear”?
Poisson Model Without Log:
yi ∼ Poisson(µi)
µi = xiβ = β0 + xi,1β1 + ... + xi,pβp
Its log likelihood and first derivative (assuming only one β) are:
li = −µi + yi log µi = −xiβ + yi log (xiβ)
∂li∂β
= −xi +yixi
xiβ
Considering the second derivatives and more than one β, using this model is
almost like “looking for troubles.” There is also another obvious issue with this
model. What is it?
The reason why “Log Linear” will be more clear under the GLM framework.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 124
Summary of Models
Given a dataset xi, yini=1, so far, we have seen three different models:
• Linear Regression −∞ < yi <∞,
yi ∼ N(
µi, σ2)
, µi = xiβ
• Poisson Regression yi ∈ 0, 1, 2, ..., ,
yi ∼ Poisson (µi) , log µi = xiβ
• Binary Logistic Regression yi ∈ 0, 1,
yi ∼ Binomial (pi) , logpi
1− pi= xiβ
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 125
Quotes from George E. P. Box
• Essentially, all models are wrong, but some are useful.
• Remember that all models are wrong; the practical question i s how
wrong do they have to be to not be useful.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 126
Generalized Linear Models (GLM)
All the models we have seen so far belong to the family of generalized linear
models (GLM). In general, a GLM consists of three components:
• The random component yi ∼ f(yi; θi).
f(yi; θi) = a(θi)b(yi)eyiQ(θi)
• The systematic component ηi = xiβ =∑p
j=0 xi,jβj .
(This may be replaced by a more flexible model.)
• The link function ηi = g(ui), where ui = E(yi).
g(u) is a monotonic function. If g(u) = u, it is called “identity link”.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 127
Revisit Poisson Log Linear Model Under GLM
For GLM,
yi ∼ f(yi; θi) = a(θi)× b(yi)× eyiQ(θi)
In this case, θi = ui,
f(yi) =e−uiuyi
i
yi!=[
e−ui]
[
1
yi
]
[
eyi log ui]
Therefore,
a(µi) = e−ui , b(yi) =1
yi!, Q(µi) = log ui
And the link function
g(ui) = Q(θi) = log ui = xiβ
This is called canonical link.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 128
Revisit Binary Logistic Model Under GLM
For GLM,
yi ∼ f(yi; θi) = a(θi)× b(yi)× eyiQ(θi)
In this case, θi = pi,
f(yi) = pyi
i (1− pi)1−yi = [(1− pi)] [1]
[
eyi log
pi1−pi
]
Therefore,
a(pi) = 1− pi, b(yi) = 1, Q(pi) = logpi
1− pi
And the link function
g(pi) = Q(θi) = logpi
1− pi= xiβ
This is again a canonical link.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 129
Revisit Linear Regression Model Under GLM (with σ2 = 1)
For GLM,
yi ∼ f(yi; θi) = a(θi)× b(yi)× eyiQ(θi)
In this case, θi = µi (and σ2 = 1 by assumption)
f(yi) = e−(yi−µi)
2
2 =
[
e−µ2
i2
] [
e−y2
i2
]
[eyiµi ]
Therefore,
a(µi) = e−µ2
i2 , b(yi) = e−
y2i2 , Q(µi) = µi
And the link function
g(ui) = Q(θi) = µi = xiβ
This is again a canonical link and is in fact an identity link.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 130
Statistical Inference
After we have fitted a GLM (e.g., logistic regression) and estimated the
coefficients β, we can ask many questions, such as
• Which βj is more important?
• Is βj significantly different from 0?
• What is the (joint) distribution of β?
To understand these questions, it is crucial to learn some theory of the MLE,
because fitting a GLM is finding the MLE for a particular distribution.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 131
Revisit the Maximum Likelihood Estimation (MLE)
Observations xi, i = 1 to n, are i.i.d. samples from a distribution with probability
density function fX (x; θ1, θ2, ..., θk),
where θj , j = 1 to k, are parameters to be estimated.
The maximum likelihood estimator seeks the θ to maximize the joint likelihood
θ = argmaxθ
n∏
i=1
fX(xi; θ)
Or, equivalently, to maximize the log joint likelihood
θ = argmaxθ
n∑
i=1
log fX(xi; θ) = argmaxθ
l(θ; x)
where l(θ; x) =∑n
i=1 log fX(xi; θ) is the joint log likelihood function.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 132
Large Sample Theory for MLE
Large sample theory says, as n→∞, θ is asymptotically unbiased and normal.
θ ∼ N
(
θ,1
nI(θ)
)
, approximately
I(θ) is the Fisher Information of θ:
I(θ) = −E
[
∂2
∂θ2log f(X |θ)
]
= −E (l′′(θ))
Note that it is also true that
I(θ) = E (l′(θ))2
but you don’t have to worry about the proof.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 133
Intuition About the Asymptotic Distributions & Variances o f MLE
The MLE θ is the solution to the MLE equation l′(θ) = 0.
The Taylor expansion around the true θ
l′(θ) ≈ l′(θ) + (θ − θ)l′′(θ)
Let l′(θ) = 0 (because θ is the MLE solution)
(θ − θ) ≈ − l′(θ)
l′′(θ)
We know that
E(−l′′(θ)) = nI(θ) = E(l′(θ))2,
E(l′(θ)) = 0. (Read the next slide if interested in the proof)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 134
(Don’t worry about this slide if you are not interested.)
l′(θ) =
n∑
i=1
∂ log f(xi)
∂θ=
n∑
i=1
∂f(xi)∂θ
f(xi)
E (l′(θ)) =n∑
i=1
E
(
∂ log f(xi)
∂θ
)
= nE
(
∂f(x)∂θ
f(x)
)
= 0
because
E
(
∂f(x)∂θ
f(x)
)
=
∫ ∂f(x)∂θ
f(x)f(x)dx =
∫
∂f(x)
∂θdx =
∂
∂θ
∫
f(x)dx = 0
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 135
The heuristic trick is to approximate
θ − θ ≈ l′(θ)
−l′′(θ)≈ l′(θ)
E(−l′′(θ))=
l′(θ)
nI(θ)
Therefore,
E(θ − θ) ≈ E(l′(θ))
nI(θ)= 0
V ar(θ) ≈ E(θ − θ)2 ≈ E
(
l′(θ)
nI(θ)
)2
=nI(θ)
n2I2(θ)=
1
nI(θ)
This is why intuitively, we know that θ ∼ N(
θ, 1nI(θ)
)
.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 136
Example: Normal Distribution
Given n i.i.d. samples, xi ∼ N(µ, σ2), i = 1 to n.
log fX(x; µ, σ2) = − 1
2σ2(x− µ)2 − 1
2log(2πσ2)
∂2 log fX(x; µ, σ2)
∂µ2= − 1
σ2=⇒ I(µ) =
1
σ2
Therefore, the MLE µ will have asymptotic variance 1nI(µ) = σ2
n . But in this
case, we already know that
µ =1
n
n∑
i=1
xi ∼ N
(
µ,σ2
n
)
In other words, the “asymptotic” variance of the MLE is in fact exact in this case.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 137
Example: Binomial Distribution
x ∼ Binomial(p, n): Pr (x = k) =(
nk
)
pk(1− p)n−k
Log likelihood and Fisher Information:
l(p) = k log p + (n− k) log(1− p)
l′(p) =k
p− n− k
1− p=⇒ MLE p =
k
n
l′′(p) = − k
p2− n− k
(1− p)2
I(p) = −E (l′′(p)) =np
p2+
n− np
(1− p)2=
n
p(1− p)
That is, the asymptotic variance of the MLE p isp(1−p)
n , which is in fact again the
exact variance.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 138
Example: Contingency Table with Known Margins
n = n11 + n12 + n21 + n22 N = N11 + N12 + N21 + N22
22
n 11 n 12
n 21 n 2221
Ν Ν
ΝΝ
11 12
Margins: M1 = N11 + N12, M2 = N11 + N21, are known.
The (asymptotic) variance of the MLE (for N11) is
Var(
N11,MLE
)
=N/n
1N11
+ 1M1−N11
+ 1M2−N11
+ 1N−M1−M2+N11
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 139
Derivation : The log likelihood is
l(N11) =n11 logN11
N+ n12 log
M1 −N11
N
+ n21 logM2 −N11
N+ n22 log
N −M1 −M2 + N11
N
The MLE solution is
l′(N11) =n11
N11− n12
M1 −N11− n21
M2 −N11+
n22
N −M1 −M2 + N11= 0
The second derivative is
l′′(N11) =− n11
N211
− n12
N212
− n21
N221
− n22
N222
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 140
The Fisher Information is thus
I(N11) =E(−l′′(N11)) =E(n11)
N211
+E(n12)
N212
+E(n21)
N221
+E(n22)
N222
=n
N
[
1
N11+
1
N12+
1
N21+
1
N22
]
Recall
E(n11) = nN11
N, E(n12) = n
N12
N,
E(n21) = nN21
N, E(n22) = n
N22
N,
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 141
Asymptotic Covariance Matrix
More generally, suppose there are more than one parameters
θ = θ1, θ2, θ3, θp. The Fisher Information Matrix is defined as
I(θ) = E
(
− ∂2l(θ)
∂θi∂θj
)
And the asymptotic covariance matrix is
Cov(θ) = I−1(θ)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 142
Review Binary Logistic Regression Derivatives
Newton’ update formula
βnew = βold −[
(
∂2l(β)
∂β∂βT
)−1∂l(β)
∂β
]
βold
where, in a matrix form
∂l(β)
∂β=
n∑
i=1
xi (yi − p(xi; β)) = XT(y − p)
∂2l(β)
∂β∂βT= −
n∑
i=1
xTixip(xi; β) (1− p(xi; β)) = −XTWX
where W = diagp(xi)(1− p(xi)).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 143
Fisher Information and Covariance for Logistic Regression
Suppose the Newton’s iteration has reached the optimal solution (very important),
then
I(β) = E(
XTWX)
= XTWX
And the asymptotic covariance matrix is
Cov(β) = I−1(β) =[
XTWX]−1
In other words, the MLE estimates β of the binary logistic regression parameters
are asymptotically jointly normal
N(
β,[
XTWX]−1)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 144
A Simple Test for Logistic Regression Coefficients
At convergence, the coefficients of logistic regression
β ∼ N(
β,[
XTWX]−1)
We can just test each coefficient separately because, asymptotically
βj ∼ N(
βj ,[
XTWX]−1
jj
)
which allows us to use normal probability functions to compute the p-values.
Two caveats: (1) We need the “true” W , which is replaced by the estimated W at
the last iteration. (2) We still have to specify the true βj for the test. In general, it
makes sense to test H0 : βj = 0.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 145
GLM with R
> data= read.table("d:\\class\\6030Spring12\\fig\\cra b.txt");
> model = glm((data[,5]==0)˜data$V2+data$V3+data$V4,fa mily=’binomial’);
> summary(model)
Call:
glm(formula = (data[, 5] == 0) ˜ data$V2 + data$V3 + data$V4,
family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7120 -0.8948 -0.5242 1.0431 2.0833
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.46885 3.56974 2.653 0.00799 **data$V2 -0.04952 0.22094 -0.224 0.82267
data$V3 -0.30540 0.18220 -1.676 0.09370 .
data$V4 -0.84479 0.67369 -1.254 0.20985
---
Signif. codes: 0 ’ *** ’ 0.001 ’ ** ’ 0.01 ’ * ’ 0.05 ’.’ 0.1 ’ ’ 1
(Dispersion parameter for binomial family taken to be 1)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 146
Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 192.84 on 169 degrees of freedom
AIC: 200.84
Number of Fisher Scoring iterations: 4
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 147
R Resources
Download R executable from
http://www.r-project.org/
After launching R, type “glm.help()” for the helper screen.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 148
Validating the Asymptotic Theory Using Crab Data
We use 3 variables (Width, Weight, Spine, plus the intercept) from the crab data
for building the binary logistic regression model for predicting Pr (Sa > 0).
Instead of using the original labels, we generate the “true” β and sample the
labels from the generated β.
function TestLogitCrab;
load crab.txt;
X = crab(:,1:end-1); X(:,1)=1;
be_true = [-10,0.05,0.3,0.8]’ + randn(4,1) * 0.1;
The true β is fixed once generated. Once β is known, we can easily compute
p(xi) = Pr (yi = 1) =exiβ
1 + exiβ
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 149
Once β is fixed, we can compute p and sample the labels from
Bounoulli(p(xi)) for each xi.
We then fit the binary logistic regression using the original xi and the generated
yi to obtain β, which will be quite close to but not identical to the “true” β.
We then repeat the sampling procedure to create another set of labels and β.
By repeating this procedure 1000 times, we will be able to assess the distribution
of β.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 150
0 20 40 60 80 1000
20
40
60
80
100
Iterations
MS
E
True β0 = −10.0371
EmpiricalTheoretical
0 20 40 60 80 1000
0.02
0.04
0.06
Iterations
MS
E
True β1 = −0.033621
EmpiricalTheoretical
0 20 40 60 80 1000.02
0.03
0.04
0.05
0.06
Iterations
MS
E
True β2 = 0.24113
EmpiricalTheoretical
0 20 40 60 80 1000.2
0.4
0.6
0.8
1
Iterations
MS
E
True β3 = 0.97415
EmpiricalTheoretical
The MSEs for all βj MSEs converge with increasing iterations. However, they
deviate from the “true” variances predicted by [XTWX ]−1
, most likely because
our sample size n = 173 is too small for the large-sample theory to be accurate.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 151
Experiments on the Zipcode data
2 5 3 4
5 1 0 0
Conjecture: If we display the p-values from the z-test, we might be able to see
some images similar to digits.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 152
Displaying the p-values as images
0 1 2 3 4
5 6 7 8 9
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 153
Displaying 1- p-values as images
0 1 2 3 4
5 6 7 8 9
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 154
Displaying only top (smallest) 50 p-values as images
0 1 2 3 4
5 6 7 8 9
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 155
Plausible Interpretations: The asymptotic theory says
β ∼ N(
β,[
XTWX]−1)
Using only the marginal (diagonal) information
βj ∼ N(
βj ,[
XTWX]−1
jj
)
may result in serious loss of information. In particular, when the variables are
highly correlated as in this dataset, it is not realistic to expect that only the
marginal information will be sufficient.
In other words, for zipcode data, many pixels “work together” to provide strong
discriminating powers. This is the power of team work .
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 156
Multi-Class Ordinal Logistic Regression
For zip-code recognition, it is natural to treat each class (0 to 9) equally, because
in general there are indeed no orders among them (unless you are doing specific
studies in which the zip code information reveals physical locations.).
In many applications, however, there are natural orders among the class labels.
For example, in the crab data, it might be reasonable to consider # SA is ordinal
because it reflects the growth process. Also, it variable “Spine condition” may be
also ordinal.
Another example is the Webpage relevance ranking. A page with a rank of
“perfect” (4) is certainly more important than a page of “bad” (0).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 157
Practical Strategies
• For binary classification, it does not matter.
• In many cases, we can just ignore the orders.
• We can fit K binary logistic regressions by grouping the data according to
whether the labels are smaller or larger than L:
Pr (Label > L)
from which one can compute the individual class probabilities:
Pr (Label = L) = Pr (Label ≤ L + 1)−Pr (Label ≤ L)
One drawback is that for some data points, the fitted class probabilities may
be smaller than 0 after subtraction. But if you have lots of data, this method is
often quite effective in practice, for example, in our previous work on ranking
webpages. Do read the slides on ranking if you are interested.
• More sophisticated models...
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 158
Methods for Modern Massive Data Sets (MMDS)
1. Normal Random Projections
2. Cauchy Random Projections
3. Stable Random Projections
4. Random Projections for Computing Higher-Order Distances
5. Skewed Stable Random Projections
6. Tentative: Sparse Signal Recovery
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 159
An Introduction to Random Projections
Many applications require a data matrix : A ∈ Rn×D
For example, the term-by-document matrix may contain n = 1010 documents
(web pages) and D = 106 single words, or D = 1012 double words (bi-gram
model), or D = 1018 triple words (tri-gram model).
Many matrix operations boil down to computing how close (how far) two rows
(columns) of the matrix are. For example, linear least square (ATA)−1
ATy.
Challenges : The matrix may be too large to store,
or computing ATA is too expensive.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 160
Random Projections : Replace A by B = A×R
A R = B
R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).
B ∈ Rn×k : projected matrix, also random.
k is very small (eg k = 50 ∼ 100), but n and D are very large.
B approximately preserves the Euclidean distance and dot products between any
two rows of A. In particular, E (BBT) = AAT.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 161
Consider first two rows in A: u1, u2 ∈ RD .
u1 = u1,1, u1,2, u1,3, ..., u1,i, ..., u1,Du2 = u2,1, u2,2, u2,3, ..., u2,i, ..., u2,D
and first two rows in B: v1, v2 ∈ Rk .
v1 = v1,1, v1,2, v1,3, ..., v1,j, ..., v1,kv2 = v2,1, v2,2, v2,3, ..., v2,j, ..., v2,k
v1 = RTu1, v2 = RTu2.
R = rij, i = 1 to D and j = 1 to k. rij ∼ N(0, 1).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 162
v1 = RTu1, v2 = RTu2. R = rij, i = 1 to D and j = 1 to k.
v1,j =D∑
i=1
riju1,i, v2,j =D∑
i=1
riju2,i,
v1,j − v2,j =
D∑
i=1
rij [u1,i − u2,i]
The Euclidean norm of u1:∑D
i=1 |u1,i|2.
The Euclidean norm of v1:∑k
j=1 |v1,j |2.
The Euclidean distance between u1 and u2:∑D
i=1 |u1,i − u2,i|2.
The Euclidean distance between v1 and v2:∑k
j=1 |v1,j − v2,j |2.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 163
What are we hoping for?
•∑k
j=1 |v1,j |2 ≈∑D
i=1 |u1,i|2, as close as possible.
• ∑kj=1 |v1,j − v2,j |2 ≈
∑Di=1 |u1,i − u2,i|2, as close as possible.
• k should be as small as possible, for a specified level of accuracy.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 164
Unbiased Estimator of d and m1, m2
We need a good estimator, unbiased and has small variance.
Note that the estimation problem is essentially the same for d and for m1 (m2).
Thus, we can focus on estimating m1.
By random projections, we have k i.i..d. samples (why?)
vj =D∑
i=1
riju1,i, j = 1, 2, ...k
Because rij ∼ N(0, 1), we can develop estimators and analyze the properties
using normal and χ2 distributions. But we can also solve the problem without
using normals.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 165
Unbiased Estimator of m1
v1,j =D∑
i=1
riju1,i, j = 1, 2, ...k, (rij ∼ N(0, 1))
To get started, let’s first look the moments
E(v1,j) = E
(
D∑
i=1
riju1,i
)
=D∑
i=1
E(rij)u1,i = 0
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 166
E(v21,j) =E
[
D∑
i=1
riju1,i
]2
=E
D∑
i=1
r2iju
21,i +
∑
i 6=i′
riju1,iri′ju1,i′
=
D∑
i=1
E(r2ij)u
21,i +
∑
i 6=i′
E(rijri′j)u1,iu1,i′
=
(
D∑
i=1
u21,i + 0
)
= m1
Great! m1 is exactly what we are after.
Since we have k, i.i.d. samples vj , we can simply average them to estimate m1.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 167
An unbiased estimator of the Euclidean norm m1 =∑D
i=1 |u1,i|2
m1 =1
k
k∑
j=1
|v1,j |2,
E (m1) =1
k
k∑
j=1
E(
|v1,j |2)
=1
k
k∑
j=1
m1 = m1
We need to analyze its variance to assess its accuracy.
Recall, our goal is to use k (number of projections) as small as possible.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 168
V ar (m1) =1
k2
k∑
j=1
V ar(
|v1,j |2)
=1
kV ar
(
|v1,j |2)
=1
k
[
E(
|v1,j |4)
−E2(|v1,j |2)]
=1
k
E
(
D∑
i=1
riju1,i
)4
−m21
We can compute E(
∑Di=1 riju1,i
)4
directly, but it would be much easier if we
take advantage of the χ2 distribution.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 169
χ2 Distribution
If X ∼ N(0, 1), then Y = X2 is a Chi-Square distribution with one degree of
freedom, denoted by χ21.
If Xj , j = 1 to k are i.i.d. normal Xi ∼ N(0, 1). Then Y =∑k
j=1 X2j follows
a Chi-square distribution with k degrees of freedom, denoted by χ2k.
If Y ∼ χ2k, then
E(Y ) = k, V ar(Y ) = 2k
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 170
Recall, after random projections,
v1,j =D∑
i=1
riju1,i, j = 1, 2, ...k, rij ∼ N(0, 1)
Therefore, vj also has a normal distribution:
v1,j ∼ N
(
0,D∑
i=1
|ui,i|2)
= N (0, m1)
Equivalentlyv1,j√m1∼ N(0, 1).
Therefore,
[
v1,j√m1
]2
=v21,j
m1∼ χ2
1, V ar
(
v21,j
m1
)
= 2, V ar(
v21,j
)
= 2m21
Now we can figure out the variance formula for random projections.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 171
V ar (m1) =1
kV ar
(
|v1,j |2)
=2m2
1
k
Implication
V ar(m1)
m21
=2
k, independent of m1
√
V ar(m1)m2
1is known as the coefficient of variation.
——————-
We have solved the variance using χ21.
We can actually figure out the distribution of m1 using χ2k.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 172
m1 =1
k
k∑
j=1
|v1,j |2, v1,j ∼ N (0, m1)
Because v1,j ’s are i.i.d, we know
km1
m1=
k∑
j=1
(
v1,j√m1
)2
∼ χ2k (why?)
This will be useful for analyzing the error bound using probability inequalities.
We can also write down the moments of m1 directly using χ2k
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 173
Recall, if Y ∼ χ2k , then E(Y ) = k, and V ar(Y ) = 2k
=⇒
E
(
km1
m1
)
= k, V ar
(
km1
m1
)
= 2k,
=⇒
V ar(m1) = 2km2
1
k2=
2m21
k
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 174
An unbiased estimator of the Euclidean distance d =∑D
i=1 |u1,i − u2,i|2
d =1
k
k∑
j=1
|v1,j − v2,j |2,kd
d∼ χ2
k, V ar(d) =2d2
k.
They can be derived exactly the way as we analyze the estimator of m1.
Note that the coefficient of variation for d
V ar(d)
d2=
2
k, independent of d
meaning that the errors are pre-determined by k, a huge advantage.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 175
More probability problems
• What is the error probability P(
|d− d| ≥ ǫd)
?
• How large k should be?
• What about the inner (dot) product a =∑D
i=1 u1,iu2,i?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 176
An unbiased estimator of the inner product a =∑D
i=1 u1,iu2,i
a =1
k
k∑
j=1
v1,jv2,j ,
E(a) = a
V ar(a) =m1m2 + a2
k
Proof :
v1,jv2,j =
[
D∑
i=1
u1,irij
][
D∑
i=1
u2,irij
]
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 177
v1,jv2,j =
[
D∑
i=1
u1,irij
][
D∑
i=1
u2,irij
]
=D∑
i=1
u1,iu2,ir2ij +
∑
i 6=i′
u1,iu2,i′rijri′j
=⇒
E(v1,jv2,j) =D∑
i=1
u1,iu2,iE[
r2ij
]
+∑
i 6=i′
u1,iu2,i′E [rijri′j ]
=D∑
i=1
u1,iu2,i1 +∑
i 6=i′
u1,iu2,i′0
=
D∑
i=1
u1,iu2,i = a
This proves the unbiasedness.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 178
We first derive the variance of a using a complicated brute force method, then we
show a much simpler method using conditional expectation.
[v1,jv2,j ]2 =
D∑
i=1
u1,iu2,ir2ij +
∑
i 6=i′
u1,iu2,i′rijri′j
2
=
[
D∑
i=1
u1,iu2,ir2ij
]2
+
∑
i 6=i′
u1,iu2,i′rijri′j
2
+ ...
=D∑
i=1
[u1,iu2,i]2 r4
ij + 2∑
i 6=i′
u1,iu2,iu1,i′u2,i′ [rijri′j ]2
+∑
i 6=i′
[u1,iu2,i′ ]2 [rijri′j ]
2 + ...
Why can we ignore the rest of the terms (after taking expectations)?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 179
Why can we ignore the rest of the terms (after taking expectations)?
Recall rij ∼ N(0, 1) i.i.d.
E(rij) = 0, E(r2ij) = 1, E(rijri′j) = E(rij)E(ri′j) = 0
E(r3ij) = 0, E(r4
ij) = 3, E(r2ijri′j) = E(r2
ij)E(ri′j) = 0
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 180
Therefore,
E [v1,jv2,j ]2 =
D∑
i=1
3 [u1,iu2,i]2 + 2
∑
i 6=i′
u1,iu2,iu1,i′u2,i′ +∑
i 6=i′
[u1,iu2,i′ ]2
But
a2 =
[
D∑
i=1
u1,iu2,i
]2
=
D∑
i=1
[u1,iu2,i]2
+∑
i 6=i′
u1,iu2,iu1,i′u2,i′
m1m2 =
[
D∑
i=1
|u1,i|2][
D∑
i=1
|u2,i|2]
=D∑
i=1
[u1,iu2,i]2 +
∑
i 6=i′
[u1,iu2,i′ ]2
Therefore,
E [v1,jv2,j ]2 = m1m2 + 2a2, V ar [v1,jv2,j ] = m1m2 + a2
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 181
An unbiased estimator of the inner product a =∑D
i=1 u1,iu2,i
a =1
k
k∑
j=1
v1,jv2,j , E(a) = a
V ar(a) =m1m2 + a2
k
The coefficient of variation
√
V ar(a)a2 =
√
m1m2+a2
a21k , not independent of a.
When two vectors u1 and u2 are almost orthogonal, a ≈ 0,
=⇒ coefficient of variation≈ ∞.
=⇒ random projections may not be good for estimating inner products.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 182
The joint distribution of v1,j =∑D
i=1 u1,irij and v2,j =∑D
i=1 u2,irij .
E(v1,j) = 0, V ar(v1,j) =D∑
i=1
|u1,i|2 = m1
E(v2,j) = 0, V ar(v2,j) =D∑
i=1
|u2,i|2 = m2
Cov(v1,i, v2,j) = E(v1,jv2,j)− E(v1,j)E(v2,j) = a
v1,j and v2,j are jointly normal (bivariate normal).
v1,j
v2,j
∼ N
µ =
0
0
, Σ =
m1 a
a m2
(What if we know m1 and m2 exactly? For example, by one scan of data matrix.)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 183
Review Bivariate Normal Distribution
The random variables X and Y have a bivariate normal distribution if, for
constants, ux, uy , σx > 0, σy > 0,−1 < ρ < 1, their joint density function is
given, for all−∞ < x, y <∞, by
f(x, y) =1
2πσxσy
√
1− ρ2e− 1
2(1−ρ2)
[
(x−µx)2
σ2x
+(y−µy)2
σ2y
−2ρ(x−µx)(y−µy)
σxσy
]
If X and Y are independent, then ρ = 0, and
f(x, y) =1
2πσxσye− 1
2
[
(x−µx)2
σ2x
+(y−µy)2
σ2y
]
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 184
Denote that X and Y are jointly normal:
X
Y
∼ N
µ =
µx
µy
, Σ =
σ2x ρσxσy
ρσxσy σ2y
X and Y are marginally normal:
X ∼ N(µx, σ2x), Y ∼ N(µy, σ2
y)
X and Y are also conditionally normal:
X |Y ∼ N
(
µx + ρ(y − µy)σx
σy, (1− ρ2)σ2
x
)
Y |X ∼ N
(
µy + ρ(x− µx)σy
σx, (1− ρ2)σ2
y
)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 185
Bivariate Normal and Random Projections
A R = B
v1 and v2, the first two rows in B, have k entries:
v1,j =∑D
i=1 u1,irij and v2,j =∑D
i=1 u2,irij .
v1,j and v2,j are bivariate normal:
v1,j
v2,j
∼ N
µ =
0
0
, Σ =
m1 a
a m2
m1 =∑D
i=1 |u1,i|2, m2 =∑D
i=1 |u2,i|2, a =∑D
i=1 u1,iu2,i
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 186
Simplify calculations using conditional normality
v1,j |v2,j ∼ N
(
a
m2v2,j ,
m1m2 − a2
m2
)
E (v1,jv2,j)2
=E(
E(
v21,jv
22,j |v2,j
))
= E(
v22,jE
(
v21,j |v2,j
))
=E
(
v22,j
(
m1m2 − a2
m2+
(
a
m2v2,j
)2))
=m2m1m2 − a2
m2+ 3m2
2
a2
m22
=(
m1m2 + 2a2)
.
The unbiased estimator a = 1k
∑Di=1 v1,jv2,j has variance
Var (a) =1
k
(
m1m2 + a2)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 187
Review Moment Generating Function (MGF)
Definition: For a random variable X , its moment generating function (MGF),
is defined as
MX(t) = E[
etX]
=
∑
x p(x)etx if X is discrete
∫∞−∞ etxf(x)dx if X is continuous
MGF MX(t) uniquely determines the distribution of X .
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 188
Review MGF of Normal
Suppose X ∼ N(0, 1), i.e., fX(x) = 1√2π
e−−x2
2 .
MX(t) =
∫ ∞
−∞etx 1√
2πe−
x2
2 dx
=
∫ ∞
−∞
1√2π
e−x2
2 +txdx
=
∫ ∞
−∞
1√2π
e−x2
−2tx+t2−t2
2 dx
=et2
2
∫ ∞
−∞
1√2π
e−(x−t)2
2 dx
=et2
2
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 189
Suppose Y ∼ N(µ, σ2).
Write Y = σX + µ, where X ∼ N(0, 1).
MY (t) = E[
etY]
= E[
eµt+σtX]
= eµtE[
eσtX]
We can view σt as another t′.
MY (t) = eµtMX(σt) = eµt × eσ2t2
2 = eµt+ σ2
2 t2
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 190
Review MGF of Chi-Square
If Xj , j = 1 to k, are i.i.d. N(0, 1), then
Y =∑k
j=1 X2j ∼ χ2
k , a Chi-squared distribution with k degrees of freedom.
By the independence of Xj ,
MY (t) = E[
eY t]
= E[
et∑k
j=1 X2j
]
=k∏
j=1
E[
etX2j
]
=(
E[
etX2j
])k
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 191
E[
etX2j
]
=
∫ ∞
−∞etx2 1√
2πe−
x2
2 dx
=
∫ ∞
−∞
1√2π
e−x2
2 +tx2
dx
=
∫ ∞
−∞
1√2π
e−x2(1−2t)
2 dx
=
∫ ∞
−∞
1√2π
e−x2
2σ2 dx,
(
σ2 =1
1− 2t
)
=σ
∫ ∞
−∞
1√2πσ
e−x2
2σ2 dx = σ
=1
(1− 2t)1/2
MY (t) =(
E[
etX2j
])k
=1
(1− 2t)k/2, (t < 1/2)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 192
MGF for Random Projections
In random projections , the unbiased estimator d = 1k
∑kj=1 |v1,j − v2,j |2
kd
d=
k∑
j=1
|v1,j − v2,j |2d
∼ χ2k
Q: What is the MGF of d?
Solution:
Md(t) = E(
edt)
= E
(
e
[
kdd
]
[ dtk ])
=
(
1− 2dt
k
)−k/2
where 2dt/k < 1, i.e., t < k/(2d).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 193
Review Moments and MGF
MX(t) = E[
etX]
=⇒M ′X(t) = E
[
XetX]
=⇒M(n)X (t) = E
[
XnetX]
Setting t = 0,
E [Xn] = M(n)X (0)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 194
Example: X ∼ χ2k . MX(t) = 1
(1−2t)k/2 .
M ′(t) =−k
2(1− 2t)
−k/2(−2) = k (1− 2t)
−k/2−1
M ′′(t) =k
(−k
2− 1
)
(1− 2t)−k/2−2 (−2)
=k(k + 2) (1− 2t)−k/2−2
Therefore,
E(X) = M ′(0) = k, E(X2) = M ′′(0) = k2 + 2k
V ar(X) = (k2 + 2k)− k2 = 2k.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 195
MGF and Moments of a in Random Projections
The unbiased estimator of inner product: a = 1k
∑ki=1 v1,jv2,j .
Using conditional expectation:
v1,j |v2,j ∼ N
(
a
m2v2,j ,
m1m2 − a2
m2
)
v2,j ∼ N(0, m2)
For simplicity, let
x = v1,j , y = v2,j , µ =a
m2v2,j =
a
m2y,
σ2 =m1m2 − a2
m2
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 196
E (exp(v1,jv2,jt)) = E (exp(xyt)) = E (E (exp(xyt)) |y)
Using the MGF of x|y ∼ N(µ, σ2)
E (exp(xyt)|y) = eµyt+ σ2
2 (yt)2
E (E (exp(xyt)|y)) = E(
eµyt+ σ2
2 (yt)2)
µyt +σ2
2(yt)2 = y2
(
a
m2t +
σ2
2t2)
Since y ∼ N(0, m2), we known y2
m2∼ χ2
1.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 197
Using MGF of χ21, we obtain
E(
eµyt+ σ2
2 (yt)2)
= E
(
ey2
m2m2
(
am2
t+ σ2
2 t2))
=
(
1− 2m2
(
a
m2t +
σ2
2t2))−1/2
=(
1− 2at−(
m1m2 − a2)
t2)− 1
2 .
By independence,
Ma(t) =
(
1− 2at
k−(
m1m2 − a2) t2
k2
)− k2
.
Now, we can use this MGF to calculate moments of a.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 198
Ma(t) =
(
1− 2at
k−(
m1m2 − a2) t2
k2
)− k2
,
Ma(1)(t) =(−k/2)
[
(
1− 2at
k−(
m1m2 − a2) t2
k2
)− k2−1]
×(
−2a/k −(
m1m2 − a2) 2t
k2
)
The term in [...] will not matter after letting t = 0.
Therefore,
E(a) =(
MGFa(1)(0)
)
= (−k/2)(−2a/k) = a
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 199
Following a similar procedure, we can obtain
Var (a) =m1m2 + a2
k
E (a− a)3
=2a
k2
(
3m1m2 + a2)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 200
Tail Probabilities
The tail probability P (X > t) is extremely important.
For example, in random projections,
P(
|d− d| ≥ ǫd)
tells what is the probability that the difference (error) between the estimated
Euclidian distance d and the true distance d exceeds an ǫ fraction of the true
distance d.
Q: Is it just the cumulative probability function (CDF)?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 201
Tail Probability Inequalities (Bounds)
P (X > t) ≤ ???
Reasons to study tail probability bounds:
• Even if the distribution of X is known, evaluating P (X > t) often requires
numerical methods.
• Often the exact distribution of X is unknown. Instead, we may know the
moments (mean, variance, MGF, etc).
• Theoretical reasons. For example, studying how fast the error decreases.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 202
Several Tail Probability Inequalities (Bounds)
• Markov’s Inequality .
Only use the first moment. Most basic.
• Chebyshev’s Inequality .
Only use the second moment.
• Chernoff’s Inequality .
Use the MGF. Most accurate and popular among theorists.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 203
Markov’s Inequality: Theorem A in Section 4.1
If X is a random variable with P (X ≥ 0) = 1, and for which E(X) exists, then
P (X ≥ t) ≤ E(X)
t
Proof: Assume X is continuous with probability density f(x).
E(X) =
∫ ∞
0
xf(x)dx ≥∫ ∞
t
xf(x)dx ≥∫ ∞
t
tf(x)dx = tP (X ≥ t)
See the textbook for the proof by assuming X is discrete.
Many extremely useful bounds can be obtained from Markov’s inequality.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 204
Markov’s inequality P (X ≥ t) ≤ E(X)t . If t = kE(X), then
P (X ≥ t) = P (X ≥ kE(X)) ≤ 1
k
The error decreases at the rate of 1k , which is too slow.
The original Markov’s inequality only utilizes the first moment (hence its
inaccuracy).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 205
Chebyshev’s Inequality
Let X be a random variable with mean µ and variance σ2. Then for any t > 0,
P (|X − µ| ≥ t) ≤ σ2
t2
Proof: Let Y = (X − µ)2 = |X − µ|2, w = t2, then
P (Y ≥ w) ≤ E(Y )
w=
E (X − µ)2
w=
σ2
w
Note that |X − µ|2 ≥ t2 ⇐⇒ |x− µ| ≥ t. Therefore,
P (|X − µ| ≥ t) = P(
|X − µ|2 ≥ t2)
≤ σ2
t2
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 206
Chebyshev’s inequality P (|X − µ| ≥ t) ≤ σ2
t2 . If t = kσ, then
P (|X − µ| ≥ kσ) ≤ 1
k2
The error decreases at the rate of 1k2 , which is faster than 1
k .
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 207
Chernoff Inequality
If X is a random variable with finite MGF MX(t), then for any ǫ > 0
P X ≥ ǫ ≤ e−tǫMX(t), for all t > 0
P X ≤ ǫ ≤ e−tǫMX(t), for all t < 0
Application: One can choose the t to minimize the upper bounds. This
usually leads to accurate probability bounds, which decrease exponentially fast.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 208
Proof: Use Markov’s Inequality.
For t > 0, because X > ǫ⇐⇒ etX > etǫ (monotone transformation)
P (X > ǫ) =P(
etX ≥ etǫ)
≤E[
etX]
etǫ
=e−tǫMX(t)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 209
Tail Bounds of Normal Random Variables
X ∼ N(µ, σ2). Assume µ > 0. Need to know P (|X − µ| ≥ ǫµ) ≤ ??
Chebyshev’s inequality :
P (|X − µ| ≥ ǫµ) ≤ σ2
ǫ2µ2=
1
ǫ2
[
σ2
µ2
]
The bound is not good enough, only decreasing at the rate of 1ǫ2 .
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 210
Tail Bounds of Normal Using Chernoff’s Inequality
Right tail bound P (X − µ ≥ ǫµ)
For any t > 0,
P (X − µ ≥ ǫµ)
=P (X ≥ (1 + ǫ)µ)
≤e−t(1+ǫ)µMX(t)
=e−t(1+ǫ)µeµt+σ2t2/2
=e−t(1+ǫ)µ+µt+σ2t2/2
=e−tǫµ+σ2t2/2
What’s next? Since the inequality holds for any t > 0, we can choose the t to
minimize the upper bound.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 211
Right tail bound P (X − µ ≥ ǫµ)
Choose the t = t∗ to minimize g(t) = −tǫµ + σ2t2/2.
g′(t) = −ǫµ + σ2t = 0 =⇒ t∗ =µǫ
σ2=⇒ g(t∗) = − ǫ2
2
µ2
σ2
Therefore,
P (X − µ ≥ ǫµ) ≤ e−ǫ2
2µ2
σ2
decreasing at the rate of e−ǫ2 .
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 212
Left tail bound P (X − µ ≤ −ǫµ)
For any t < 0,
P (X − µ ≤ −ǫµ) =P (X ≤ (1− ǫ)µ)
≤e−t(1−ǫ)µMX(t)
=e−t(1−ǫ)µeµt+σ2t2/2
=etǫµ+σ2t2/2
Choose the t = t∗ = − µǫσ2 to minimize tǫµ + σ2t2/2. Therefore,
P (X − µ ≤ −ǫµ) ≤ e−ǫ2
2µ2
σ2
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 213
Combine left and right tail bounds P (|X − µ| ≥ ǫµ)
P (|X − µ| ≥ ǫµ)
=P (X − µ ≥ ǫµ) + P (X − µ ≤ −ǫµ)
≤2e−ǫ2
2µ2
σ2
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 214
Sample Size Selection Using Tail Bounds
Xi ∼ N(
µ, σ2)
, i.i.d. i = 1 to k.
An unbiased estimator of µ is µ
µ =1
k
k∑
i=1
Xi, µ ∼ N
(
µ,σ2
k
)
Choose k such that
P (|µ− µ| ≥ ǫµ) ≤ δ
———–
We already know P (|µ− µ| ≥ ǫµ) ≤ 2e− ǫ2
2µ2
σ2/k .
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 215
It suffices to select the k such that
2e−ǫ2
2kµ2
σ2 ≤ δ
=⇒e−ǫ2
2kµ2
σ2 ≤ δ
2
=⇒− ǫ2
2
kµ2
σ2≤ log
(
δ
2
)
=⇒ǫ2
2
kµ2
σ2≥ − log
(
δ
2
)
=⇒k ≥[
− log
(
δ
2
)]
2
ǫ2σ2
µ2
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 216
Suppose Xi ∼ N(µ, σ2), i = 1 to k, i.i.d. Then µ = 1n
∑ni=1 Xi is an
unbiased estimator of µ. If the sample size k satisfies
k ≥[
log
(
2
δ
)]
2
ǫ2σ2
µ2,
then with probability at least 1− δ, the estimated µ is within a 1± ǫ factor of the
true µ, i.e., |µ− µ| ≤ ǫµ.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 217
What affects sample size k?
k ≥[
log
(
2
δ
)]
2
ǫ2σ2
µ2
• δ: level of significance. Lower δ → more significant→ larger k.
• σ2
µ2 : noise/signal ratio. Higher σ2
µ2 → larger k.
• ǫ: accuracy. Lower ǫ→ more accurate→ larger k.
• The evaluation criterion. For example, |µ− µ| ≤ ǫµ, or |µ− µ| ≤ ǫ?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 218
Tail Bounds and Random Projections
Recall In random projections, kdd ∼ χ2
k
P(
|d− d| ≥ ǫd)
=P(
|kd/d− k| ≥ ǫk)
=P(
kd/d ≥ (1 + ǫ)k)
+ P(
kd/d ≤ (1− ǫ)k)
P(
kd/d ≥ (1 + ǫ)k)
≤ e−t(1+ǫ)k [1− 2t]−k/2
= exp (−t(1 + ǫ)k − k/2 log(1− 2t))
= exp (−k [t(1 + ǫ) + log(1− 2t)/2])
which is minimized at the t such that [t(1 + ǫ) + log(1− 2t)/2]′= 0
=⇒ (1 + ǫ)− 11−2t = 0 =⇒ 1− 2t = 1
1+ǫ =⇒ t = ǫ2(1+ǫ)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 219
Therefore, the right tail bound is
P(
kd/d ≥ (1 + ǫ)k)
≤ exp
(
−k
[
ǫ
2(1 + ǫ)(1 + ǫ)− log(1 + ǫ)/2
])
= exp
(
−k
2[ǫ− log(1 + ǫ)]
)
= exp
(
−k
2
[
ǫ2
2− ǫ3
3+ ...
])
Similarly, we can obtain the left tail bound to be
P(
kd/d ≤ (1− ǫ)k)
= exp
(
−k
2[−ǫ− log(1− ǫ)]
)
= exp
(
−k
2
[
ǫ2
2+
ǫ3
3+ ...
])
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 220
Therefore,
P(
|d− d| ≥ ǫd)
= P(
kd/d ≥ (1 + ǫ)k)
+ P(
kd/d ≤ (1− ǫ)k)
≤ exp
(
−k
2
[
ǫ2
2− ǫ3
3+ ...
])
+ exp
(
−k
2
[
ǫ2
2+
ǫ3
3+ ...
])
≤2 exp
(
−k
2
[
ǫ2
2− ǫ3
3+ ...
])
which means, in order for P(
|d− d| ≥ ǫd)
≤ δ, it suffice to let
k ≥ 2 log δ/2ǫ2
2 − ǫ3
3
Normally, ǫ is small. Hence we can simply say k = O(
1/ǫ2)
.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 221
Improving Random Projections Using Marginal Information
Recall, the projected data, v1,j and v2,j , are bivariate normal:
v1,j
v2,j
∼ N
µ =
0
0
, Σ =
m1 a
a m2
The observation is that one does not need to estimate m1 and m2, because they
can be computed exactly with a linear scan of the data.
In fact, when a2 ≈ m1m2 (i.e., two original vectors are almost identical), the
estimator a = 1k
∑kj=1 v1,jv2,j becomes very bad (the variance is maximized).
In this case, one can first estimate d and then infer a because
d = m1 + m2 − 2a. The question is whether we can find a systematic strategy
to improve the estimates of a and d for all situations. We can resort to the MLE.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 222
The MLE Results
The MLE, denoted as aMLE , is the solution to a cubic equation:
a3 − a2(
vT1v2
)
/k + a(
−m1m2 + m1‖v2‖2/k + m2‖v1‖2/k)
−m1m2vT1v2/k = 0.
E (aMLE − a) = O(k−2),
E(
(aMLE − a)3)
=−2a(3m1m2 + a2)(m1m2 − a2)3
k2(m1m2 + a2)3+ O(k−3),
Var (aMLE) =1
k
(
m1m2 − a2)2
m1m2 + a2+
1
k2
4(m1m2 − a2)4
(m1m2 + a2)4m1m2 + O(k−3).
However, the cubic MLE equation admits multiple real roots with a small
probability, when the sample size k is small.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 223
One can show that
Pr (multiple real roots) = Pr(
P 2(11−Q2/4− 4Q + P 2) + (Q− 1)3 ≤ 0)
,
where P =vT1v2
k√
m1m2, Q = ‖v1‖2
km1+ ‖v2‖2
km2.
This probability is (crudely) bounded by
Pr (multiple real roots) ≤ e−0.0085k + e−0.0966k.
When a = m1 = m2, this probability can be (sharply) bounded by
Pr (multiple real roots | a = m1 = m2) ≤ e−1.5328k + e−0.4672k.
We can also use simulations to compute the probability.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 224
2 4 6 8 100.001
0.01
0.1
1
Sample size ( k )
Pro
b. o
f mul
tiple
rea
l roo
ts a’ =0.5
a’ =1
a’ =0
Upperbound a’ =1
Simulations show that Pr (multiple real roots) decreases exponentially fast with
respect to increasing sample size k (notice the log scale in the vertical axis). After
k ≥ 8, the probability that the cubic MLE equation admits multiple roots becomes
so small (≤ 1%) that it can be safely ignored in practice. Here a′ = a√m1m2
.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 225
Derivation of the MLE Equation
First, we can write down the joint likelihood function for v1,j , v2,jkj=1:
lik(
v1,j , v2,jkj=1
)
∝ |Σ|− k2 exp
−1
2
k∑
j=1
[
v1,j v2,j
]
Σ−1
v1,j
v2,j
.
where (assuming m1m2 6= a to avoid triviality)
|Σ| = m1m2 − a2, Σ−1 =1
m1m2 − a2
m2 −a
−a m1
,
which allows us to express the log likelihood function, l(a), to be
l(a) = −k
2log(
m1m2 − a2)
−∑k
j=1
(
v21,jm2 − 2v1,jv2,ja + v2
2,jm1
)
2(m1m2 − a2).
Setting l′(a) to zero, we obtain aMLE , the solution to a cubic equation.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 226
The large sample theory tells us that aMLE is asymptotically unbiased and
converges weakly to a normal random variable N(
a, Var (aMLE) = 1I(a)
)
,
where I(a), the expected Fisher Information, is I(a) = −E (l′′(a)). Some
algebra will show that
I(a) = km1m2 + a2
(m1m2 − a2)2 . Var (aMLE) =
1
k
(
m1m2 − a2)2
m1m2 + a2+ O
(
1
k2
)
Higher-order terms can be obtained by more careful. The bias
E (aMLE − a) = −E (l′′′(a)) + 2I′(a)
2I2(a)+ O(k−2),
which is often called the “Bartlett correction.” Some algebra can show this
estimator does not have O(k−1) bias.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 227
The third central moment
E (aMLE − a)3
=−3I′(a)− E (l′′′(a))
I3(a)+ O(k−3)
= −2a(3m1m2 + a2)(m1m2 − a2)3
k2(m1m2 + a2)3+ O(k−3).
The O(k−2) term of the variance, denoted by V c2 , can be written as
V c2 =
1
I3(a)
(
E (l′′(a))2 − I2(a)− ∂ (E (l′′′(a)) + 2I′(a))
∂a
)
+1
2I4(a)
(
10 (I′(a))2 − E (l′′′(a)) (E (l′′′(a))− 4I′(a))
)
=4
k2
(
m1m2 − a2)4
(m1m2 + a2)4 m1m2,
after some truly grueling algebra.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 228
Sign Random Projection
Normal random projection would be just an interesting idea with little industry
impact if sign random projection were not discovered. This is because industry
data are often binary and sparse, for which other algorithms such as minwise
hashing can be more suitable.
Instead of storing each projected sample using (e.g.,) 64 bits, we can simple store
the sign (i.e. 1-bit). Interestingly, the collision probability has a closed-form:
T = Pr (sign(v1,j) = sign(v2,j)) = 1− θ
π, cos θ =
a√m1m2
From k i.i.d. samples, one can estimate T , then θ, and then a, again by
assuming m1 and m2 are known.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 229
aSign = cos(θ)√
m1m2.
By the Delta method, aSign is asymptotically unbiased with the asymptotic
variance
Var (aSign) = Var(θ) sin2(θ)m1m2 =θ(π − θ)
ksin2(θ)m1m2,
because
Var(
θ)
=π2
k
(
1− θ
π
)(
θ
π
)
=θ(π − θ)
k.
Regular random projections store real numbers (e.g., 64 bits). At the same
number of projections (i.e., the same k) , obviously sign random projections will
have larger variances. If the variance is inflated only by a factor of (e.g.,) 4, sign
random projections would be preferable because we could increase k to (e.g.,)
4k, to achieve the same accuracy while the storage cost will still be lower than
regular random projections.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 230
Define
VSign =Var (aSign)
Var (aMLE)=
θ(π − θ) sin2(θ)m1m2
(m1m2−a2)2
m1m2+a2
=θ(π − θ)(1 + cos2(θ))
sin2(θ),
which is symmetric about θ = π2 , monotonically decreasing in (0, π
2 ] with
minimum π2
4 ≈ 2.47, attained at θ = π2 .
0.05 0.1 0.2 0.3 0.4 0.528
16
32
48
64
θ (π)
Var
rat
io
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 231
When the data points are nearly uncorrelated (θ close to π2 ), sign random
projections should have good performance.
However, some applications such as duplicate detections are interested in data
points that are close to each other hence sign random projections may cause
relatively large errors. In this case, we’d better off use regular normal random
projections with marginal information.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 232
Proof of Sign Random Projection Collision Probability
g(ρ) =
∫ ∞
0
∫ ∞
0
f(x, y)dxdy =
∫ ∞
0
∫ ∞
0
1
2π√
1− ρ2e−x2
−2ρxy+y2
2(1−ρ2) dxdy
=
∫ ∞
0
1
2π√
1− ρ2e− y2
2(1−ρ2) dy
∫ ∞
0
e− x2
−2ρxy
2(1−ρ2) dx
=
∫ ∞
0
1
2π√
1− ρ2e− y2
2(1−ρ2) dy
∫ ∞
0
e− (x−yρ)2
2(1−ρ2) ey2ρ2
2(1−ρ2) dx
=
∫ ∞
0
1
2π√
1− ρ2e−
y2
2 dy
∫ ∞
0
e− (x−yρ)2
2(1−ρ2) dx
=
∫ ∞
0
1
2π√
1− ρ2e−
y2
2 dy
∫ ∞
−yρ√1−ρ2
e−t2
2
√
1− ρ2dt
=
∫ ∞
0
φ(y)Φ
(
yρ√
1− ρ2
)
dy
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 233
∂g(ρ)
∂ρ=
∫ ∞
0
φ(y)φ
(
yρ√
1− ρ2
)
y
(1− ρ2)3/2dy
=1
(1− ρ2)1/2
∫ ∞
0
1
2πe− y2
2(1−ρ2) dy2
2(1− ρ2)
=1
(1− ρ2)1/2
1
2π
Note that g(0) = 14 . Hence
g(ρ) =
∫ ρ
0
1
(1− ρ2)1/2
1
2πdρ +
1
4
=1
2πsin−1(ρ) +
1
4
=1
2π
(π
2− θ)
+1
4=
1
2− θ
2π
This proves the desired probability T = 2g(ρ) = 1− θπ . Here ρ = a√
m1m2.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 234
Comparing Random Projection with Simple Random Sampling
Suppose we randomly sample k elements from u1 and denote the samples to be
s1, s2, ..., sk. Then an unbiased estimator of m1 =∑D
i=1 u21,i would be
m1,s =D
k
k∑
j=1
s2j
E (m1,s) = DE(s2j) = D
∑Di=1 u2
1,i
D= m1
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 235
The variance would be (assuming k ≪ D):
V ar (m1,s) =D2
k
(
E(s4j)−E2(s2
j))
=D2
k
∑Di=1 u4
1,i
D−(
∑Di=1 u2
1,i
D
)2
which can be dominated by the fourth-order moment of the data.
Recall that the variance of the random projection estimator is only related to the
second-order moment. When the data are heavy-tailed, simple random sampling
will have much worse performance compared to random projection.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 236
Summary of Normal Random Projections
Random Projections : Replace A by B = A×R
A R = B
• An elegant method, interesting (elementary) probability exercise. Suitable for
approximating Euclidean distances in massive, dense, and heavy-tailed
(some entries are excessively large) data matrices.
• It does not take advantage of data sparsity.
• It has guaranteed performance when estimating the l2 distance.
• The straightforward estimator for the inner product can be quite unsatisfactory.
• An MLE can improve the estimates by using marginal information.
• What is often used in industry is the sign (1-bit) random projections.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 237
Normality Assumption is Not Necessary
A R = B
R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).
B ∈ Rn×k : projected matrix, also random.
B approximately preserves the Euclidean distance and dot products between any
two rows of A. In particular, E (BBT) = AAT.
However, we do not really need to use normal distribution for sampling the
projection matrix. In fact, any zero-mean distribution with finite variance should
work (central limit theorem).
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 238
Sparse Random Projection Matrix
Instead of N(0, 1), we can sample entries of R from i.i.d.
rij =√
s
1 with prob. 12s
0 with prob. 1− 1s
−1 with prob. 12s
,
Here√
s is just for convenience. With this choice
E(rij) = 0, E(r2ij) = 1, E(r4
ij) = s, E(|r3ij|) =
√s,
E (rij ri′j′) = 0, E(
r2ij ri′j′
)
= 0 when i 6= i′, or j 6= j′.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 239
We still use the same unbiased estimators as in normal random projections, i.e.,
m1 =1
k
k∑
j=1
|v1,j |2,
d =1
k
k∑
j=1
|v1,j − v2,j |2,
a =1
k
k∑
j=1
v1,jv2,j
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 240
Their variances can be proved to be
Var (m1) =1
k
(
2m21 + (s− 3)
D∑
i=1
u41,i
)
,
Var(
d)
=1
k
(
2d2 + (s− 3)D∑
i=1
(u1,i − u2,i)4
)
,
Var (a) =1
k
(
m1m2 + a2 + (s− 3)
D∑
i=1
u21,iu
22,i
)
.
Interestingly, when s < 3, the variances are strictly smaller than the variances of
normal random projections, regardless of the data.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 241
Even when s is large , for example, s =√
D, the variances may not increase
much unless the fourth moments of the data are too large.
To see this
(s− 3)∑D
i=1 u41,i
m21
=(s− 3)
D
∑Di=1 u4
1,i/D
m21/D
2
which can be written as O(
s−3D
)
if the data can be assumed to have finite fourth
moment. When D is very large (as in practice), we can choose s to be very large
as long as(s−3)
D is relatively small.
For example, s =√
D, is often a good choice if D is truly large and not too
heavy-tailed.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 242
Proof of the Variances
It suffices to study a. (why?) Note that
a2 =
(
D∑
i=1
u1,iu2,i
)2
=D∑
i=1
u21,iu
22,i + 2
∑
i<i′
u1,iu2,iu1,i′u2,i′ .
v1,jv2,j =D∑
i=1
r2iju1,iu2,i +
∑
i 6=i′
riju1,iri′ju2,i′ ,
E (v1,jv2,j) =
D∑
i=1
E(r2ij)u1,iu2,i +
∑
i 6=i′
E(rij)u1,iE(ri′j)u2,i′
=
D∑
i=1
u1,iu2,i
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 243
v21,jv
22,j =
D∑
i=1
(
r2ij
)
u1,iu2,i +∑
i 6=i′
(rij)u1,i (ri′j) u2,i′
2
=
∑Di=1
(
r4ij
)
u21,iu
22,i+
2∑
i<i′
(
r2ij
)
u1,iu2,i
(
r2i′j
)
u1,i′u2,i′+(
∑
i 6=i′ (rij)u1,i (ri′j) u2,i′
)2
+
2∑D
i=1
(
r2ij
)
u1,iu2,i
∑
i 6=i′ (rij)u2,i (ri′j) u1,i′
,
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 244
E(
v21,jv
22,j
)
=
s
D∑
i=1
u21,iu
22,i + 4
∑
i<i′
u1,iu2,iu1,i′u2,i′ +∑
i 6=i′
u21,iu
22,i′
=
(s− 2)D∑
i=1
u21,iu
22,i +
∑
i 6=i′
u21,iu
22,i′ + 2a2
=
(
m1m2 + (s− 3)D∑
i=1
u21,iu
22,i + 2a2
)
,
Var (a) =1
k
(
m1m2 + a2 + (s− 3)D∑
i=1
u21,iu
22,i
)
.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 245
Discussions
• Sparse random projection has significant advantages because (i) random
number generation is much simpler; (ii) matrix multiplication is essentially
avoid; (iii) storing the projection matrix is much less costly, etc.
• Note that the analysis only needs: E(rij) = E(r3ij) = 0, E(r2
ij) = 1, and
E(r4ij) = s. The exact distribution of rij is actually irrelevant. Therefore, the
analysis is much more general than just for the particular choice of random
projection matrix.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 246
Cauchy Random Projections
A R = B
R ∈ RD×k: a random matrix with i.i.d. entries sampled from standard Cauchy
C(0, 1). B ∈ Rn×k : projected matrix, also random.
It turns out B contains sufficient information to estimate the original l1 distances
in A. Consider the first two rows of A, then d = d1 =∑D
i=1 |u1,i − u2,i|.
The l1 distance is often believed to provide more “robust” results (e.g., clustering,
classification) than the l2 distance. (but we should keep in mind that for many
datasets, l1 distance is not a good choice.)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 247
Normal Random Projections Can Not Estimate l1 Distance
Recall, if rij ∼ N(0, 1), then v1,j ∼ N(
0,∑D
i=1 |u1,i|2)
, which does not
contain information about the l1 norm.
If rij is sampled from other distributions, as long as E(rij) = 0 and
E(r2ij) <∞, by Central Limit Theorem (CLT), we always have approximately
normal projected data.
Therefore, in order to “avoid” CLT, we should sample from distributions which do
not have bounded variance or even bounded mean. Cauchy is a well-known
example of such distributions.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 248
Review Cauchy Distribution
A Cauchy random variable z ∼ C(0, γ) has the density:
f(z) =γ
π
1
z2 + γ2, γ > 0, −∞ < z <∞
and the characteristic function:
E(
exp(√−1zt)
)
= exp (−γ|t|) ,
Consider i.i.d. for z1, z2, ..., zD , i.i.d. C(0, γ), and any constants c1, c2, ..., cD ,
then
E
(
exp
(
√−1t
D∑
i=1
cizi
))
= exp
(
−γD∑
i=1
|ci||t|)
,
which means the weighted sum∑D
i=1 cizi ∼ C(
0,∑D
i=1 |ci|)
. This is the
foundation of Cauchy random projection.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 249
Parameter Estimation Problem in Cauchy Random Projections
In Cauchy random projections, we let rij ∼ C(0, 1), and v1,j =∑D
i=1 u1,irij ,
v2,j =∑D
i=1 u2,irij . Therefore,
v1,j ∼ C
(
0,D∑
i=1
|u1,i|)
v2,j ∼ C
(
0,D∑
i=1
|u2,i|)
xj = v1,j − v2,j ∼ C
(
0,D∑
i=1
|u1,i − u2,i|)
.
That is, the task boils down to estimating the scale parameter, which happens to
be the l1 distance: d = |u1 − u2| =∑D
i=1 |u1,i − u2,i|, in the original space.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 250
Three Types of Estimators
Given k i.i.d. samples xj ∼ C(0, d), the task is to estimate d. We know the
sample mean does not work, because E|x| =∞. We study three types of
estimators:
1. Bias-corrected Sample Median Estimator
2. Bias-corrected Geometric Mean Estimator
3. Bias-corrected MLE
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 251
The bias-corrected Sample Median Estimator
dme,c =dme
bme,
where
dme = median(|xj |, j = 1, 2, ..., k)
bme =
∫ 1
0
(2m + 1)!
(m!)2tan
(π
2t)
(
t− t2)m
dt, k = 2m + 1
Here, for convenience, we only consider k = 2m + 1, m = 1, 2, 3, ...
Note that bme can be numerically evaluated and tabulated for each k.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 252
Some properties of dme,c:
• E(
dme,c
)
= d, i.e, dme,c is unbiased.
• When k ≥ 5, the variance of dme,c is
Var(
dme,c
)
= d2
(m!)2
(2m + 1)!
∫ 1
0tan2
(
π2 t) (
t− t2)m
dt(
∫ 1
0tan
(
π2 t)
(t− t2)m
dt)2 − 1
,
Var(
dme,c
)
=∞ if k = 3.
• As k →∞, dme,c converges to a normal in distribution
√k(
dme,c − d)
D=⇒ N
(
0,π2
4d2
)
.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 253
The bias correction factor, bme, can be numerically evaluated and tabulated, as a
function of k = 2m + 1. After k > 50, the bias is negligible.
0 5 10 15 20 25 30 35 40 45 501
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Sample size k
Bia
s co
rrec
tion
fact
or
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 254
The Bias-corrected Geometric Mean Estimator
The bias-corrected geometric mean estimator is defined as
dgm,c = cosk( π
2k
)
k∏
j=1
|xj |1/k, k > 1
Useful properties of dgm,c include:
• It is unbiased, i.e., E(
dgm,c
)
= d.
• Its variance is (for k > 2)
Var(
dgm,c
)
= d2
(
cos2k(
π2k
)
cosk(
πk
) − 1
)
=π2
4
d2
k+
π4
32
d2
k2+ O
(
1
k3
)
.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 255
• For 0 ≤ ǫ ≤ 1, its tail bounds can be represented in exponential forms
Pr(
dgm,c − d > ǫd)
≤ exp
(
−k
(
ǫ2
8(1 + ǫ)
))
Pr(
dgm,c − d < −ǫd)
≤ exp
(
−k
(
ǫ2
8(1 + ǫ)
))
, k ≥ π2
1.5ǫ
• These exponential tail bounds yield an analog of the Johnson-Lindenstrauss
(JL) Lemma for dimension reduction in l1:
If k ≥ 8(2 log n−log δ)ǫ2/(1+ǫ) ≥ π2
1.5ǫ , then with probability at least 1− δ, one can
recover the original l1 distance between any pair of data points (among all n
data points) within a 1± ǫ (0 ≤ ǫ ≤ 1) factor of the truth, using dgm,c, i.e.,
|dgm,c − d| ≤ ǫd.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 256
Fractional Moments of Cauchy
Assume x ∼ C(0, d). Then
E(
|x|λ)
=dλ
cos(λπ/2), |λ| < 1
Proof
E(
|x|λ)
=2d
π
∫ ∞
0
yλ
y2 + d2dy =
dλ
π
∫ ∞
0
yλ−1
2
y + 1dy =
dλ
cos(λπ/2).
with the help of the integral tables.
By letting λ = 1/k, we obtain the geometric mean estimator.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 257
The Bias-corrected Maximum Likelihood Estimator
The bias-corrected maximum likelihood estimator (MLE) is
dMLE,c = dMLE
(
1− 1
k
)
,
where dMLE solves a nonlinear MLE equation
− k
dMLE
+k∑
j=1
2dMLE
x2j + d2
MLE
= 0.
Some properties of dMLE,c:
• It is nearly unbiased, E(
dMLE,c
)
= d + O(
1k2
)
.
• Its asymptotic variance is
Var(
dMLE,c
)
=2d2
k+
3d2
k2+ O
(
1
k3
)
,
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 258
i.e.,Var(dMLE,c)
Var(dme,c)→ 8
π2 ,Var(dMLE,c)
Var(dgm,c)→ 8
π2 , as k →∞. ( 8π2 ≈ 80%)
• Its distribution can be accurately approximated by an inverse Gaussian, at
least in the small deviation range. The inverse Gaussian approximation
suggests the following approximate tail bound
Pr(
|dMLE,c − d| ≥ ǫd) ∼≤ 2 exp
(
− ǫ2/(1 + ǫ)
2(
2k + 3
k2
)
)
, 0 ≤ ǫ ≤ 1,
which has been verified by simulations for the tail probability≥ 10−10 range.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 259
Sampling from Cauchy and Equivalent Distributions
A Cauchy random variable Z ∼ C(0, 1) has the density:
fZ(z) =1
π
1
z2 + 1, −∞ < z <∞
There are various ways to sample from C(0, 1):
• If X, Y ∼ N(0, 1) i.i.d. then Z = X/Y ∼ C(0, 1).
• If U ∼ U(0, 1), then Z = tan (π [U − 0.5]) ∼ C(0, 1).
• We can use 1U to approximate Cauchy as they are asymptotically equivalent.
Note that fZ(z) = 1π
1z2+1 ≈ 1
π1z2 for large z. In other words,
Pr(Z > z) = 1z for large z, which is the the η-Pareto distribution with η = 1.
if Z ∼ η-Pareto, then Pr(Z > z) = 1zη . E(|Z|λ) <∞ only if λ < η.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 260
Very Sparse Cauchy Random Projections
In practice, we can sample rij from the following very sparse distribution:
rij =
1/U1 with prob. β2
0 with prob. 1− β
−1/U2 with prob. β2
,
where U1 and U2 are independent uniform random variables in (0, 1).
One can show by Fourier Transform that, as D →∞,∑D
i=1 cirij
β∑D
i=1 |ci|D
=⇒ C(
0,π
2
)
,
which for convenience, is written as
D∑
i=1
rijciD
=⇒ C
(
0,π
2β
D∑
i=1
|ci|)
.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 261
Numerical Experiments on Synthetic Data
We simulate data from an η-Pareto distribution for η = 1.1 (i.e., highly
heavy-tailed data), and then apply very sparse stable random projections with
β = 0.05 (i.e., a 20-fold speedup) to estimate the l1 norm of the data.
The mean square errors (MSE), computed from 106 simulation for each k and D,
are plotted against the sample size k for each D=100, 500, 1000, and 5000.
We compute the empirical variances from 106 simulations for every k and D.
The “theoretic” curve is the theoretical variance assuming the data are exactly
(instead of asymptotically) Cauchy. When D = 100, the performance is poor, but
as soon as D ≥ 500, very sparse Cauchy random projections produce very
similar results to regular Cauchy random projections.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 262
10 20 30 40 50 60 70 80 90 1000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Sample size k
MS
E
Pareto data ( η = 1.1, β = 0.05)
TheoreticD=100D=500D=1000D=5000
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 263
Next, we simulate data from an η-Pareto distribution, Pη , for η = 1.5 and
η = 2.0. More aggressively, we let β = D−0.6 and D−0.75. And we will see
that D does not have to be very large for the asymptotic theory to work well.
10 20 30 40 50 60 70 80 90 1000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Sample size k
MS
E
Pareto data ( η = 1.5, β = D−0.6)
Theoretic
D=100
D=500
D=1000
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 264
10 20 30 40 50 60 70 80 90 1000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Sample size k
MS
E
Pareto data ( η=2, β =D − 0.75 )
Theoretic
D=100
D=500
D=1000
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 265
Experiments on Web Crawl Data
We apply very sparse stable random projections on some Web crawl data. We
pick two pairs of words, THIS-HAVE, and SCHOOL-PROGRAM.
For each word (vector), the ith entry (i = 1 to D = 65536) is the number of
occurrences of this word in the ith Web page. It is well-known that the word
frequency data are highly heavy-tailed and highly sparse.
For each pair, we estimate the l1 distance using very sparse stable random
projections with β = 0.1, 0.01, and 0.001. For the pair THIS-HAVE, even when
β = 0.001, the results are indistinguishable from what we would obtain by exact
stable random projections. For the pair SCHOOL-PROGRAM, when β = 0.01,
the results are good. However, when β = 0.001, we see larger errors. This is
because the data are sparse and the data are highly heavy-tailed.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 266
10 20 30 40 50 60 70 80 90 1000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Sample size k
MS
E
THIS − HAVE
Theoretic
β=0.1
β=0.01
β=0.001
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 267
10 20 30 40 50 60 70 80 90 1000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Sample size k
MS
E
SCHOOL − PROGRAM
Theoretic
β=0.1
β=0.01
β=0.001
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 268
Classifying Cancers in Microarray Data
Usually the purpose of computing distances is for the subsequent tasks such as
clustering, classification, information retrieval, etc. Here we consider the task of
classifying deceases Harvard microarray dataset. The original dataset contains
176 samples (specimens) in 12600 gene dimensions.
We conduct both Cauchy random projections and very sparse stable random
projections (β = 0.1, 0.01, and 0.001) and classify the specimens using a
5-nearest neighbor classifier based on the estimated l1 distances (using the
geometric mean estimator).
We observe (i): stable random projections can achieve similar classification
accuracy using about 100 projections (as opposed to the original D = 12600
dimensions); (ii): very sparse stable random projections work well when β = 0.1
and 0.01. Even with β = 0.001, the classification results are only slightly worse.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 269
10 100 4000
5
10
15
20
25
30
35
Sample size k
Mis
clas
sific
atio
n er
rors
(m
ean)
Cauchyβ = 0.1β = 0.01β = 0.001
10 100 4000
1
2
3
4
5
6
7
Sample size kM
iscl
assi
ficat
ion
erro
rs (
std)
Cauchyβ = 0.1β = 0.01β = 0.001
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 270
Choice of the Random Projection Matrix R
depending on which pairwise lp distance (between rows) is of interest.
• If p = 2 (Euclidian distance), then entries of R are sampled from normal or
normal-like (distributed with finite variance) distribution.
• If p = 1 (Manhattan distance), then sampling R from Cauchy distribution.
• If p = 0 (Hamming distance), then sampling R from p-stable distribution (p ≈ 0)
• For general 0 < p ≤ 2, sampling R from p-stable distribution
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 271
Stable Random Projections (SRP)
Given two D-dimensional vectors u1, u2 ∈ RD , let v1 = RTu1 and
v2 = RTu2, then,
v1,j − v2,j =D∑
i=1
(u1,i − u2,i) rij ∼ S
(
p,D∑
i=1
|u1,i − u2,i|p = dp
)
Thus, if we only need the distance dp, we can perform this projection k times and
estimate the scale parameter from the resultant k i.i.d. stable samples.
SRP essentially boils down to a statistical estimation problem. Main open problem
is how to estimate dp when p > 2 (e.g., kurtosis, skewness)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 272
Applications of Random Projections
Numerous applications• Data visualization , e.g., Multi-dimensional scaling (MDS) requires a pairwise similarity matrix.
• Machine Learning , e.g., support vector machine (SVM) requires a kernel/distance matrix.
• Information retrieval , e.g., Filtering nearly duplicate docs (often measured by distance)
• Databases , e.g., Estimating join sizes (dot products) for optimizing query execution.
• Dynamic data stream computations , e.g., Estimating summary statistics for
visualizing/detecting anomaly real-time
Advantages (over sampling methods)
• Guaranteed accuracies in many cases, even on heavy-tailed data.
Disadvantages
• Limited to 0 < p ≤ 2 (although recently we extended it to p = 4, 6, 8...)
• One projection only for one p (i.e., not one-sketch-for-all).
• It does not take advantage of data sparsity.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 273
Impact of the Choice of p
Experiments of classification using m-nearest neighbors and lp distance
0 1 2 3 4 5 6 7 8 9 102
3
4
5
Cla
ssifi
catio
n er
ror
rate
( %
)
Norm (p)
Mnist m = 1m = 5m = 10
0 1 2 3 4 5 6 7 8 9 103
4
5
6
7
Cla
ssifi
catio
n er
ror
rate
( %
)Norm (p)
Letter m = 1m = 5m = 10
lp distance dp =∑D
i=1 |xi − yi|p (In the figure p is the same as p).
Interestingly, in many data sets, lowest classification errors occur at p ≥ 4. But
p-stable random projections are normally limited to 0 < p ≤ 2.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 274
0 1 2 3 4 5 6 7 8 9 104
5
6
7
8
9C
lass
ifica
tion
erro
r ra
te (
% )
Norm (p)
Mnist10k m = 1m = 5m = 10
0 1 2 3 4 5 6 7 8 9 104
5
6
7
8
9
Cla
ssifi
catio
n er
ror
rate
( %
)
Norm (p)
Zipcode m = 1m = 5m = 10
0 1 2 3 4 5 6 7 8 9 1015
20
25
30
35
40
Cla
ssifi
catio
n er
ror
rate
( %
)
Norm (p)
Realsim
m = 1m = 5m = 10
0 1 2 3 4 5 6 7 8 9 103
4
5
6
Cla
ssifi
catio
n er
ror
rate
( %
)
Norm (p)
Gisette
m = 1m = 5m = 10
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 275
Efficient Algorithms for Estimating lp Distances ( p > 2)
Difficult to approximate lp distances if p > 2: dp =∑D
i=1 |xi − yi|p.
• When 0 < p ≤ 2, the space (sample size) needed is O(
1/ǫ2)
= O(1) if
we treat ǫ as a constant. Note that the result is independent of D.
• When p > 2, the space (sample size) needed is Ω(
D1−2/p)
.
• No practical algorithms were known to well approximate lp distances for
general p > 2. Even CRS (Conditional Random Sampling, another line of
work we will study later) does not work well when p is large.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 276
Simple Random Sampling
Randomly sample k columns from the data matrix, to compute the l4 distances.
Consider, for example, two rows in A, denoted by x and y. Denote the sampled
k entries by xj , yj , j = 1 to k. The estimator, denoted by d(4),S , is
d(4),S =D
k
k∑
j=1
|xj − yj |4
Var(
d(4),S
)
=D2
k2k[
E(
|xj − yj |8)
− E2(
|xj − yj |4)]
=D
k
D∑
i=1
|xi − yi|8 −
(
∑Di=1 |xi − yi|4
)2
D
.
In the worst case, the variance is dominated by the 8th order terms. Thus,
random sampling can have very large errors.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 277
Conditional Random Sampling for L4 Distance
Conditional Random Sampling (CRS) was recently proposed for sampling from
sparse data. The details will soon be explained. Denote the CRS estimate by
d(4),CRS . The variance is
Var(
d(4),CRS
)
≈ max |x|0, |y|0D
× Var(
d(4),S
)
,
where |x|0 and |y|0 are the numbers of non-zeros in vectors x and y.
While the variance of CRS is reduced substantially by taking advantage of the
data sparsity, it is nevertheless still dominated by the 8th order terms.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 278
A Solution Based (Normal) Random Projections for p = 4
Based on a (retrospectively) simple trick:
d(4) =
D∑
i=1
|xi − yi|4
=D∑
i=1
x4i +
D∑
i=1
y4i + 6
D∑
i=1
x2i y
2i − 4
D∑
i=1
x3i yi − 4
D∑
i=1
xiy3i .
•∑D
i=1 x4i and
∑D
i=1 y4i may be computed exactly by one scan of the data.
•∑D
i=1 x2i y
2i ,∑D
i=1 x3i yi,
∑D
i=1 xiy3i are inner products at different orders and can
be approximated by (normal) random projections.
• For example, we apply normal random projections on x2i and y2
i to estimate the inner
product∑D
i=1 x2i y
2i . This is certainly possible. Then we apply random projections on
x3i and yi to estimate
∑D
i=1 x3i yi, etc.
• The question is: should we use just one projection matrix or should we use three?
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 279
Efficient Algorithm for Estimating l4 Distance (Three Projections)
Apply three projection matrices, R(1),R(2),R(3) ∈ RD×k with i.i.d. entries
N(0, 1), to vectors x, y ∈ RD , to generate six vectors:
u1,j =
D∑
i=1
xir(1)ij , u2,j =
D∑
i=1
x2i r
(2)ij , u3,j =
D∑
i=1
x3i r
(3)ij ,
v1,j =
D∑
i=1
yir(3)ij , v2,j =
D∑
i=1
y2i r
(2)ij , v3,j =
D∑
i=1
y3i r
(1)ij .
We have an unbiased estimator, denoted by d(4),3p
d(4),3p =
D∑
i=1
x4i +
D∑
i=1
y4i +
1
k(6uT
2v2 − 4uT3v1 − 4uT
1v3)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 280
d(4),3p =
D∑
i=1
x4i +
D∑
i=1
y4i +
1
k(6uT
2v2 − 4uT3v1 − 4uT
1v3)
The variance is basically the addition of the three variances by normal random
projections (recall the formula m1m2+a2
k ):
Var(
d(4),3p
)
=36
k
D∑
i=1
x4i
D∑
i=1
y4i +
(
D∑
i=1
x2i y
2i
)2
+16
k
D∑
i=1
x6i
D∑
i=1
y2i +
(
D∑
i=1
x3i yi
)2
+16
k
D∑
i=1
x2i
D∑
i=1
y6i +
(
D∑
i=1
xiy3i
)2
.
Note that the highest-order term is only 6th, not 8th.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 281
Efficient Algorithm for Estimating l4 Distance (One Projection)
Apply only one projection matrix of R ∈ RD×k with i.i.d. entries rij ∼ N(0, 1),
to vectors x, y ∈ RD , to generate six vectors:
u1,j =
D∑
i=1
xirij , u2,j =
D∑
i=1
x2i rij , u3,j =
D∑
i=1
x3i rij ,
v1,j =
D∑
i=1
yirij , v2,j =
D∑
i=1
y2i rij , v3,j =
D∑
i=1
y3i rij .
A simple unbiased estimator of d(4) =∑D
i=1 |xi − yi|4
d(4),1p =
D∑
i=1
x4i +
D∑
i=1
y4i +
1
k(6u
T2v2 − 4u
T3v1 − 4u
T1v3)
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 282
The variance computations become more difficult because of the correlations.
Var(
d(4),1p
)
= Var(
d(4),1p
)
+ ∆1p
=36
k
D∑
i=1
x4i
D∑
i=1
y4i +
D∑
i=1
x2i y
2i
2
+16
k
D∑
i=1
x6i
D∑
i=1
y2i +
D∑
i=1
x3i yi
2
+16
k
D∑
i=1
x2i
D∑
i=1
y6i +
D∑
i=1
xiy3i
2
+ ∆1p
∆1p = − 48
k
(
D∑
i=1
x5i
D∑
i=1
y3i +
D∑
i=1
x2i yi
D∑
i=1
x3i y
2i
)
− 48
k
(
D∑
i=1
x3i
D∑
i=1
y5i +
D∑
i=1
xiy2i
D∑
i=1
x2i y3
i
)
+32
k
(
D∑
i=1
x4i
D∑
i=1
y4i +
D∑
i=1
xiyi
D∑
i=1
x3i y3
i
)
.
When the data are positive (often the case in practice), it is usually true that
∆1p < 0, i.e., correlation helps.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 283
Improving Estimates Using Marginal Information
Neither d(4),3p or d(4),1p perform well when the data are highly correlated, i.e.,
xi ≈ yi Recall that for normal random projections, we can use marginal
information and MLE to improve the estimates.
For simplicity, we assume that we use three projection matrices. In this case, we
can estimate d(4) by d(4),3p,m, where
d(4),3p,m =D∑
i=1
x4i +
D∑
i=1
y4i + 6a2,2 − 4a3,1 − 4a1,3,
and a2,2, a3,1, a1,3, are respectively, the solutions to the following three cubic
equations:
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 284
a32,2 −
a22,2
ku
T2v2 − 1
k
D∑
i=1
x4i
D∑
i=1
y4i u
T2v2 − a2,2
(
D∑
i=1
x4i
D∑
i=1
y4i
)
+a2,2
k
(
D∑
i=1
x4i‖v2‖2
+
D∑
i=1
y4i ‖u2‖2
)
= 0.
a33,1 −
a23,1
ku
T3v1 − 1
k
D∑
i=1
x6i
D∑
i=1
y2i u
T3v1 − a3,1
(
D∑
i=1
x6i
D∑
i=1
y2i
)
+a3,1
k
(
D∑
i=1
x6i‖v1‖2
+
D∑
i=1
y2i ‖u3‖2
)
= 0.
a31,3 −
a21,3
ku
T1v3 − 1
k
D∑
i=1
x2i
D∑
i=1
y6i u
T1v3 − a1,3
(
D∑
i=1
x2i
D∑
i=1
y6i
)
+a1,3
k
(
D∑
i=1
x2i‖v3‖2
+
D∑
i=1
y6i ‖u1‖2
)
= 0.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 285
Asymptotically (as k →∞), the variance would be
Var(
d(4),3p,m
)
=36Var (a2,2) + 16Var (a2,2) + 16Var (a2,2)
=36
k
(
∑Di=1 x4
i
∑Di=1 y4
i −(
∑Di=1 x2
i y2i
)2)2
∑Di=1 x4
i
∑Di=1 y4
i +(∑D
i=1 x2i y2
i
)2
+16
k
(
∑Di=1 x6
i
∑Di=1 y2
i −(
∑Di=1 x3
i yi
)2)2
∑Di=1 x6
i
∑Di=1 y2
i +(∑D
i=1 x3i yi
)2
+16
k
(
∑Di=1 x2
i
∑Di=1 y6
i −(
∑Di=1 xiy
3i
)2)2
∑Di=1 x2
i
∑Di=1 y6
i +(∑D
i=1 xiy3i
)2+ O
(
1
k2
)
But when xi = yi, we do not obtain zero variance. This is disappointing. Even
more disappointingly, d(4),1p,m (i.e., one projection and using margins) would
help and will not achieve zero variance either when xi = yi.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 286
A Good Estimator for Highly Correlated Data (i.e., xi ≈ yi)
Instead of using the exact values, we can estimate∑D
i=1 x4i and
∑Di=1 y4
i . This
is counter-intuitive but it actually works well when xi ≈ yi. A nice example of
utilizing noise cancelations.
d(4),1p,I =1
k(uT
2u2 + vT2v2 + 6uT
2v2 − 4uT3v1 − 4uT
1v3)
One can see that when x = y (i.e., u = v), d(4),I = 0 always. The varianceanalysis, though, requires good patience, because there are many cross-terms.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 287
Var(
d(4),1p,I
)
= Var(
d(4),1p
)
+ ∆I ,
∆I =36
k
D∑
i=1
x2i y
2i
2
+34
k
D∑
i=1
x4i
2
+
D∑
i=1
y4i
2
+32
k
D∑
i=1
xiy3i
D∑
i=1
x3i yi −
32
k
D∑
i=1
x4i
D∑
i=1
x3i yi −
72
k
D∑
i=1
x4i
D∑
i=1
x2i y2
i +
D∑
i=1
y4i
D∑
i=1
x2i y2
i
−
32
k
D∑
i=1
y4i
D∑
i=1
x3i yi +
D∑
i=1
x4i
D∑
i=1
xiy3i +
D∑
i=1
y4i
D∑
i=1
xiy3i
−
48
k
D∑
i=1
x3i
D∑
i=1
x5i +
D∑
i=1
x2i yi
D∑
i=1
x2i y3
i
−
48
k
D∑
i=1
y3i
D∑
i=1
y5i +
D∑
i=1
xiy2i
D∑
i=1
x3i y2
i
+48
k
D∑
i=1
x3i
D∑
i=1
x3i y2
i +
D∑
i=1
x5i
D∑
i=1
xiy2i +
D∑
i=1
y3i
D∑
i=1
x2i y3
i
+48
k
D∑
i=1
y5i
D∑
i=1
x2i yi +
D∑
i=1
x5i
D∑
i=1
x2i yi +
D∑
i=1
y3i
D∑
i=1
x3i y
2i
+48
k
D∑
i=1
x3i
D∑
i=1
x2i y3
i +
D∑
i=1
y5i
D∑
i=1
xiy2i
−
32
k
D∑
i=1
x6i
D∑
i=1
xiyi +
D∑
i=1
y2i
D∑
i=1
x3i y
3i
−
32
k
D∑
i=1
x2i
D∑
i=1
x3i y
3i +
D∑
i=1
y6i
D∑
i=1
xiyi
+16
k
D∑
i=1
x2i
D∑
i=1
x6i +
D∑
i=1
y2i
D∑
i=1
y6i + 2
D∑
i=1
xiyi
D∑
i=1
x3i y3
i
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 288
Comparisons with Random Sampling and CRS
101
102
103
10−2
10−1
100
101
102
103
104
k
Nor
mal
ized
MS
E
KONG−−HONGRandom Sampling
CRS
EmpiricalTheoretic
101
102
103
10−2
10−1
100
101
102
103
104
k
Nor
mal
ized
MS
EKONG−−HONG 1 Projection
1p1p,m1p,ITheoretic
101
102
103
100
101
102
103
k
Nor
mal
ized
MS
E
OF −− ANDRandom Sampling
CRS
EmpiricalTheoretic
101
102
103
10−1
100
101
102
103
k
Nor
mal
ized
MS
E
OF −− AND 1 Projection
1p1p,m1p,ITheoretic
• Random sampling has very large errors. CRS helps significantly.
• Even CRS is substantially less accurate than our random projection method
for this task. Note that lower MSE is better on the figures.
• For these two cases, our d(4),1p,I is substantially better than our d(4),1p
estimator. In fact, for HONG-KONG, random sampling is even better than
d(4),1p. This is because OF-AND and HONG-KONG are two highly
correlated pairs, especially HONG-KONG.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 289
Nearest Neighbor Classification
We applied random projections to estimate the l4 distances for m-nearest
neighbor classification. We compared the the average classification error rates,
together with the errors using the original l4 distances (horizontal lines).
101
102
103
0
10
20
30
40
k
Err
or r
ate
( %
, mea
n)
Gisette (mean) : m = 1
No marginMarginOriginal
101
102
103
0
10
20
30
40
kE
rror
rat
e (
%, m
ean)
Gisette (mean) : m = 10
No marginMarginOriginal
101
102
103
0
10
20
30
40
k
Err
or r
ate
( %
, mea
n)
Gisette (mean) : m = 20
No marginMarginOriginal
101
102
103
15
20
25
30
35
40
45
k
Err
or r
ate
( %
, mea
n)
Realsim (mean) : m = 1
No marginMarginOriginal
101
102
103
20
25
30
35
k
Err
or r
ate
( %
, mea
n)
Realsim (mean) : m = 10
No marginMarginOriginal
101
102
103
20
25
30
35
k
Err
or r
ate
( %
, mea
n)
Realsim (mean) : m = 20
No marginMarginOriginal
With the number of projections k > 500, using projected data achieved similar
(in some cases even better) accuracies compared to using the original data.
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 290
Conclusion
• (Symmetric) Stable Random Projection (SRP) is a very effective technique for
efficiently computing the lp distances with 0 < p ≤ 2.
• Depending on the datasets, the optimal p values vary. Interestingly, the
optimal p > 2, which is a difficult task to approximate.
• Estimating the lp distances for 0 < p ≤ 2, which is now (almost) a
technique.
• There is a recently method to efficiently compute the lp distances for
p = 4, 6, 8, .... Very simple algorithm and complicated analysis.
• Main open problems: (i) to develop methods for p = 3, 5, 7, ... especially
p = 3; (ii) To extend the methods to data streams.