btry6520/stsci6520 fall, 2012 department of statistical...

290
BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 1 BTRY6520/STSCI6520: Fall 2012 Computationally Intensive Statistical Methods Instructor: Ping Li Department of Statistical Science Cornell University

Upload: others

Post on 23-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 1

BTRY6520/STSCI6520: Fall 2012

Computationally Intensive Statistical Methods

Instructor: Ping Li

Department of Statistical Science

Cornell University

Page 2: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 2

General Information

• Lectures : Mon, Wed 2:55pm - 4:10pm, Hollister Hall 362

• Instructor : Ping Li, [email protected], Comstock Hall 1192

Office Hours: After class, Computing Labs, and by appointments.

• TA: No TA for this course

• Textbook : No textbook for this course

• Homework

– About 5-7 homework assignments.

– No late homework will be accepted.

– Before computing your overall homework grade, the assignment with the

lowest grade (if≥ 40%) will be dropped.

– It is the students’ responsibility to keep copies of the submitted homework.

Page 3: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 3

• Course grading :

1. Homework: 80%

2. Class Participation: 20%

• Computing : All the homework assignments will be programming in matlab.

The students who register for the class should be willing to learn matlab. The

programming assignments will be graded on correctness, efficiency (to an

extent), and demos. Questions will be asked during the (one-to-one) demos

• Labs : In additional to regular lectures, several computing labs will be

provided, usually in the evenings. Note that, as there is no TA, this is

additional time the instructor will offer to help students. Due to the conflicts

with several conferences, a small number of regular lectures will be canceled.

Page 4: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 4

Course Material

• Matlab programming.

• Basic numerical optimization.

• Contingency table estimation.

• Linear regression and Logistic regressions.

• Clustering algorithms.

• Random projection algorithms and applications.

• Hashing algorithms and applications.

• Other topics on modern statistical computing, if time permits.

Page 5: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 5

Prerequisite

This is a first-year Statistics Ph.D. course. Students are expected to be

well-prepared: probability theory, mathematical statistics, some programming

experience, basic numerical optimization etc.

Nevertheless, the instructor is happy to accommodate motivated graduate

students whose are willing to quickly learn the prerequisite material. The

instructor will often review relevant material, at a fairly fast pace.

Page 6: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 6

Contingency Table Estimations

Original Contingency Table Sample Contingency Table

2221

Ν Ν

ΝΝ

11 12

22

n 11 n 12

n 21 n

Suppose we only observe the sample contingency table, how can we estimate the

original table, if N = N11 + N12 + N21 + N22 is known?

(Almost) equivalently, how can we estimate πij =Nij

N ?

Page 7: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 7

An Example of Contingency Table

The task is to estimate how many times two words (e.g., Cornell and University)

co-occur in all the Web pages (over 10 billion).

No Word 2

21

Ν Ν

ΝΝ

11 12

22

Word 1

Word 2

No Word 1

N11: number of documents containing both word 1 and word 2.

N22: number of documents containing neither word 1 nor word 2.

Page 8: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 8

Google Pagehits

Google tells user the number of Web pages containing the input query word(s).

Pagehits by typing the following queries in Google (numbers can change):

• a : 25,270,000,000 pages (a surrogate for N , the total # of pages).

• Cornell : 99,600,000 pages. (N11 + N12)

• University : 2,700,000,000 pages. (N11 + N21)

• Cornell University : 31,800,000 pages. (N11)

No Word 2

21

Ν Ν

ΝΝ

11 12

22

Word 1

Word 2

No Word 1

How much do we believe these numbers?

Page 9: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 9

Suppose there are in total n = 107 word items.

It is easy to store 107 numbers (how many documents each word occurs in), but it

would be difficult to store a matrix of 107 × 107 numbers (how many documents

a pair of words co-occur in).

Even if storing 107 × 107 is not a problem (it is Google), it is much more difficult

to store 107 × 107 × 107 numbers, for 3-way co-occurrences (e.g., Cornell,

University, Statistics).

Even if we can store 3-way or 4-way co-occurrences, most of the items will be so

rare that they will almost never be used.

Therefore, it is realistic to believe that the counts for individual words are exact,

but the numbers of co-occurrences may be estimated, eg, from some samples.

Page 10: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 10

Estimating Contingency Tables by MLE of Multinomial Sampli ng

Original Contingency Table Sample Contingency Table

π

11 12

21 22

π π

π22

n 11 n 12

n 21 n

Observations: (n11, n12, n21, n22), n = n11 + n12 + n21 + n22.

Parameters (π11, π12, π21, π22), (π11 + π12 + π21 + π22 = 1)

Page 11: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 11

The likelihood

n!

n11!n12!n21!n22!πn11

11 πn1212 πn21

21 πn2222

The log likelihood

l = logn!

n11!n12!n21!n22!

+ n11 log π11 + n12 log π12 + n21 log π21 + n22 log π22

We can choose to write π22 = 1− π11 − π12 − π21.

Finding the maximum (setting first derivatives to be zero)

∂l

π11=

n11

π11+

−n22

1− π11 − π12 − π21= 0,

=⇒ n11

π11=

n22

π22or written as

π11

π22=

n11

n22

Page 12: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 12

Similarly

n11

π11=

n12

π12=

n21

π21=

n22

π22.

Therefore

π11 = n11λ, π12 = n12λ, π21 = n21λ, π22 = n22λ,

Summing up all the terms

1 = π11 + π12 + π21 + π22 = [n11 + n12 + n21 + n22] λ = nλ

yields λ = 1n .

The MLE solution is

π11 =n11

n, π12 =

n12

n, π21 =

n21

n, π22 =

n22

n.

Page 13: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 13

Finding the MLE Solution by Lagrange Multiplier

MLE as a constrained optimization:

argmaxπ11,π12,π21,π22

n11 log π11 + n12 log π12 + n21 log π21 + n22 log π22

subject to : π11 + π12 + π21 + π22 = 1

The unconstrained optimization problem:

argmaxπ11,π12,π21,π22

L = n11 log π11 + n12 log π12 + n21 log π21 + n22 log π22

− λ (π11 + π12 + π21 + π22 − 1)

Finding the optimum: ∂L∂z = 0, z ∈ π11, π12, π21, π22, λ

n11

π11− λ = 0,

n12

π12=

n21

π21=

n22

π22= λ, π11 + π12 + π21 + π22 = 1

Page 14: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 14

Maximum Likelihood Estimation (MLE)

Observations xi, i = 1 to n, are i.i.d. samples from a distribution with probability

density function fX (x; θ1, θ2, ..., θk),

where θj , j = 1 to k, are parameters to be estimated.

The maximum likelihood estimator seeks the θ to maximize the joint likelihood

θ = argmaxθ

n∏

i=1

fX(xi; θ)

Or, equivalently, to maximize the log joint likelihood

θ = argmaxθ

n∑

i=1

log fX(xi; θ)

This is a convex optimization if fX is concave or -log-convex.

Page 15: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 15

An Example: Normal Distribution

If X ∼ N(

µ, σ2)

, then fX

(

x; µ, σ2)

= 1√2πσ

e−(x−µ)2

2σ2

Fix σ2 = 1, x = 0. fX

(

x; µ, σ2)

log fX

(

x; µ, σ2)

−2 −1 0 1 20

0.1

0.2

0.3

0.4

µ

f X (

x; µ

,σ2 )

−2 −1 0 1 2−3

−2.5

−2

−1.5

−1

−0.5

µ

log

f X(x

; µ, σ

2 )

It is Not concave, but it is a -log convex, i.e., a unique MLE solution exists.

Page 16: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 16

Another Example of Exact MLE Solution

Given n i.i.d. samples, xi ∼ N(µ, σ2), i = 1 to n.

l(

x1, x2, ..., xn; µ, σ2)

=n∑

i=1

log fX(xi; µ, σ2)

= − 1

2σ2

n∑

i=1

(xi − µ)2 − 1

2n log(2πσ2)

∂l

∂µ=

1

2σ22

n∑

i=1

(xi − µ) = 0 =⇒ µ =1

n

n∑

i=1

xi

∂l

∂σ2=

1

2σ4

n∑

i=1

(xi − µ)2 − n

2σ2= 0 =⇒ σ2 =

1

n

n∑

i=1

(xi − µ)2.

Page 17: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 17

Convex Functions

A function f(x) is convex if the second derivative f ′′(x) ≥ 0.

−2 −1 0 1 20

1

2

3

4

x

f(x)

=x2

f(x) = x2

0 0.5 1 1.5 2−0.5

0

0.5

1

1.5

x

f(x)

=xl

og(x

)

f(x) = xlog(x)

f(x) = x2 =⇒ f ′′ = 2 > 0, i.e., f(x) = x2 is convex for all x.

f(x) = x log x =⇒ f ′′ = 1x , i.e., f(x) = x log x is convex if x > 0.

Both are widely used in statistics and data mining as loss functions,

=⇒ computationally tractable algorithms: least square, logistic regression.

Page 18: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 18

Left panel: f(x) = 1√2π

e−x2

2 is -log convex,∂2[− log f(x)]

∂x2 = 1 > 0.

0 0.5 1 1.5 20.5

1

1.5

2

2.5

3

x

log

f(x)

−10 0 10 20 30 400

2

4

6

8

x

log

f(x)

Right panel: a mixture of normals is not -log convex

f(x) = 1√2π

e−x2

2 + 1√2π10

e−(x−10)2

200

The mixture of normals is an extremely useful model in statistics.

In general, only a local minimum can be obtained.

Page 19: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 19

Steepest Descent

y

y=f(x)

xxx 012x34x x

Procedure:

Start with an initial guess x0.

Compute x1 = x0 −∆f ′(x0), where ∆ is the step size.

Continue the process xt+1 = xt −∆f ′(xt).

Until some criterion is met, e.g., f(xt+1) ≈ f(xt)

The meaning of “steepest” is more clear in the two-dimensional situation.

Page 20: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 20

An Example of Steepest Descent: f(x) = x2

f(x) = x2. The minimum is attained at x = 0, f ′(x) = 2x.

The steepest descent iteration formula xt+1 = xt −∆f ′(xt) = xt − 2∆xt.

0 1 2 3 4 5 6 7 8 910 12 14 16 17−10

−5

0

5

10

Iteration

∆ = 0.45∆ = 0.1∆ = 0.9

Choosing the step size ∆ is important (even when f(x) is convex).

Too small ∆ =⇒ slow convergence, i.e., many iterations,

Too large ∆ =⇒ oscillations, i.e., also many iterations.

Page 21: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 21

Steepest Descent in Practice

Steepest descent is one of the most widely techniques in real world

• It is extremely simple

• It only requires knowing the first derivative

• It is numerically stable (for above reasons)

• For real applications, it is often affordable to use very small ∆

• In machine learning, ∆ is often called learning rate

• It is used in Neural Nets and Gradient Boosting (MART)

Page 22: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 22

Newton’s Method

Recall the goal is to find the x∗ to minimize f(x).

If f(x) is convex, it is equivalent to finding the x∗ such that f ′(x∗) = 0.

Let h(x) = f ′(x). Take Taylor expansion about the optimum solution x∗:

h(x∗) = h(x) + (x∗ − x)h′(x) + “negligible” higher order terms

Because h(x∗) = f ′(x∗) = 0, we know approximately

0 ≈ h(x) + (x∗ − x)h′(x) =⇒ x∗ ≈ x− h(x)

h′(x)

Page 23: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 23

The procedure of Newton’s Method

Start with an initial guess x0

Update x1 = x0 − f ′(x0)f ′′(x0)

Repeat xt+1 = xt − f ′(xt)f ′′(xt)

Until some stopping criterion is reached, e.g., xt+1 ≈ xt.

An example: f(x) = (x− c)2. f ′(x) = 2(x− c), f ′′(x) = 2.

x1 = x0 − f ′(x0)f ′′(x0)

=⇒ x1 = x0 − 2(x0−c)2 = c

But we already know that x = c minimizes f(x) = (x− c)2.

Newton’s method may find the minimum solution using only one step.

Page 24: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 24

An Example of Newton’s Method: f(x) = x log x

f ′(x) = log x + 1, f ′′(x) = 1x . xt+1 = xt − log xt+1

1/xt

0 1 2 3 4 5 6 7 8 910 12 14 16 18

0

0.2

0.4

0.6

0.8

1

Iteration

x0 = 0.5

x0 = 10−10

x0 = 1−10−10

When x0 is close to optimum solution, the convergence is very fast

When x0 is far from the optimum, the convergence is slow initially

When x0 is badly chosen, no convergence. This example requires 0 < x0 < 1.

Page 25: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 25

Steepest Descent for f(x) = x log x

f ′(x) = log x + 1, xt+1 = xt −∆(log xt + 1)

0 10 20 30 400

2

4

6

8

10

Iteration

x t

x0 = 0, ∆ = 0.1

x0 = 10, ∆ = 0.1

x0 = 10, ∆ = 0.9

Regardless of x0, convergence is guaranteed if f(x) is convex.

May be oscillating if step size ∆ is too large

Convergence is slow near the optimum solution.

Page 26: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 26

General Comments on Numerical Optimization

Numerical Optimization is tricky!, even for convex problems.

Multivariate optimization is much trickier!

Whenever possible, try to avoid intensive numerical optimization,

even maybe at the cost of losing some accuracy.

An example :

Iterative Proportional Scaling for contingency table with known margins

Page 27: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 27

Contingency Table with Margin Constraints

Original Contingency Table Sample Contingency Table

2221

Ν Ν

ΝΝ

11 12

22

n 11 n 12

n 21 n

Margins: M1 = N11 + N12, M2 = N11 + N21.

Margins are much easier to be counted exactly than interactions.

Page 28: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 28

An Example of Contingency Tables with Known Margins

Term-by-Document matrix n = 106 words and m = 1010 (Web) documents.

Cell xij = 1 if word i appears in document j. xij = 0 otherwise.

0

Word n

Word 4

Word 2

Word 1

Word 3

Doc 1 Doc 2 Doc m

1 0

10

0 1 0

0 1 0 0 1 0

10

No Word 2

21

Ν Ν

ΝΝ

11 12

22

Word 1

Word 2

No Word 1

N11: number of documents containing both word 1 and word 2.

N22: number of documents containing neither word 1 nor word 2.

Margins (M1 = N11 + N12, M2 = N11 + N21) for all rows costs nm, easy!

Interactions (N11, N12, N21, N22) for all pairs costs n(n− 1)m/2, difficult!.

Page 29: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 29

To avoid storing all pairwise contingency tables (n(n− 1)/2 pairs in total), one

strategy is to sample a fraction (k) of the columns of the original (term-doc) data

matrix and and build sample contingency tables on demand, from which one can

estimate the original contingency tables.

However, we observe that the margins (the total number of ones in each row) can

be easily counted. This naturally leads to the conjecture that one might (often

considerably) improves the estimation accuracy by taking advantage of the known

margins. The next question is how.

Two approaches :

1. Maximum likelihood estimator (MLE) accurate but fairly complicated.

2. Iterative proportional scaling (IPS) simple but usually not as accurate.

Page 30: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 30

An Example of IPS for 2 by 2 Tables

22

n 11 n 12

n 21 n

The steps of IPS

(1) Modify the counts to satisfy the row margins.

(2) Modify the counts to satisfy the column margins.

(3) Iterate until some stopping criterion is met.

An example: n11 = 30, n12 = 5, n21 = 10, n22 = 10, D = 600.

M1 = N11 + N12 = 400, M2 = N11 + N21 = 300.

In the first iteration: N11 ← M1

n11+n12n11 = 400

35 30 = 342.8571.

Page 31: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 31

Iteration 1

342.8571 57.1429

100.0000 100.0000

232.2581 109.0909

67.7419 190.9091

Iteration 2

272.1649 127.8351

52.3810 147.6190

251.5807 139.2265

48.4193 160.7735

Iteration 3

257.4985 142.5015

46.2916 153.7084

254.2860 144.3248

45.7140 155.6752

Page 32: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 32

Iteration 4

255.1722 144.8278

45.3987 154.6013

254.6875 145.1039

45.3125 154.8961

Iteration 5

254.8204 145.1796

45.2653 154.7347

254.7477 145.2211

45.2523 154.7789

Iteration 6

254.7676 145.2324

45.2453 154.7547

254.7567 145.2386

45.2433 154.7614

Page 33: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 33

Error = |current step - previous step counts|, sum over four cells.

0 1 2 3 4 5 6 7 8 910 12 14 16 18 20

10−12

10−10

10−8

10−6

10−4

10−2

100

102

Iteration

Abs

olut

e er

ror

IPS converges fast and it always converges.

But how good are the estimates?: My general observation is that it is very good

for 2 by 2 tables and the accuracy decreases (compared to the MLE) as the table

size increases.

Page 34: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 34

The MLE for 2 by 2 Table with Known Margins

Total samples : n = n11 + n12 + n21 + n22

Total original counts : N = N11 + N12 + N21 + N22, i.e., πij = Nij/N .

Sample Contingency Table Original Contingency Table

22

n 11 n 12

n 21 n 2221

Ν Ν

ΝΝ

11 12

Margins: M1 = N11 + N12, M2 = N11 + N21.

If margins M1 and M2 are known, then only need to estimate N11.

Page 35: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 35

The likelihood

∝(

N11

N

)n11(

N12

N

)n12(

N21

N

)n21(

N22

N

)n22

The log likelihood

n11 log

(

N11

N

)

+ n12 log

(

N12

N

)

+ n21 log

(

N21

N

)

+ n22 log

(

N22

N

)

=n11 log

(

N11

N

)

+ n12 log

(

M1 −N11

N

)

+ n21 log

(

M2 −N11

N

)

+ n22 log

(

N −M1 −M2 + N11

N

)

The MLE equation

n11

N11− n12

M1 −N11− n21

M2 −N11+

n22

N −M1 −M2 + N11= 0.

which is a cubic equation and can be solved either analytically or numerically.

Page 36: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 36

Error Analysis

To assess the quality of the estimator θ of θ, it is common to use bias, variance,

and MSE (mean square error):

Bias : E(θ)− θ

Var : E(

θ −E(θ))2

= E(θ2)− E2(θ)

MSE : E(

θ − θ)2

= V ar + Bias2

The last equality is known as the bias variance trade-off. For unbiased estimators,

it is desirable to have smaller variance as possible. As the sample size increases,

the MLE (under certain conditions) becomes unbiased and achieves the smallest

variance. Therefore, the MLE is often a desirable estimator. However, in some

cases, biased estimators may achieve smaller MSE than the MLE.

Page 37: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 37

The Expectations and Variances of Common Distributions

The derivations of variances are not required in this course. Nevertheless, it is

useful to know the expectations and variances of common distributions.

• Binomial : X ∼ binomail(n, p), E(X) = np, V ar(X) = np(1− p).

• Normal : X ∼ N(µ, σ2), E(X) = µ, V ar(X) = σ2.

• Chi-square : X ∼ χ2(k), E(X) = k, V ar(X) = 2k.

• Exponential : X ∼ exp(λ), E(X) = 1λ , V ar(X) = 1

λ2 .

• Poisson : X ∼ Pois(λ), E(X) = λ, V ar(λ).

Page 38: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 38

Multinomial Distribution

The multinomial is a natural extension to the binomial distribution. For example,

the 2 by 2 contingency table often assumes to follow the multinomial distribution.

Consider c cells and denote the observations by (n1, n2, ..., nc), which follow a

c-cell multinomial distribution with the underlying probabilities (π1, π2, ..., πc)

(with∑c

i=1 πi = 1). Denote n =∑c

i=1 ni. We write

(n1, n2, ..., nc) ∼Multinomial (n, π1, π2, ..., πc)

The expectations are (for i = 1 to c and i 6= j)

E (ni) = nπi, V ar (ni) = nπi(1− πi), Cov (ninj) = −nπiπj .

Note that the cells are negatively correlated.

Page 39: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 39

Variances of the 2 by 2 Contingency Table Estimates

Using previous notation, the MLE estimator of N11 is

N11 =n11

nN, (n11, n12, n21, n22) ∼Multinomial(n, π11, π12, π21, π22)

Using the general equalities about the expectations:

E(aX) = aE(X), V ar(aX) = a2V ar(X)

we know

E(

N11

)

=N

nE(n11) =

N

nnπ11 = Nπ11 = N11

V ar(

N11

)

=N2

n2V ar(n11) =

N2

n2nπ11(1− π11) =

N2

nπ11(1− π11)

Page 40: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 40

The Asymptotic Variance of the MLE Using Margins

When the margins are known: M1 = N11 + N12, M2 = N12 + N21

The MLE equation

n11

N11− n12

M1 −N11− n21

M2 −N11+

n22

N −M1 −M2 + N11= 0.

The asymptotic variance of the solution, denoted by N11,M , can be shown to be

V ar(

N11,M

)

=N

n

11

N11+ 1

N12+ 1

N21+ 1

N22

which is smaller than the variance of the MLE without using margins.

———-

What about the variance of IPS? : No closed-form answer and the estimates are

usually biased.

Page 41: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 41

Logistic Regression

Logistic regression is one of the most widely used statistical tools for predicting

cateogrical outcomes.

General setup for binary logistic regression

n observations: xi, yi, i = 1 to n. xi can be a vector.

yi ∈ 0, 1. For example, “1” = “YES” and “0” = “NO”.

Define

p(xi) = Pr (yi = 1|xi) = π(xi)

i.e., Pr (yi = 0|xi) = 1− p(xi).

Page 42: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 42

The major assumption of logistic regression

logp(xi)

1− p(xi)= β0 + β1xi,1 + ... + βpxi,p =

p∑

j=0

βjxi,j .

Here, we treat xi,0 = 1. We can also use vector notation to write

logp(xi; β)

1− p(xi; β)= xiβ.

Here, we view xi as a row-vector and β as a column-vector.

Page 43: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 43

The model in vector notation

p(xi; β) =exiβ

1 + exiβ, 1− p(xi; β) =

1

1 + exiβ,

Log likelihood for the ith observation:

li(β|xi) =(1− yi) log [1− p(xi; β)] + yi log p(xi; β)

=

log p(xi; β) if yi = 1

log [1− p(xi; β)] if yi = 0

To understand this, consider binomial with only one sample binomial(1, p(xi))

(i.e., Bernouli). When yi = 1, the log likelihood is log p(xi) and when yi = 0,

the log likelihood is log (1− p(xi)). These two formulas can be written into one.

Page 44: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 44

Joint log likelihood for n observations:

l(β|x1, ..., xn) =

n∑

i=1

li(β|xi)

=n∑

i=1

(1− yi) log [1− p(xi; β)] + yi log p(xi; β)

=n∑

i=1

yi logp(xi; β)

1− p(xi; β)+ log [1− p(xi; β)]

=

n∑

i=1

yixiβ − log(

1 + exiβ)

The remaining task is to solve the optimization problem by MLE.

Page 45: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 45

Logistic Regression with Only One Variable

Basic assumption

logit(π(xi)) = logp(xi; β)

1− p(xi; β)= β0 + β1xi

Joint Log likelihood

l(β|x1, ..., xn) =n∑

i=1

[

yixiβ − log(

1 + eβ0+xiβ1)]

Next, we solve the optmization problem for maximizing the joint likelihood, given

the data.

Page 46: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 46

First derivatives

∂l(β)

∂β0=

n∑

i=1

yi − p(xi),∂l(β)

∂β1=

n∑

i=1

xi (yi − p(xi)) ,

Second derivatives

∂2l(β)

∂β20

= −n∑

i=1

p(xi) (1− p(xi)) ,

∂2l(β)

∂β21

= −n∑

i=1

x2i p(xi) (1− p(xi)) ,

∂2l(β)

∂β0β1= −

n∑

i=1

xip(xi) (1− p(xi))

Solve the MLE by Newton’s Method or steepest descent (two-dim problem).

Page 47: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 47

Logistic Regression without Intercept ( β0 = 0)

The simplified model

logit(π(xi)) = logp(xi)

1− p(xi)= βxi

Equivalently,

p(xi) =eβxi

1 + eβxi= π(xi), 1− p(xi) =

1

1 + eβxi,

Joint log likelihood for n observations:

l(β|x1, ..., xn) =n∑

i=1

xiyiβ − log(

1 + eβxi)

Page 48: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 48

First derivative

l′ (β) =

n∑

i=1

xi (yi − p(xi)) ,

Second derivative

l′′ (β) = −n∑

i=1

x2i p(xi) (1− p(xi)) ,

Newton’s Method updating formula

βt+1 = βt −l′(βt)

l′′(βt)

Steepest descent (in fact ascent ) updating formula

βt+1 = βt+∆l′(βt)

Page 49: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 49

A Numerical Example of Logistic Regression

Data

x = 8, 14,−7, 6, 5, 6,−5, 1, 0,−17y = 1, 1, 0, 0, 1, 0, 1, 0, 0, 0Log likelihood function

−1 −0.5 0 0.5 1−60

−40

−20

0

β

l(β)

Page 50: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 50

0 1 2 3 4 5 6 7 8 9 10 12 150.05

0.1

0.15

0.2

Iteration

β t

∆ =0.015β0 =0.2

Newton’s MethodSteepest Descent

0 10 20 30−0.2

0

0.2

0.4

Iteration

β t

∆ =0.02β0 =0.32

Newton’s MethodSteepest Descent

Steepest descent is quite sensitive to the step size ∆.

Too large ∆ leads to oscillation.

Page 51: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 51

0 5 10 15 20−6

−4

−2

0

2

4

6

Iteration

β t

∆ =0.02β0 =0.33

Newton’s MethodSteepest Descent

0 10 20 30

0

5

10

Iteration

β t

∆ =0.02β0 =10

Newton’s MethodSteepest Descent

Newton’s Method is sensitive to the starting point β0. May not converge at all.

The starting point (mostly) only affects computing time of steepest descent.

——————

In general, with multiple variables, we need to use the matrix formulation, which in

fact is easier to implement in matlab.

Page 52: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 52

Newon’s Method for Logistic Regression with β0 and β1

Analogous to the one variable case, the Newton’s update formula is

βnew = βold −[

(

∂2l(β)

∂β∂βT

)−1∂l(β)

∂β

]

βold

where β =

β0

β1

,

∂l(β)

∂β=

[

∑n

i=1 yi − p(xi)∑n

i=1 xi (yi − p(xi))

]

=

1 x1

1 x2

...

1 xn

T

y1 − p(x1)

y2 − p(x2)

...

yn − p(xn)

= XT (y − p)

Page 53: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 53

(

∂2l(β)

∂β∂βT

)

=

−∑ni=1 p(xi) (1− p(xi)) −∑n

i=1 xip(xi) (1− p(xi))

−∑ni=1 xip(xi) (1− p(xi)) −∑n

i=1 x2i p(xi) (1− p(xi))

=−XTWX

W =

p(x1)(1− p(x1)) 0 0... 0

0 p(x2)(1− p(x2)) 0... 0

...

0 0 0... p(xn)(1− p(xn))

Page 54: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 54

Multivariate Logistic Regression Solution in Matrix Form

Newton’ update formula

βnew = βold −[

(

∂2l(β)

∂β∂βT

)−1∂l(β)

∂β

]

βold

where, in a matrix form

∂l(β)

∂β=

n∑

i=1

xi (yi − p(xi; β)) = XT(y − p)

∂2l(β)

∂β∂βT= −

n∑

i=1

xTixip(xi; β) (1− p(xi; β)) = −XTWX

We can write the update formula in a matrix form

βnew =[

XTWX]−1

XTWz,

z = Xβold + W−1(y − p)

Page 55: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 55

X =

1 x1,1 x1,2 ... x1,p

1 x2,1 x2,2 ... x2,p

...

1 xn,1 xn,2 ... xn,p

∈ Rn×(p+1)

W =

p1(1− p1) 0 0 ... 0

0 p2(1− p2) 0 ... 0

...

0 0 0 ... pn(1− pn)

∈ Rn×n

where pi = p(xi; βold).

Page 56: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 56

Derivation

βnew =βold −[

(

∂2l(β)

∂β∂βT

)−1∂l(β)

∂β

]

βold

=βold +[

XTWX]−1

XT(y − p)

=[

XTWX]−1 [

XTWX]

βold +[

XTWX]−1

XT(y − p)

=[

XTWX]−1

XTW(

Xβold + W−1(y − p))

=[

XTWX]−1

XTWz

Note that[

XTWX]−1

XTWz looks a lot like (weighted) least square.

Two major practical issues:

• The inverse may not (usually does not) exist, especially with large datasets.

• Newton update steps may be too agressive and lead to divergence.

Page 57: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 57

Fitting Logistic Regression with a Learning Rate

At time t, update each coefficient vector:

βt =β(t−1) + ν[

XTWX]−1

XT(y − p)∣

t−1

where

W = diag [pi(1− pi)]ni=1

The magic parameter ν can be viewed as the learning rate to help make sure that

the procedure converges. Practically, it is often set to be ν = 0.1.

Page 58: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 58

Revisit The Simple Example with Only One β

Data

x = 8, 14,−7, 6, 5, 6,−5, 1, 0,−17y = 1, 1, 0, 0, 1, 0, 1, 0, 0, 0Log likelihood function

−1 −0.5 0 0.5 1−60

−40

−20

0

β

l(β)

Page 59: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 59

Newton’s Method with Learning Rate ν = 1

0 10 20 30−0.4

−0.2

0

0.2

0.4

Iteration

β t

ν = 1

β0 = 0.2

β0 = 0.32

β0 = 0.33

When initial β0 = 0.32, the method coverges. When β0 = 0.33, it does not

converge.

Page 60: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 60

Newton’s Method with Learning Rate ν = 0.1

0 10 20 30−0.4

−0.2

0

0.2

0.4

Iteration

β tν = 0.1

β0 = 0.2

β0 = 0.32

β0 = 0.33

Page 61: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 61

Fitting Logistic Regression With Regularization

The almost correct update formula:

βt =β(t−1) + ν[

XTWX + λI]−1

XT(y − p)∣

t−1

Adding the regularization parameter λ usually improves the numerical stability

and some times may even result in better test errors.

There are also good statsitical interpretations.

Page 62: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 62

Fitting Logistic Regression With Regularization

The update formula:

βt =β(t−1) + ν[

XTWX + λI]−1 [

XT(y − p)− λβ]

t−1

To understand the formula, consider the following modified (regularized) likelihood

function:

l(β) =

n∑

i=1

yi log pi + (1− yi) log(1− pi) −λ

2

p∑

j=0

β2j

Page 63: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 63

Newton’s Method with No Regularization λ = 0 (ν = 0.1)

0 10 20 30−10

−5

0

5

10

Iteration

β tν = 0.1 , λ = 0

β0 = 1

β0 = 5

β0 = 10

Page 64: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 64

Newton’s Method with Regularization λ = 1 (ν = 0.1)

0 10 20 30−10

−5

0

5

10

Iteration

β tν = 0.1 , λ = 1

β0 = 1

β0 = 5

β0 = 10

Page 65: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 65

Newton’s Method with Regularization λ = 1 (ν = 0.1)

0 10 20 30−50

0

50

Iteration

β tν = 0.1 , λ = 1

β0 = 10

β0 = 20

β0 = 50

Page 66: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 66

Crab Data Analysis (from Agresti’s Book)

Color (C) Spine (S) Width (W, cm) Weight (Wt, Kg) # Saterlites (Sa)

2 3 28.3 3.05 8

3 3 22.5 1.55 0

1 1 26.0 2.30 9

3 3 24.8 2.10 0

3 3 26.0 2.60 4

2 3 23.8 2.10 0

1 1 26.5 2.35 0

3 2 24.7 1.90 0

...

It is natural to view color as (norminal) cateogrical variable and weight and width

as numerical variables. The distinction, however, is often not clear in practice.

Page 67: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 67

Logistic regression for Sa classification using width only

y = 1 if Sa > 0, y = 0 if Sa = 0. Only one variable x = W . The task is to

compute Pr (y = 1|x) and classify the data using a simple classification rule:

yi = 1, if pi > 0.5

Using own matlab code, the fitted model is

p(xi) =e−12.3108+0.497xi

1 + e−12.3108+0.497xi

If we choose not to include the intercept term, the fitted model becomes

p(xi) =e0.02458xi

1 + e0.02458xi

Page 68: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 68

Training mis-classification errors

20 40 60 80 10020

30

40

50

60

70

Iteration

Mis

−C

lass

ifica

tion

Err

or (

%)

λ = 0

λ = 1λ = 5

Page 69: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 69

Training log likelihood

20 40 60 80 100−120

−110

−100

−90

Iteration

Log

likel

ihoo

d λ = 0

λ = 1

λ = 5

Page 70: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 70

Logistic regression for Sa classification using S, W, and Wt

Using own matlab code, the fitted model is

p(S, W, Wt) =e−9.4684+0.0495S+0.3054W+0.8447Wt

1 + e−9.4684+0.0495S+0.3054W+0.8447Wt

Page 71: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 71

Training mis-classification errors

20 40 60 80 10020

30

40

50

60

70

Iteration

Mis

−C

lass

ifica

tion

Err

or (

%)

λ = 0λ = 1

λ = 5

Page 72: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 72

Training log likelihood

20 40 60 80 100−120

−110

−100

−90

Iteration

Log

likel

ihoo

d

λ = 0

λ = 1

λ = 5

Page 73: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 73

Multi-Class Logistic Regression

Data: xi, yini=1, xi ∈ Rn×p, yi ∈ 0, 1, 2, ..., K − 1.

Probability model

pi,k = Pr yi = k|xi , k = 0, 1, ..., K − 1,

K−1∑

k=0

pk = 1, (only K − 1 degrees of freedom).

Label assignment

yi|xi = argmaxk

pi,k

Page 74: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 74

Multi-Class Example: USPS ZipCode Recognition

Person 1:

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

Person 2:

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

Person 3:

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

This task can be cast (simplified) as a K = 10-class classification problem.

Page 75: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 75

Multinomial Logit Probability Model

pi,k =eFi,k

∑K−1s=0 eFi,s

where Fi,k = Fk(xi) is the function to be learned from the data.

Linear logistic regression : Fi,k = Fk(xi) = xiβk

Note that, βk = [βk,0, βk,1, ..., βk,p]T

Page 76: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 76

Multinomial Maximum Likelihood

Mutlinomial likelihood: Suppose yi = k,

Lik ∝ p0i,0 × ...× pi,k

1 × ...× p0i,K−1 = pi,k

log likelihood:

li = log pi,k, if yi = k

Total log-likelihood in a double summation form:

l(β) =n∑

i=1

li =n∑

i=1

K−1∑

k=0

ri,k log pi,k

ri,k =

1 if yi = k

0 otherwise

Page 77: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 77

Derivatives of Multi-Class Log-likelihood

The first derivative :

∂li∂Fi,k

= (ri,k − pi,k)

Proof:

∂pi,k

∂Fi,k=

[

∑K−1s=0 eFi,s

]

eFi,k − e2Fi,k

[

∑K−1s=0 eFi,s

]2 = pi,k (1− pi,k)

∂pi,k

∂Fi,t=

−eFi,keFi,t

[

∑K−1s=0 eFi,s

]2 = −pi,kpi,t

Page 78: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 78

∂li∂Fi,k

=K−1∑

s=0

ri,s1

pi,s

∂pi,s

∂Fi,k= ri,k

1

pi,kpi,k(1− pi,k) +

s 6=k

ri,s1

pi,s

∂pi,s

∂Fi,k

=ri,k(1− pi,k)−∑

s 6=k

ri,spi,k = ri,k −K−1∑

s=0

ri,spi,k = ri,k − pi,k

The second derivatives :

∂2li∂F 2

i,k

= −pi,k (1− pi,k) ,

∂2li∂Fi,kFi,s

= −pi,kpi,s

Page 79: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 79

Multi-class logistic regression can be fairly complicated. Here, we introduce a

simpler approach, which does not seem to explicitly appear in common textbooks.

Conceptually, we fit K binary classification problems (one vs rest) at each

iteration. That is, at each iteration, we update βk seperately for each class. At the

end of each iteration, we jointly update the probabilities pi,k = exiβk∑K−1

s=0 exiβs.

Page 80: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 80

A Simple Implementation for Multi-Class Logistic Regressi on

At time t, update each coefficient vector:

βt

k=β

(t−1)k

+ ν[

XTWkX]−1

XT(rk − pk)∣

t−1

where

rk = [r1,k, r2,k, ..., rn,k]T

pk = [p1,k, p2,k, ..., pn,k]T

Wk = diag [pi,k(1− pi,k)]ni=1

Then update pk,Wk for the next iteration.

Again, the magic parameter ν can be viewed as the learning rate to help make

sure that the procedure converges. Practically, it is often set to be ν = 0.1.

Page 81: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 81

Logistic Regression With L2 Regularization

Total log-likelihood in a double summation form:

l(β) =n∑

i=1

K−1∑

k=0

ri,k log pi,k

− λ

2

K−1∑

k=0

d∑

j=0

β2k,j

ri,k =

1 if yi = k

0 otherwise

Let g(β) = λ2

∑K−1k=0

∑dj=0 β2

k,j , then

∂g(β)

βk,j= βk,jλ,

∂2g(β)

β2k,j

= λ

Page 82: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 82

At time t, the updating formula becomes

βt

k=β

(t−1)k

+ ν[

XTWkX + λI]−1 [

XT(rk − pk)− λβk

]

t−1

L2 regularization sometimes improves the numerical stability and some times

may even result in better test errors.

Page 83: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 83

Logistic Regression Results on Zip Code Data

Zip code data: 7291 training examples in 256 dimensions. 2007 test examples.

20 40 60 80 1000

5

10

15

Iteration

Mis

−C

lass

ifica

tion

Err

or (

%)

λ = 0

Training Mis−Classfication Error

λ = 5λ = 10

With no regularization (λ = 0), numerical problems may occur.

Page 84: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 84

20 40 60 80 1005

10

15

20

25

30

Iteration

Mis

−C

lass

ifica

tion

Err

or (

%)

λ = 0

Testing Mis−Classfication Error

λ = 5λ = 10

Page 85: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 85

20 40 60 80 100−15000

−10000

−5000

0

Iteration

Log

likel

ihoo

d

Training Log Likelihood

λ = 5λ = 10

Page 86: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 86

Another Example on Letter ( K = 26) Recognition

Letter dataset: 2000 training samples in 16 dimensions. 18000 testing samples.

0 200 400 600 800 1000

0.2

0.25

0.3

0.35

0.4

0.45

0.5

λ = 0λ = 1

λ = 10

Iterations

Tra

inin

g er

ror

rate

../../data/letter−train.txt: ν = 0.1

Page 87: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 87

0 200 400 600 800 10000.2

0.25

0.3

0.35

0.4

0.45

0.5

λ = 0λ = 1λ = 10

Iterations

Tes

ting

erro

r ra

te

../../data/letter−test.txt: ν = 0.1

Page 88: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 88

0 200 400 600 800 1000−6000

−5500

−5000

−4500

−4000

−3500

−3000

−2500

−2000

−1500

−1000

λ = 0λ = 1λ = 10

Iterations

Log−

likel

ihoo

d

../../data/letter−train.txt: ν = 0.1

Page 89: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 89

Revisit Crab Data as a Multi-Class Problem

Color (C) Spine (S) Width (W, cm) Weight (Wt, Kg) # Saterlites (Sa)

2 3 28.3 3.05 8

3 3 22.5 1.55 0

1 1 26.0 2.30 9

3 3 24.8 2.10 0

3 3 26.0 2.60 4

2 3 23.8 2.10 0

1 1 26.5 2.35 0

3 2 24.7 1.90 0

Page 90: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 90

0 5 10 150

20

40

60

Counts

Fre

quen

cy

Crab # Saterlites

It appears reasonable to treat this as a binary classification problem, given the

counts distribution and # samples. Nevertheless, it might be still interesting to

consider it as a multi-class problem.

Page 91: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 91

We consider a 6-class (0 to 5) classification problem by grouping all samples with

counts≥ 5 as class 5. Use 3 variales (S, W, Wt).

0 50 10054

55

56

57

Iteration

Mis

−C

lass

ifica

tion

Err

or (

%)

λ = 0

Train Mis−Classfication Error: ν = 0.1

Compared to the binary-classification problem, it seems the mis-classification

error is much higher. Why?

Page 92: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 92

Some thoughts

• Multi-class problems are usually (but not always) more difficult.

• For binary-classifiction, an error rate of 50% is very bad because a random

guess can achieve that. For K-class problem, the error rate of random

guessing would be 1− 1/K (5/6 in this example). So the results may be

actually not too bad.

• Multi-class models are more complex (in that they require more parameters)

and need more data samples. The crab dataset is very small.

• This problem may be actually ordinal classification instead of nomial, for

biological reaons.

Page 93: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 93

Dealing with Nominal Categorical Variables

It might be reasonable to consider “Color (C)” as a nominal cateogrical variable.

Then how can we include it in our logistic regression model?

The trick is simple. Suppose the color variable take four different values. We add

four binary variable (i.e., only taking values in 0, 1. For one particular sample,

only one of the four variables will take value 1.

This is basically the same trick as we expand the y in multi-class logistic

regression.

Page 94: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 94

Adding Color as Four Binary Variables

C1 C2 C3 C4 S W Wt Sa

0 1 0 0 3 28.3 3.05 8

0 0 1 0 3 22.5 1.55 0

1 0 0 0 1 26.0 2.30 9

0 0 1 0 3 24.8 2.10 0

0 0 1 0 3 26.0 2.60 4

0 1 0 0 3 23.8 2.10 0

1 0 0 0 1 26.5 2.35 0

0 0 1 0 2 24.7 1.90 0

Page 95: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 95

Adding the color variable noticeably reduced the (binary) classification error.

Colors Not Included Colors Included

0 50 10027

27.5

28

28.5

Iteration

Mis

−C

lass

ifica

tion

Err

or (

%)

λ = 1e−010

Train Mis−Classfication Error: ν = 0.1

0 50 10024

25

26

27

Iteration

Mis

−C

lass

ifica

tion

Err

or (

%)

λ = 1e−010

Train Mis−Classfication Error: ν = 0.1

Here to minimize the effect of regularization, only λ = 10−10 is used, just

enough to ensure numerical stability.

Logistic regression does not directly minimize mis-classification errors. The log

likelihood probably better illustrates the effect of adding the color variable.

Page 96: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 96

Adding the color variable noticeably improved the log likelihood.

Colors Not Included Colors Included

0 50 100−115

−110

−105

−100

−95

−90

Iteration

Log

likel

ihoo

d λ = 1e−010

Train Log Likelihood: ν = 0.1

0 50 100−115

−110

−105

−100

−95

−90

Iteration

Log

likel

ihoo

d

λ = 1e−010

Train Log Likelihood: ν = 0.1

Page 97: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 97

Adding Pairwise (Interaction) Variables

Feature expansion is a common trick to boost the performance. For example,

(x1, x2, x3, ..., xp) =⇒(x1, x2, x3, ..., xp, x

21, x1x2, ..., x1xp, x

22, x2x3, ..., x2xp, ..., x

2p)

In other words, the original p variables can be expanded to be

p +p(p + 1)

2variables

The expansion often helps, but not always. In general, when the number of

examples n is large, feature expansion usually helps.

Page 98: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 98

Adding Pairwise Interactions on Crab Data

Adding all pairwise (interaction) variables only help slightly in terms of the log

likelihood (red denotes using only the original variables).

0 50 10024

25

26

27

28

Iteration

Mis

−C

lass

ifica

tion

Err

or (

%)

λ = 1e−010

Train Mis−Classfication Error: ν = 0.1

λ = 1e−008λ = 1e−010

0 50 100−115

−110

−105

−100

−95

−90

Iteration

Log

likel

ihoo

d

λ = 1e−010

Train Log Likelihood: ν = 0.1

λ = 1e−008λ = 1e−010

Page 99: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 99

Simplify Label Assignments

Recall label assignment in logistic regression:

yi|xi = argmaxk

pi,k

and the probability model of logistic regression:

pi,k =exiβk

∑K−1s=0 exiβs

It is equivalent to assign labels directly by

yi|xi = argmaxk

xiβk

This raises an interesting question: maybe we don’t need a probability model for

the purpose of classification? For example, a linear regression may be sufficient?

Page 100: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 100

Linear Regression and Its Applications in Classification

Both linear regression and logistic regression are examples of

Generalized Linear Models (GLM) .

We first review linear regression and then discuss how to use it for (multi-class)

classification.

Page 101: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 101

Review Linear Regression

Given data xi, yini=1, where xi is a p-dimensional vector and yi is a scalar

(not limited to be categories).

We again construct the data matrix

X =

1 x1,1 x1,2 ... x1,p

1 x2,1 x2,2 ... x2,p

...

1 xn,1 xn,2 ... xn,p

, y =

y1

y2

...

yn

The data model is

y = X× β

β (a vector of length p + 1) is obtained by minimizing the mean square errors

(equivalent to maximizing the joint likelihood under the normal distribution model).

Page 102: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 102

Linear Regression Estimation by Least Square

The idea is to minimize the mean square errors

MSE(β) =

n∑

i=1

|yi − xiβ|2 = (Y −Xβ)T(Y −Xβ)

We can find the optimal β by setting the first derivative to be zero

∂MSE(β)

β= XT (Y −Xβ) = 0

=⇒XTY = XTXβ

=⇒β = (XTX)−1XTY

Don’t worry much about how to do matrix derivatives. The trick is to view this

simply as a scalar derivative but we need to manipulate the order (and add

transposes) to get the dimensions correct.

Page 103: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 103

Ridge Regression

Similar to l2-regularized logistic regression, we can add a regularization

parameter

β = (XTX + λI)−1XTY

which is known as ridge regression .

Adding regularization not only improves the numerical stability but also often

increases the test accuracy.

Page 104: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 104

Linear Regression for Classification

For binary classification, i.e., yi ∈ 0, 1, we can simply treat yi as numerical

response and fit a linear regression. To obtain the classification result, we can

simply use y = 0.5 as the classification threshold.

Multi-class classification (with K classes) is more interesting. We can use exactly

the same trick as in multi-class logistic regression by first expanding the yi into a

vector of length K with only one entry being 1 and then fitting K binary linear

regressions simultaneously and using the location of the maximum fitted value as

the class label prediction. Since you have completed the homework in multi-class

logistic regression, this idea should be straightforward now. Also see sample

code.

Page 105: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 105

Mis-Classification Errors on Zipcode Data

0 20 40 60 80 1006

8

10

12

14

λ

Mis

−cl

assi

ficat

ion

erro

r (%

)

TrainTest

• This is essentially the first iteration of multi-class logistic regression. Clearly,

the results are not as good as logistic regression with many iterations.

• Adding regularization (λ) slightly increases the training errors but decreases

the testing errors at certain range.

Page 106: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 106

Linear Regression Classification on Crab Data

Binary classification. 50% of the data points are used for training and the rest for

testing. Three models are compared:

• Model using S, W, and Wt.

• Model using the above three as well as colors.

• Model using all four plus all pairwise interactions.

Both linear regression and logistic regressions are experimented. For logistic

regression, we use ν = 0.1 and only report the errors at the 100th iterations

Page 107: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 107

0 20 40 60 80 100

26

28

30

32

34

36

38

λ

Mis

−cl

assi

ficat

ion

erro

r (%

)

Spine, Width, and Weight

Linear RegressionLogistic Regression

Linear regression and logistic regression produce almost the same results.

Regularization does not appear to be helpful in this example.

Page 108: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 108

0 20 40 60 80 100

26

28

30

32

34

36

38

λ

Mis

−cl

assi

ficat

ion

erro

r (%

)

Colors Included

Linear RegressionLogistic Regression

Linear regression seems to be even slightly better

Regularization still does not appear to be helpful.

Page 109: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 109

0 20 40 60 80 100

26

28

30

32

34

36

38

λ

Mis

−cl

assi

ficat

ion

erro

r (%

)

Colors and Pairwise Interactions Included

Linear RegressionLogistic Regression

Now logistic regression seems to be slightly better

Regularization really helps.

Page 110: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 110

Limitations of Using Linear Regression for Classification

• For many datasets, the classification accuracies of using linear regressions

are actually quite similar to using logistic regressions, especially when the

datasets are “not so good.”

• However, for many “good” datasets (such as zip code data), logistic

regressions may have some noticeable advantages.

• Linear regression does not (directly) provide an probabilistic interpretations of

the classification results, which may be needed in many applications, for

example, learning to rank using classification.

Page 111: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 111

Poisson Log-Linear Model

Revisit the crab data. It appears very natural to model the Sa counts as a

poisson random variable, which may be parameterized by a linear model.

Color (C) Spine (S) Width (W, cm) Weight (Wt, Kg) # Saterlites (Sa)

2 3 28.3 3.05 8

3 3 22.5 1.55 0

1 1 26.0 2.30 9

3 3 24.8 2.10 0

3 3 26.0 2.60 4

2 3 23.8 2.10 0

1 1 26.5 2.35 0

3 2 24.7 1.90 0

Page 112: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 112

Poisson Distribution

Denote Y ∼ Poisson(µ). The probability mass function (PMF) is

Pr (Y = y) =e−µuy

y!, y = 0, 1, 2, ...

E(Y ) = µ, V ar(Y ) = µ

One drawback of the Poisson model is that its variance is the same as the mean

which often contradicts real data observations.

Page 113: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 113

Fitting Poisson Distribution

Given n observations, yi, i = 1 to n, the MLE of µ is simply the sample mean:

µ =1

n

n∑

i=1

yi

Observed counts Fitted counts

0 5 10 150

20

40

60

80

100

120

Sa

Obs

erve

d F

requ

ncy

0 5 10 150

20

40

60

80

100

120

Sa

Fitt

ed F

requ

ncy

No need to perform any test. It is obviously not a good fit.

Page 114: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 114

Linear Regression for Predicting Counts

Maybe we can simply model

yi ∼ N(µi, σ2)

µi = xiβ = β0 + xi,1β1 + ... + xi,pβp

i.e., µi is the mean of a normal distribution N(µi, σ2).

This way, we can easily predict the counts by

β =(

XTX)−1

XTy

y = Xβ

Page 115: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 115

Histograms of the predictions of counts by using linear regression using only the

width. The sum of square error (SE) is

SE =n∑

i=1

(yi − yi)2

= 1.5079× 103

0 2 4 6 80

10

20

30

40

Sa

Fre

quen

cyHistograms by Linear Regression

Clearly, linear regression can not possibly be the best approach.

Page 116: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 116

Poisson Regression Model

Assumption:

yi ∼ Poisson(µi)

log µi = xiβ = β0 + xi,1β1 + ... + xi,pβp

Note that this is very different from assuming that the logarithms of the counts

follow a linear regression model. Why?

Page 117: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 117

0 5 100

20

40

60

80

Sa

Fre

quen

cy

Histograms by Poisson Regression

Clearly, this looks better than the histogram from linear regression.

However, the square error SE =∑n

i=1 (yi − yi)2 = 1.5373× 103 is actually

larger than the SE from linear regression. Why is it not too surprising?

Page 118: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 118

Comparing Fitted Counts

y y Linear y Poisson

8.0000 3.9344 3.8103

0 0.9916 1.4714

9.0000 2.7674 2.6127

0 2.1586 2.1459

4.0000 2.7674 2.6127

0 1.6512 1.8212

0 3.0211 2.8361

0 2.1079 2.1110

0 1.6005 1.7916

0 2.5645 2.4468

Need to see more rows to understand the differences...

Page 119: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 119

Now we use 3 variables, S, W, Wt, to fit linear regression and Poisson regression.

0 5 100

10

20

30

40

50

Sa

Fre

quen

cy

Histograms by Linear Regression

0 5 10 150

20

40

60

80

100

SaF

requ

ency

Histograms by Poisson Regression

Clearly, Poisson regression looks better, although SE values are 1.4696× 103

and 1.5343× 103, respectively, for linear regression and Poisson regression.

Page 120: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 120

Fitting Poisson Regression

Log Likelihood:

li = −µi + yi log µi = −exiβ + yixiβ

First Derivatives:

∂li∂β

= (yi − µi)xTi

Page 121: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 121

Given n observations, the log likelihood is l =∑n

i=1 li.

First Derivatives (matrix form) :

∂l

∂β= XT (y − µ)

Second Derivatives (matrix form) :

∂2l

∂ββT= −XTWX

where W is the diagonal matrix of µ.

They look very similar to logistic regression.

Page 122: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 122

Newton’s Method for Solving Poisson Regression Model

βnew = βold −[

(

∂2l(β)

∂β∂βT

)−1∂l(β)

∂β

]

βold

βt =β(t−1) + ν[

XTWX]−1 [

XT(y − µ)]

t−1

where again ν (e.g., 0.1) is a shrinkage parameter which helps the numerical

stability.

Page 123: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 123

Why “Log Linear”?

Poisson Model Without Log:

yi ∼ Poisson(µi)

µi = xiβ = β0 + xi,1β1 + ... + xi,pβp

Its log likelihood and first derivative (assuming only one β) are:

li = −µi + yi log µi = −xiβ + yi log (xiβ)

∂li∂β

= −xi +yixi

xiβ

Considering the second derivatives and more than one β, using this model is

almost like “looking for troubles.” There is also another obvious issue with this

model. What is it?

The reason why “Log Linear” will be more clear under the GLM framework.

Page 124: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 124

Summary of Models

Given a dataset xi, yini=1, so far, we have seen three different models:

• Linear Regression −∞ < yi <∞,

yi ∼ N(

µi, σ2)

, µi = xiβ

• Poisson Regression yi ∈ 0, 1, 2, ..., ,

yi ∼ Poisson (µi) , log µi = xiβ

• Binary Logistic Regression yi ∈ 0, 1,

yi ∼ Binomial (pi) , logpi

1− pi= xiβ

Page 125: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 125

Quotes from George E. P. Box

• Essentially, all models are wrong, but some are useful.

• Remember that all models are wrong; the practical question i s how

wrong do they have to be to not be useful.

Page 126: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 126

Generalized Linear Models (GLM)

All the models we have seen so far belong to the family of generalized linear

models (GLM). In general, a GLM consists of three components:

• The random component yi ∼ f(yi; θi).

f(yi; θi) = a(θi)b(yi)eyiQ(θi)

• The systematic component ηi = xiβ =∑p

j=0 xi,jβj .

(This may be replaced by a more flexible model.)

• The link function ηi = g(ui), where ui = E(yi).

g(u) is a monotonic function. If g(u) = u, it is called “identity link”.

Page 127: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 127

Revisit Poisson Log Linear Model Under GLM

For GLM,

yi ∼ f(yi; θi) = a(θi)× b(yi)× eyiQ(θi)

In this case, θi = ui,

f(yi) =e−uiuyi

i

yi!=[

e−ui]

[

1

yi

]

[

eyi log ui]

Therefore,

a(µi) = e−ui , b(yi) =1

yi!, Q(µi) = log ui

And the link function

g(ui) = Q(θi) = log ui = xiβ

This is called canonical link.

Page 128: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 128

Revisit Binary Logistic Model Under GLM

For GLM,

yi ∼ f(yi; θi) = a(θi)× b(yi)× eyiQ(θi)

In this case, θi = pi,

f(yi) = pyi

i (1− pi)1−yi = [(1− pi)] [1]

[

eyi log

pi1−pi

]

Therefore,

a(pi) = 1− pi, b(yi) = 1, Q(pi) = logpi

1− pi

And the link function

g(pi) = Q(θi) = logpi

1− pi= xiβ

This is again a canonical link.

Page 129: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 129

Revisit Linear Regression Model Under GLM (with σ2 = 1)

For GLM,

yi ∼ f(yi; θi) = a(θi)× b(yi)× eyiQ(θi)

In this case, θi = µi (and σ2 = 1 by assumption)

f(yi) = e−(yi−µi)

2

2 =

[

e−µ2

i2

] [

e−y2

i2

]

[eyiµi ]

Therefore,

a(µi) = e−µ2

i2 , b(yi) = e−

y2i2 , Q(µi) = µi

And the link function

g(ui) = Q(θi) = µi = xiβ

This is again a canonical link and is in fact an identity link.

Page 130: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 130

Statistical Inference

After we have fitted a GLM (e.g., logistic regression) and estimated the

coefficients β, we can ask many questions, such as

• Which βj is more important?

• Is βj significantly different from 0?

• What is the (joint) distribution of β?

To understand these questions, it is crucial to learn some theory of the MLE,

because fitting a GLM is finding the MLE for a particular distribution.

Page 131: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 131

Revisit the Maximum Likelihood Estimation (MLE)

Observations xi, i = 1 to n, are i.i.d. samples from a distribution with probability

density function fX (x; θ1, θ2, ..., θk),

where θj , j = 1 to k, are parameters to be estimated.

The maximum likelihood estimator seeks the θ to maximize the joint likelihood

θ = argmaxθ

n∏

i=1

fX(xi; θ)

Or, equivalently, to maximize the log joint likelihood

θ = argmaxθ

n∑

i=1

log fX(xi; θ) = argmaxθ

l(θ; x)

where l(θ; x) =∑n

i=1 log fX(xi; θ) is the joint log likelihood function.

Page 132: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 132

Large Sample Theory for MLE

Large sample theory says, as n→∞, θ is asymptotically unbiased and normal.

θ ∼ N

(

θ,1

nI(θ)

)

, approximately

I(θ) is the Fisher Information of θ:

I(θ) = −E

[

∂2

∂θ2log f(X |θ)

]

= −E (l′′(θ))

Note that it is also true that

I(θ) = E (l′(θ))2

but you don’t have to worry about the proof.

Page 133: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 133

Intuition About the Asymptotic Distributions & Variances o f MLE

The MLE θ is the solution to the MLE equation l′(θ) = 0.

The Taylor expansion around the true θ

l′(θ) ≈ l′(θ) + (θ − θ)l′′(θ)

Let l′(θ) = 0 (because θ is the MLE solution)

(θ − θ) ≈ − l′(θ)

l′′(θ)

We know that

E(−l′′(θ)) = nI(θ) = E(l′(θ))2,

E(l′(θ)) = 0. (Read the next slide if interested in the proof)

Page 134: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 134

(Don’t worry about this slide if you are not interested.)

l′(θ) =

n∑

i=1

∂ log f(xi)

∂θ=

n∑

i=1

∂f(xi)∂θ

f(xi)

E (l′(θ)) =n∑

i=1

E

(

∂ log f(xi)

∂θ

)

= nE

(

∂f(x)∂θ

f(x)

)

= 0

because

E

(

∂f(x)∂θ

f(x)

)

=

∫ ∂f(x)∂θ

f(x)f(x)dx =

∂f(x)

∂θdx =

∂θ

f(x)dx = 0

Page 135: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 135

The heuristic trick is to approximate

θ − θ ≈ l′(θ)

−l′′(θ)≈ l′(θ)

E(−l′′(θ))=

l′(θ)

nI(θ)

Therefore,

E(θ − θ) ≈ E(l′(θ))

nI(θ)= 0

V ar(θ) ≈ E(θ − θ)2 ≈ E

(

l′(θ)

nI(θ)

)2

=nI(θ)

n2I2(θ)=

1

nI(θ)

This is why intuitively, we know that θ ∼ N(

θ, 1nI(θ)

)

.

Page 136: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 136

Example: Normal Distribution

Given n i.i.d. samples, xi ∼ N(µ, σ2), i = 1 to n.

log fX(x; µ, σ2) = − 1

2σ2(x− µ)2 − 1

2log(2πσ2)

∂2 log fX(x; µ, σ2)

∂µ2= − 1

σ2=⇒ I(µ) =

1

σ2

Therefore, the MLE µ will have asymptotic variance 1nI(µ) = σ2

n . But in this

case, we already know that

µ =1

n

n∑

i=1

xi ∼ N

(

µ,σ2

n

)

In other words, the “asymptotic” variance of the MLE is in fact exact in this case.

Page 137: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 137

Example: Binomial Distribution

x ∼ Binomial(p, n): Pr (x = k) =(

nk

)

pk(1− p)n−k

Log likelihood and Fisher Information:

l(p) = k log p + (n− k) log(1− p)

l′(p) =k

p− n− k

1− p=⇒ MLE p =

k

n

l′′(p) = − k

p2− n− k

(1− p)2

I(p) = −E (l′′(p)) =np

p2+

n− np

(1− p)2=

n

p(1− p)

That is, the asymptotic variance of the MLE p isp(1−p)

n , which is in fact again the

exact variance.

Page 138: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 138

Example: Contingency Table with Known Margins

n = n11 + n12 + n21 + n22 N = N11 + N12 + N21 + N22

22

n 11 n 12

n 21 n 2221

Ν Ν

ΝΝ

11 12

Margins: M1 = N11 + N12, M2 = N11 + N21, are known.

The (asymptotic) variance of the MLE (for N11) is

Var(

N11,MLE

)

=N/n

1N11

+ 1M1−N11

+ 1M2−N11

+ 1N−M1−M2+N11

Page 139: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 139

Derivation : The log likelihood is

l(N11) =n11 logN11

N+ n12 log

M1 −N11

N

+ n21 logM2 −N11

N+ n22 log

N −M1 −M2 + N11

N

The MLE solution is

l′(N11) =n11

N11− n12

M1 −N11− n21

M2 −N11+

n22

N −M1 −M2 + N11= 0

The second derivative is

l′′(N11) =− n11

N211

− n12

N212

− n21

N221

− n22

N222

Page 140: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 140

The Fisher Information is thus

I(N11) =E(−l′′(N11)) =E(n11)

N211

+E(n12)

N212

+E(n21)

N221

+E(n22)

N222

=n

N

[

1

N11+

1

N12+

1

N21+

1

N22

]

Recall

E(n11) = nN11

N, E(n12) = n

N12

N,

E(n21) = nN21

N, E(n22) = n

N22

N,

Page 141: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 141

Asymptotic Covariance Matrix

More generally, suppose there are more than one parameters

θ = θ1, θ2, θ3, θp. The Fisher Information Matrix is defined as

I(θ) = E

(

− ∂2l(θ)

∂θi∂θj

)

And the asymptotic covariance matrix is

Cov(θ) = I−1(θ)

Page 142: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 142

Review Binary Logistic Regression Derivatives

Newton’ update formula

βnew = βold −[

(

∂2l(β)

∂β∂βT

)−1∂l(β)

∂β

]

βold

where, in a matrix form

∂l(β)

∂β=

n∑

i=1

xi (yi − p(xi; β)) = XT(y − p)

∂2l(β)

∂β∂βT= −

n∑

i=1

xTixip(xi; β) (1− p(xi; β)) = −XTWX

where W = diagp(xi)(1− p(xi)).

Page 143: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 143

Fisher Information and Covariance for Logistic Regression

Suppose the Newton’s iteration has reached the optimal solution (very important),

then

I(β) = E(

XTWX)

= XTWX

And the asymptotic covariance matrix is

Cov(β) = I−1(β) =[

XTWX]−1

In other words, the MLE estimates β of the binary logistic regression parameters

are asymptotically jointly normal

N(

β,[

XTWX]−1)

Page 144: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 144

A Simple Test for Logistic Regression Coefficients

At convergence, the coefficients of logistic regression

β ∼ N(

β,[

XTWX]−1)

We can just test each coefficient separately because, asymptotically

βj ∼ N(

βj ,[

XTWX]−1

jj

)

which allows us to use normal probability functions to compute the p-values.

Two caveats: (1) We need the “true” W , which is replaced by the estimated W at

the last iteration. (2) We still have to specify the true βj for the test. In general, it

makes sense to test H0 : βj = 0.

Page 145: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 145

GLM with R

> data= read.table("d:\\class\\6030Spring12\\fig\\cra b.txt");

> model = glm((data[,5]==0)˜data$V2+data$V3+data$V4,fa mily=’binomial’);

> summary(model)

Call:

glm(formula = (data[, 5] == 0) ˜ data$V2 + data$V3 + data$V4,

family = "binomial")

Deviance Residuals:

Min 1Q Median 3Q Max

-1.7120 -0.8948 -0.5242 1.0431 2.0833

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 9.46885 3.56974 2.653 0.00799 **data$V2 -0.04952 0.22094 -0.224 0.82267

data$V3 -0.30540 0.18220 -1.676 0.09370 .

data$V4 -0.84479 0.67369 -1.254 0.20985

---

Signif. codes: 0 ’ *** ’ 0.001 ’ ** ’ 0.01 ’ * ’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Page 146: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 146

Null deviance: 225.76 on 172 degrees of freedom

Residual deviance: 192.84 on 169 degrees of freedom

AIC: 200.84

Number of Fisher Scoring iterations: 4

Page 147: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 147

R Resources

Download R executable from

http://www.r-project.org/

After launching R, type “glm.help()” for the helper screen.

Page 148: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 148

Validating the Asymptotic Theory Using Crab Data

We use 3 variables (Width, Weight, Spine, plus the intercept) from the crab data

for building the binary logistic regression model for predicting Pr (Sa > 0).

Instead of using the original labels, we generate the “true” β and sample the

labels from the generated β.

function TestLogitCrab;

load crab.txt;

X = crab(:,1:end-1); X(:,1)=1;

be_true = [-10,0.05,0.3,0.8]’ + randn(4,1) * 0.1;

The true β is fixed once generated. Once β is known, we can easily compute

p(xi) = Pr (yi = 1) =exiβ

1 + exiβ

Page 149: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 149

Once β is fixed, we can compute p and sample the labels from

Bounoulli(p(xi)) for each xi.

We then fit the binary logistic regression using the original xi and the generated

yi to obtain β, which will be quite close to but not identical to the “true” β.

We then repeat the sampling procedure to create another set of labels and β.

By repeating this procedure 1000 times, we will be able to assess the distribution

of β.

Page 150: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 150

0 20 40 60 80 1000

20

40

60

80

100

Iterations

MS

E

True β0 = −10.0371

EmpiricalTheoretical

0 20 40 60 80 1000

0.02

0.04

0.06

Iterations

MS

E

True β1 = −0.033621

EmpiricalTheoretical

0 20 40 60 80 1000.02

0.03

0.04

0.05

0.06

Iterations

MS

E

True β2 = 0.24113

EmpiricalTheoretical

0 20 40 60 80 1000.2

0.4

0.6

0.8

1

Iterations

MS

E

True β3 = 0.97415

EmpiricalTheoretical

The MSEs for all βj MSEs converge with increasing iterations. However, they

deviate from the “true” variances predicted by [XTWX ]−1

, most likely because

our sample size n = 173 is too small for the large-sample theory to be accurate.

Page 151: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 151

Experiments on the Zipcode data

2 5 3 4

5 1 0 0

Conjecture: If we display the p-values from the z-test, we might be able to see

some images similar to digits.

Page 152: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 152

Displaying the p-values as images

0 1 2 3 4

5 6 7 8 9

Page 153: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 153

Displaying 1- p-values as images

0 1 2 3 4

5 6 7 8 9

Page 154: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 154

Displaying only top (smallest) 50 p-values as images

0 1 2 3 4

5 6 7 8 9

Page 155: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 155

Plausible Interpretations: The asymptotic theory says

β ∼ N(

β,[

XTWX]−1)

Using only the marginal (diagonal) information

βj ∼ N(

βj ,[

XTWX]−1

jj

)

may result in serious loss of information. In particular, when the variables are

highly correlated as in this dataset, it is not realistic to expect that only the

marginal information will be sufficient.

In other words, for zipcode data, many pixels “work together” to provide strong

discriminating powers. This is the power of team work .

Page 156: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 156

Multi-Class Ordinal Logistic Regression

For zip-code recognition, it is natural to treat each class (0 to 9) equally, because

in general there are indeed no orders among them (unless you are doing specific

studies in which the zip code information reveals physical locations.).

In many applications, however, there are natural orders among the class labels.

For example, in the crab data, it might be reasonable to consider # SA is ordinal

because it reflects the growth process. Also, it variable “Spine condition” may be

also ordinal.

Another example is the Webpage relevance ranking. A page with a rank of

“perfect” (4) is certainly more important than a page of “bad” (0).

Page 157: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 157

Practical Strategies

• For binary classification, it does not matter.

• In many cases, we can just ignore the orders.

• We can fit K binary logistic regressions by grouping the data according to

whether the labels are smaller or larger than L:

Pr (Label > L)

from which one can compute the individual class probabilities:

Pr (Label = L) = Pr (Label ≤ L + 1)−Pr (Label ≤ L)

One drawback is that for some data points, the fitted class probabilities may

be smaller than 0 after subtraction. But if you have lots of data, this method is

often quite effective in practice, for example, in our previous work on ranking

webpages. Do read the slides on ranking if you are interested.

• More sophisticated models...

Page 158: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 158

Methods for Modern Massive Data Sets (MMDS)

1. Normal Random Projections

2. Cauchy Random Projections

3. Stable Random Projections

4. Random Projections for Computing Higher-Order Distances

5. Skewed Stable Random Projections

6. Tentative: Sparse Signal Recovery

Page 159: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 159

An Introduction to Random Projections

Many applications require a data matrix : A ∈ Rn×D

For example, the term-by-document matrix may contain n = 1010 documents

(web pages) and D = 106 single words, or D = 1012 double words (bi-gram

model), or D = 1018 triple words (tri-gram model).

Many matrix operations boil down to computing how close (how far) two rows

(columns) of the matrix are. For example, linear least square (ATA)−1

ATy.

Challenges : The matrix may be too large to store,

or computing ATA is too expensive.

Page 160: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 160

Random Projections : Replace A by B = A×R

A R = B

R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).

B ∈ Rn×k : projected matrix, also random.

k is very small (eg k = 50 ∼ 100), but n and D are very large.

B approximately preserves the Euclidean distance and dot products between any

two rows of A. In particular, E (BBT) = AAT.

Page 161: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 161

Consider first two rows in A: u1, u2 ∈ RD .

u1 = u1,1, u1,2, u1,3, ..., u1,i, ..., u1,Du2 = u2,1, u2,2, u2,3, ..., u2,i, ..., u2,D

and first two rows in B: v1, v2 ∈ Rk .

v1 = v1,1, v1,2, v1,3, ..., v1,j, ..., v1,kv2 = v2,1, v2,2, v2,3, ..., v2,j, ..., v2,k

v1 = RTu1, v2 = RTu2.

R = rij, i = 1 to D and j = 1 to k. rij ∼ N(0, 1).

Page 162: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 162

v1 = RTu1, v2 = RTu2. R = rij, i = 1 to D and j = 1 to k.

v1,j =D∑

i=1

riju1,i, v2,j =D∑

i=1

riju2,i,

v1,j − v2,j =

D∑

i=1

rij [u1,i − u2,i]

The Euclidean norm of u1:∑D

i=1 |u1,i|2.

The Euclidean norm of v1:∑k

j=1 |v1,j |2.

The Euclidean distance between u1 and u2:∑D

i=1 |u1,i − u2,i|2.

The Euclidean distance between v1 and v2:∑k

j=1 |v1,j − v2,j |2.

Page 163: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 163

What are we hoping for?

•∑k

j=1 |v1,j |2 ≈∑D

i=1 |u1,i|2, as close as possible.

• ∑kj=1 |v1,j − v2,j |2 ≈

∑Di=1 |u1,i − u2,i|2, as close as possible.

• k should be as small as possible, for a specified level of accuracy.

Page 164: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 164

Unbiased Estimator of d and m1, m2

We need a good estimator, unbiased and has small variance.

Note that the estimation problem is essentially the same for d and for m1 (m2).

Thus, we can focus on estimating m1.

By random projections, we have k i.i..d. samples (why?)

vj =D∑

i=1

riju1,i, j = 1, 2, ...k

Because rij ∼ N(0, 1), we can develop estimators and analyze the properties

using normal and χ2 distributions. But we can also solve the problem without

using normals.

Page 165: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 165

Unbiased Estimator of m1

v1,j =D∑

i=1

riju1,i, j = 1, 2, ...k, (rij ∼ N(0, 1))

To get started, let’s first look the moments

E(v1,j) = E

(

D∑

i=1

riju1,i

)

=D∑

i=1

E(rij)u1,i = 0

Page 166: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 166

E(v21,j) =E

[

D∑

i=1

riju1,i

]2

=E

D∑

i=1

r2iju

21,i +

i 6=i′

riju1,iri′ju1,i′

=

D∑

i=1

E(r2ij)u

21,i +

i 6=i′

E(rijri′j)u1,iu1,i′

=

(

D∑

i=1

u21,i + 0

)

= m1

Great! m1 is exactly what we are after.

Since we have k, i.i.d. samples vj , we can simply average them to estimate m1.

Page 167: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 167

An unbiased estimator of the Euclidean norm m1 =∑D

i=1 |u1,i|2

m1 =1

k

k∑

j=1

|v1,j |2,

E (m1) =1

k

k∑

j=1

E(

|v1,j |2)

=1

k

k∑

j=1

m1 = m1

We need to analyze its variance to assess its accuracy.

Recall, our goal is to use k (number of projections) as small as possible.

Page 168: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 168

V ar (m1) =1

k2

k∑

j=1

V ar(

|v1,j |2)

=1

kV ar

(

|v1,j |2)

=1

k

[

E(

|v1,j |4)

−E2(|v1,j |2)]

=1

k

E

(

D∑

i=1

riju1,i

)4

−m21

We can compute E(

∑Di=1 riju1,i

)4

directly, but it would be much easier if we

take advantage of the χ2 distribution.

Page 169: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 169

χ2 Distribution

If X ∼ N(0, 1), then Y = X2 is a Chi-Square distribution with one degree of

freedom, denoted by χ21.

If Xj , j = 1 to k are i.i.d. normal Xi ∼ N(0, 1). Then Y =∑k

j=1 X2j follows

a Chi-square distribution with k degrees of freedom, denoted by χ2k.

If Y ∼ χ2k, then

E(Y ) = k, V ar(Y ) = 2k

Page 170: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 170

Recall, after random projections,

v1,j =D∑

i=1

riju1,i, j = 1, 2, ...k, rij ∼ N(0, 1)

Therefore, vj also has a normal distribution:

v1,j ∼ N

(

0,D∑

i=1

|ui,i|2)

= N (0, m1)

Equivalentlyv1,j√m1∼ N(0, 1).

Therefore,

[

v1,j√m1

]2

=v21,j

m1∼ χ2

1, V ar

(

v21,j

m1

)

= 2, V ar(

v21,j

)

= 2m21

Now we can figure out the variance formula for random projections.

Page 171: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 171

V ar (m1) =1

kV ar

(

|v1,j |2)

=2m2

1

k

Implication

V ar(m1)

m21

=2

k, independent of m1

V ar(m1)m2

1is known as the coefficient of variation.

——————-

We have solved the variance using χ21.

We can actually figure out the distribution of m1 using χ2k.

Page 172: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 172

m1 =1

k

k∑

j=1

|v1,j |2, v1,j ∼ N (0, m1)

Because v1,j ’s are i.i.d, we know

km1

m1=

k∑

j=1

(

v1,j√m1

)2

∼ χ2k (why?)

This will be useful for analyzing the error bound using probability inequalities.

We can also write down the moments of m1 directly using χ2k

Page 173: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 173

Recall, if Y ∼ χ2k , then E(Y ) = k, and V ar(Y ) = 2k

=⇒

E

(

km1

m1

)

= k, V ar

(

km1

m1

)

= 2k,

=⇒

V ar(m1) = 2km2

1

k2=

2m21

k

Page 174: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 174

An unbiased estimator of the Euclidean distance d =∑D

i=1 |u1,i − u2,i|2

d =1

k

k∑

j=1

|v1,j − v2,j |2,kd

d∼ χ2

k, V ar(d) =2d2

k.

They can be derived exactly the way as we analyze the estimator of m1.

Note that the coefficient of variation for d

V ar(d)

d2=

2

k, independent of d

meaning that the errors are pre-determined by k, a huge advantage.

Page 175: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 175

More probability problems

• What is the error probability P(

|d− d| ≥ ǫd)

?

• How large k should be?

• What about the inner (dot) product a =∑D

i=1 u1,iu2,i?

Page 176: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 176

An unbiased estimator of the inner product a =∑D

i=1 u1,iu2,i

a =1

k

k∑

j=1

v1,jv2,j ,

E(a) = a

V ar(a) =m1m2 + a2

k

Proof :

v1,jv2,j =

[

D∑

i=1

u1,irij

][

D∑

i=1

u2,irij

]

Page 177: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 177

v1,jv2,j =

[

D∑

i=1

u1,irij

][

D∑

i=1

u2,irij

]

=D∑

i=1

u1,iu2,ir2ij +

i 6=i′

u1,iu2,i′rijri′j

=⇒

E(v1,jv2,j) =D∑

i=1

u1,iu2,iE[

r2ij

]

+∑

i 6=i′

u1,iu2,i′E [rijri′j ]

=D∑

i=1

u1,iu2,i1 +∑

i 6=i′

u1,iu2,i′0

=

D∑

i=1

u1,iu2,i = a

This proves the unbiasedness.

Page 178: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 178

We first derive the variance of a using a complicated brute force method, then we

show a much simpler method using conditional expectation.

[v1,jv2,j ]2 =

D∑

i=1

u1,iu2,ir2ij +

i 6=i′

u1,iu2,i′rijri′j

2

=

[

D∑

i=1

u1,iu2,ir2ij

]2

+

i 6=i′

u1,iu2,i′rijri′j

2

+ ...

=D∑

i=1

[u1,iu2,i]2 r4

ij + 2∑

i 6=i′

u1,iu2,iu1,i′u2,i′ [rijri′j ]2

+∑

i 6=i′

[u1,iu2,i′ ]2 [rijri′j ]

2 + ...

Why can we ignore the rest of the terms (after taking expectations)?

Page 179: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 179

Why can we ignore the rest of the terms (after taking expectations)?

Recall rij ∼ N(0, 1) i.i.d.

E(rij) = 0, E(r2ij) = 1, E(rijri′j) = E(rij)E(ri′j) = 0

E(r3ij) = 0, E(r4

ij) = 3, E(r2ijri′j) = E(r2

ij)E(ri′j) = 0

Page 180: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 180

Therefore,

E [v1,jv2,j ]2 =

D∑

i=1

3 [u1,iu2,i]2 + 2

i 6=i′

u1,iu2,iu1,i′u2,i′ +∑

i 6=i′

[u1,iu2,i′ ]2

But

a2 =

[

D∑

i=1

u1,iu2,i

]2

=

D∑

i=1

[u1,iu2,i]2

+∑

i 6=i′

u1,iu2,iu1,i′u2,i′

m1m2 =

[

D∑

i=1

|u1,i|2][

D∑

i=1

|u2,i|2]

=D∑

i=1

[u1,iu2,i]2 +

i 6=i′

[u1,iu2,i′ ]2

Therefore,

E [v1,jv2,j ]2 = m1m2 + 2a2, V ar [v1,jv2,j ] = m1m2 + a2

Page 181: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 181

An unbiased estimator of the inner product a =∑D

i=1 u1,iu2,i

a =1

k

k∑

j=1

v1,jv2,j , E(a) = a

V ar(a) =m1m2 + a2

k

The coefficient of variation

V ar(a)a2 =

m1m2+a2

a21k , not independent of a.

When two vectors u1 and u2 are almost orthogonal, a ≈ 0,

=⇒ coefficient of variation≈ ∞.

=⇒ random projections may not be good for estimating inner products.

Page 182: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 182

The joint distribution of v1,j =∑D

i=1 u1,irij and v2,j =∑D

i=1 u2,irij .

E(v1,j) = 0, V ar(v1,j) =D∑

i=1

|u1,i|2 = m1

E(v2,j) = 0, V ar(v2,j) =D∑

i=1

|u2,i|2 = m2

Cov(v1,i, v2,j) = E(v1,jv2,j)− E(v1,j)E(v2,j) = a

v1,j and v2,j are jointly normal (bivariate normal).

v1,j

v2,j

∼ N

µ =

0

0

, Σ =

m1 a

a m2

(What if we know m1 and m2 exactly? For example, by one scan of data matrix.)

Page 183: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 183

Review Bivariate Normal Distribution

The random variables X and Y have a bivariate normal distribution if, for

constants, ux, uy , σx > 0, σy > 0,−1 < ρ < 1, their joint density function is

given, for all−∞ < x, y <∞, by

f(x, y) =1

2πσxσy

1− ρ2e− 1

2(1−ρ2)

[

(x−µx)2

σ2x

+(y−µy)2

σ2y

−2ρ(x−µx)(y−µy)

σxσy

]

If X and Y are independent, then ρ = 0, and

f(x, y) =1

2πσxσye− 1

2

[

(x−µx)2

σ2x

+(y−µy)2

σ2y

]

Page 184: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 184

Denote that X and Y are jointly normal:

X

Y

∼ N

µ =

µx

µy

, Σ =

σ2x ρσxσy

ρσxσy σ2y

X and Y are marginally normal:

X ∼ N(µx, σ2x), Y ∼ N(µy, σ2

y)

X and Y are also conditionally normal:

X |Y ∼ N

(

µx + ρ(y − µy)σx

σy, (1− ρ2)σ2

x

)

Y |X ∼ N

(

µy + ρ(x− µx)σy

σx, (1− ρ2)σ2

y

)

Page 185: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 185

Bivariate Normal and Random Projections

A R = B

v1 and v2, the first two rows in B, have k entries:

v1,j =∑D

i=1 u1,irij and v2,j =∑D

i=1 u2,irij .

v1,j and v2,j are bivariate normal:

v1,j

v2,j

∼ N

µ =

0

0

, Σ =

m1 a

a m2

m1 =∑D

i=1 |u1,i|2, m2 =∑D

i=1 |u2,i|2, a =∑D

i=1 u1,iu2,i

Page 186: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 186

Simplify calculations using conditional normality

v1,j |v2,j ∼ N

(

a

m2v2,j ,

m1m2 − a2

m2

)

E (v1,jv2,j)2

=E(

E(

v21,jv

22,j |v2,j

))

= E(

v22,jE

(

v21,j |v2,j

))

=E

(

v22,j

(

m1m2 − a2

m2+

(

a

m2v2,j

)2))

=m2m1m2 − a2

m2+ 3m2

2

a2

m22

=(

m1m2 + 2a2)

.

The unbiased estimator a = 1k

∑Di=1 v1,jv2,j has variance

Var (a) =1

k

(

m1m2 + a2)

Page 187: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 187

Review Moment Generating Function (MGF)

Definition: For a random variable X , its moment generating function (MGF),

is defined as

MX(t) = E[

etX]

=

x p(x)etx if X is discrete

∫∞−∞ etxf(x)dx if X is continuous

MGF MX(t) uniquely determines the distribution of X .

Page 188: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 188

Review MGF of Normal

Suppose X ∼ N(0, 1), i.e., fX(x) = 1√2π

e−−x2

2 .

MX(t) =

∫ ∞

−∞etx 1√

2πe−

x2

2 dx

=

∫ ∞

−∞

1√2π

e−x2

2 +txdx

=

∫ ∞

−∞

1√2π

e−x2

−2tx+t2−t2

2 dx

=et2

2

∫ ∞

−∞

1√2π

e−(x−t)2

2 dx

=et2

2

Page 189: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 189

Suppose Y ∼ N(µ, σ2).

Write Y = σX + µ, where X ∼ N(0, 1).

MY (t) = E[

etY]

= E[

eµt+σtX]

= eµtE[

eσtX]

We can view σt as another t′.

MY (t) = eµtMX(σt) = eµt × eσ2t2

2 = eµt+ σ2

2 t2

Page 190: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 190

Review MGF of Chi-Square

If Xj , j = 1 to k, are i.i.d. N(0, 1), then

Y =∑k

j=1 X2j ∼ χ2

k , a Chi-squared distribution with k degrees of freedom.

By the independence of Xj ,

MY (t) = E[

eY t]

= E[

et∑k

j=1 X2j

]

=k∏

j=1

E[

etX2j

]

=(

E[

etX2j

])k

Page 191: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 191

E[

etX2j

]

=

∫ ∞

−∞etx2 1√

2πe−

x2

2 dx

=

∫ ∞

−∞

1√2π

e−x2

2 +tx2

dx

=

∫ ∞

−∞

1√2π

e−x2(1−2t)

2 dx

=

∫ ∞

−∞

1√2π

e−x2

2σ2 dx,

(

σ2 =1

1− 2t

)

∫ ∞

−∞

1√2πσ

e−x2

2σ2 dx = σ

=1

(1− 2t)1/2

MY (t) =(

E[

etX2j

])k

=1

(1− 2t)k/2, (t < 1/2)

Page 192: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 192

MGF for Random Projections

In random projections , the unbiased estimator d = 1k

∑kj=1 |v1,j − v2,j |2

kd

d=

k∑

j=1

|v1,j − v2,j |2d

∼ χ2k

Q: What is the MGF of d?

Solution:

Md(t) = E(

edt)

= E

(

e

[

kdd

]

[ dtk ])

=

(

1− 2dt

k

)−k/2

where 2dt/k < 1, i.e., t < k/(2d).

Page 193: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 193

Review Moments and MGF

MX(t) = E[

etX]

=⇒M ′X(t) = E

[

XetX]

=⇒M(n)X (t) = E

[

XnetX]

Setting t = 0,

E [Xn] = M(n)X (0)

Page 194: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 194

Example: X ∼ χ2k . MX(t) = 1

(1−2t)k/2 .

M ′(t) =−k

2(1− 2t)

−k/2(−2) = k (1− 2t)

−k/2−1

M ′′(t) =k

(−k

2− 1

)

(1− 2t)−k/2−2 (−2)

=k(k + 2) (1− 2t)−k/2−2

Therefore,

E(X) = M ′(0) = k, E(X2) = M ′′(0) = k2 + 2k

V ar(X) = (k2 + 2k)− k2 = 2k.

Page 195: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 195

MGF and Moments of a in Random Projections

The unbiased estimator of inner product: a = 1k

∑ki=1 v1,jv2,j .

Using conditional expectation:

v1,j |v2,j ∼ N

(

a

m2v2,j ,

m1m2 − a2

m2

)

v2,j ∼ N(0, m2)

For simplicity, let

x = v1,j , y = v2,j , µ =a

m2v2,j =

a

m2y,

σ2 =m1m2 − a2

m2

Page 196: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 196

E (exp(v1,jv2,jt)) = E (exp(xyt)) = E (E (exp(xyt)) |y)

Using the MGF of x|y ∼ N(µ, σ2)

E (exp(xyt)|y) = eµyt+ σ2

2 (yt)2

E (E (exp(xyt)|y)) = E(

eµyt+ σ2

2 (yt)2)

µyt +σ2

2(yt)2 = y2

(

a

m2t +

σ2

2t2)

Since y ∼ N(0, m2), we known y2

m2∼ χ2

1.

Page 197: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 197

Using MGF of χ21, we obtain

E(

eµyt+ σ2

2 (yt)2)

= E

(

ey2

m2m2

(

am2

t+ σ2

2 t2))

=

(

1− 2m2

(

a

m2t +

σ2

2t2))−1/2

=(

1− 2at−(

m1m2 − a2)

t2)− 1

2 .

By independence,

Ma(t) =

(

1− 2at

k−(

m1m2 − a2) t2

k2

)− k2

.

Now, we can use this MGF to calculate moments of a.

Page 198: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 198

Ma(t) =

(

1− 2at

k−(

m1m2 − a2) t2

k2

)− k2

,

Ma(1)(t) =(−k/2)

[

(

1− 2at

k−(

m1m2 − a2) t2

k2

)− k2−1]

×(

−2a/k −(

m1m2 − a2) 2t

k2

)

The term in [...] will not matter after letting t = 0.

Therefore,

E(a) =(

MGFa(1)(0)

)

= (−k/2)(−2a/k) = a

Page 199: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 199

Following a similar procedure, we can obtain

Var (a) =m1m2 + a2

k

E (a− a)3

=2a

k2

(

3m1m2 + a2)

Page 200: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 200

Tail Probabilities

The tail probability P (X > t) is extremely important.

For example, in random projections,

P(

|d− d| ≥ ǫd)

tells what is the probability that the difference (error) between the estimated

Euclidian distance d and the true distance d exceeds an ǫ fraction of the true

distance d.

Q: Is it just the cumulative probability function (CDF)?

Page 201: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 201

Tail Probability Inequalities (Bounds)

P (X > t) ≤ ???

Reasons to study tail probability bounds:

• Even if the distribution of X is known, evaluating P (X > t) often requires

numerical methods.

• Often the exact distribution of X is unknown. Instead, we may know the

moments (mean, variance, MGF, etc).

• Theoretical reasons. For example, studying how fast the error decreases.

Page 202: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 202

Several Tail Probability Inequalities (Bounds)

• Markov’s Inequality .

Only use the first moment. Most basic.

• Chebyshev’s Inequality .

Only use the second moment.

• Chernoff’s Inequality .

Use the MGF. Most accurate and popular among theorists.

Page 203: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 203

Markov’s Inequality: Theorem A in Section 4.1

If X is a random variable with P (X ≥ 0) = 1, and for which E(X) exists, then

P (X ≥ t) ≤ E(X)

t

Proof: Assume X is continuous with probability density f(x).

E(X) =

∫ ∞

0

xf(x)dx ≥∫ ∞

t

xf(x)dx ≥∫ ∞

t

tf(x)dx = tP (X ≥ t)

See the textbook for the proof by assuming X is discrete.

Many extremely useful bounds can be obtained from Markov’s inequality.

Page 204: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 204

Markov’s inequality P (X ≥ t) ≤ E(X)t . If t = kE(X), then

P (X ≥ t) = P (X ≥ kE(X)) ≤ 1

k

The error decreases at the rate of 1k , which is too slow.

The original Markov’s inequality only utilizes the first moment (hence its

inaccuracy).

Page 205: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 205

Chebyshev’s Inequality

Let X be a random variable with mean µ and variance σ2. Then for any t > 0,

P (|X − µ| ≥ t) ≤ σ2

t2

Proof: Let Y = (X − µ)2 = |X − µ|2, w = t2, then

P (Y ≥ w) ≤ E(Y )

w=

E (X − µ)2

w=

σ2

w

Note that |X − µ|2 ≥ t2 ⇐⇒ |x− µ| ≥ t. Therefore,

P (|X − µ| ≥ t) = P(

|X − µ|2 ≥ t2)

≤ σ2

t2

Page 206: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 206

Chebyshev’s inequality P (|X − µ| ≥ t) ≤ σ2

t2 . If t = kσ, then

P (|X − µ| ≥ kσ) ≤ 1

k2

The error decreases at the rate of 1k2 , which is faster than 1

k .

Page 207: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 207

Chernoff Inequality

If X is a random variable with finite MGF MX(t), then for any ǫ > 0

P X ≥ ǫ ≤ e−tǫMX(t), for all t > 0

P X ≤ ǫ ≤ e−tǫMX(t), for all t < 0

Application: One can choose the t to minimize the upper bounds. This

usually leads to accurate probability bounds, which decrease exponentially fast.

Page 208: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 208

Proof: Use Markov’s Inequality.

For t > 0, because X > ǫ⇐⇒ etX > etǫ (monotone transformation)

P (X > ǫ) =P(

etX ≥ etǫ)

≤E[

etX]

etǫ

=e−tǫMX(t)

Page 209: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 209

Tail Bounds of Normal Random Variables

X ∼ N(µ, σ2). Assume µ > 0. Need to know P (|X − µ| ≥ ǫµ) ≤ ??

Chebyshev’s inequality :

P (|X − µ| ≥ ǫµ) ≤ σ2

ǫ2µ2=

1

ǫ2

[

σ2

µ2

]

The bound is not good enough, only decreasing at the rate of 1ǫ2 .

Page 210: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 210

Tail Bounds of Normal Using Chernoff’s Inequality

Right tail bound P (X − µ ≥ ǫµ)

For any t > 0,

P (X − µ ≥ ǫµ)

=P (X ≥ (1 + ǫ)µ)

≤e−t(1+ǫ)µMX(t)

=e−t(1+ǫ)µeµt+σ2t2/2

=e−t(1+ǫ)µ+µt+σ2t2/2

=e−tǫµ+σ2t2/2

What’s next? Since the inequality holds for any t > 0, we can choose the t to

minimize the upper bound.

Page 211: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 211

Right tail bound P (X − µ ≥ ǫµ)

Choose the t = t∗ to minimize g(t) = −tǫµ + σ2t2/2.

g′(t) = −ǫµ + σ2t = 0 =⇒ t∗ =µǫ

σ2=⇒ g(t∗) = − ǫ2

2

µ2

σ2

Therefore,

P (X − µ ≥ ǫµ) ≤ e−ǫ2

2µ2

σ2

decreasing at the rate of e−ǫ2 .

Page 212: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 212

Left tail bound P (X − µ ≤ −ǫµ)

For any t < 0,

P (X − µ ≤ −ǫµ) =P (X ≤ (1− ǫ)µ)

≤e−t(1−ǫ)µMX(t)

=e−t(1−ǫ)µeµt+σ2t2/2

=etǫµ+σ2t2/2

Choose the t = t∗ = − µǫσ2 to minimize tǫµ + σ2t2/2. Therefore,

P (X − µ ≤ −ǫµ) ≤ e−ǫ2

2µ2

σ2

Page 213: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 213

Combine left and right tail bounds P (|X − µ| ≥ ǫµ)

P (|X − µ| ≥ ǫµ)

=P (X − µ ≥ ǫµ) + P (X − µ ≤ −ǫµ)

≤2e−ǫ2

2µ2

σ2

Page 214: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 214

Sample Size Selection Using Tail Bounds

Xi ∼ N(

µ, σ2)

, i.i.d. i = 1 to k.

An unbiased estimator of µ is µ

µ =1

k

k∑

i=1

Xi, µ ∼ N

(

µ,σ2

k

)

Choose k such that

P (|µ− µ| ≥ ǫµ) ≤ δ

———–

We already know P (|µ− µ| ≥ ǫµ) ≤ 2e− ǫ2

2µ2

σ2/k .

Page 215: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 215

It suffices to select the k such that

2e−ǫ2

2kµ2

σ2 ≤ δ

=⇒e−ǫ2

2kµ2

σ2 ≤ δ

2

=⇒− ǫ2

2

kµ2

σ2≤ log

(

δ

2

)

=⇒ǫ2

2

kµ2

σ2≥ − log

(

δ

2

)

=⇒k ≥[

− log

(

δ

2

)]

2

ǫ2σ2

µ2

Page 216: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 216

Suppose Xi ∼ N(µ, σ2), i = 1 to k, i.i.d. Then µ = 1n

∑ni=1 Xi is an

unbiased estimator of µ. If the sample size k satisfies

k ≥[

log

(

2

δ

)]

2

ǫ2σ2

µ2,

then with probability at least 1− δ, the estimated µ is within a 1± ǫ factor of the

true µ, i.e., |µ− µ| ≤ ǫµ.

Page 217: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 217

What affects sample size k?

k ≥[

log

(

2

δ

)]

2

ǫ2σ2

µ2

• δ: level of significance. Lower δ → more significant→ larger k.

• σ2

µ2 : noise/signal ratio. Higher σ2

µ2 → larger k.

• ǫ: accuracy. Lower ǫ→ more accurate→ larger k.

• The evaluation criterion. For example, |µ− µ| ≤ ǫµ, or |µ− µ| ≤ ǫ?

Page 218: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 218

Tail Bounds and Random Projections

Recall In random projections, kdd ∼ χ2

k

P(

|d− d| ≥ ǫd)

=P(

|kd/d− k| ≥ ǫk)

=P(

kd/d ≥ (1 + ǫ)k)

+ P(

kd/d ≤ (1− ǫ)k)

P(

kd/d ≥ (1 + ǫ)k)

≤ e−t(1+ǫ)k [1− 2t]−k/2

= exp (−t(1 + ǫ)k − k/2 log(1− 2t))

= exp (−k [t(1 + ǫ) + log(1− 2t)/2])

which is minimized at the t such that [t(1 + ǫ) + log(1− 2t)/2]′= 0

=⇒ (1 + ǫ)− 11−2t = 0 =⇒ 1− 2t = 1

1+ǫ =⇒ t = ǫ2(1+ǫ)

Page 219: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 219

Therefore, the right tail bound is

P(

kd/d ≥ (1 + ǫ)k)

≤ exp

(

−k

[

ǫ

2(1 + ǫ)(1 + ǫ)− log(1 + ǫ)/2

])

= exp

(

−k

2[ǫ− log(1 + ǫ)]

)

= exp

(

−k

2

[

ǫ2

2− ǫ3

3+ ...

])

Similarly, we can obtain the left tail bound to be

P(

kd/d ≤ (1− ǫ)k)

= exp

(

−k

2[−ǫ− log(1− ǫ)]

)

= exp

(

−k

2

[

ǫ2

2+

ǫ3

3+ ...

])

Page 220: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 220

Therefore,

P(

|d− d| ≥ ǫd)

= P(

kd/d ≥ (1 + ǫ)k)

+ P(

kd/d ≤ (1− ǫ)k)

≤ exp

(

−k

2

[

ǫ2

2− ǫ3

3+ ...

])

+ exp

(

−k

2

[

ǫ2

2+

ǫ3

3+ ...

])

≤2 exp

(

−k

2

[

ǫ2

2− ǫ3

3+ ...

])

which means, in order for P(

|d− d| ≥ ǫd)

≤ δ, it suffice to let

k ≥ 2 log δ/2ǫ2

2 − ǫ3

3

Normally, ǫ is small. Hence we can simply say k = O(

1/ǫ2)

.

Page 221: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 221

Improving Random Projections Using Marginal Information

Recall, the projected data, v1,j and v2,j , are bivariate normal:

v1,j

v2,j

∼ N

µ =

0

0

, Σ =

m1 a

a m2

The observation is that one does not need to estimate m1 and m2, because they

can be computed exactly with a linear scan of the data.

In fact, when a2 ≈ m1m2 (i.e., two original vectors are almost identical), the

estimator a = 1k

∑kj=1 v1,jv2,j becomes very bad (the variance is maximized).

In this case, one can first estimate d and then infer a because

d = m1 + m2 − 2a. The question is whether we can find a systematic strategy

to improve the estimates of a and d for all situations. We can resort to the MLE.

Page 222: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 222

The MLE Results

The MLE, denoted as aMLE , is the solution to a cubic equation:

a3 − a2(

vT1v2

)

/k + a(

−m1m2 + m1‖v2‖2/k + m2‖v1‖2/k)

−m1m2vT1v2/k = 0.

E (aMLE − a) = O(k−2),

E(

(aMLE − a)3)

=−2a(3m1m2 + a2)(m1m2 − a2)3

k2(m1m2 + a2)3+ O(k−3),

Var (aMLE) =1

k

(

m1m2 − a2)2

m1m2 + a2+

1

k2

4(m1m2 − a2)4

(m1m2 + a2)4m1m2 + O(k−3).

However, the cubic MLE equation admits multiple real roots with a small

probability, when the sample size k is small.

Page 223: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 223

One can show that

Pr (multiple real roots) = Pr(

P 2(11−Q2/4− 4Q + P 2) + (Q− 1)3 ≤ 0)

,

where P =vT1v2

k√

m1m2, Q = ‖v1‖2

km1+ ‖v2‖2

km2.

This probability is (crudely) bounded by

Pr (multiple real roots) ≤ e−0.0085k + e−0.0966k.

When a = m1 = m2, this probability can be (sharply) bounded by

Pr (multiple real roots | a = m1 = m2) ≤ e−1.5328k + e−0.4672k.

We can also use simulations to compute the probability.

Page 224: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 224

2 4 6 8 100.001

0.01

0.1

1

Sample size ( k )

Pro

b. o

f mul

tiple

rea

l roo

ts a’ =0.5

a’ =1

a’ =0

Upperbound a’ =1

Simulations show that Pr (multiple real roots) decreases exponentially fast with

respect to increasing sample size k (notice the log scale in the vertical axis). After

k ≥ 8, the probability that the cubic MLE equation admits multiple roots becomes

so small (≤ 1%) that it can be safely ignored in practice. Here a′ = a√m1m2

.

Page 225: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 225

Derivation of the MLE Equation

First, we can write down the joint likelihood function for v1,j , v2,jkj=1:

lik(

v1,j , v2,jkj=1

)

∝ |Σ|− k2 exp

−1

2

k∑

j=1

[

v1,j v2,j

]

Σ−1

v1,j

v2,j

.

where (assuming m1m2 6= a to avoid triviality)

|Σ| = m1m2 − a2, Σ−1 =1

m1m2 − a2

m2 −a

−a m1

,

which allows us to express the log likelihood function, l(a), to be

l(a) = −k

2log(

m1m2 − a2)

−∑k

j=1

(

v21,jm2 − 2v1,jv2,ja + v2

2,jm1

)

2(m1m2 − a2).

Setting l′(a) to zero, we obtain aMLE , the solution to a cubic equation.

Page 226: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 226

The large sample theory tells us that aMLE is asymptotically unbiased and

converges weakly to a normal random variable N(

a, Var (aMLE) = 1I(a)

)

,

where I(a), the expected Fisher Information, is I(a) = −E (l′′(a)). Some

algebra will show that

I(a) = km1m2 + a2

(m1m2 − a2)2 . Var (aMLE) =

1

k

(

m1m2 − a2)2

m1m2 + a2+ O

(

1

k2

)

Higher-order terms can be obtained by more careful. The bias

E (aMLE − a) = −E (l′′′(a)) + 2I′(a)

2I2(a)+ O(k−2),

which is often called the “Bartlett correction.” Some algebra can show this

estimator does not have O(k−1) bias.

Page 227: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 227

The third central moment

E (aMLE − a)3

=−3I′(a)− E (l′′′(a))

I3(a)+ O(k−3)

= −2a(3m1m2 + a2)(m1m2 − a2)3

k2(m1m2 + a2)3+ O(k−3).

The O(k−2) term of the variance, denoted by V c2 , can be written as

V c2 =

1

I3(a)

(

E (l′′(a))2 − I2(a)− ∂ (E (l′′′(a)) + 2I′(a))

∂a

)

+1

2I4(a)

(

10 (I′(a))2 − E (l′′′(a)) (E (l′′′(a))− 4I′(a))

)

=4

k2

(

m1m2 − a2)4

(m1m2 + a2)4 m1m2,

after some truly grueling algebra.

Page 228: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 228

Sign Random Projection

Normal random projection would be just an interesting idea with little industry

impact if sign random projection were not discovered. This is because industry

data are often binary and sparse, for which other algorithms such as minwise

hashing can be more suitable.

Instead of storing each projected sample using (e.g.,) 64 bits, we can simple store

the sign (i.e. 1-bit). Interestingly, the collision probability has a closed-form:

T = Pr (sign(v1,j) = sign(v2,j)) = 1− θ

π, cos θ =

a√m1m2

From k i.i.d. samples, one can estimate T , then θ, and then a, again by

assuming m1 and m2 are known.

Page 229: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 229

aSign = cos(θ)√

m1m2.

By the Delta method, aSign is asymptotically unbiased with the asymptotic

variance

Var (aSign) = Var(θ) sin2(θ)m1m2 =θ(π − θ)

ksin2(θ)m1m2,

because

Var(

θ)

=π2

k

(

1− θ

π

)(

θ

π

)

=θ(π − θ)

k.

Regular random projections store real numbers (e.g., 64 bits). At the same

number of projections (i.e., the same k) , obviously sign random projections will

have larger variances. If the variance is inflated only by a factor of (e.g.,) 4, sign

random projections would be preferable because we could increase k to (e.g.,)

4k, to achieve the same accuracy while the storage cost will still be lower than

regular random projections.

Page 230: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 230

Define

VSign =Var (aSign)

Var (aMLE)=

θ(π − θ) sin2(θ)m1m2

(m1m2−a2)2

m1m2+a2

=θ(π − θ)(1 + cos2(θ))

sin2(θ),

which is symmetric about θ = π2 , monotonically decreasing in (0, π

2 ] with

minimum π2

4 ≈ 2.47, attained at θ = π2 .

0.05 0.1 0.2 0.3 0.4 0.528

16

32

48

64

θ (π)

Var

rat

io

Page 231: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 231

When the data points are nearly uncorrelated (θ close to π2 ), sign random

projections should have good performance.

However, some applications such as duplicate detections are interested in data

points that are close to each other hence sign random projections may cause

relatively large errors. In this case, we’d better off use regular normal random

projections with marginal information.

Page 232: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 232

Proof of Sign Random Projection Collision Probability

g(ρ) =

∫ ∞

0

∫ ∞

0

f(x, y)dxdy =

∫ ∞

0

∫ ∞

0

1

2π√

1− ρ2e−x2

−2ρxy+y2

2(1−ρ2) dxdy

=

∫ ∞

0

1

2π√

1− ρ2e− y2

2(1−ρ2) dy

∫ ∞

0

e− x2

−2ρxy

2(1−ρ2) dx

=

∫ ∞

0

1

2π√

1− ρ2e− y2

2(1−ρ2) dy

∫ ∞

0

e− (x−yρ)2

2(1−ρ2) ey2ρ2

2(1−ρ2) dx

=

∫ ∞

0

1

2π√

1− ρ2e−

y2

2 dy

∫ ∞

0

e− (x−yρ)2

2(1−ρ2) dx

=

∫ ∞

0

1

2π√

1− ρ2e−

y2

2 dy

∫ ∞

−yρ√1−ρ2

e−t2

2

1− ρ2dt

=

∫ ∞

0

φ(y)Φ

(

yρ√

1− ρ2

)

dy

Page 233: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 233

∂g(ρ)

∂ρ=

∫ ∞

0

φ(y)φ

(

yρ√

1− ρ2

)

y

(1− ρ2)3/2dy

=1

(1− ρ2)1/2

∫ ∞

0

1

2πe− y2

2(1−ρ2) dy2

2(1− ρ2)

=1

(1− ρ2)1/2

1

Note that g(0) = 14 . Hence

g(ρ) =

∫ ρ

0

1

(1− ρ2)1/2

1

2πdρ +

1

4

=1

2πsin−1(ρ) +

1

4

=1

2− θ)

+1

4=

1

2− θ

This proves the desired probability T = 2g(ρ) = 1− θπ . Here ρ = a√

m1m2.

Page 234: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 234

Comparing Random Projection with Simple Random Sampling

Suppose we randomly sample k elements from u1 and denote the samples to be

s1, s2, ..., sk. Then an unbiased estimator of m1 =∑D

i=1 u21,i would be

m1,s =D

k

k∑

j=1

s2j

E (m1,s) = DE(s2j) = D

∑Di=1 u2

1,i

D= m1

Page 235: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 235

The variance would be (assuming k ≪ D):

V ar (m1,s) =D2

k

(

E(s4j)−E2(s2

j))

=D2

k

∑Di=1 u4

1,i

D−(

∑Di=1 u2

1,i

D

)2

which can be dominated by the fourth-order moment of the data.

Recall that the variance of the random projection estimator is only related to the

second-order moment. When the data are heavy-tailed, simple random sampling

will have much worse performance compared to random projection.

Page 236: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 236

Summary of Normal Random Projections

Random Projections : Replace A by B = A×R

A R = B

• An elegant method, interesting (elementary) probability exercise. Suitable for

approximating Euclidean distances in massive, dense, and heavy-tailed

(some entries are excessively large) data matrices.

• It does not take advantage of data sparsity.

• It has guaranteed performance when estimating the l2 distance.

• The straightforward estimator for the inner product can be quite unsatisfactory.

• An MLE can improve the estimates by using marginal information.

• What is often used in industry is the sign (1-bit) random projections.

Page 237: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 237

Normality Assumption is Not Necessary

A R = B

R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).

B ∈ Rn×k : projected matrix, also random.

B approximately preserves the Euclidean distance and dot products between any

two rows of A. In particular, E (BBT) = AAT.

However, we do not really need to use normal distribution for sampling the

projection matrix. In fact, any zero-mean distribution with finite variance should

work (central limit theorem).

Page 238: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 238

Sparse Random Projection Matrix

Instead of N(0, 1), we can sample entries of R from i.i.d.

rij =√

s

1 with prob. 12s

0 with prob. 1− 1s

−1 with prob. 12s

,

Here√

s is just for convenience. With this choice

E(rij) = 0, E(r2ij) = 1, E(r4

ij) = s, E(|r3ij|) =

√s,

E (rij ri′j′) = 0, E(

r2ij ri′j′

)

= 0 when i 6= i′, or j 6= j′.

Page 239: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 239

We still use the same unbiased estimators as in normal random projections, i.e.,

m1 =1

k

k∑

j=1

|v1,j |2,

d =1

k

k∑

j=1

|v1,j − v2,j |2,

a =1

k

k∑

j=1

v1,jv2,j

Page 240: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 240

Their variances can be proved to be

Var (m1) =1

k

(

2m21 + (s− 3)

D∑

i=1

u41,i

)

,

Var(

d)

=1

k

(

2d2 + (s− 3)D∑

i=1

(u1,i − u2,i)4

)

,

Var (a) =1

k

(

m1m2 + a2 + (s− 3)

D∑

i=1

u21,iu

22,i

)

.

Interestingly, when s < 3, the variances are strictly smaller than the variances of

normal random projections, regardless of the data.

Page 241: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 241

Even when s is large , for example, s =√

D, the variances may not increase

much unless the fourth moments of the data are too large.

To see this

(s− 3)∑D

i=1 u41,i

m21

=(s− 3)

D

∑Di=1 u4

1,i/D

m21/D

2

which can be written as O(

s−3D

)

if the data can be assumed to have finite fourth

moment. When D is very large (as in practice), we can choose s to be very large

as long as(s−3)

D is relatively small.

For example, s =√

D, is often a good choice if D is truly large and not too

heavy-tailed.

Page 242: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 242

Proof of the Variances

It suffices to study a. (why?) Note that

a2 =

(

D∑

i=1

u1,iu2,i

)2

=D∑

i=1

u21,iu

22,i + 2

i<i′

u1,iu2,iu1,i′u2,i′ .

v1,jv2,j =D∑

i=1

r2iju1,iu2,i +

i 6=i′

riju1,iri′ju2,i′ ,

E (v1,jv2,j) =

D∑

i=1

E(r2ij)u1,iu2,i +

i 6=i′

E(rij)u1,iE(ri′j)u2,i′

=

D∑

i=1

u1,iu2,i

Page 243: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 243

v21,jv

22,j =

D∑

i=1

(

r2ij

)

u1,iu2,i +∑

i 6=i′

(rij)u1,i (ri′j) u2,i′

2

=

∑Di=1

(

r4ij

)

u21,iu

22,i+

2∑

i<i′

(

r2ij

)

u1,iu2,i

(

r2i′j

)

u1,i′u2,i′+(

i 6=i′ (rij)u1,i (ri′j) u2,i′

)2

+

2∑D

i=1

(

r2ij

)

u1,iu2,i

i 6=i′ (rij)u2,i (ri′j) u1,i′

,

Page 244: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 244

E(

v21,jv

22,j

)

=

s

D∑

i=1

u21,iu

22,i + 4

i<i′

u1,iu2,iu1,i′u2,i′ +∑

i 6=i′

u21,iu

22,i′

=

(s− 2)D∑

i=1

u21,iu

22,i +

i 6=i′

u21,iu

22,i′ + 2a2

=

(

m1m2 + (s− 3)D∑

i=1

u21,iu

22,i + 2a2

)

,

Var (a) =1

k

(

m1m2 + a2 + (s− 3)D∑

i=1

u21,iu

22,i

)

.

Page 245: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 245

Discussions

• Sparse random projection has significant advantages because (i) random

number generation is much simpler; (ii) matrix multiplication is essentially

avoid; (iii) storing the projection matrix is much less costly, etc.

• Note that the analysis only needs: E(rij) = E(r3ij) = 0, E(r2

ij) = 1, and

E(r4ij) = s. The exact distribution of rij is actually irrelevant. Therefore, the

analysis is much more general than just for the particular choice of random

projection matrix.

Page 246: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 246

Cauchy Random Projections

A R = B

R ∈ RD×k: a random matrix with i.i.d. entries sampled from standard Cauchy

C(0, 1). B ∈ Rn×k : projected matrix, also random.

It turns out B contains sufficient information to estimate the original l1 distances

in A. Consider the first two rows of A, then d = d1 =∑D

i=1 |u1,i − u2,i|.

The l1 distance is often believed to provide more “robust” results (e.g., clustering,

classification) than the l2 distance. (but we should keep in mind that for many

datasets, l1 distance is not a good choice.)

Page 247: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 247

Normal Random Projections Can Not Estimate l1 Distance

Recall, if rij ∼ N(0, 1), then v1,j ∼ N(

0,∑D

i=1 |u1,i|2)

, which does not

contain information about the l1 norm.

If rij is sampled from other distributions, as long as E(rij) = 0 and

E(r2ij) <∞, by Central Limit Theorem (CLT), we always have approximately

normal projected data.

Therefore, in order to “avoid” CLT, we should sample from distributions which do

not have bounded variance or even bounded mean. Cauchy is a well-known

example of such distributions.

Page 248: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 248

Review Cauchy Distribution

A Cauchy random variable z ∼ C(0, γ) has the density:

f(z) =γ

π

1

z2 + γ2, γ > 0, −∞ < z <∞

and the characteristic function:

E(

exp(√−1zt)

)

= exp (−γ|t|) ,

Consider i.i.d. for z1, z2, ..., zD , i.i.d. C(0, γ), and any constants c1, c2, ..., cD ,

then

E

(

exp

(

√−1t

D∑

i=1

cizi

))

= exp

(

−γD∑

i=1

|ci||t|)

,

which means the weighted sum∑D

i=1 cizi ∼ C(

0,∑D

i=1 |ci|)

. This is the

foundation of Cauchy random projection.

Page 249: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 249

Parameter Estimation Problem in Cauchy Random Projections

In Cauchy random projections, we let rij ∼ C(0, 1), and v1,j =∑D

i=1 u1,irij ,

v2,j =∑D

i=1 u2,irij . Therefore,

v1,j ∼ C

(

0,D∑

i=1

|u1,i|)

v2,j ∼ C

(

0,D∑

i=1

|u2,i|)

xj = v1,j − v2,j ∼ C

(

0,D∑

i=1

|u1,i − u2,i|)

.

That is, the task boils down to estimating the scale parameter, which happens to

be the l1 distance: d = |u1 − u2| =∑D

i=1 |u1,i − u2,i|, in the original space.

Page 250: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 250

Three Types of Estimators

Given k i.i.d. samples xj ∼ C(0, d), the task is to estimate d. We know the

sample mean does not work, because E|x| =∞. We study three types of

estimators:

1. Bias-corrected Sample Median Estimator

2. Bias-corrected Geometric Mean Estimator

3. Bias-corrected MLE

Page 251: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 251

The bias-corrected Sample Median Estimator

dme,c =dme

bme,

where

dme = median(|xj |, j = 1, 2, ..., k)

bme =

∫ 1

0

(2m + 1)!

(m!)2tan

2t)

(

t− t2)m

dt, k = 2m + 1

Here, for convenience, we only consider k = 2m + 1, m = 1, 2, 3, ...

Note that bme can be numerically evaluated and tabulated for each k.

Page 252: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 252

Some properties of dme,c:

• E(

dme,c

)

= d, i.e, dme,c is unbiased.

• When k ≥ 5, the variance of dme,c is

Var(

dme,c

)

= d2

(m!)2

(2m + 1)!

∫ 1

0tan2

(

π2 t) (

t− t2)m

dt(

∫ 1

0tan

(

π2 t)

(t− t2)m

dt)2 − 1

,

Var(

dme,c

)

=∞ if k = 3.

• As k →∞, dme,c converges to a normal in distribution

√k(

dme,c − d)

D=⇒ N

(

0,π2

4d2

)

.

Page 253: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 253

The bias correction factor, bme, can be numerically evaluated and tabulated, as a

function of k = 2m + 1. After k > 50, the bias is negligible.

0 5 10 15 20 25 30 35 40 45 501

1.1

1.2

1.3

1.4

1.5

1.6

1.7

Sample size k

Bia

s co

rrec

tion

fact

or

Page 254: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 254

The Bias-corrected Geometric Mean Estimator

The bias-corrected geometric mean estimator is defined as

dgm,c = cosk( π

2k

)

k∏

j=1

|xj |1/k, k > 1

Useful properties of dgm,c include:

• It is unbiased, i.e., E(

dgm,c

)

= d.

• Its variance is (for k > 2)

Var(

dgm,c

)

= d2

(

cos2k(

π2k

)

cosk(

πk

) − 1

)

=π2

4

d2

k+

π4

32

d2

k2+ O

(

1

k3

)

.

Page 255: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 255

• For 0 ≤ ǫ ≤ 1, its tail bounds can be represented in exponential forms

Pr(

dgm,c − d > ǫd)

≤ exp

(

−k

(

ǫ2

8(1 + ǫ)

))

Pr(

dgm,c − d < −ǫd)

≤ exp

(

−k

(

ǫ2

8(1 + ǫ)

))

, k ≥ π2

1.5ǫ

• These exponential tail bounds yield an analog of the Johnson-Lindenstrauss

(JL) Lemma for dimension reduction in l1:

If k ≥ 8(2 log n−log δ)ǫ2/(1+ǫ) ≥ π2

1.5ǫ , then with probability at least 1− δ, one can

recover the original l1 distance between any pair of data points (among all n

data points) within a 1± ǫ (0 ≤ ǫ ≤ 1) factor of the truth, using dgm,c, i.e.,

|dgm,c − d| ≤ ǫd.

Page 256: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 256

Fractional Moments of Cauchy

Assume x ∼ C(0, d). Then

E(

|x|λ)

=dλ

cos(λπ/2), |λ| < 1

Proof

E(

|x|λ)

=2d

π

∫ ∞

0

y2 + d2dy =

π

∫ ∞

0

yλ−1

2

y + 1dy =

cos(λπ/2).

with the help of the integral tables.

By letting λ = 1/k, we obtain the geometric mean estimator.

Page 257: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 257

The Bias-corrected Maximum Likelihood Estimator

The bias-corrected maximum likelihood estimator (MLE) is

dMLE,c = dMLE

(

1− 1

k

)

,

where dMLE solves a nonlinear MLE equation

− k

dMLE

+k∑

j=1

2dMLE

x2j + d2

MLE

= 0.

Some properties of dMLE,c:

• It is nearly unbiased, E(

dMLE,c

)

= d + O(

1k2

)

.

• Its asymptotic variance is

Var(

dMLE,c

)

=2d2

k+

3d2

k2+ O

(

1

k3

)

,

Page 258: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 258

i.e.,Var(dMLE,c)

Var(dme,c)→ 8

π2 ,Var(dMLE,c)

Var(dgm,c)→ 8

π2 , as k →∞. ( 8π2 ≈ 80%)

• Its distribution can be accurately approximated by an inverse Gaussian, at

least in the small deviation range. The inverse Gaussian approximation

suggests the following approximate tail bound

Pr(

|dMLE,c − d| ≥ ǫd) ∼≤ 2 exp

(

− ǫ2/(1 + ǫ)

2(

2k + 3

k2

)

)

, 0 ≤ ǫ ≤ 1,

which has been verified by simulations for the tail probability≥ 10−10 range.

Page 259: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 259

Sampling from Cauchy and Equivalent Distributions

A Cauchy random variable Z ∼ C(0, 1) has the density:

fZ(z) =1

π

1

z2 + 1, −∞ < z <∞

There are various ways to sample from C(0, 1):

• If X, Y ∼ N(0, 1) i.i.d. then Z = X/Y ∼ C(0, 1).

• If U ∼ U(0, 1), then Z = tan (π [U − 0.5]) ∼ C(0, 1).

• We can use 1U to approximate Cauchy as they are asymptotically equivalent.

Note that fZ(z) = 1π

1z2+1 ≈ 1

π1z2 for large z. In other words,

Pr(Z > z) = 1z for large z, which is the the η-Pareto distribution with η = 1.

if Z ∼ η-Pareto, then Pr(Z > z) = 1zη . E(|Z|λ) <∞ only if λ < η.

Page 260: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 260

Very Sparse Cauchy Random Projections

In practice, we can sample rij from the following very sparse distribution:

rij =

1/U1 with prob. β2

0 with prob. 1− β

−1/U2 with prob. β2

,

where U1 and U2 are independent uniform random variables in (0, 1).

One can show by Fourier Transform that, as D →∞,∑D

i=1 cirij

β∑D

i=1 |ci|D

=⇒ C(

0,π

2

)

,

which for convenience, is written as

D∑

i=1

rijciD

=⇒ C

(

0,π

D∑

i=1

|ci|)

.

Page 261: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 261

Numerical Experiments on Synthetic Data

We simulate data from an η-Pareto distribution for η = 1.1 (i.e., highly

heavy-tailed data), and then apply very sparse stable random projections with

β = 0.05 (i.e., a 20-fold speedup) to estimate the l1 norm of the data.

The mean square errors (MSE), computed from 106 simulation for each k and D,

are plotted against the sample size k for each D=100, 500, 1000, and 5000.

We compute the empirical variances from 106 simulations for every k and D.

The “theoretic” curve is the theoretical variance assuming the data are exactly

(instead of asymptotically) Cauchy. When D = 100, the performance is poor, but

as soon as D ≥ 500, very sparse Cauchy random projections produce very

similar results to regular Cauchy random projections.

Page 262: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 262

10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Sample size k

MS

E

Pareto data ( η = 1.1, β = 0.05)

TheoreticD=100D=500D=1000D=5000

Page 263: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 263

Next, we simulate data from an η-Pareto distribution, Pη , for η = 1.5 and

η = 2.0. More aggressively, we let β = D−0.6 and D−0.75. And we will see

that D does not have to be very large for the asymptotic theory to work well.

10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Sample size k

MS

E

Pareto data ( η = 1.5, β = D−0.6)

Theoretic

D=100

D=500

D=1000

Page 264: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 264

10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Sample size k

MS

E

Pareto data ( η=2, β =D − 0.75 )

Theoretic

D=100

D=500

D=1000

Page 265: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 265

Experiments on Web Crawl Data

We apply very sparse stable random projections on some Web crawl data. We

pick two pairs of words, THIS-HAVE, and SCHOOL-PROGRAM.

For each word (vector), the ith entry (i = 1 to D = 65536) is the number of

occurrences of this word in the ith Web page. It is well-known that the word

frequency data are highly heavy-tailed and highly sparse.

For each pair, we estimate the l1 distance using very sparse stable random

projections with β = 0.1, 0.01, and 0.001. For the pair THIS-HAVE, even when

β = 0.001, the results are indistinguishable from what we would obtain by exact

stable random projections. For the pair SCHOOL-PROGRAM, when β = 0.01,

the results are good. However, when β = 0.001, we see larger errors. This is

because the data are sparse and the data are highly heavy-tailed.

Page 266: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 266

10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Sample size k

MS

E

THIS − HAVE

Theoretic

β=0.1

β=0.01

β=0.001

Page 267: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 267

10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Sample size k

MS

E

SCHOOL − PROGRAM

Theoretic

β=0.1

β=0.01

β=0.001

Page 268: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 268

Classifying Cancers in Microarray Data

Usually the purpose of computing distances is for the subsequent tasks such as

clustering, classification, information retrieval, etc. Here we consider the task of

classifying deceases Harvard microarray dataset. The original dataset contains

176 samples (specimens) in 12600 gene dimensions.

We conduct both Cauchy random projections and very sparse stable random

projections (β = 0.1, 0.01, and 0.001) and classify the specimens using a

5-nearest neighbor classifier based on the estimated l1 distances (using the

geometric mean estimator).

We observe (i): stable random projections can achieve similar classification

accuracy using about 100 projections (as opposed to the original D = 12600

dimensions); (ii): very sparse stable random projections work well when β = 0.1

and 0.01. Even with β = 0.001, the classification results are only slightly worse.

Page 269: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 269

10 100 4000

5

10

15

20

25

30

35

Sample size k

Mis

clas

sific

atio

n er

rors

(m

ean)

Cauchyβ = 0.1β = 0.01β = 0.001

10 100 4000

1

2

3

4

5

6

7

Sample size kM

iscl

assi

ficat

ion

erro

rs (

std)

Cauchyβ = 0.1β = 0.01β = 0.001

Page 270: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 270

Choice of the Random Projection Matrix R

depending on which pairwise lp distance (between rows) is of interest.

• If p = 2 (Euclidian distance), then entries of R are sampled from normal or

normal-like (distributed with finite variance) distribution.

• If p = 1 (Manhattan distance), then sampling R from Cauchy distribution.

• If p = 0 (Hamming distance), then sampling R from p-stable distribution (p ≈ 0)

• For general 0 < p ≤ 2, sampling R from p-stable distribution

Page 271: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 271

Stable Random Projections (SRP)

Given two D-dimensional vectors u1, u2 ∈ RD , let v1 = RTu1 and

v2 = RTu2, then,

v1,j − v2,j =D∑

i=1

(u1,i − u2,i) rij ∼ S

(

p,D∑

i=1

|u1,i − u2,i|p = dp

)

Thus, if we only need the distance dp, we can perform this projection k times and

estimate the scale parameter from the resultant k i.i.d. stable samples.

SRP essentially boils down to a statistical estimation problem. Main open problem

is how to estimate dp when p > 2 (e.g., kurtosis, skewness)

Page 272: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 272

Applications of Random Projections

Numerous applications• Data visualization , e.g., Multi-dimensional scaling (MDS) requires a pairwise similarity matrix.

• Machine Learning , e.g., support vector machine (SVM) requires a kernel/distance matrix.

• Information retrieval , e.g., Filtering nearly duplicate docs (often measured by distance)

• Databases , e.g., Estimating join sizes (dot products) for optimizing query execution.

• Dynamic data stream computations , e.g., Estimating summary statistics for

visualizing/detecting anomaly real-time

Advantages (over sampling methods)

• Guaranteed accuracies in many cases, even on heavy-tailed data.

Disadvantages

• Limited to 0 < p ≤ 2 (although recently we extended it to p = 4, 6, 8...)

• One projection only for one p (i.e., not one-sketch-for-all).

• It does not take advantage of data sparsity.

Page 273: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 273

Impact of the Choice of p

Experiments of classification using m-nearest neighbors and lp distance

0 1 2 3 4 5 6 7 8 9 102

3

4

5

Cla

ssifi

catio

n er

ror

rate

( %

)

Norm (p)

Mnist m = 1m = 5m = 10

0 1 2 3 4 5 6 7 8 9 103

4

5

6

7

Cla

ssifi

catio

n er

ror

rate

( %

)Norm (p)

Letter m = 1m = 5m = 10

lp distance dp =∑D

i=1 |xi − yi|p (In the figure p is the same as p).

Interestingly, in many data sets, lowest classification errors occur at p ≥ 4. But

p-stable random projections are normally limited to 0 < p ≤ 2.

Page 274: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 274

0 1 2 3 4 5 6 7 8 9 104

5

6

7

8

9C

lass

ifica

tion

erro

r ra

te (

% )

Norm (p)

Mnist10k m = 1m = 5m = 10

0 1 2 3 4 5 6 7 8 9 104

5

6

7

8

9

Cla

ssifi

catio

n er

ror

rate

( %

)

Norm (p)

Zipcode m = 1m = 5m = 10

0 1 2 3 4 5 6 7 8 9 1015

20

25

30

35

40

Cla

ssifi

catio

n er

ror

rate

( %

)

Norm (p)

Realsim

m = 1m = 5m = 10

0 1 2 3 4 5 6 7 8 9 103

4

5

6

Cla

ssifi

catio

n er

ror

rate

( %

)

Norm (p)

Gisette

m = 1m = 5m = 10

Page 275: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 275

Efficient Algorithms for Estimating lp Distances ( p > 2)

Difficult to approximate lp distances if p > 2: dp =∑D

i=1 |xi − yi|p.

• When 0 < p ≤ 2, the space (sample size) needed is O(

1/ǫ2)

= O(1) if

we treat ǫ as a constant. Note that the result is independent of D.

• When p > 2, the space (sample size) needed is Ω(

D1−2/p)

.

• No practical algorithms were known to well approximate lp distances for

general p > 2. Even CRS (Conditional Random Sampling, another line of

work we will study later) does not work well when p is large.

Page 276: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 276

Simple Random Sampling

Randomly sample k columns from the data matrix, to compute the l4 distances.

Consider, for example, two rows in A, denoted by x and y. Denote the sampled

k entries by xj , yj , j = 1 to k. The estimator, denoted by d(4),S , is

d(4),S =D

k

k∑

j=1

|xj − yj |4

Var(

d(4),S

)

=D2

k2k[

E(

|xj − yj |8)

− E2(

|xj − yj |4)]

=D

k

D∑

i=1

|xi − yi|8 −

(

∑Di=1 |xi − yi|4

)2

D

.

In the worst case, the variance is dominated by the 8th order terms. Thus,

random sampling can have very large errors.

Page 277: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 277

Conditional Random Sampling for L4 Distance

Conditional Random Sampling (CRS) was recently proposed for sampling from

sparse data. The details will soon be explained. Denote the CRS estimate by

d(4),CRS . The variance is

Var(

d(4),CRS

)

≈ max |x|0, |y|0D

× Var(

d(4),S

)

,

where |x|0 and |y|0 are the numbers of non-zeros in vectors x and y.

While the variance of CRS is reduced substantially by taking advantage of the

data sparsity, it is nevertheless still dominated by the 8th order terms.

Page 278: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 278

A Solution Based (Normal) Random Projections for p = 4

Based on a (retrospectively) simple trick:

d(4) =

D∑

i=1

|xi − yi|4

=D∑

i=1

x4i +

D∑

i=1

y4i + 6

D∑

i=1

x2i y

2i − 4

D∑

i=1

x3i yi − 4

D∑

i=1

xiy3i .

•∑D

i=1 x4i and

∑D

i=1 y4i may be computed exactly by one scan of the data.

•∑D

i=1 x2i y

2i ,∑D

i=1 x3i yi,

∑D

i=1 xiy3i are inner products at different orders and can

be approximated by (normal) random projections.

• For example, we apply normal random projections on x2i and y2

i to estimate the inner

product∑D

i=1 x2i y

2i . This is certainly possible. Then we apply random projections on

x3i and yi to estimate

∑D

i=1 x3i yi, etc.

• The question is: should we use just one projection matrix or should we use three?

Page 279: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 279

Efficient Algorithm for Estimating l4 Distance (Three Projections)

Apply three projection matrices, R(1),R(2),R(3) ∈ RD×k with i.i.d. entries

N(0, 1), to vectors x, y ∈ RD , to generate six vectors:

u1,j =

D∑

i=1

xir(1)ij , u2,j =

D∑

i=1

x2i r

(2)ij , u3,j =

D∑

i=1

x3i r

(3)ij ,

v1,j =

D∑

i=1

yir(3)ij , v2,j =

D∑

i=1

y2i r

(2)ij , v3,j =

D∑

i=1

y3i r

(1)ij .

We have an unbiased estimator, denoted by d(4),3p

d(4),3p =

D∑

i=1

x4i +

D∑

i=1

y4i +

1

k(6uT

2v2 − 4uT3v1 − 4uT

1v3)

Page 280: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 280

d(4),3p =

D∑

i=1

x4i +

D∑

i=1

y4i +

1

k(6uT

2v2 − 4uT3v1 − 4uT

1v3)

The variance is basically the addition of the three variances by normal random

projections (recall the formula m1m2+a2

k ):

Var(

d(4),3p

)

=36

k

D∑

i=1

x4i

D∑

i=1

y4i +

(

D∑

i=1

x2i y

2i

)2

+16

k

D∑

i=1

x6i

D∑

i=1

y2i +

(

D∑

i=1

x3i yi

)2

+16

k

D∑

i=1

x2i

D∑

i=1

y6i +

(

D∑

i=1

xiy3i

)2

.

Note that the highest-order term is only 6th, not 8th.

Page 281: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 281

Efficient Algorithm for Estimating l4 Distance (One Projection)

Apply only one projection matrix of R ∈ RD×k with i.i.d. entries rij ∼ N(0, 1),

to vectors x, y ∈ RD , to generate six vectors:

u1,j =

D∑

i=1

xirij , u2,j =

D∑

i=1

x2i rij , u3,j =

D∑

i=1

x3i rij ,

v1,j =

D∑

i=1

yirij , v2,j =

D∑

i=1

y2i rij , v3,j =

D∑

i=1

y3i rij .

A simple unbiased estimator of d(4) =∑D

i=1 |xi − yi|4

d(4),1p =

D∑

i=1

x4i +

D∑

i=1

y4i +

1

k(6u

T2v2 − 4u

T3v1 − 4u

T1v3)

Page 282: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 282

The variance computations become more difficult because of the correlations.

Var(

d(4),1p

)

= Var(

d(4),1p

)

+ ∆1p

=36

k

D∑

i=1

x4i

D∑

i=1

y4i +

D∑

i=1

x2i y

2i

2

+16

k

D∑

i=1

x6i

D∑

i=1

y2i +

D∑

i=1

x3i yi

2

+16

k

D∑

i=1

x2i

D∑

i=1

y6i +

D∑

i=1

xiy3i

2

+ ∆1p

∆1p = − 48

k

(

D∑

i=1

x5i

D∑

i=1

y3i +

D∑

i=1

x2i yi

D∑

i=1

x3i y

2i

)

− 48

k

(

D∑

i=1

x3i

D∑

i=1

y5i +

D∑

i=1

xiy2i

D∑

i=1

x2i y3

i

)

+32

k

(

D∑

i=1

x4i

D∑

i=1

y4i +

D∑

i=1

xiyi

D∑

i=1

x3i y3

i

)

.

When the data are positive (often the case in practice), it is usually true that

∆1p < 0, i.e., correlation helps.

Page 283: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 283

Improving Estimates Using Marginal Information

Neither d(4),3p or d(4),1p perform well when the data are highly correlated, i.e.,

xi ≈ yi Recall that for normal random projections, we can use marginal

information and MLE to improve the estimates.

For simplicity, we assume that we use three projection matrices. In this case, we

can estimate d(4) by d(4),3p,m, where

d(4),3p,m =D∑

i=1

x4i +

D∑

i=1

y4i + 6a2,2 − 4a3,1 − 4a1,3,

and a2,2, a3,1, a1,3, are respectively, the solutions to the following three cubic

equations:

Page 284: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 284

a32,2 −

a22,2

ku

T2v2 − 1

k

D∑

i=1

x4i

D∑

i=1

y4i u

T2v2 − a2,2

(

D∑

i=1

x4i

D∑

i=1

y4i

)

+a2,2

k

(

D∑

i=1

x4i‖v2‖2

+

D∑

i=1

y4i ‖u2‖2

)

= 0.

a33,1 −

a23,1

ku

T3v1 − 1

k

D∑

i=1

x6i

D∑

i=1

y2i u

T3v1 − a3,1

(

D∑

i=1

x6i

D∑

i=1

y2i

)

+a3,1

k

(

D∑

i=1

x6i‖v1‖2

+

D∑

i=1

y2i ‖u3‖2

)

= 0.

a31,3 −

a21,3

ku

T1v3 − 1

k

D∑

i=1

x2i

D∑

i=1

y6i u

T1v3 − a1,3

(

D∑

i=1

x2i

D∑

i=1

y6i

)

+a1,3

k

(

D∑

i=1

x2i‖v3‖2

+

D∑

i=1

y6i ‖u1‖2

)

= 0.

Page 285: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 285

Asymptotically (as k →∞), the variance would be

Var(

d(4),3p,m

)

=36Var (a2,2) + 16Var (a2,2) + 16Var (a2,2)

=36

k

(

∑Di=1 x4

i

∑Di=1 y4

i −(

∑Di=1 x2

i y2i

)2)2

∑Di=1 x4

i

∑Di=1 y4

i +(∑D

i=1 x2i y2

i

)2

+16

k

(

∑Di=1 x6

i

∑Di=1 y2

i −(

∑Di=1 x3

i yi

)2)2

∑Di=1 x6

i

∑Di=1 y2

i +(∑D

i=1 x3i yi

)2

+16

k

(

∑Di=1 x2

i

∑Di=1 y6

i −(

∑Di=1 xiy

3i

)2)2

∑Di=1 x2

i

∑Di=1 y6

i +(∑D

i=1 xiy3i

)2+ O

(

1

k2

)

But when xi = yi, we do not obtain zero variance. This is disappointing. Even

more disappointingly, d(4),1p,m (i.e., one projection and using margins) would

help and will not achieve zero variance either when xi = yi.

Page 286: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 286

A Good Estimator for Highly Correlated Data (i.e., xi ≈ yi)

Instead of using the exact values, we can estimate∑D

i=1 x4i and

∑Di=1 y4

i . This

is counter-intuitive but it actually works well when xi ≈ yi. A nice example of

utilizing noise cancelations.

d(4),1p,I =1

k(uT

2u2 + vT2v2 + 6uT

2v2 − 4uT3v1 − 4uT

1v3)

One can see that when x = y (i.e., u = v), d(4),I = 0 always. The varianceanalysis, though, requires good patience, because there are many cross-terms.

Page 287: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 287

Var(

d(4),1p,I

)

= Var(

d(4),1p

)

+ ∆I ,

∆I =36

k

D∑

i=1

x2i y

2i

2

+34

k

D∑

i=1

x4i

2

+

D∑

i=1

y4i

2

+32

k

D∑

i=1

xiy3i

D∑

i=1

x3i yi −

32

k

D∑

i=1

x4i

D∑

i=1

x3i yi −

72

k

D∑

i=1

x4i

D∑

i=1

x2i y2

i +

D∑

i=1

y4i

D∑

i=1

x2i y2

i

32

k

D∑

i=1

y4i

D∑

i=1

x3i yi +

D∑

i=1

x4i

D∑

i=1

xiy3i +

D∑

i=1

y4i

D∑

i=1

xiy3i

48

k

D∑

i=1

x3i

D∑

i=1

x5i +

D∑

i=1

x2i yi

D∑

i=1

x2i y3

i

48

k

D∑

i=1

y3i

D∑

i=1

y5i +

D∑

i=1

xiy2i

D∑

i=1

x3i y2

i

+48

k

D∑

i=1

x3i

D∑

i=1

x3i y2

i +

D∑

i=1

x5i

D∑

i=1

xiy2i +

D∑

i=1

y3i

D∑

i=1

x2i y3

i

+48

k

D∑

i=1

y5i

D∑

i=1

x2i yi +

D∑

i=1

x5i

D∑

i=1

x2i yi +

D∑

i=1

y3i

D∑

i=1

x3i y

2i

+48

k

D∑

i=1

x3i

D∑

i=1

x2i y3

i +

D∑

i=1

y5i

D∑

i=1

xiy2i

32

k

D∑

i=1

x6i

D∑

i=1

xiyi +

D∑

i=1

y2i

D∑

i=1

x3i y

3i

32

k

D∑

i=1

x2i

D∑

i=1

x3i y

3i +

D∑

i=1

y6i

D∑

i=1

xiyi

+16

k

D∑

i=1

x2i

D∑

i=1

x6i +

D∑

i=1

y2i

D∑

i=1

y6i + 2

D∑

i=1

xiyi

D∑

i=1

x3i y3

i

Page 288: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 288

Comparisons with Random Sampling and CRS

101

102

103

10−2

10−1

100

101

102

103

104

k

Nor

mal

ized

MS

E

KONG−−HONGRandom Sampling

CRS

EmpiricalTheoretic

101

102

103

10−2

10−1

100

101

102

103

104

k

Nor

mal

ized

MS

EKONG−−HONG 1 Projection

1p1p,m1p,ITheoretic

101

102

103

100

101

102

103

k

Nor

mal

ized

MS

E

OF −− ANDRandom Sampling

CRS

EmpiricalTheoretic

101

102

103

10−1

100

101

102

103

k

Nor

mal

ized

MS

E

OF −− AND 1 Projection

1p1p,m1p,ITheoretic

• Random sampling has very large errors. CRS helps significantly.

• Even CRS is substantially less accurate than our random projection method

for this task. Note that lower MSE is better on the figures.

• For these two cases, our d(4),1p,I is substantially better than our d(4),1p

estimator. In fact, for HONG-KONG, random sampling is even better than

d(4),1p. This is because OF-AND and HONG-KONG are two highly

correlated pairs, especially HONG-KONG.

Page 289: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 289

Nearest Neighbor Classification

We applied random projections to estimate the l4 distances for m-nearest

neighbor classification. We compared the the average classification error rates,

together with the errors using the original l4 distances (horizontal lines).

101

102

103

0

10

20

30

40

k

Err

or r

ate

( %

, mea

n)

Gisette (mean) : m = 1

No marginMarginOriginal

101

102

103

0

10

20

30

40

kE

rror

rat

e (

%, m

ean)

Gisette (mean) : m = 10

No marginMarginOriginal

101

102

103

0

10

20

30

40

k

Err

or r

ate

( %

, mea

n)

Gisette (mean) : m = 20

No marginMarginOriginal

101

102

103

15

20

25

30

35

40

45

k

Err

or r

ate

( %

, mea

n)

Realsim (mean) : m = 1

No marginMarginOriginal

101

102

103

20

25

30

35

k

Err

or r

ate

( %

, mea

n)

Realsim (mean) : m = 10

No marginMarginOriginal

101

102

103

20

25

30

35

k

Err

or r

ate

( %

, mea

n)

Realsim (mean) : m = 20

No marginMarginOriginal

With the number of projections k > 500, using projected data achieved similar

(in some cases even better) accuracies compared to using the original data.

Page 290: BTRY6520/STSCI6520 Fall, 2012 Department of Statistical ...stat.rutgers.edu/home/pingli/STSCI6520/Lecture/lecture.pdf · BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science

BTRY6520/STSCI6520 Fall, 2012 Department of Statistical Science Cornell University 290

Conclusion

• (Symmetric) Stable Random Projection (SRP) is a very effective technique for

efficiently computing the lp distances with 0 < p ≤ 2.

• Depending on the datasets, the optimal p values vary. Interestingly, the

optimal p > 2, which is a difficult task to approximate.

• Estimating the lp distances for 0 < p ≤ 2, which is now (almost) a

technique.

• There is a recently method to efficiently compute the lp distances for

p = 4, 6, 8, .... Very simple algorithm and complicated analysis.

• Main open problems: (i) to develop methods for p = 3, 5, 7, ... especially

p = 3; (ii) To extend the methods to data streams.