sub-gaussian estimators of the mean of a random … estimators of the mean of a random matrix with...

Sub-Gaussian Estimators of the Mean of a Random Matrix withEntries Possessing Only Two Moments

Stas MinskerUniversity of Southern California

July 21, 2016

ICERM Workshop

Simple question: how to estimate the mean?

Assume that X1, . . . ,Xn are i.i.d. N (µ, σ20).

Problem: construct CInorm(α) for µ with coverage probability ≥ 1− 2α.

Solution: compute µn := 1n

n∑j=1

Xj , take

CInorm(α) =

[µn − σ0

√2

√log(1/α)

n, µn + σ0

√2

√log(1/α)

n

]

Simple question: how to estimate the mean?

Assume that X1, . . . ,Xn are i.i.d. N (µ, σ20).

Problem: construct CInorm(α) for µ with coverage probability ≥ 1− 2α.

Solution: compute µn := 1n

n∑j=1

Xj , take

CInorm(α) =

[µn − σ0

√2

√log(1/α)

n, µn + σ0

√2

√log(1/α)

n

]

Coverage is guaranteed since

Pr

(∣∣µn − µ∣∣ ≥ σ0

√2 log(1/α)

n

)≤ 2α.

Example: how to estimate the mean?

P. J. Huber (1964): “...This raises a question which could have been asked already by Gauss,but which was, as far as I know, only raised a few years ago (notably by Tukey): whathappens if the true distribution deviates slightly from the assumed normal one?"

Going back to our question: what if X1, . . . ,Xn are i.i.d. copies of X ∼ Π such that

EX = µ, Var(X) ≤ σ20?

Problem: construct CI for µ with coverage probability ≥ 1− α such that for any α

length(CI(α)) ≤ (Absolute constant) · length(CInorm(α))

No additional assumptions on Π are imposed.

Remark: guarantees for the sample mean µn = 1n

n∑j=1

Xj is unsatisfactory:

Pr

(∣∣µn − µ∣∣ ≥ σ0

√(1/α)

n

)≤ α.

Does the solution exist?


Answer (somewhat unexpected?): Yes!

Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11]

Split the sample into k = blog(1/α)c+ 1 groups G1, . . . ,Gk of size ' n/k each:

G1︷︸︸︷X1, . . . ,X|G1|︸︷︷︸µ1:= 1

|G1|∑

Xi∈G1

Xi

. . . . . .

Gk︷︸︸︷Xn−|Gk |+1, . . . ,Xn︸︷︷︸µk := 1

|Gk |∑

Xi∈Gk

Xi︸︷︷︸µ∗=µ∗(α):=median(µ1,...,µk )

Claim:

Pr

(|µ∗ − µ| ≥ 7.7σ0

√log(e/α)

n

)≤ α


Answer (somewhat unexpected?): Yes!

Construction: [A. Nemirovski, D. Yudin ‘83; N. Alon, Y. Matias, M. Szegedy ‘96; R. Oliveira, M. Lerasle ‘11]

Split the sample into k = blog(1/α)c+ 1 groups G1, . . . ,Gk of size ' n/k each:

G1︷︸︸︷X1, . . . ,X|G1|︸︷︷︸µ1:= 1

|G1|∑

Xi∈G1

Xi

. . . . . .

Gk︷︸︸︷Xn−|Gk |+1, . . . ,Xn︸︷︷︸µk := 1

|Gk |∑

Xi∈Gk

Xi︸︷︷︸µ∗=µ∗(α):=median(µ1,...,µk )

Claim:

Pr

(|µ∗ − µ| ≥ 7.7σ0

√log(e/α)

n

)≤ α

Then take

CI(α) =

[µ∗ − 7.7σ0

√log(e/α)

n, µ∗ + 7.7σ0

√log(e/α)

n

]

Idea of the proof:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

µ8µ1 µ. . . . . . . . . . . .

|µ− µ| ≥ s =⇒ at least half of events {|µj − µ| ≥ s} occur.

Improve the constant?

O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0

− log(1− x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2),

and define µ vian∑

j=1

ψ(θ(Xj − µ)

)= 0.


O. Catoni’s estimator (2012), “Generalized truncation”: let α > 0

− log(1− x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2),

and define µ vian∑

j=1

ψ(θ(Xj − µ)

)= 0.

Truncation τ(x) = (|x | ∧ 1)sign(x) satisfies a weaker inequality

− log(1− x + x2) ≤ τ(x) ≤ log(1 + x + x2)

!1 0 1

!1

0

1


n∑j=1

ψ(θ(Xj − µ)

)= 0.

Intuition: for small θ > 0,

n∑j=1

ψ(θ(Xj − µ)

)'

n∑j=1

θ(Xj − µ) = 0

=⇒ µ '1n

n∑j=1

Xj


n∑j=1

ψ(θ(Xj − µ)

)= 0.

The following holds: set θ∗ =√

2 log(1/α)n

1σ0

. Then

|µ− µ| ≤(√

2 + o(1))σ0

√log(1/α)

n

with probability ≥ 1− 2α.

Extensions to higher dimensions

A natural question: is it possible to extend presented techniques to the multivariate mean?

Motivation: PCA

Genes mirror geography within Europe, J. Novembre et al, Nature 2008.

Mathematical framework:

Y1, . . . ,Yn ∈ Rd , i.i.d. EYj = 0, EYj Y Tj = Σ.

Goal: construct Σ, an estimator of Σ such that∥∥∥Σ− Σ∥∥∥

Op

is small.

Sample covariance

Σn =1n

n∑j=1

Yj Y Tj

is very sensitive to outliers.

Extensions to higher dimensionsA natural question: is it possible to extend presented techniques to the multivariate mean?Motivation: PCA

0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

5

6

7

8

9

10

71

71.1

71.2

71.3

71.4

71.5

71.6

71.7

71.8

71.9

72

=⇒

0

1

2

3

4

5

6

7

8

9

10

0

1

2

3

4

5

6

7

8

9

10

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

Genes mirror geography within Europe, J. Novembre et al, Nature 2008.Mathematical framework:



Op

is small.Sample covariance

Σn =1n

n∑j=1

Yj Y Tj




Motivation: PCAGenes mirror geography within Europe, J. Novembre et al, Nature 2008.

The direction of the PC1 axis and its relative strength may reflect aspecial role for this geographic axis in the demographic history ofEuropeans (as first suggested in ref. 10). PC1 aligns north-northwest/south-southeast (NNW/SSE, 216 degrees) and accounts forapproximately twice the amount of variation as PC2 (0.30% versus0.15%, first eigenvalue 5 4.09, second eigenvalue 5 2.04). However,caution is required because the direction and relative strength of thePC axes are affected by factors such as the spatial distribution ofsamples (results not shown, also see ref. 9). More robust evidencefor the importance of a roughly NNW/SSE axis in Europe is that, inthese same data, haplotype diversity decreases from south to north(A.A. et al., submitted). As the fine-scale spatial structure evident inFig. 1 suggests, European DNA samples can be very informativeabout the geographical origins of their donors. Using a multi-ple-regression-based assignment approach, one can place 50% of

individuals within 310 km of their reported origin and 90% within700 km of their origin (Fig. 2 and Supplementary Table 4, resultsbased on populations with n . 6). Across all populations, 50% ofindividuals are placed within 540 km of their reported origin, and90% of individuals within 840 km (Supplementary Fig. 3 andSupplementary Table 4). These numbers exclude individuals whoreported mixed grandparental ancestry, who are typically assignedto locations between those expected from their grandparental origins(results not shown). Note that distances of assignments fromreported origin may be reduced if finer-scale information on originwere available for each individual.

Population structure poses a well-recognized challenge for disease-association studies (for example, refs 11–13). The results obtainedhere reinforce that the geographic distribution of a sample is impor-tant to consider when evaluating genome-wide association studies

–0.03 –0.02 –0.01 0 0.01 0.02 0.03–0.03

–0.02

–0.01

0

0.01

0.02

0.03

Italy

Germany

France

UK

SpainPortugal

0 1,000 2,000 3,000

–0.010

0

0.010

0.020

Geographic distance betweenpopulations (km)

Med

ian

gene

tic c

orre

latio

n

PC

1a

b c

French-speaking SwissGerman-speaking SwissItalian-speaking Swiss

FrenchGermanItalian

Nor

th–s

outh

in P

C1–

PC

2 sp

ace

East–west in PC1–PC2 space

PC2

Figure 1 | Population structure within Europe. a, A statistical summary ofgenetic data from 1,387 Europeans based on principal component axis one(PC1) and axis two (PC2). Small coloured labels represent individuals andlarge coloured points represent median PC1 and PC2 values for eachcountry. The inset map provides a key to the labels. The PC axes are rotatedto emphasize the similarity to the geographic map of Europe. AL, Albania;AT, Austria; BA, Bosnia-Herzegovina; BE, Belgium; BG, Bulgaria; CH,Switzerland; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark;ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR,

Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK,Macedonia; NO, Norway; NL, Netherlands; PL, Poland; PT, Portugal; RO,Romania; RS, Serbia and Montenegro; RU, Russia, Sct, Scotland; SE,Sweden; SI, Slovenia; SK, Slovakia; TR, Turkey; UA, Ukraine; YG,Yugoslavia. b, A magnification of the area around Switzerland froma showing differentiation within Switzerland by language. c, Geneticsimilarity versus geographic distance. Median genetic correlation betweenpairs of individuals as a function of geographic distance between theirrespective populations.

NATURE | Vol 456 | 6 November 2008 LETTERS

99 ©2008 Macmillan Publishers Limited. All rights reserved

good explanation for non-experts:https://faculty.washington.edu/tathornt/SISG2015/lectures/assoc2015session05.pdf




Op

is small.Sample covariance

Σn =1n

n∑j=1

Yj Y Tj


https://faculty.washington.edu/tathornt/SISG2015/lectures/assoc2015session05.pdf



Motivation: PCA

Genes mirror geography within Europe, J. Novembre et al, Nature 2008.




Op

is small.

Sample covariance

Σn =1n

n∑j=1

Yj Y Tj



Naive approach: apply the "median trick" (or Catoni’s estimator) coordinatewise.Makes the bound dimension-dependent.

Better approach – replace the usual median by the geometric median.

x∗ = med(x1, . . . , xk ) := argminy∈Rd

k∑j=1

‖y − xj‖.

Still some issues:1 does not work well for small sample sizes;2 yields bounds in the wrong norm.

Alternatives: Tyler’s M-estimator, Maronna’s M-estimator; guarantees are limited to specialclasses of distributions.

Matrix functions

f : R 7→ R, A = AT = UΛUT , then

f (A) = Uf (Λ)UT , f (Λ) = f

λ1

. . .λd

=

f (λ1)

. . .f (λd )

Construction of the estimator

X ∈ Rd×d - symmetric random matrix, X1, . . . ,Xn ∈ Rd×d – i.i.d. copies of X , E‖X‖2F <∞.

No additional assumptions.

− log(1− x + x2/2) ≤ ψ(x) ≤ log(1 + x + x2/2), θ > 0, define

Σn =1nθ

n∑j=1

ψ(θXj )

For example, if Xj = Yj Y Tj , we get

Σn =1nθ

n∑j=1

ψ(θYj Y T

j

)Intuition: for small θ, ψ(θx) ' θx , hence

Σn ' Sample mean + o(θ)





Σn =1nθ

n∑j=1

ψ(θXj )


Σn =1nθ

n∑j=1

ψ(θYj Y T

j

)Note that

ψ(θYj Y T

j

)= ψ(θ‖Yj‖2

2)Yj

‖Yj‖2

Y Tj

‖Yj‖2

is easy to compute.

Intuition: for small θ, ψ(θx) ' θx , hence






Σn =1nθ

n∑j=1

ψ(θXj )


Σn =1nθ

n∑j=1

ψ(θYj Y T

j

)Intuition: for small θ, ψ(θx) ' θx , hence


Σn =1nθ

n∑j=1

ψ(θXj)

Theorem (M., 2016)

X1, . . . ,Xn - i.i.d. Assume that σ2 ≥ ‖EX 2‖. Let θ =√

2 log(d/α)n

1σ

, then

∥∥∥Σn − EX∥∥∥ ≤ σ√2 log(d/α)

n


For example, in covariance estimation σ2 =∥∥∥E‖Y‖2

2 YY T∥∥∥.

Theorem (M., 2016)

X1, . . . ,Xn - i.i.d. Assume that σ2 ≥ ‖EX 2‖. Let θ =√

2 log(d/α)n

1σ

, then

∥∥∥Σn − EX∥∥∥ ≤ σ√2 log(d/α)

n


Compare to:

Theorem (Matrix Bernstein inequality, Tropp ‘11)

X ,X1, . . . ,Xn ∈ Rd×d - i.i.d., σ20 =

∥∥E(X − EX)2∥∥, ‖X‖ ≤ M. Then for all 0 < α < 1,

∥∥∥1n

n∑j=1

Xj − EX∥∥∥ ≤ max

(2σ0

√log(d/α)

n,

43

M log(d/α)

n

)


Further improvements: Xj 7→ Xj + S,

Σ(S) = S +1nθ

n∑j=1

ψ(θ(Xj − S)

)︸︷︷︸

'EX−S

.

"Ideal choice" S = EX is unavailable =⇒ use the initial estimator Σn in place of S.

Iterate...

S∞ = S∞ +1nθ

n∑j=1

ψ(θ(Xj − S∞)

)︸︷︷︸

=0

Theorem (M., 2016)

Assume that σ20 ≥ ‖E(X − EX)2‖. Let θ =

√2 log(d/α)

n1σ0

, and

1nθ

n∑j=1

ψ(θ(Xj − S∞)

)= 0.

Assume that n is large enough (n & d3). Then S∞ exists and

∥∥∥S∞ − EX∥∥∥ ≤ Cσ0

√log(d/α)

n

with probability ≥ 1− α.

Numerical results

Y1, . . . ,Yn ∈ R100,

Σ =

10

51

. . .1

100

Yi,j ∼ symmetric Pareto-type distribution with 4 moments.

Numerical results

Histograms over 500 replications: n = 100.

1 2 3 4 5 6 7 8 9 10 110

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Error

Fre

quency

Sample covariance estimator

Robust covariance estimator

Sample covariance error

‖Sn− Σ‖/‖Σ‖

Robust estimator error

‖Σn − Σ‖/‖Σ‖

Numerical results


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Error

Fre

quency



‖u1(Σn)u(Σn)T− u1(Σ)u1(Σ)T‖

‖u1(Sn)u1(Sn)T− u1(Σ)u1(Σ)T‖

Numerical results


0 1 10 20 30 40 50 600

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Error

Fre

quency



Robust estimator error

‖Σn − Σ‖/‖Σ‖

Sample covariance error

‖Sn− Σ‖/‖Σ‖

Numerical results


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Error

Fre

quency



‖u1(Σn)u(Σn)T− u1(Σ)u1(Σ)T‖

‖u1(Sn)u1(Sn)T− u1(Σ)u1(Σ)T‖

Matrix Completion

Observe some entries of the ratings matrix

A0 =

movie 1 movie 2 . . . movie n

user 1 ∗ ∗ . . . ∗... . . . . . . . . .

...user k ∗ ∗ . . . ∗

Question: can we predict the unobserved entries?

Matrix Completion

X ={

ej (d)eTk (d), 1 ≤ j ≤ d , 1 ≤ k ≤ d

}.

X1, . . . ,Xn - independent sample from Π := Unif(X ), and observations Yj , j = 1, . . . , n havethe form

Yj = tr (X Tj A0) + ξj , (“noisy matrix entry”)

where ξj , j = 1, . . . , n is additive noise.

E(YX) = 1d2 A0, hence natural estimator of A0 is

A =d2

n

n∑j=1

Yj Xj .

Incorporate low rank assumption:

Aτ = argminA∈Rd×d

[‖A− A‖2

F

d2+ τ‖A‖1

]

Matrix completion

What if noise ξj is heavy-tailed (only Var(ξj ) <∞)?

Replace A with a "robust" estimator

R =d2

nθ

n∑j=1

ψ(θYjH(Xj )

)and

Rτ = argminA∈Rd×d

[‖A− R‖2

F

d2+ τ‖A‖1

].

Here, H(X) =

(0 X

X T 0

)is the so-called self-adjoint dilation.

Matrix completionWhat if noise ξj is heavy-tailed (only Var(ξj ) <∞)?

Replace A with a "robust" estimator

R =d2

nθ

n∑j=1

ψ(θYjH(Xj )

)and

Rτ = argminA∈Rd×d

[‖A− R‖2

F

d2+ τ‖A‖1

].

Here, H(X) =

(0 X

X T 0

)is the so-called self-adjoint dilation.

Theorem (M., 2016)Take

τ = Const ·√

t + log 2dnd

,

then1

d2

∥∥∥Rτ −H(A0)∥∥∥2

F≤(

1 +√

22

)2d · 2rank(A0)

n

√t + log 2d

with probability ≥ 1− e−t .

Thank you for your attention!

sub-gaussian estimators of the mean of a random … estimators of the mean of a random matrix with...

Documents