statistical decision theory and information constraints€¦ · statistical decision theory and...

Statistical Decision Theory and Information Constraints

Michael I. Jordan University of California, Berkeley

November 20, 2014

What Is the Big Data Phenomenon?

•  Science in confirmatory mode (e.g., particle physics) •  Science in exploratory mode (e.g., astronomy, genomics) •  Measurement of human activity, particularly online

activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

•  Sensor networks are becoming pervasive

What Are the Conceptual/Mathematical Issues?

•  The need to control statistical risk under constraints on algorithmic runtime


•  The need to control statistical risk under constraints on algorithmic runtime –  how do risk and runtime trade off as a function of the amount of

data?



data? •  Statistical with distributed and streaming data

–  how is inferential quality impacted by communication constraints?





•  The tradeoff between statistical risk and privacy (and other externalities)





•  The tradeoff between statistical risk and privacy (and other externalities)

•  Many other issues that require a blend of statistical thinking (e.g., a focus on sampling, confidence intervals, evaluation, diagnostics, causal inference) and computational thinking (e.g., scalability, abstraction)

•  Take (classical) statistical decision theory as a mathematical point of departure

•  Treat computation, communication, privacy, etc as constraints on statistical risk

•  This induces tradeoffs among these quantities and the number of data points

Our Approach

•  Take (classical) statistical decision theory as a mathematical point of departure

•  Treat computation, communication, privacy, etc as constraints on statistical risk

•  This induces tradeoffs among these quantities and the number of data points

•  Under the hood: geometry, information theory and optimization

Our Approach

•  In the 1930’s, Wald laid the foundations of statistical decision theory

•  Given a family of probability distributions , a parameter for each , an estimator , and a loss , define the risk:

•  Minimax principle [Wald, ‘39, ‘43]: choose

estimator minimizing worst-case risk:

Background

supP2P

EP

hl(✓̂, ✓(P ))

i

P✓(P ) P 2 P

✓̂ l(✓̂, ✓(P ))

RP (✓̂) := EP

hl(✓̂, ✓(P ))

i

Part I: Privacy and Minimax Risk

with John Duchi and Martin Wainwright University of California, Berkeley

•  Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur

•  We will quantify “privacy loss” via differential privacy

•  We then treat differential privacy as a constraint on inference via statistical decision theory

•  This yields (personal) tradeoffs between privacy loss and inferential gain

Privacy and Risk

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]


X1 X2 Xn

ZnZ2Z1

b✓


X1 X2 Xn

ZnZ2Z1

b✓

Private


X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)


X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)

Channel


X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)

Channel

Individuals with private data Estimator

Xiiid⇠ Pi 2 {1, . . . , n}

Zn1 7! b✓(Zn

1 )

Definitions of privacyDefinition: channel is -differentially private if

↵Q

[Dwork, McSherry, Nissim, Smith 06]

Xi

Zi

b✓

Q(· | Xi)sup

S,x2X ,x

02X

Q(Z 2 S | x)Q(Z 2 S | x0

)

exp(↵)

Private Minimax RiskCentral object of study: minimax risk

Minimax risk

•Parameter of distribution✓(P )

•Family of distributions P•Loss measuring error`

Mn(✓(P), `) := infb✓supP2P

EP

h`(b✓(Xn

1 ), ✓(P ))i

-private Minimax risk↵

Private Minimax RiskCentral object of study: minimax risk•Parameter of distribution✓(P )


Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

EP,Q

h`(b✓(Zn

1 ), ✓(P ))i

•Family of private channelsQ↵



Best -private channel↵



Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

EP,Q

h`(b✓(Zn

1 ), ✓(P ))i




Best -private channel↵Minimax risk under privacy constraint



Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

EP,Q

h`(b✓(Zn

1 ), ✓(P ))i


Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

Proportions =

Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines

✓

✓1✓2✓3✓4✓5✓6

= .45= .32= .16= .20= .00= .02

Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=

�distributions P supported on [�1, 1]d

Proposition:


✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=


Minimax rateMn(Pd, k·k1) ⇣ min

⇢1,

plog dpn

�

(achieved by sample mean)

Proposition:Private minimax rate for ↵ = O(1)


✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=


Mn(Pd, k·k1 ,↵) ⇣ min

⇢1,

pd log dpn↵2

�

Proposition:Private minimax rate for ↵ = O(1)


✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=


Mn(Pd, k·k1 ,↵) ⇣ min

⇢1,

pd log dpn↵2

�

Effective sample size n 7! n↵2/dNote:

Optimal mechanism?

Non-privateobservation

X =

2

66664

10100

3

77775

Optimal mechanism?


X =

2

66664

10100

3

77775

Idea 1: add independent noise(e.g. standard Laplace

mechanism)

Z = X +W =

2

66664

1 +W1

0 +W2

1 +W3

0 +W4

0 +W5

3

77775

Optimal mechanism?


X =

2

66664

10100

3

77775

Idea 1: add independent noise(e.g. standard Laplace

mechanism)

Z = X +W =

2

66664

1 +W1

0 +W2

1 +W3

0 +W4

0 +W5

3

77775

Problem: magnitude much too large(this is unavoidable: provably sub-optimal)

Optimal mechanism


X =

2

66664

10100

3

77775

Optimal mechanism


X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2

•Draw uniformly inv {0, 1}d

Optimal mechanism


X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2


•With probability

choose closer of and tov 1� v X

(closer : 3 overlap) (farther : 2 overlap)

•otherwise, choose farther

e↵

1 + e↵

Optimal mechanism


X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2


•With probability

choose closer of and tov 1� v X

(closer : 3 overlap) (farther : 2 overlap)

At end:Compute sample

average andde-bias

•otherwise, choose farther

e↵

1 + e↵

Empirical evidence

Estimate proportion of emergency room visits involving different substances

Data source: Drug Abuse Warning

NetworkSample size n

Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Xi Zi1,2Pj Q


Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Question: How much “contraction” does privacy induce?

Xi Zi1,2Pj Q


Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)


Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

↵n

Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV

Pj Q


Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Note: for n 7! n↵2 ↵ . 1


Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

↵n

Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV

Pj Q

Final remarks: privacy

Key: Allows identification of new optimal mechanisms

Rough technique: Reduction of estimation to testing,then apply information-theoretic testing lower bounds

[Le Cam, Hasminskii, Ibragimov, Assouad, Birge, Barron, Yu, ...]Additional examples

‣ Fixed-design regression‣ Convex risk minimization‣ Multinomial estimation‣ Nonparametric density estimation

n 7! n↵2Almost always: effective sample size reduction

n 7! n↵2

dIn d-dimensional problems:

Part II: Communication and Minimax Risk

with John Duchi, Martin Wainwright and Yuchen Zhang

University of California, Berkeley

Communication-constraints

[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Communication-constraints•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?



X1 X2 Xm

Z1 Z2 Zm

b✓

•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?



X1 X2 Xm

Z1 Z2 Zm

b✓

Xi = (Xi1, X

i2, . . . , X

in)


Setting: each of agents has sample of size

mn

Messages to fusion centerZi



X1 X2 Xm

Z1 Z2 Zm

b✓

Xi = (Xi1, X

i2, . . . , X

in)


Setting: each of agents has sample of size

mn

Messages to fusion centerZi

Question: tradeoffs between communication

and statistical utility?[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study:• Parameter of distribution✓(P )

• Family of distributions P

Mn(✓(P), B) := inf⇡2⇧B

infb✓supP2P

EP

hkb✓(Zm

1 )� ✓(P )k22i

X1 X2 Xm

Z1 Z2 Zm

b✓

• Loss k·k22

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study:• Parameter of distribution✓(P )

• Family of distributions P

Mn(✓(P), B) := inf⇡2⇧B

infb✓supP2P

EP

hkb✓(Zm

1 )� ✓(P )k22i

Best protocol with smaller than bitsBZiZi = ⇡(Xi)

X1 X2 Xm

Z1 Z2 Zm

b✓

• Loss k·k22

Constrained to be bits B

Vignette: mean estimationX1 X2 Xm

Z1 Z2 Zm

b✓

Consider estimation in normallocation family,

Xijiid⇠ N(✓,�2Id⇥d)

✓ 2 [�1, 1]d


Z1 Z2 Zm

b✓

Minimax rateTheorem: when each agent has sample of sizen

E[kb✓(X1, . . . , Xm)� ✓k22] ⇣�2d

nm



✓ 2 [�1, 1]d

Minimax rate with B-bounded communication


Z1 Z2 Zm

b✓

Theorem: when each agent has sample of sizen

bitsB



✓ 2 [�1, 1]d

d

B ^ d

1

logm

�2d

nm. Mn(Nd, B) . d logm

B ^ d

�2d

nm

Minimax rate with B-bounded communication


Z1 Z2 Zm

b✓

Theorem: when each agent has sample of sizen

bitsB

Consequence: each sends bits for optimal estimation⇡ d



✓ 2 [�1, 1]d

d

B ^ d

1

logm

�2d

nm. Mn(Nd, B) . d logm

B ^ d

�2d

nm

Part III: Computation and Minimax Risk

with John Duchi and Brendan McMahan

Towards computationHighly-varying

or sparse X 2 Rn⇥d ✓ 2 Rd

Noise " 2 RnTargetY 2 Rn

[D., Hazan, Singer 11; D., Jordan, McMahan 13]

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

3

7777775

z}|{2

666666664

⇤⇤⇤⇤⇤⇤⇤

3

777777775

+

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775




•Both large n and large d•Analogous models for classification (e.g. logistic)


z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

3

7777775

z}|{2

666666664

⇤⇤⇤⇤⇤⇤⇤

3

777777775

+

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775

Columns very different (ill-conditioning)




•Both large n and large d•Analogous models for classification (e.g. logistic)


z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

3

7777775

z}|{2

666666664

⇤⇤⇤⇤⇤⇤⇤

3

777777775

+

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775

Sparse and heavy-tailed

Features of 2000 webpages

Medical risk prediction patients and indicator variablesBinary interactions yield

(sparse) featuresd ⇡ 4.5 · 108

d ⇡ 3 · 104n ⇡ 105

Text data classification Binary word indicatorsfor words, bigramslead to variables,< 1% non-zero

d ⇡ 5 · 104

d ⇡ 3 · 106

[Manning & Schütze 99; Shah & Meinshausen 13; Li & König 11; Ma et al., 09; Crammer, Dredze, Kulesza 09; ...]

Convex risk minimizationLarge n, large d: focus on prediction. Minimize

RP (✓) := EP [`(X; ✓)]

convex inwhere ` ✓ and X ⇠ P


Computation = optimization complexity

{x, ✓} `(x; ✓)

r✓`(x; ✓)

1unit

[Nemirovski& Yudin 83]

RP (✓) := EP [`(X; ✓)]



Computation = optimization complexity

{x, ✓} `(x; ✓)

r✓`(x; ✓)

1unit

[Nemirovski& Yudin 83]

RP (✓) := EP [`(X; ✓)]


Minimax risk with n computationsMn(⇥,P, `) := inf

b✓2Cn

supP2P

n

EP [RP (b✓)]�min✓2⇥

RP (✓)o

Optimal convergenceLinear functionals on hypercube`(x; ✓) = x

>✓

⇥ = [�1, 1]d

E[X]

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj

Feature j of appears with probability pjX 2 {�1, 0, 1}d

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

✓i+1 ✓i �1pigi



2. Update

Xi


Efficient? Sometimes[Robbins & Monro 51; Polyak & Juditsky 92; Nedić & Bertsekas 01; Lai 03; Nemirovski, Juditsky, Lan, Shapiro 09; Agarwal, Bartlett, Ravikumar, Wainwright 12;...]



2. Update

Xi


Problem: method (and variants) sub-optimal for data with high-variance features




2. Update

Xi


Problem: method (and variants) sub-optimal for data with high-variance features


Sometimes a factor of d worse than necessary...



2. Update

Xi




2. Update

Xi

Unc

omm

on

Common

Contours of R(✓)




2. Update

Xi

Unc

omm

on

CommonCan we do something a bit more second order?

Contours of R(✓)




2. Update

Xi

Unc

omm

on


hard

Rescaled risk

Contours of R(✓) easier




2. Update

Xi

Unc

omm

on


hard

Rescaled risk

Contours of R(✓) easier

Ai = diag

✓X

ji

gjg>j

◆ 12

✓i+1 ✓i �A�1i gi

AdaGrad

Optimal convergence

Upper bounds

Iterate:gi = r`(Xi; ✓i)1. Gradient

3. Update

2. Scale Ai = diag(Pji

gjg>j )12

✓i+1 ✓i �A�1i gi

For any risk functional,E[R(b✓)]�R(✓⇤)

1p2n

E

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

nX

i=1

g>i A�1gi

��

Optimal convergence

Upper bounds

Iterate:gi = r`(Xi; ✓i)1. Gradient

3. Update

2. Scale Ai = diag(Pji

gjg>j )12

✓i+1 ✓i �A�1i gi

Oracle: adaptiveto this data


1p2n

E

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

nX

i=1

g>i A�1gi

��

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj

Feature j of appears with probability pjX

Linear functional on hypercube`(x; ✓) = x

>✓

⇥ = [�1, 1]d


1p2n

E

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

nX

i=1

g>i A�1gi

��

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj



>✓

⇥ = [�1, 1]d

Linear models on hypercubeE[R(b✓)]�R(✓⇤)

p2pn

dX

j=1

ppj

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj



>✓

⇥ = [�1, 1]d

Linear models on hypercubeE[R(b✓)]�R(✓⇤)

p2pn

dX

j=1

ppj

Note: adaptivity < factor 12 larger than lower bound

Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents

1[Crammer et al. ’06]

Misc

lassifi

catio

n ra

te 1

Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents

1[Crammer et al. ’06]

Misc

lassifi

catio

n ra

te 125 - 50%

improvement

Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounPer noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104

[Bengio, Weston, Grangier 10]

Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounpr

opor

tion

good

in to

p k

k

Per noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104

[Bengio, Weston, Grangier 10]

One more result

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rag

e F

ram

e A

ccu

racy

(%

)Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L!BFGS

[Dean et al. 12]

One more result

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rag

e F

ram

e A

ccu

racy

(%



Adaptive method

[Dean et al. 12]

One more result

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rag

e F

ram

e A

ccu

racy

(%



Adaptive method

[Dean et al. 12]

Order of magnitude improvement in time & costIn many production systems at Google(only result I may show...)

Unifying picture

Unifying pictureConstraints in minimax analysis

Reduction of estimation to testingV X Z

( is private, only a few bits, a function evaluation...)Z

find initial using onlyV Z


Reduction of estimation to testing

Information-theoretic techniques: strong data processing

V X Z

Relate information ‣ Fano‣ Assouad‣ Le Cam

tovia constraint on pairV Z V X

X Z

apply




Reduction of estimation to testing

Information-theoretic techniques: strong data processing

V X Z

Relate information ‣ Fano‣ Assouad‣ Le Cam

tovia constraint on pairV Z V X

X Z

apply

Understanding leads to new, optimal schemes



statistical decision theory and information constraints€¦ · statistical decision theory and...

Documents