statistical decision theory and information constraints€¦ · statistical decision theory and...

92
Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November 20, 2014

Upload: others

Post on 21-May-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Statistical Decision Theory and Information Constraints

Michael I. Jordan University of California, Berkeley

November 20, 2014

Page 2: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

What Is the Big Data Phenomenon?

•  Science in confirmatory mode (e.g., particle physics) •  Science in exploratory mode (e.g., astronomy, genomics) •  Measurement of human activity, particularly online

activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

•  Sensor networks are becoming pervasive

Page 3: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

What Are the Conceptual/Mathematical Issues?

•  The need to control statistical risk under constraints on algorithmic runtime

Page 4: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

What Are the Conceptual/Mathematical Issues?

•  The need to control statistical risk under constraints on algorithmic runtime –  how do risk and runtime trade off as a function of the amount of

data?

Page 5: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

What Are the Conceptual/Mathematical Issues?

•  The need to control statistical risk under constraints on algorithmic runtime –  how do risk and runtime trade off as a function of the amount of

data? •  Statistical with distributed and streaming data

–  how is inferential quality impacted by communication constraints?

Page 6: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

What Are the Conceptual/Mathematical Issues?

•  The need to control statistical risk under constraints on algorithmic runtime –  how do risk and runtime trade off as a function of the amount of

data? •  Statistical with distributed and streaming data

–  how is inferential quality impacted by communication constraints?

•  The tradeoff between statistical risk and privacy (and other externalities)

Page 7: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

What Are the Conceptual/Mathematical Issues?

•  The need to control statistical risk under constraints on algorithmic runtime –  how do risk and runtime trade off as a function of the amount of

data? •  Statistical with distributed and streaming data

–  how is inferential quality impacted by communication constraints?

•  The tradeoff between statistical risk and privacy (and other externalities)

•  Many other issues that require a blend of statistical thinking (e.g., a focus on sampling, confidence intervals, evaluation, diagnostics, causal inference) and computational thinking (e.g., scalability, abstraction)

Page 8: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

•  Take (classical) statistical decision theory as a mathematical point of departure

•  Treat computation, communication, privacy, etc as constraints on statistical risk

•  This induces tradeoffs among these quantities and the number of data points

Our Approach

Page 9: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

•  Take (classical) statistical decision theory as a mathematical point of departure

•  Treat computation, communication, privacy, etc as constraints on statistical risk

•  This induces tradeoffs among these quantities and the number of data points

•  Under the hood: geometry, information theory and optimization

Our Approach

Page 10: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

•  In the 1930’s, Wald laid the foundations of statistical decision theory

•  Given a family of probability distributions , a parameter for each , an estimator , and a loss , define the risk:

•  Minimax principle [Wald, ‘39, ‘43]: choose

estimator minimizing worst-case risk:

Background

supP2P

EP

hl(✓̂, ✓(P ))

i

P✓(P ) P 2 P

✓̂ l(✓̂, ✓(P ))

RP (✓̂) := EP

hl(✓̂, ✓(P ))

i

Page 11: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Part I: Privacy and Minimax Risk

with John Duchi and Martin Wainwright University of California, Berkeley

Page 12: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

•  Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur

•  We will quantify “privacy loss” via differential privacy

•  We then treat differential privacy as a constraint on inference via statistical decision theory

•  This yields (personal) tradeoffs between privacy loss and inferential gain

Privacy and Risk

Page 13: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

Page 14: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

X1 X2 Xn

ZnZ2Z1

b✓

Page 15: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

X1 X2 Xn

ZnZ2Z1

b✓

Private

Page 16: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)

Page 17: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)

Channel

Page 18: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)

Channel

Individuals with private data Estimator

Xiiid⇠ Pi 2 {1, . . . , n}

Zn1 7! b✓(Zn

1 )

Page 19: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

X1 X2 Xn

ZnZ2Z1

b✓

Private

Xi

Zi

b✓

Q(· | Xi)

Channel

Individuals with private data Estimator

Xiiid⇠ Pi 2 {1, . . . , n}

Zn1 7! b✓(Zn

1 )

Page 20: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Definitions of privacyDefinition: channel is -differentially private if

↵Q

[Dwork, McSherry, Nissim, Smith 06]

Xi

Zi

b✓

Q(· | Xi)sup

S,x2X ,x

02X

Q(Z 2 S | x)Q(Z 2 S | x0

)

exp(↵)

Page 21: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Definitions of privacyDefinition: channel is -differentially private if

↵Q

[Dwork, McSherry, Nissim, Smith 06]

Xi

Zi

b✓

Q(· | Xi)sup

S,x2X ,x

02X

Q(Z 2 S | x)Q(Z 2 S | x0

)

exp(↵)

logQ(z | x)logQ(z | x0

)

Page 22: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Private Minimax RiskCentral object of study: minimax risk

Minimax risk

•Parameter of distribution✓(P )

•Family of distributions P•Loss measuring error`

Mn(✓(P), `) := infb✓supP2P

EP

h`(b✓(Xn

1 ), ✓(P ))i

Page 23: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

-private Minimax risk↵

Private Minimax RiskCentral object of study: minimax risk•Parameter of distribution✓(P )

•Family of distributions P•Loss measuring error`

Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

EP,Q

h`(b✓(Zn

1 ), ✓(P ))i

•Family of private channelsQ↵

Page 24: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

-private Minimax risk↵

Private Minimax RiskCentral object of study: minimax risk

Best -private channel↵

•Parameter of distribution✓(P )

•Family of distributions P•Loss measuring error`

Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

EP,Q

h`(b✓(Zn

1 ), ✓(P ))i

•Family of private channelsQ↵

Page 25: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

-private Minimax risk↵

Private Minimax RiskCentral object of study: minimax risk

Best -private channel↵Minimax risk under privacy constraint

•Parameter of distribution✓(P )

•Family of distributions P•Loss measuring error`

Mn(✓(P), `,↵) := infQ2Q↵

infb✓supP2P

EP,Q

h`(b✓(Zn

1 ), ✓(P ))i

•Family of private channelsQ↵

Page 26: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

Page 27: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Proportions =

Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines

✓1✓2✓3✓4✓5✓6

= .45= .32= .16= .20= .00= .02

Page 28: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=

�distributions P supported on [�1, 1]d

Page 29: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Proposition:

Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=

�distributions P supported on [�1, 1]d

Minimax rateMn(Pd, k·k1) ⇣ min

⇢1,

plog dpn

(achieved by sample mean)

Page 30: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Proposition:Private minimax rate for ↵ = O(1)

Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=

�distributions P supported on [�1, 1]d

Mn(Pd, k·k1 ,↵) ⇣ min

⇢1,

pd log dpn↵2

Page 31: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Proposition:Private minimax rate for ↵ = O(1)

Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

✓(P ) := EP [X] 2 Rd

`1 E[kb✓ � ✓k1]

Pd :=

�distributions P supported on [�1, 1]d

Mn(Pd, k·k1 ,↵) ⇣ min

⇢1,

pd log dpn↵2

Effective sample size n 7! n↵2/dNote:

Page 32: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal mechanism?

Non-privateobservation

X =

2

66664

10100

3

77775

Page 33: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal mechanism?

Non-privateobservation

X =

2

66664

10100

3

77775

Idea 1: add independent noise(e.g. standard Laplace

mechanism)

Z = X +W =

2

66664

1 +W1

0 +W2

1 +W3

0 +W4

0 +W5

3

77775

Page 34: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal mechanism?

Non-privateobservation

X =

2

66664

10100

3

77775

Idea 1: add independent noise(e.g. standard Laplace

mechanism)

Z = X +W =

2

66664

1 +W1

0 +W2

1 +W3

0 +W4

0 +W5

3

77775

Problem: magnitude much too large(this is unavoidable: provably sub-optimal)

Page 35: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal mechanism

Non-privateobservation

X =

2

66664

10100

3

77775

Page 36: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal mechanism

Non-privateobservation

X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2

•Draw uniformly inv {0, 1}d

Page 37: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal mechanism

Non-privateobservation

X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2

•Draw uniformly inv {0, 1}d

•With probability

choose closer of and tov 1� v X

(closer : 3 overlap) (farther : 2 overlap)

•otherwise, choose farther

e↵

1 + e↵

Page 38: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal mechanism

Non-privateobservation

X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2

•Draw uniformly inv {0, 1}d

•With probability

choose closer of and tov 1� v X

(closer : 3 overlap) (farther : 2 overlap)

•otherwise, choose farther

e↵

1 + e↵

Page 39: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal mechanism

Non-privateobservation

X =

2

66664

10100

3

77775v =

2

66664

01100

3

777751� v =

2

66664

10011

3

77775

View 1 View 2

•Draw uniformly inv {0, 1}d

•With probability

choose closer of and tov 1� v X

(closer : 3 overlap) (farther : 2 overlap)

At end:Compute sample

average andde-bias

•otherwise, choose farther

e↵

1 + e↵

Page 40: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Empirical evidence

Estimate proportion of emergency room visits involving different substances

Data source: Drug Abuse Warning

NetworkSample size n

Page 41: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Xi Zi1,2Pj Q

Page 42: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Question: How much “contraction” does privacy induce?

Xi Zi1,2Pj Q

Page 43: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Question: How much “contraction” does privacy induce?

Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

↵n

Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV

Pj Q

Page 44: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}↵

Mj(S) :=

ZQ(S | x1, . . . , xn)dP

nj (x1, . . . , xn)

Note: for n 7! n↵2 ↵ . 1

Question: How much “contraction” does privacy induce?

Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

↵n

Dkl (M1||M2) +Dkl (M2||M1) 4n(e↵ � 1)2 kP1 � P2k2TV

Pj Q

Page 45: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Final remarks: privacy

Key: Allows identification of new optimal mechanisms

Rough technique: Reduction of estimation to testing,then apply information-theoretic testing lower bounds

[Le Cam, Hasminskii, Ibragimov, Assouad, Birge, Barron, Yu, ...]Additional examples

‣ Fixed-design regression‣ Convex risk minimization‣ Multinomial estimation‣ Nonparametric density estimation

n 7! n↵2Almost always: effective sample size reduction

n 7! n↵2

dIn d-dimensional problems:

Page 46: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Part II: Communication and Minimax Risk

with John Duchi, Martin Wainwright and Yuchen Zhang

University of California, Berkeley

Page 47: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Communication-constraints

[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Page 48: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Communication-constraints•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?

[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Page 49: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Communication-constraints

X1 X2 Xm

Z1 Z2 Zm

b✓

•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?

[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Page 50: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Communication-constraints

X1 X2 Xm

Z1 Z2 Zm

b✓

Xi = (Xi1, X

i2, . . . , X

in)

•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?

Setting: each of agents has sample of size

mn

Messages to fusion centerZi

[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Page 51: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Communication-constraints

X1 X2 Xm

Z1 Z2 Zm

b✓

Xi = (Xi1, X

i2, . . . , X

in)

•Large data necessitates distributed storage•Independent data collection (hospitals)•Privacy?

Setting: each of agents has sample of size

mn

Messages to fusion centerZi

Question: tradeoffs between communication

and statistical utility?[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Page 52: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study:• Parameter of distribution✓(P )

• Family of distributions P

Mn(✓(P), B) := inf⇡2⇧B

infb✓supP2P

EP

hkb✓(Zm

1 )� ✓(P )k22i

X1 X2 Xm

Z1 Z2 Zm

b✓

• Loss k·k22

Page 53: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study:• Parameter of distribution✓(P )

• Family of distributions P

Mn(✓(P), B) := inf⇡2⇧B

infb✓supP2P

EP

hkb✓(Zm

1 )� ✓(P )k22i

Best protocol with smaller than bitsBZiZi = ⇡(Xi)

X1 X2 Xm

Z1 Z2 Zm

b✓

• Loss k·k22

Constrained to be bits B

Page 54: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Vignette: mean estimationX1 X2 Xm

Z1 Z2 Zm

b✓

Consider estimation in normallocation family,

Xijiid⇠ N(✓,�2Id⇥d)

✓ 2 [�1, 1]d

Page 55: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Vignette: mean estimationX1 X2 Xm

Z1 Z2 Zm

b✓

Minimax rateTheorem: when each agent has sample of sizen

E[kb✓(X1, . . . , Xm)� ✓k22] ⇣�2d

nm

Consider estimation in normallocation family,

Xijiid⇠ N(✓,�2Id⇥d)

✓ 2 [�1, 1]d

Page 56: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Minimax rate with B-bounded communication

Vignette: mean estimationX1 X2 Xm

Z1 Z2 Zm

b✓

Theorem: when each agent has sample of sizen

bitsB

Consider estimation in normallocation family,

Xijiid⇠ N(✓,�2Id⇥d)

✓ 2 [�1, 1]d

d

B ^ d

1

logm

�2d

nm. Mn(Nd, B) . d logm

B ^ d

�2d

nm

Page 57: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Minimax rate with B-bounded communication

Vignette: mean estimationX1 X2 Xm

Z1 Z2 Zm

b✓

Theorem: when each agent has sample of sizen

bitsB

Consequence: each sends bits for optimal estimation⇡ d

Consider estimation in normallocation family,

Xijiid⇠ N(✓,�2Id⇥d)

✓ 2 [�1, 1]d

d

B ^ d

1

logm

�2d

nm. Mn(Nd, B) . d logm

B ^ d

�2d

nm

Page 58: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Part III: Computation and Minimax Risk

with John Duchi and Brendan McMahan

Page 59: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Towards computationHighly-varying

or sparse X 2 Rn⇥d ✓ 2 Rd

Noise " 2 RnTargetY 2 Rn

[D., Hazan, Singer 11; D., Jordan, McMahan 13]

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

3

7777775

z}|{2

666666664

⇤⇤⇤⇤⇤⇤⇤

3

777777775

+

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775

Page 60: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Towards computationHighly-varying

or sparse X 2 Rn⇥d ✓ 2 Rd

Noise " 2 RnTargetY 2 Rn

•Both large n and large d•Analogous models for classification (e.g. logistic)

[D., Hazan, Singer 11; D., Jordan, McMahan 13]

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

3

7777775

z}|{2

666666664

⇤⇤⇤⇤⇤⇤⇤

3

777777775

+

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775

Page 61: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Columns very different (ill-conditioning)

Towards computationHighly-varying

or sparse X 2 Rn⇥d ✓ 2 Rd

Noise " 2 RnTargetY 2 Rn

•Both large n and large d•Analogous models for classification (e.g. logistic)

[D., Hazan, Singer 11; D., Jordan, McMahan 13]

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775⇡

z }| {2

6666664

⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

⇤ ⇤⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤⇤ ⇤ ⇤ ⇤

3

7777775

z}|{2

666666664

⇤⇤⇤⇤⇤⇤⇤

3

777777775

+

z}|{2

6666664

⇤⇤⇤⇤⇤⇤

3

7777775

Page 62: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Sparse and heavy-tailed

Features of 2000 webpages

Medical risk prediction patients and indicator variablesBinary interactions yield

(sparse) featuresd ⇡ 4.5 · 108

d ⇡ 3 · 104n ⇡ 105

Text data classification Binary word indicatorsfor words, bigramslead to variables,< 1% non-zero

d ⇡ 5 · 104

d ⇡ 3 · 106

[Manning & Schütze 99; Shah & Meinshausen 13; Li & König 11; Ma et al., 09; Crammer, Dredze, Kulesza 09; ...]

Page 63: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Sparse and heavy-tailed

Features of 2000 webpages

Medical risk prediction patients and indicator variablesBinary interactions yield

(sparse) featuresd ⇡ 4.5 · 108

d ⇡ 3 · 104n ⇡ 105

Text data classification Binary word indicatorsfor words, bigramslead to variables,< 1% non-zero

d ⇡ 5 · 104

d ⇡ 3 · 106

[Manning & Schütze 99; Shah & Meinshausen 13; Li & König 11; Ma et al., 09; Crammer, Dredze, Kulesza 09; ...]

Page 64: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Convex risk minimizationLarge n, large d: focus on prediction. Minimize

RP (✓) := EP [`(X; ✓)]

convex inwhere ` ✓ and X ⇠ P

Page 65: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Convex risk minimizationLarge n, large d: focus on prediction. Minimize

Computation = optimization complexity

{x, ✓} `(x; ✓)

r✓`(x; ✓)

1unit

[Nemirovski& Yudin 83]

RP (✓) := EP [`(X; ✓)]

convex inwhere ` ✓ and X ⇠ P

Page 66: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Convex risk minimizationLarge n, large d: focus on prediction. Minimize

Computation = optimization complexity

{x, ✓} `(x; ✓)

r✓`(x; ✓)

1unit

[Nemirovski& Yudin 83]

RP (✓) := EP [`(X; ✓)]

convex inwhere ` ✓ and X ⇠ P

Minimax risk with n computationsMn(⇥,P, `) := inf

b✓2Cn

supP2P

n

EP [RP (b✓)]�min✓2⇥

RP (✓)o

Page 67: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal convergenceLinear functionals on hypercube`(x; ✓) = x

>✓

⇥ = [�1, 1]d

E[X]

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj

Feature j of appears with probability pjX 2 {�1, 0, 1}d

Page 68: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

✓i+1 ✓i �1pigi

Page 69: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

✓i+1 ✓i �1pigi

Efficient? Sometimes[Robbins & Monro 51; Polyak & Juditsky 92; Nedić & Bertsekas 01; Lai 03; Nemirovski, Juditsky, Lan, Shapiro 09; Agarwal, Bartlett, Ravikumar, Wainwright 12;...]

Page 70: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

✓i+1 ✓i �1pigi

Problem: method (and variants) sub-optimal for data with high-variance features

Efficient? Sometimes[Robbins & Monro 51; Polyak & Juditsky 92; Nedić & Bertsekas 01; Lai 03; Nemirovski, Juditsky, Lan, Shapiro 09; Agarwal, Bartlett, Ravikumar, Wainwright 12;...]

Page 71: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

✓i+1 ✓i �1pigi

Problem: method (and variants) sub-optimal for data with high-variance features

Efficient? Sometimes[Robbins & Monro 51; Polyak & Juditsky 92; Nedić & Bertsekas 01; Lai 03; Nemirovski, Juditsky, Lan, Shapiro 09; Agarwal, Bartlett, Ravikumar, Wainwright 12;...]

Sometimes a factor of d worse than necessary...

Page 72: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

✓i+1 ✓i �1pigi

Page 73: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

Unc

omm

on

Common

Contours of R(✓)

✓i+1 ✓i �1pigi

Page 74: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

Unc

omm

on

CommonCan we do something a bit more second order?

Contours of R(✓)

✓i+1 ✓i �1pigi

Page 75: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

Unc

omm

on

CommonCan we do something a bit more second order?

hard

Rescaled risk

Contours of R(✓) easier

✓i+1 ✓i �1pigi

Page 76: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Methods and challengesStochastic approximationIterate:

gi = r`(Xi; ✓i)1. Random

2. Update

Xi

Unc

omm

on

CommonCan we do something a bit more second order?

hard

Rescaled risk

Contours of R(✓) easier

Ai = diag

✓X

ji

gjg>j

◆ 12

✓i+1 ✓i �A�1i gi

AdaGrad

Page 77: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal convergence

Upper bounds

Iterate:gi = r`(Xi; ✓i)1. Gradient

3. Update

2. Scale Ai = diag(Pji

gjg>j )12

✓i+1 ✓i �A�1i gi

For any risk functional,E[R(b✓)]�R(✓⇤)

1p2n

E

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

nX

i=1

g>i A�1gi

��

Page 78: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal convergence

Upper bounds

Iterate:gi = r`(Xi; ✓i)1. Gradient

3. Update

2. Scale Ai = diag(Pji

gjg>j )12

✓i+1 ✓i �A�1i gi

Oracle: adaptiveto this data

For any risk functional,E[R(b✓)]�R(✓⇤)

1p2n

E

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

nX

i=1

g>i A�1gi

��

Page 79: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj

Feature j of appears with probability pjX

Linear functional on hypercube`(x; ✓) = x

>✓

⇥ = [�1, 1]d

For any risk functional,E[R(b✓)]�R(✓⇤)

1p2n

E

mindiagonal A⌫0

⇢k✓⇤k21 tr(A) +

nX

i=1

g>i A�1gi

��

Page 80: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj

Feature j of appears with probability pjX

Linear functional on hypercube`(x; ✓) = x

>✓

⇥ = [�1, 1]d

Linear models on hypercubeE[R(b✓)]�R(✓⇤)

p2pn

dX

j=1

ppj

Page 81: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Optimal convergence

Upper bounds

Lower bounds

Mn(⇥,P, `) � 1

8

1pn

dX

j=1

ppj

Feature j of appears with probability pjX

Linear functional on hypercube`(x; ✓) = x

>✓

⇥ = [�1, 1]d

Linear models on hypercubeE[R(b✓)]�R(✓⇤)

p2pn

dX

j=1

ppj

Note: adaptivity < factor 12 larger than lower bound

Page 82: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents

1[Crammer et al. ’06]

Misc

lassifi

catio

n ra

te 1

Page 83: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Empirics: text classificationReuters news feed document classification task:d = 2,000,000 features, 4000 non-zero / document,n = 800,000 documents

1[Crammer et al. ’06]

Misc

lassifi

catio

n ra

te 125 - 50%

improvement

Page 84: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounPer noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104

[Bengio, Weston, Grangier 10]

Page 85: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Empirics: image rankingTask: 15,000 different nounsGiven noun, rank images in order of relevance for nounpr

opor

tion

good

in to

p k

k

Per noun: observations in dimensionsn ⇡ 2 · 106 d ⇡ 104

[Bengio, Weston, Grangier 10]

Page 86: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

One more result

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rag

e F

ram

e A

ccu

racy

(%

)Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L!BFGS

[Dean et al. 12]

Page 87: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

One more result

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rag

e F

ram

e A

ccu

racy

(%

)Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L!BFGS

Adaptive method

[Dean et al. 12]

Page 88: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

One more result

0 20 40 60 80 100 1200

5

10

15

20

25

Time (hours)

Ave

rag

e F

ram

e A

ccu

racy

(%

)Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L!BFGS

Adaptive method

[Dean et al. 12]

Order of magnitude improvement in time & costIn many production systems at Google(only result I may show...)

Page 89: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Unifying picture

Page 90: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Unifying pictureConstraints in minimax analysis

Reduction of estimation to testingV X Z

( is private, only a few bits, a function evaluation...)Z

find initial using onlyV Z

Page 91: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Unifying pictureConstraints in minimax analysis

Reduction of estimation to testing

Information-theoretic techniques: strong data processing

V X Z

Relate information ‣ Fano‣ Assouad‣ Le Cam

tovia constraint on pairV Z V X

X Z

apply

( is private, only a few bits, a function evaluation...)Z

find initial using onlyV Z

Page 92: Statistical Decision Theory and Information Constraints€¦ · Statistical Decision Theory and Information Constraints Michael I. Jordan University of California, Berkeley November

Unifying pictureConstraints in minimax analysis

Reduction of estimation to testing

Information-theoretic techniques: strong data processing

V X Z

Relate information ‣ Fano‣ Assouad‣ Le Cam

tovia constraint on pairV Z V X

X Z

apply

Understanding leads to new, optimal schemes

( is private, only a few bits, a function evaluation...)Z

find initial using onlyV Z