maya r. gupta, google - stanford university · multi‐task averaging: theory and practice maya r....

101
MultiTask Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela Frigyik Univ. Pecs

Upload: vokiet

Post on 15-Feb-2019

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Multi‐Task Averaging:Theory and Practice

Maya R. Gupta,   Google Research, Univ. Washington

1

Sergey FeldmanUniv. Washington

Bela FrigyikUniv. Pecs

Page 2: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Aristotle

2

The idea of a mean is old :

\By the mean of a thing I denote a pointequally distant from either extreme..."-Aristotle

v = ymin+ymax

2

v ¡ ymin = ymax ¡ v

ymin ymaxv

Page 3: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Tycho Brahe (16th century)

3

Averaged to reduce measurement error.

¹y = 1N

PNi=1 yi

Page 4: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Legendre (1805)

4

Legendre noted the mean minimizes squared error:

¹y = arg min¹

NXi=1

(yi ¡ ¹)2

Page 5: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Legendre (1805)

5

Frigyik et al. 2008:the mean minimizes anyfunctional Bregman divergence.

Banerjee et al. 2005:the mean minimizes any Bregman divergence:

¹y = arg min¹

NXi=1

Ã(yi; ¹)

)

Legendre noted the mean minimizes squared error:

¹y = arg min¹

NXi=1

(yi ¡ ¹)2

(

Page 6: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Gauss (1809) 

6

The average was central to Gauss's construction ofthe normal distribution. His goals:

- a smooth distribution- whose likelihood peak was at the sample mean.

Page 7: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Fisher 1922

7

\...no other statistic which can becalculated from the same sampleprovides any additional information asto the value of the parameter..."-R. A. Fisher

Page 8: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Stein’s Paradox 1956

8

Total squared error can be reduced by estimating each of themeans of T Gaussian random variables using datasampled from all of them, even if the random variablesare independent and have di®erent means.

t = 1

t = 2

t = T

...

¹1

¹2

¹3

Page 9: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables

Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T

9

¹1

¹2

¹3

Page 10: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables

Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T

10

¹1

¹2

¹3

Maximum Likelihood Estimate

¹t = Yt

Page 11: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables

Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T

11

James-Stein Estimate

¹JSt =

Ã1¡ (T ¡ 2)¾2PT

r=1 Yr2

!Yt

Maximum Likelihood Estimate

¹t = Yt ¹1

¹2

¹3

Page 12: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument

12

Key assumptions:

² ¹t » N (0; ¿ 2)

² ¿ 2 unknown

² Yt » N (¹t; ¾2)

² ¾2 known

James-Stein Estimate

¹JSt =

Ã1¡ (T ¡ 2)¾2PT

r=1 Yr2

!Yt

Page 13: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

13

E[¹tjYt] =

μ1¡ ¾2

¿2 + ¾2

¶Yt;

Key assumptions:

² ¹t » N (0; ¿ 2)

² ¿ 2 unknown

² Yt » N (¹t; ¾2)

² ¾2 known

James-Stein Estimate

¹JSt =

Ã1¡ (T ¡ 2)¾2PT

r=1 Yr2

!Yt

James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument

Page 14: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

E

"(T ¡ 2)¾2PT

r=1 Yr2

#=

¾2

¿ 2 + ¾2

Key assumptions:

² ¹t » N (0; ¿ 2)

² ¿ 2 unknown

² Yt » N (¹t; ¾2)

² ¾2 known

James-Stein Estimate

¹JSt =

Ã1¡ (T ¡ 2)¾2PT

r=1 Yr2

!Yt

E[¹tjYt] =

μ1¡ ¾2

¿2 + ¾2

¶Yt;

James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument

Page 15: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

² ¹t » N (»; ¿ 2)

² ¿ 2 and » unknown

² Yti » N (¹t; ¾2t )

i = 1; : : : ; Nt

² ¾2t unknown

² » = 1T

PTr=1

¹Yr

² Positive-part: (x)+ = max(x; 0)

² Diagonal § with §tt = ¾2t

Nt

² ¹Y is a T length vector with tth entry ¹Yt

General James-Stein Estimate

A More General JSE (Bock, 1972)

15

¹JSt = » +

³1¡ T¡3

(¹Y¡»)T§¡1(¹Y¡»)

´+ ³¹Yt ¡ »

´

Page 16: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

James‐Stein Dominates

16

James's and Stein's theorem (1961):

For T > 3 the general JSE dominates the sample average (MLE):

E[jj¹¡ ¹JSjj22] · E[jj¹¡ ¹Y jj22]

for every choice of ¹. This is also written as:

R(¹JS) · R( ¹Y )

Page 17: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

17

James-Stein Estimation Ã! Empirical Bayes

Multi-task Averaging (MTA) Ã! Empirical Loss Minimizationwith Regularization

(Empirical Vapnik)

(Tikhonov Regularization)

Page 18: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Multi‐Task Averaging Feldman et al. 2012

Problem: estimate means f¹tg of T random variables.

Given: Nt IID samples fytigNti=1 from each random variable.

Data Model: Yti drawn IID from ºt with ¯nite mean ¹t.

18

t = 1

t = 2

t = T

...

¹1

¹2

¹3

Page 19: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Building the MTA Objective\Single-task" averaging:

19

¹yt = argmin¹t

NtXi=1

(yti ¡ ¹t)2

Page 20: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Building the MTA Objective

20

f¹ytgTt=1 = arg min

f¹tgTt=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

add across tasks

\Single-task" averaging:

Page 21: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Building the MTA Objective

21

f¹ytgTt=1 = arg min

f¹tgTt=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

Mahalanobis distance

\Single-task" averaging:

Page 22: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

22

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

similarity between task r and sMahalanobis distanceto samples from task t

Page 23: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

23

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

Task 1:Estimateaverage movieticket price

Task 2:Estimatemean ageof kids atsummer camp

Page 24: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

24

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

Task 1:Estimateaverage movieticket price

Task 2:Estimateprice ofteain China?

Page 25: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

25

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2

Page 26: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

26

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹MTA1¹y1

¹y2¹MTA

2

Page 27: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

27

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2

Page 28: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

28

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2

¹MTA1

¹MTA2

Page 29: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

29

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2

Page 30: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

30

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

t = 1

t = 2

¹y1

¹y2

¹MTA1

¹MTA2

Page 31: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

The MTA Objective\Multi-task" averaging:

31

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

empirical losslowers bias

regularizer:lowers estimationvariance

Page 32: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA Closed Form Solution

For non-negative A:

vectorof Tsampleaverages

graphLaplacianof A + AT

diagonalmatrix ofsample mean

variances¾2

t

Nt 32

¹MTA =³I +

°

T§L

´¡1

¹y

vectorof TMTAsolution

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

\Multi-task" averaging:

Page 33: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA Closed Form Solution

33

Lemma: this inverse always exists ifArs ¸ 0, ° ¸ 0, and Nt ¸ 1.

¹MTA =³I +

°

T§L

´¡1

¹y

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

\Multi-task" averaging:

Page 34: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA Closed Form Solution

34

MTA estimatesare a linearcombo ofsampleaverages

T £ Tmatrix W

T sampleaverages

¹MTA =³I +

°

T§L

´¡1

¹y

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

\Multi-task" averaging:

Page 35: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA Closed Form Solution

35

right-stochasticmatrix W

Theorem:convexcombo ofsampleaverages

T sampleaverages

¹MTA =³I +

°

T§L

´¡1

¹y

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

\Multi-task" averaging:

Page 36: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

When is MTA Better than the sample means?

36

Y1i = ¹1 + ²1

N1 samples, ¾21

Y2i = ¹2 + ²2

N2 samples, ¾22

Task 1:Estimateaverage movieticket price

Task 2:Estimatemean ageof kids atsummer camp

Page 37: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

37

Y1i = ¹1 + ²1

N1 samples, ¾21

Y2i = ¹2 + ²2

N2 samples, ¾22

MTA estimate¡I + °

T§L

¢¡1 ¹Y :

¹MTA1 =

ÃT +

¾22

N2A12

T +¾21

N1A12 +

¾22

N2A12

!¹Y1 +

þ21

N1A12

T +¾21

N1A12 +

¾22

N2A12

!¹Y2:

When is MTA Better than the sample means?

Page 38: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

38

N1 samples, ¾21 N2 samples, ¾2

2

Biased, but smaller error variance than sample averages.

Y1i = ¹1 + ²1 Y2i = ¹2 + ²2

Risk[¹MTA1 ] <Risk[¹Y1] if (¹1¡¹2)

2 < 4A12

+¾21

N1+

¾22

N2

MTA estimate¡I + °

T§L

¢¡1 ¹Y :

¹MTA1 =

ÃT +

¾22

N2A12

T +¾21

N1A12 +

¾22

N2A12

!¹Y1 +

þ21

N1A12

T +¾21

N1A12 +

¾22

N2A12

!¹Y2:

When is MTA Better than the sample means?

Page 39: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

39

Risk[¹MTA] <Risk[¹Y ] if (¹1¡¹2)2¡ ¾2

1

N1¡ ¾2

2

N2< 4

A12

¹2

Page 40: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Optimal A for T = 2

4040

Example:

Answer: the optimal task similarity in terms of MSE:

A¤12 = 2

(¹1¡¹2)2

Y1 Y2

¹1¹2

Page 41: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

4141

Example:

Answer: the optimal task similarity in terms of MSE:

A¤12 = 2

(¹1¡¹2)2

Y1 Y2

¹1¹2

A¤12 = 2

(¹y1¡¹y2)2

Estimated Optimal A for T = 2

estimated

Page 42: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

42

Optimal sim:A¤

12 = 2(¹1¡¹2)2

A12

¾21 = 1 ¾2

2 = 1

¹1 = 0; ¹2 = 1

Page 43: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Optimal A for T > 2

43

² Bad news: no simple analytical minimization of risk for T > 2.

Page 44: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Optimal A for T > 2

44

² Bad news: no simple analytical minimization of risk for T > 2.

² One solution: use pairwise estimate to populate A:

A¤rs = 2

(¹yr¡¹ys)2

A¤12 A¤

13

A¤21 A¤

23

A¤32A¤

31

Page 45: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Optimal A for T > 2² Bad news: no simple analytical minimization of risk for T > 2.

² One solution: use pairwise estimate to populate A

² Better solution: constrain A = a11T and optimize over a.

aa

a

a

a

Page 46: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Optimal A for T > 2

46

aa

a

a

a

² Bad news: no simple analytical minimization of risk for T > 2.

² One solution: use pairwise estimate to populate A

² Better solution: constrain A = a11T and optimize over a.

² Analyzable: optimal similarity a is

a¤ = 21

T (T¡1)Tr=1

Ts=1(¹r¡¹s)2

average squared distancebetween the T means

Page 47: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Optimal A for T > 2

47

aa

a

a

a

² Bad news: no simple analytical minimization of risk for T > 2.

² One solution: use pairwise estimate to populate A

² Better solution: constrain A = a11T and optimize over a.

² Analyzable: optimal similarity a is

a¤ = 21

T (T¡1)Tr=1

Ts=1(¹r¡¹s)2

a¤ = 21

T (T¡1)Tr=1

Ts=1(¹yr¡¹ys)2

² In practice, we estimate

Page 48: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

How to Set Similarity Matrix A?

48

Choose A to minimize expected total squared error.

Choose A to minimize worst-case total squared error.

Page 49: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Minimax A for T = 2To ¯nd minimax A, we need:

1. A constraint set for f¹tg: ¹t 2 [bl; bu].

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)

Page 50: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Minimax A for T = 2To ¯nd minimax A, we need:

1. A constraint set for f¹tg: ¹t 2 [bl; bu].

(bl; bl)

(bu; bu)(bl; bu)

(bu; bl)

Page 51: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Minimax A for T = 2To ¯nd minimax A, we need:

1. A constraint set for f¹tg: ¹t 2 [bl; bu].

2. A least favorable prior (LFP):

p(¹1; ¹2) =

8><>:12; if (¹1; ¹2) = (bl; bu)

12; if (¹1; ¹2) = (bu; bl)

0; otherwise.

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)

Page 52: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Minimax A for T = 2To ¯nd minimax A, we need:

1. A constraint set for f¹tg: ¹t 2 [bl; bu].

2. A least favorable prior (LFP).

3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:

MTA with sim Amm12 = 2

(bu¡bl)2.

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)

Page 53: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Minimax A for T = 2To ¯nd minimax A, we need:

1. A constraint set for f¹tg: ¹t 2 [bl; bu].

2. A least favorable prior (LFP).

3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:

MTA with sim Amm12 = 2

(bu¡bl)2.

bl = mint ¹yt bu = maxt ¹yt

¹y1¹y2

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)

Page 54: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Minimax A for T > 2To ¯nd minimax A, we need:

1. A constraint set for f¹tg: ¹t 2 [bl; bu].

2. A least favorable prior (LFP).

3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:

MTA with constant simamm = 2

(bu¡bl)2.

(bl; bu)

(bl; bl)

(bu; bu)

(bu; bl)

bl = mint ¹yt bu = maxt ¹yt

¹y1 ¹y3 ¹y4 ¹y5¹y2

Page 55: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Estimator Summary

55

Page 56: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

SimulationsGaussian Simulations Uniform Simulations¹t » N (0; ¾2

¹) ¹t » U(¡p

3¾2¹;

p3¾2

¹)¾2

t » Gamma(0:9; 1:0) + 0:1 ¾2t » U(0:1; 2:0)

Nt » Uf2; : : : ; 100g Nt » Uf2; : : : ; 100gyti » N (¹t; ¾

2t ) yti » U [¹t ¡

p3¾2

t ; ¹t +p

3¾2t ]

56

Page 57: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

5‐fold randomized Cross‐Validation 

57

Page 58: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Gaussian Simulation, T = 5

58

(Lower is better.)

Page 59: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Gaussian Simulation, T = 25

59

Page 60: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Gaussian Simulation, T = 500

60

Page 61: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Uniform Simulation, T = 5

61

Page 62: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Uniform Simulation, T = 25

62

Page 63: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Uniform Simulation, T = 500

63

Page 64: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Scales O(T)

64

Constant and minimax MTA weight matrices can be rewritten:

Sherman Morrisonformula

Z is diagonal so Z¡1, Z¡1z, and 1TZ¡1

can all be computed in O(T )W¹Y is O(T)

W = (I + §L(a11T ))¡1

= (I + §(aTI ¡ a11T ))¡1

= (I + aT§¡ a§11T )¡1

= (Z ¡ z1T )¡1

= Z¡1 +Z¡1z1TZ¡1

1 + 1TZ¡1z

Page 65: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Application: Class Grades

65

Problem: estimate ¯nal grades f¹tg of T students.

Given: N homework grades fytigNi=1 from each student.

Page 66: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Application: Class Grades

66

Problem: estimate ¯nal grades f¹tg of T students.

Given: N homework grades fytigNi=1 from each student.

² 16 classrooms ! 16 datasets.

² Uncurved grades normalized to be between 0 and 100.

² Pooled variance used for all tasks in a dataset.

² Final class grades include homeworks, projects, labs,quizzes, midterms, and the ¯nal exam.

Page 67: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

67

Percent change in risk vs. single-task.

Page 68: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

68

Percent change in risk vs. single-task.

Page 69: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

69

Percent change in risk vs. single-task.

Page 70: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

70

Percent change in risk vs. single-task.

Page 71: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Application: Product Sales

71

Exp 1: How much will tth customer spend on their next order?

Given: $ amounts fytigNti=1 that tth customer spent on Nt orders.

² T = 477

² yti ranged from $15 to $480.

² Nt ranged from 2 to 17.

Page 72: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Application: Product Sales

72

Given: $ amounts fytigNti=1 that each of Nt customers spent

after buying the tth puzzle.

Exp 2: If you bought the tth puzzle, how muchwill you spend on your next order?

² T = 77

² yti ranged from $0 to $480.

² Nt ranged from 8 to 348.

Page 73: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Application: Product Sales

73

No ground truth ! use sample means from all data as ¹t

and use random half of data to get ¹yt.

Customer 1:

Customer 2:

Customer T :

Page 74: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Application: Product Sales

74

No ground truth ! use sample means from all data as ¹t

and use random half of data to get ¹yt.

Customer 1:

Customer 2:

Customer T :

¹1

¹2

¹T

Page 75: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Application: Product Sales

75

No ground truth ! use sample means from all data as ¹t

and use random half of data to get ¹yt.

Customer 1:

Customer 2:

Customer T :

¹y1

¹y2

¹yT

Page 76: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Application: Product Sales

76

Percent change in risk vs. single-taskaveraged over 1000 random splits.

(Lower is better.)

111

Page 77: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Application: Product Sales

77

Percent change in risk vs. single-taskaveraged over 1000 random splits.

(Lower is better.)

Page 78: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Model Mismatch: 2008 Election

78

Problem: What percent of tth state's vote will go toObama and McCain on election day?

Given: Nt pre-election polls fytigNti=1 from each state.

Page 79: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Model Mismatch: 2008 Election

79

Problem: What percent of tth state's vote will go toObama and McCain on election day?

Given: Nt pre-election polls fytigNti=1 from each state.

Percent change in average risk vs. single-task.(Lower is better.)

Page 80: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA Applied to Kernel Density Estimation

80

KDE: Given that events fxig happened,estimate the probability of event z as

p(z) = 1N

PNi=1 K(xi; z)

x3

x1

x2

x4

x5

z

Page 81: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA Applied to Kernel Density Estimation

81

KDE: Given that events fxig happened,estimate the probability of event z as

x3

x1

x2

x4

x5

p(z) = 1N

PNi=1 K(xi; z)

z

Page 82: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA Applied to Kernel Density Estimation

82

KDE: Given that events fxig happened,estimate the probability of event z as

Equivalently,

argminy(z)

NXi=1

(K(xi; z)¡ y(z))2

p(z) = 1N

PNi=1 K(xi; z)

Page 83: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA Applied to Kernel Density Estimation

83

KDE: Given that events fxig happened,estimate the probability of event z as

Equivalently,

argminy(z)

NXi=1

(K(xi; z)¡ y(z))2

Use MTA to form a Multi-task KDE:

arg minfyt(zt)gT

t=1

TXt=1

NtXi=1

(Kt(xti; zt)¡ yt(zt))2 +°

TXr=1

TXs=1

Ars (yr(zr)¡ ys(zs))2

p(z) = 1N

PNi=1 K(xi; z)

Page 84: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MT‐KDE for Terrorism Risk AssessmentProblem: Estimate the probability of terrorist eventsat 40,000 locations in Jerusalem,each location z; xi 2 R74

T = 7 terrorist groups

84

Task similarity matrix A from terrorism expert Mohammed Hafez:

Page 85: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MT‐KDE for Terrorism Risk Assessment

85

Suicides (T = 17) Bombings (T = 11)Single task .145 .1096James-Stein .145 .1096MTA constant .1897 .1096MTA minimax .1897 .1096Expert sim .1292 .0089

Mean Reciprocal Rank of a Left-Out Event

Page 86: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA is an intuitive, simple, accurate approach to estimating multiple means jointly.

When can you estimate multiple means at once? 

Can you estimate the task similarities better?

Learn more:  see our 2012 NIPS paper or email me for the journal paper  ([email protected])

86

Last Slide

Page 87: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

¹ = W ¹y

right stochastic W

W =¡I + °

T§L

¢¡1

diagonal § with §tt ¸ 0Ars ¸ 0, ° ¸ 0

¹t = ¸¹yt + (1¡ ¸)PT

r=1 ®r¹yr

¹ = W ¹y

MTA:¹ = W ¹y

0 < ¸ · 1PTr=1 ®r = 1

®r ¸ 0, 8r(J-S, more)

Page 88: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Bayesian Analysis: IGMRFs

88

Recall that:

12

PNr=1

PTs=1 Ars(yr ¡ ys)

2 = 12yTLS y

The above regularizer can be thoughtof as coming from an intrinsic (improper)GMRF prior (Rue and Held, '05):

p(y) = (2¼)¡T2 jLSj 12 exp

¡¡1

2yTLS y

¢Usually used for graphical models when LS is sparse.

where L = D ¡ A

Page 89: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Bayesian Analysis

Assuming di®erences are independent:

p(y) /TY

r=1

TYs=1

e¡°Ars(yr¡ys)2

Y1 ¡ Y2 » N (0; 1=2°A12)

Y2 ¡ Y3 » N (0; 1=2°A23)

Y1 ¡ Y3 » N (0; 1=2°A13)

Y1 ¡ Y3 = (Y1 ¡ Y2) + (Y2 ¡ Y3) !1

A13=

1

A12+

1

A23

Y1 ¡ Y2 = (Y1 ¡ Y3) + (Y3 ¡ Y2) !1

A12=

1

A13+

1

A32

Y2 ¡ Y3 = (Y2 ¡ Y1) + (Y1 ¡ Y3) !1

A23=

1

A21+

1

A13

Impossible to satisfy all RHS with any ¯nite A!

for T = 3

89

Page 90: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Related Multi‐Task Regularizers

90

PTr=1 k¯r ¡ 1

T

PTs=1 ¯sk2

2 Distance to mean(Evgeniou and Pontil, 2004)

jj¯jj¤ Trace norm (Abernethy et al., 2009)

tr(¯TD¡1¯) Learned, shared feature covariance matrix(Argyriou et al., 2008)

tr(¯§¡1¯T ) Learned task covariance matrix(Jacob et al., 2008 and Zhang and Yeung, 2010)PT

r=1

PTs=1 Arsk¯r ¡ ¯sk2

2 Pairwise distance regularizer (Sheldon, 2008)or constraint (Kato et al., 2007)

Page 91: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Gaussian Simulation, T = 2

91

(Lower is better.)(Lower is better.)

Page 92: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Uniform Simulation, T = 2

92(Lower is better.)

Page 93: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Pairwise T=5 Results

93(Lower is better.)

Page 94: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Oracle T=5 Results

94(Lower is better.)

Page 95: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Stein’s Unbiased Risk Estimate

95

² True A¤12 is depends on unknown ¹t. We plugged in ¹yt to get:

A¤12 = 2

(¹y1¡¹y2)2

² Another approach:minimize Stein's unbiased risk estimate (SURE),which is an empirical proxy Q such that E[Q] = risk.

² Result:

ASURE12 =

Ã2

(¹y1¡¹y2)2¡¾21

N1¡ ¾2

2N2

!+

Page 96: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

SURE T=2 Experiments

96(Lower is better.)

Page 97: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Alternative Formulation

97

MTA is:¡I + °

T§L

¢¡1 ¹Y

MTA Variant is:

§1=2¡I + °

TL¢¡1

§¡1=2 ¹Y ;

with optimal sim:

A¤12 = 2

(¹1¡¹2)2:

with optimal sim:

A¤12 = 2

¹1¾1

¡¹2¾2

2 :

Di®erent notions of distance!

What if ¹1 = 2, ¾1 = 1, ¹2 = 4, ¾2 = 2?

Page 98: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

98

Alternative Formulation T=2 Results

(Lower is better.)

Page 99: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

MTA Closed Form Solution

99

more general form ofregularized Laplacian kernel(Smola and Kondor, 2003)

¹MTA =³I +

°

T§L

´¡1

¹y

f¹MTAt gT

t=1 = arg minf¹tgT

t=1

1

T

TXt=1

NtXi=1

(yti ¡ ¹t)2

¾2t

T 2

TXr=1

TXs=1

Ars(¹r ¡ ¹s)2

\Multi-task" averaging:

Page 100: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

Stein’s Unbiased Risk Estimate

100

² True A¤12 is depends on unknown ¹t. We plugged in ¹yt to get:

A¤12 = 2

(¹y1¡¹y2)2

² Another approach:minimize Stein's unbiased risk estimate (SURE),which is an empirical proxy Q such that E[Q] = risk.

² Result:

ASURE12 =

Ã2

(¹y1¡¹y2)2¡¾21

N1¡ ¾2

2N2

!+

Page 101: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela

SURE T=2 Experiments

101(Lower is better.)