advanced topics in machine learning (part ii)ttic.uchicago.edu/~argyriou/courses/lec5.pdfadvanced...

Advanced Topics in Machine Learning

(Part II)

5. Multi-Task Learning

February 13, 2009

Andreas Argyriou

1

Today’s Plan

• What multi-task learning is

• Regularisation methods for multi-task learning

• Learning multiple tasks on a subspace & an alternating algorithm

• Other multi-task learning approaches

2

Supervised Tasks

• Recall the notion of a supervised regression or classification task

• Given a set of input / output pairs (training set) we wish to computethe functional relationship between the input and the output

xf

−−−−→ y

3

Multiple Supervised Tasks

• What if we have multiple supervised tasks?

xf1−−−−→ y

...

xfn

−−−−−→ y

• Assuming that there are relations among the n tasks, is learning themtogether better than learning each of them separately?

4

Example (Marketing Survey)

• There are 180 persons; there are 8 computer models; each model isrepresented as a vector x; each person rates all the models in a{0, . . . , 10} scale (likelihood of purchase) [Lenk et al. 1996]

• Each person corresponds to a task: we wish to learn a “decisionfunction” ft for each person t

• But the ways different persons make decisions about products arerelated

• Can we exploit the fact that the tasks are related?

• Can we say anything about the preferences of a new person?

5

Example (Collaborative Filtering)

• E.g. Netflix database; there are ratings of M movies by n users; eachuser has rated only a small set of movies

• So, the users/movies matrix is only partially complete

• Can we fill in the remaining entries of this matrix, i.e. can werecommend to a user a movie he/she would like to watch?

• Similar to the previous example, but now there is only partialinformation; in some data sets, we may even know nothing about themovies

• The tasks are again related

6

Example (Computer Vision)

• From [Toralba et al.,2004]; detection of multiple object classes incluttered scenes

• Detection of each object corresponds to a classification task

• The input data here are the images; note that the input data are shared

by all the tasks

7

Example (Computer Vision)

• The assumption made is that human vision uses simple features fordetecting a large number of different objects; so there are relationsamong object detection tasks

• Another example is character recognition

• Humans learn to recognise characters at an early age; but then if theysee a new character (e.g. the euro currency symbol) they only need oneor two training examples

• Thus, character recognition tasks are related; and these relations helpus learn new tasks of the same type

8

Learning Theoretic View: Environments of Tasks

• Each task t can be viewed as a probability measure on IRd × IR (inputsx and outputs y) — e.g. y = ft(x) + noise

• Define an environment as a probability measure on a set of learningtasks [Baxter, 1996] – e.g. favouring tasks related in some sense

• To sample a task-specific sample from the environment

– draw a task t from the environment

– generate a sample {(xt1, yt1), . . . , (xtm, ytm)} ∈(

IRd × IR)m

using task t

9

Learning Theoretic View (contd.)

• Generalisation error bounds from [Baxter, 1996] indicate that

– as the number of tasks n increases, a smaller sample size m per task

is required (with high probability)

– having learned n tasks, the error of learning a novel task from the

environment, using the knowledge about the n tasks, is bounded

(with high probability)

• Other works give error bounds under more specific assumptions

10

Learning Paradigm

• Tasks t = 1, . . . , n

• We are given m examples per task: (xt1, yt1), . . . , (xtm, ytm) ∈ IRd × IR(the framework allows for different sample sizes per task without anysubstantial change)

• The goal is to learn n functions f1, . . . , fn and for each task t predicton input x by computing ft(x)

• A subsequent goal is to exploit what we learned about f1 . . . , fn inorder to learn a novel function ft′ from a new sample(xt′1, yt′1), . . . , (xt′m, yt′m)

11

Learning a Common Kernel

• Idea: assume that related tasks share the same kernel

• Assume that we use a common feature map

φ(x) = Rx where R ∈ IRd×d

for all tasks

• This corresponds to a common linear kernel

K(x, x′) = x>Dx′

where D := R>R

12

Learning a Common Kernel (contd.)

• For each task t, we solve the regularisation problem

minzt∈IRd

m∑

i=1

E(

z>

t Rxti , yti

)

+ γ ‖zt‖2

• This is equivalent to solving the joint problem

minz1,...,zn∈IRd

n∑

t=1

m∑

i=1

E(

z>

t Rxti, yti

)

+ γ

n∑

t=1

‖zt‖2

(why?)

13


• Using the variable change wt = R>zt and assuming R is invertible, weobtain the problem

minw1,...,wn∈IRd

n∑

t=1

m∑

i=1

E(

w>

t xti, yti

)

+ γ

n∑

t=1

w>

t D−1wt

• This gives a function ft for each task t, assuming a given linear kernelin common

• Why not learn this kernel?

14


infDÂ0,tr(D)≤1

minw1,...,wn∈IRd

n∑

t=1

m∑

i=1

E(

w>

t xti, yti

)

+ γ

n∑

t=1

w>

t D−1wt (1)

• Here, the convex set of kernels is generated by infinite basic kernels

K = {K(x, x′) = x>Dx′ : D Â 0, tr(D) ≤ 1}

• Note: there is an inf since the set K is open

15


• Why do we bound the kernel (tr(D) ≤ 1)?

Normalisation issues; if D → ∞, the regulariser approaches zero and wewould overfit

• If we use a convex loss function E, the problem is convex: the function(wt, D) 7→ w>

t D−1wt is convex over D Â 0, as we saw during the lastlecture

• The functions learned are

ft(x) = z>

t Rx = w>

t x

16

Transfer Learning

• The linear kernel learned can be transferred to new tasks

• Suppose we have solved (1) and found D

• Given a sample for a new task, (xt′1, yt′1), . . . , (xt′m, yt′m), we solve theproblem

minw∈IRd

m∑

i=1

E(

w>xt′i, yt′i

)

+ γ w>D−1w

and obtain the function ft′

17

Optimality Conditions

• Let us fix w1, . . . , wn and minimise (1) wrt. D:

infDÂ0,tr(D)≤1

n∑

t=1

w>

t D−1wt =

n∑

t=1

tr(D−1wtw>

t ) = tr

(

D−1n∑

t=1

wtw>

t

)

= tr(

D−1WW>)

where

W =

w1 . . . wn

• Clearly, this expression is smallest when tr(D) = 1

18

Optimality Conditions (contd.)

• So, we can minimise the Lagrangian

tr(

D−1WW>)

+ α(tr(D) − 1)

Setting the derivative to zero, we get

− D−1WW>D−1 + αId = 0 =⇒tr(D)=1

D =(WW>)

12

tr(

(WW>)12

)

Note 1: the stationary point is a minimiser since the Lagrangian is convex for D Â 0

Note 2: when W is rank-deficient, the above expression is the limit of a minimising

sequence

19

Optimality Conditions (contd.)

• Thus, the optimal W , D satisfy

D =(WW>)

12

tr(

(WW>)12

) (2)

wt = argminw∈IRd

m∑

i=1

E(

w>xti, yti

)

+ γ w>D−1w

• We use these conditions to obtain an alternating algorithm

20

Alternating Minimization Algorithm

• Alternating minimization over W and D

Initialization: set D = Idd

while convergence condition is not true do

for t = 1, . . . , n, learn each wt independently by minimizingm∑

i=1

E(w>xti, yti) + γ w>D−1w

using Gram matrix (x>tiDxtj)

m

i,j=1

end for

set D = (WW>)12

tr(WW>)12

end while

21

Alternating Minimization (contd.)

• Each wt step is a regularization problem (e.g. SVM, ridge regressionetc.)

• Each D step requires an SVD of matrix W

• The algorithm (with some perturbation) can be shown to converge tothe optimal solution

• This fact is independent of the starting value of D

22

Regularisation with the Trace Norm

• What if we substitute for D in (1) using (2)?

• The regulariser equals

tr(

D−1WW>)

=(

tr(WW>)12

)2

=(

tr(UΣ2U>)12

)2

=(

tr(UΣU>))2

=

(

r∑

i=1

σi

)2

where W = UΣV > is an SVD, r is the rank of W and σ1, . . . , σr itssingular values

23

Regularisation with the Trace Norm (contd.)

• The sum of the singular values of a matrix W can be shown to be anorm and is called the trace norm (or nuclear norm) of W

‖W‖tr =

r∑

i=1

σi

• Thus, problem (1) is equivalent to

minw1,...,wn∈IRd

n∑

t=1

m∑

i=1

E(

w>

t xti, yti

)

+ γ ‖W‖2tr (3)

24

Learning Multiple Tasks on a Subspace

• Regularising with the trace norm in (3) tends1 to favour low-rank

matrices W

• This means that, in many cases, the vectors w1, . . . , wn are likely to lieon a low-dimensional subspace of IRd

1Under conditions. This is an active topic of research, see e.g. [Candes and Recht, 2008]

25

Learning Multiple Tasks on a Subspace (contd.)

• The effect is the matrix analogue of Lasso; the trace norm is an L1

norm on the singular values

• The trace norm and its square are non-differentiable functions (hence, itis not very efficient to optimise (3) with gradient descent)

Reg.

error

0 20 40 60 80 10024

25

26

27

28

29

iterations

Reg

η = 0.05η = 0.03η = 0.01Alternating

secs.

50 100 150 2000

1

2

3

4

5

6

tasks

seconds

Alternating

η = 0.05

#iterations #tasks

26

SVM Case

• If E is the hinge loss, problem (1) becomes

infDÂ0,tr(D)≤1

minw1,...,wn∈IRd

ξti∈IR

n∑

t=1

m∑

i=1

ξti + γ

n∑

t=1

w>

t D−1wt

s.t. ξti ≥ 1 − (w>

t xti)yti ξti ≥ 0 ∀t, i

or

infDÂ0,tr(D)≤1

minw1,...,wn∈IRd

ξti∈IR,pt∈IR

n∑

t=1

m∑

i=1

ξti + γ

n∑

t=1

pt

s.t. w>

t D−1wt ≤ pt ∀t

ξti ≥ 1 − (w>

t xti)yti, ξti ≥ 0 ∀t, i

27

SVM Case (contd.)

• Use Schur’s complement lemma: if C Â 0 then

A − B>C−1B º 0 ⇔(

A B>

B C

)

º 0

and the problem becomes

infDÂ0,tr(D)≤1

minw1,...,wn∈IRd

ξti∈IR,pt∈IR

n∑

t=1

m∑

i=1

ξti + γ

n∑

t=1

pt

s.t.

(

pt w>t

wt D

)

º 0 ∀t

ξti ≥ 1 − (w>

t xti)yti, ξti ≥ 0 ∀t, i

• This is an SDP; similarly, multiple ridge regressions lead to an SDP

28

Kernelisation / Representer Theorem

• It can be shown that any solution satisfies a multi-task representer

theorem

wt =

n∑

s=1

m∑

i=1

c(t)si xsi for all t = 1, . . . , n

where c(t)si are real numbers

• Consequently, a nonlinear feature map can be used in place of x and wecan obtain an equivalent problem involving only the Gram matrix

• All the tasks are involved in the above expression and in the Grammatrix (unlike the single-task representer theorem)

29

Algorithms

• An approach is to use standard solvers (interior-point methods) for theSDP; cannot scale above 103 variables + constraints

• Another approach is the alternating algorithm or related approaches[Ando & Zhang, 2005, Cai et al., 2008]; empirically, the number ofiterations is observed to be small; however, an SVD is required at everyiteration

• A gradient descent approach can be used (e.g. by smoothening thetrace norm); as we have seen, it requires many more iterations than thealternating approach; also computing the gradient wrt. the matrix canbe costly

30

Experiment (Marketing Survey)

• Consumers’ ratings of products [Lenk et al. 1996]

• 180 persons (tasks)

• 8 PC models (training examples); 4 PC models (test examples)

• 13 binary input variables (RAM, CPU, price etc.) + bias term

• Integer output in {0, . . . , 10} (likelihood of purchase)

• The square loss was used

31


Test

error

0 50 100 150 2004.3

4.4

4.5

4.6

4.7

4.8

4.9

5

5.1

5.2

5.3

Eig(D)

1 2 3 4 5 6 7 8 9 10 11 12 13 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

#tasks

• Performance improves with more tasks(for learning tasks independently, error = 16.53)

• A single most important feature shared by all persons

32


u1

TE RAM SC CPU HD CD CA CO AV WA SW GU PR−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Method RMSEAlternating 1.93

Hierarchical Bayes[Lenk et al.]

1.90

• The most important feature weighs technical characteristics (RAM,CPU, CD-ROM) vs. price

33

Other Multi-Task Learning Approaches

• Other regularisation / kernel based approaches [Evgeniou, Micchelli,Pontil 2005]

• E.g. we could regularise with ‖wt − w0‖2, that is, the task functions

should be close to a mean (which is also learned)

• Another regulariser could be ‖ws − wt‖Ast, where A is a matrix whose(s, t) entry expresses how related tasks s, t are

34

Neural Network Approaches

• E.g. [Baxter 1996, Caruana 1997, Silver & Mercer 1996]

• Learn a small number of common features jointly for all the tasks; use ahidden layer with few nodes and a set of network weights shared by allthe tasks

35

Bayesian Approaches

• Hierarchical Bayes [Bakker & Heskes 2003, Xue et al. 2007, Yu et al.2005, Zhang et al., 2006 etc.]

• Idea: enforce task relatedness through a common prior probabilitydistribution on the tasks’ parameters

• The prior is learned as part of the training process

• Some approaches use Dirichlet processes, Gaussian processes, ICA etc.

36

Related Work in Statistics and Machine Learning

• Multilevel modeling [Goldstein, 1991]

• Reduced rank regression [Izenman, 1975]

• Other: estimating gradients within multi-task regularisation[Guinney et al., 2007];co-regularisation for semi-supervised learning[Rosenberg & Bartlett 2007];task clustering with nearest neighbours [Thrun & O’Sullivan 1996];group Lasso [Bakin 1999, Obozinski et al. 2006, Yuan & Lin, 2006] canbe applied to MTL etc.

37

References

[R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple

tasks and unlabeled data. JMLR 2005]

[A. Argyriou, T. Evgeniou and M. Pontil. Multi-task feature learning. NIPS 2006.]

[B. Bakker and T. Heskes. Task clustering and gating for Bayesian multi–task learning.

JMLR 2003]

[J. Baxter. A model for inductive bias learning. JAIR 2000]

[J. F. Cai, E. J. Candes, Z. Shen, A Singular Value Thresholding Algorithm for Matrix

Completion, 2008].

[R. Caruana. Multi–task learning. JMLR 1997]

[T. Evgeniou, C.A. Micchelli and M. Pontil. Learning multiple tasks with kernel methods.

JMLR 2005]

38

References

[A. Maurer. Bounds for linear multi-task learning. JMLR 2006]

[R. Raina, A. Y. Ng and D. Koller. Constructing informative priors using transfer learning.ICML 2006]

[N. Srebro, J.D.M. Rennie and T.S. Jaakkola. Maximum-margin matrix factorization.NIPS 2004]

[Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classificationwith Dirichlet process priors. JMLR, 2007.]

[K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks.ICML, 2005.]

[M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.J. Roy. Stat. Soc., 2006.]

[J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latentindependent component analysis. NIPS, 2006.]

39

advanced topics in machine learning (part ii)ttic.uchicago.edu/~argyriou/courses/lec5.pdfadvanced...

Documents