Advanced Topics in Machine Learning
(Part II)
5. Multi-Task Learning
February 13, 2009
Andreas Argyriou
1
Today’s Plan
• What multi-task learning is
• Regularisation methods for multi-task learning
• Learning multiple tasks on a subspace & an alternating algorithm
• Other multi-task learning approaches
2
Supervised Tasks
• Recall the notion of a supervised regression or classification task
• Given a set of input / output pairs (training set) we wish to computethe functional relationship between the input and the output
xf
−−−−→ y
3
Multiple Supervised Tasks
• What if we have multiple supervised tasks?
xf1−−−−→ y
...
xfn
−−−−−→ y
• Assuming that there are relations among the n tasks, is learning themtogether better than learning each of them separately?
4
Example (Marketing Survey)
• There are 180 persons; there are 8 computer models; each model isrepresented as a vector x; each person rates all the models in a{0, . . . , 10} scale (likelihood of purchase) [Lenk et al. 1996]
• Each person corresponds to a task: we wish to learn a “decisionfunction” ft for each person t
• But the ways different persons make decisions about products arerelated
• Can we exploit the fact that the tasks are related?
• Can we say anything about the preferences of a new person?
5
Example (Collaborative Filtering)
• E.g. Netflix database; there are ratings of M movies by n users; eachuser has rated only a small set of movies
• So, the users/movies matrix is only partially complete
• Can we fill in the remaining entries of this matrix, i.e. can werecommend to a user a movie he/she would like to watch?
• Similar to the previous example, but now there is only partialinformation; in some data sets, we may even know nothing about themovies
• The tasks are again related
6
Example (Computer Vision)
• From [Toralba et al.,2004]; detection of multiple object classes incluttered scenes
• Detection of each object corresponds to a classification task
• The input data here are the images; note that the input data are shared
by all the tasks
7
Example (Computer Vision)
• The assumption made is that human vision uses simple features fordetecting a large number of different objects; so there are relationsamong object detection tasks
• Another example is character recognition
• Humans learn to recognise characters at an early age; but then if theysee a new character (e.g. the euro currency symbol) they only need oneor two training examples
• Thus, character recognition tasks are related; and these relations helpus learn new tasks of the same type
8
Learning Theoretic View: Environments of Tasks
• Each task t can be viewed as a probability measure on IRd × IR (inputsx and outputs y) — e.g. y = ft(x) + noise
• Define an environment as a probability measure on a set of learningtasks [Baxter, 1996] – e.g. favouring tasks related in some sense
• To sample a task-specific sample from the environment
– draw a task t from the environment
– generate a sample {(xt1, yt1), . . . , (xtm, ytm)} ∈(
IRd × IR)m
using task t
9
Learning Theoretic View (contd.)
• Generalisation error bounds from [Baxter, 1996] indicate that
– as the number of tasks n increases, a smaller sample size m per task
is required (with high probability)
– having learned n tasks, the error of learning a novel task from the
environment, using the knowledge about the n tasks, is bounded
(with high probability)
• Other works give error bounds under more specific assumptions
10
Learning Paradigm
• Tasks t = 1, . . . , n
• We are given m examples per task: (xt1, yt1), . . . , (xtm, ytm) ∈ IRd × IR(the framework allows for different sample sizes per task without anysubstantial change)
• The goal is to learn n functions f1, . . . , fn and for each task t predicton input x by computing ft(x)
• A subsequent goal is to exploit what we learned about f1 . . . , fn inorder to learn a novel function ft′ from a new sample(xt′1, yt′1), . . . , (xt′m, yt′m)
11
Learning a Common Kernel
• Idea: assume that related tasks share the same kernel
• Assume that we use a common feature map
φ(x) = Rx where R ∈ IRd×d
for all tasks
• This corresponds to a common linear kernel
K(x, x′) = x>Dx′
where D := R>R
12
Learning a Common Kernel (contd.)
• For each task t, we solve the regularisation problem
minzt∈IRd
m∑
i=1
E(
z>
t Rxti , yti
)
+ γ ‖zt‖2
• This is equivalent to solving the joint problem
minz1,...,zn∈IRd
n∑
t=1
m∑
i=1
E(
z>
t Rxti, yti
)
+ γ
n∑
t=1
‖zt‖2
(why?)
13
Learning a Common Kernel (contd.)
• Using the variable change wt = R>zt and assuming R is invertible, weobtain the problem
minw1,...,wn∈IRd
n∑
t=1
m∑
i=1
E(
w>
t xti, yti
)
+ γ
n∑
t=1
w>
t D−1wt
• This gives a function ft for each task t, assuming a given linear kernelin common
• Why not learn this kernel?
14
Learning a Common Kernel (contd.)
infDÂ0,tr(D)≤1
minw1,...,wn∈IRd
n∑
t=1
m∑
i=1
E(
w>
t xti, yti
)
+ γ
n∑
t=1
w>
t D−1wt (1)
• Here, the convex set of kernels is generated by infinite basic kernels
K = {K(x, x′) = x>Dx′ : D Â 0, tr(D) ≤ 1}
• Note: there is an inf since the set K is open
15
Learning a Common Kernel (contd.)
• Why do we bound the kernel (tr(D) ≤ 1)?
Normalisation issues; if D → ∞, the regulariser approaches zero and wewould overfit
• If we use a convex loss function E, the problem is convex: the function(wt, D) 7→ w>
t D−1wt is convex over D Â 0, as we saw during the lastlecture
• The functions learned are
ft(x) = z>
t Rx = w>
t x
16
Transfer Learning
• The linear kernel learned can be transferred to new tasks
• Suppose we have solved (1) and found D
• Given a sample for a new task, (xt′1, yt′1), . . . , (xt′m, yt′m), we solve theproblem
minw∈IRd
m∑
i=1
E(
w>xt′i, yt′i
)
+ γ w>D−1w
and obtain the function ft′
17
Optimality Conditions
• Let us fix w1, . . . , wn and minimise (1) wrt. D:
infDÂ0,tr(D)≤1
n∑
t=1
w>
t D−1wt =
n∑
t=1
tr(D−1wtw>
t ) = tr
(
D−1n∑
t=1
wtw>
t
)
= tr(
D−1WW>)
where
W =
w1 . . . wn
• Clearly, this expression is smallest when tr(D) = 1
18
Optimality Conditions (contd.)
• So, we can minimise the Lagrangian
tr(
D−1WW>)
+ α(tr(D) − 1)
Setting the derivative to zero, we get
− D−1WW>D−1 + αId = 0 =⇒tr(D)=1
D =(WW>)
12
tr(
(WW>)12
)
Note 1: the stationary point is a minimiser since the Lagrangian is convex for D Â 0
Note 2: when W is rank-deficient, the above expression is the limit of a minimising
sequence
19
Optimality Conditions (contd.)
• Thus, the optimal W , D satisfy
D =(WW>)
12
tr(
(WW>)12
) (2)
wt = argminw∈IRd
m∑
i=1
E(
w>xti, yti
)
+ γ w>D−1w
• We use these conditions to obtain an alternating algorithm
20
Alternating Minimization Algorithm
• Alternating minimization over W and D
Initialization: set D = Idd
while convergence condition is not true do
for t = 1, . . . , n, learn each wt independently by minimizingm∑
i=1
E(w>xti, yti) + γ w>D−1w
using Gram matrix (x>tiDxtj)
m
i,j=1
end for
set D = (WW>)12
tr(WW>)12
end while
21
Alternating Minimization (contd.)
• Each wt step is a regularization problem (e.g. SVM, ridge regressionetc.)
• Each D step requires an SVD of matrix W
• The algorithm (with some perturbation) can be shown to converge tothe optimal solution
• This fact is independent of the starting value of D
22
Regularisation with the Trace Norm
• What if we substitute for D in (1) using (2)?
• The regulariser equals
tr(
D−1WW>)
=(
tr(WW>)12
)2
=(
tr(UΣ2U>)12
)2
=(
tr(UΣU>))2
=
(
r∑
i=1
σi
)2
where W = UΣV > is an SVD, r is the rank of W and σ1, . . . , σr itssingular values
23
Regularisation with the Trace Norm (contd.)
• The sum of the singular values of a matrix W can be shown to be anorm and is called the trace norm (or nuclear norm) of W
‖W‖tr =
r∑
i=1
σi
• Thus, problem (1) is equivalent to
minw1,...,wn∈IRd
n∑
t=1
m∑
i=1
E(
w>
t xti, yti
)
+ γ ‖W‖2tr (3)
24
Learning Multiple Tasks on a Subspace
• Regularising with the trace norm in (3) tends1 to favour low-rank
matrices W
• This means that, in many cases, the vectors w1, . . . , wn are likely to lieon a low-dimensional subspace of IRd
1Under conditions. This is an active topic of research, see e.g. [Candes and Recht, 2008]
25
Learning Multiple Tasks on a Subspace (contd.)
• The effect is the matrix analogue of Lasso; the trace norm is an L1
norm on the singular values
• The trace norm and its square are non-differentiable functions (hence, itis not very efficient to optimise (3) with gradient descent)
Reg.
error
0 20 40 60 80 10024
25
26
27
28
29
iterations
Reg
η = 0.05η = 0.03η = 0.01Alternating
secs.
50 100 150 2000
1
2
3
4
5
6
tasks
seconds
Alternating
η = 0.05
#iterations #tasks
26
SVM Case
• If E is the hinge loss, problem (1) becomes
infDÂ0,tr(D)≤1
minw1,...,wn∈IRd
ξti∈IR
n∑
t=1
m∑
i=1
ξti + γ
n∑
t=1
w>
t D−1wt
s.t. ξti ≥ 1 − (w>
t xti)yti ξti ≥ 0 ∀t, i
or
infDÂ0,tr(D)≤1
minw1,...,wn∈IRd
ξti∈IR,pt∈IR
n∑
t=1
m∑
i=1
ξti + γ
n∑
t=1
pt
s.t. w>
t D−1wt ≤ pt ∀t
ξti ≥ 1 − (w>
t xti)yti, ξti ≥ 0 ∀t, i
27
SVM Case (contd.)
• Use Schur’s complement lemma: if C Â 0 then
A − B>C−1B º 0 ⇔(
A B>
B C
)
º 0
and the problem becomes
infDÂ0,tr(D)≤1
minw1,...,wn∈IRd
ξti∈IR,pt∈IR
n∑
t=1
m∑
i=1
ξti + γ
n∑
t=1
pt
s.t.
(
pt w>t
wt D
)
º 0 ∀t
ξti ≥ 1 − (w>
t xti)yti, ξti ≥ 0 ∀t, i
• This is an SDP; similarly, multiple ridge regressions lead to an SDP
28
Kernelisation / Representer Theorem
• It can be shown that any solution satisfies a multi-task representer
theorem
wt =
n∑
s=1
m∑
i=1
c(t)si xsi for all t = 1, . . . , n
where c(t)si are real numbers
• Consequently, a nonlinear feature map can be used in place of x and wecan obtain an equivalent problem involving only the Gram matrix
• All the tasks are involved in the above expression and in the Grammatrix (unlike the single-task representer theorem)
29
Algorithms
• An approach is to use standard solvers (interior-point methods) for theSDP; cannot scale above 103 variables + constraints
• Another approach is the alternating algorithm or related approaches[Ando & Zhang, 2005, Cai et al., 2008]; empirically, the number ofiterations is observed to be small; however, an SVD is required at everyiteration
• A gradient descent approach can be used (e.g. by smoothening thetrace norm); as we have seen, it requires many more iterations than thealternating approach; also computing the gradient wrt. the matrix canbe costly
30
Experiment (Marketing Survey)
• Consumers’ ratings of products [Lenk et al. 1996]
• 180 persons (tasks)
• 8 PC models (training examples); 4 PC models (test examples)
• 13 binary input variables (RAM, CPU, price etc.) + bias term
• Integer output in {0, . . . , 10} (likelihood of purchase)
• The square loss was used
31
Experiment (Marketing Survey)
Test
error
0 50 100 150 2004.3
4.4
4.5
4.6
4.7
4.8
4.9
5
5.1
5.2
5.3
Eig(D)
1 2 3 4 5 6 7 8 9 10 11 12 13 140
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
#tasks
• Performance improves with more tasks(for learning tasks independently, error = 16.53)
• A single most important feature shared by all persons
32
Experiment (Marketing Survey)
u1
TE RAM SC CPU HD CD CA CO AV WA SW GU PR−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Method RMSEAlternating 1.93
Hierarchical Bayes[Lenk et al.]
1.90
• The most important feature weighs technical characteristics (RAM,CPU, CD-ROM) vs. price
33
Other Multi-Task Learning Approaches
• Other regularisation / kernel based approaches [Evgeniou, Micchelli,Pontil 2005]
• E.g. we could regularise with ‖wt − w0‖2, that is, the task functions
should be close to a mean (which is also learned)
• Another regulariser could be ‖ws − wt‖Ast, where A is a matrix whose(s, t) entry expresses how related tasks s, t are
34
Neural Network Approaches
• E.g. [Baxter 1996, Caruana 1997, Silver & Mercer 1996]
• Learn a small number of common features jointly for all the tasks; use ahidden layer with few nodes and a set of network weights shared by allthe tasks
35
Bayesian Approaches
• Hierarchical Bayes [Bakker & Heskes 2003, Xue et al. 2007, Yu et al.2005, Zhang et al., 2006 etc.]
• Idea: enforce task relatedness through a common prior probabilitydistribution on the tasks’ parameters
• The prior is learned as part of the training process
• Some approaches use Dirichlet processes, Gaussian processes, ICA etc.
36
Related Work in Statistics and Machine Learning
• Multilevel modeling [Goldstein, 1991]
• Reduced rank regression [Izenman, 1975]
• Other: estimating gradients within multi-task regularisation[Guinney et al., 2007];co-regularisation for semi-supervised learning[Rosenberg & Bartlett 2007];task clustering with nearest neighbours [Thrun & O’Sullivan 1996];group Lasso [Bakin 1999, Obozinski et al. 2006, Yuan & Lin, 2006] canbe applied to MTL etc.
37
References
[R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple
tasks and unlabeled data. JMLR 2005]
[A. Argyriou, T. Evgeniou and M. Pontil. Multi-task feature learning. NIPS 2006.]
[B. Bakker and T. Heskes. Task clustering and gating for Bayesian multi–task learning.
JMLR 2003]
[J. Baxter. A model for inductive bias learning. JAIR 2000]
[J. F. Cai, E. J. Candes, Z. Shen, A Singular Value Thresholding Algorithm for Matrix
Completion, 2008].
[R. Caruana. Multi–task learning. JMLR 1997]
[T. Evgeniou, C.A. Micchelli and M. Pontil. Learning multiple tasks with kernel methods.
JMLR 2005]
38
References
[A. Maurer. Bounds for linear multi-task learning. JMLR 2006]
[R. Raina, A. Y. Ng and D. Koller. Constructing informative priors using transfer learning.ICML 2006]
[N. Srebro, J.D.M. Rennie and T.S. Jaakkola. Maximum-margin matrix factorization.NIPS 2004]
[Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classificationwith Dirichlet process priors. JMLR, 2007.]
[K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks.ICML, 2005.]
[M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.J. Roy. Stat. Soc., 2006.]
[J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latentindependent component analysis. NIPS, 2006.]
39