chameleon: learning model initialization across tasks with ... · due to the rapid growth of deep...

9
Chameleon: Learning Model Initializations Across Tasks With Different Schemas Lukas Brinkmeyer * , Rafael Rego Drumond * , Randolf Scholz, Josif Grabocka, Lars Schmidt-Thieme Information Systems and Machine-Learning Lab - University of Hildesheim Samelsonplatz 1 Hildesheim - 31141 - Germany {brinkmeyer,radrumond,scholz,josif,schmidt-thieme}@ismll.uni-hildesheim.de Abstract Parametric models, and particularly neural networks, require weight initialization as a starting point for gradient-based optimization. In most current practices, this is accomplished by using some form of random initialization. Instead, recent work shows that a specific initial parameter set can be learned from a population of tasks, i.e., dataset and target variable for supervised learning tasks. Using this initial parameter set leads to faster convergence for new tasks (model-agnostic meta-learning). Currently, methods for learning model initializations are limited to a population of tasks shar- ing the same schema, i.e., the same number, order, type and semantics of predictor and target variables. In this paper, we address the problem of meta-learning parameter initialization across tasks with different schemas, i.e., if the number of predictors varies across tasks, while they still share some variables. We propose Chameleon, a model that learns to align different pre- dictor schemas to a common representation. We use permutations and masks of the predictors of the training tasks at hand. In experiments on real-life data sets, we show that Chameleon successfully can learn parameter initializations across tasks with different schemas pro- viding a 26% lift on accuracy on average over random initialization and of 5% over a state-of-the-art method for fixed-schema learning model initializations. To the best of our knowledge, our paper is the first work on the problem of learning model initialization across tasks with different schemas. Introduction Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn- ing has gained an immense popularity among researchers and the industry. Even people without much knowledge regarding this area are starting to build their own mod- els for research or business applications. However, this lack of expertise might become an obstacle in various cases such as finding the adequate hyperparameters for the desired modeling of the task. Meta-learning (Finn, * Equal Contribution Xu, and Levine 2018) and hyperparameter optimiza- tion (Wistuba, Schilling, and Schmidt-Thieme 2015; Schilling, Wistuba, and Schmidt-Thieme 2016) have aided both experts and non-experts in finding the best solutions such as model size, step length, regularization and parameter initialization (Kotthoff et al. 2017). Current approaches optimize virtually all of these parameters through the use of hyperparameter search methods with exception of the weights initialization. Instead, a suitable distribution is selected in almost all cases (Glorot and Bengio 2010; He et al. 2015). The reason is clear, a simple binary classification model with one fully-connected hidden layer with 64 neurons, input size 2 and output size 1, already yields at least 192 continuous weight parameters. This makes the hyper- parameter space of the initial network weights, even for small networks, far too complex for finding a good combination in a feasible time. Recent approaches in meta-learning and multi-task learning have shown that it is possible to learn such a per-weight initialization across different tasks if they have the same schema. For that reason, all of these approaches rely on using image tasks which can be easily scaled to a fixed shape. We want to extend this work to classical vector data, e.g. learning an initialization from a hearts disease data set which can then be used for a diabetes detection task that relies on similar features. In contrast to image data, there is no trivial solution to map Chameleon Component Target Permutation Figure 1: The main idea of Chameleon is to shift feature vectors of different tasks to a shared representation. In this picture the top part represents schemas from different tasks of the same domain. The bottom part represent the aligned features in a fixed schema. arXiv:1909.13576v2 [cs.LG] 1 Oct 2019

Upload: others

Post on 26-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chameleon: Learning Model Initialization Across Tasks With ... · Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn-ing

Chameleon: Learning Model Initializations Across Tasks WithDifferent Schemas

Lukas Brinkmeyer∗, Rafael Rego Drumond∗,Randolf Scholz, Josif Grabocka, Lars Schmidt-Thieme

Information Systems and Machine-Learning Lab - University of HildesheimSamelsonplatz 1

Hildesheim - 31141 - Germany{brinkmeyer,radrumond,scholz,josif,schmidt-thieme}@ismll.uni-hildesheim.de

Abstract

Parametric models, and particularly neural networks,require weight initialization as a starting point forgradient-based optimization. In most current practices,this is accomplished by using some form of randominitialization. Instead, recent work shows that a specificinitial parameter set can be learned from a populationof tasks, i.e., dataset and target variable for supervisedlearning tasks. Using this initial parameter set leadsto faster convergence for new tasks (model-agnosticmeta-learning). Currently, methods for learning modelinitializations are limited to a population of tasks shar-ing the same schema, i.e., the same number, order, typeand semantics of predictor and target variables.In this paper, we address the problem of meta-learningparameter initialization across tasks with differentschemas, i.e., if the number of predictors varies acrosstasks, while they still share some variables. We proposeChameleon, a model that learns to align different pre-dictor schemas to a common representation. We usepermutations and masks of the predictors of the trainingtasks at hand. In experiments on real-life data sets, weshow that Chameleon successfully can learn parameterinitializations across tasks with different schemas pro-viding a 26% lift on accuracy on average over randominitialization and of 5% over a state-of-the-art methodfor fixed-schema learning model initializations. To thebest of our knowledge, our paper is the first work onthe problem of learning model initialization across taskswith different schemas.

IntroductionDue to the rapid growth of Deep Learning (Goodfellow,Bengio, and Courville 2016), the field of Machine Learn-ing has gained an immense popularity among researchersand the industry. Even people without much knowledgeregarding this area are starting to build their own mod-els for research or business applications. However, thislack of expertise might become an obstacle in variouscases such as finding the adequate hyperparameters forthe desired modeling of the task. Meta-learning (Finn,

∗Equal Contribution

Xu, and Levine 2018) and hyperparameter optimiza-tion (Wistuba, Schilling, and Schmidt-Thieme 2015;Schilling, Wistuba, and Schmidt-Thieme 2016) haveaided both experts and non-experts in finding the bestsolutions such as model size, step length, regularizationand parameter initialization (Kotthoff et al. 2017).

Current approaches optimize virtually all of theseparameters through the use of hyperparameter searchmethods with exception of the weights initialization.Instead, a suitable distribution is selected in almost allcases (Glorot and Bengio 2010; He et al. 2015). Thereason is clear, a simple binary classification model withone fully-connected hidden layer with 64 neurons, inputsize 2 and output size 1, already yields at least 192continuous weight parameters. This makes the hyper-parameter space of the initial network weights, evenfor small networks, far too complex for finding a goodcombination in a feasible time.

Recent approaches in meta-learning and multi-tasklearning have shown that it is possible to learn sucha per-weight initialization across different tasks if theyhave the same schema. For that reason, all of theseapproaches rely on using image tasks which can be easilyscaled to a fixed shape. We want to extend this work toclassical vector data, e.g. learning an initialization froma hearts disease data set which can then be used for adiabetes detection task that relies on similar features. Incontrast to image data, there is no trivial solution to map

ChameleonComponent

TargetPermutation

Figure 1: The main idea of Chameleon is to shift featurevectors of different tasks to a shared representation.In this picture the top part represents schemas fromdifferent tasks of the same domain. The bottom partrepresent the aligned features in a fixed schema.

arX

iv:1

909.

1357

6v2

[cs

.LG

] 1

Oct

201

9

Page 2: Chameleon: Learning Model Initialization Across Tasks With ... · Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn-ing

different feature vectors to a common representation.Simply padding the smaller task, so that both datasets have the same number of features, still does notguarantee that similar features contained in both tasksare aligned. Thus, we require an encoder that mapshearts and diabetes data to one feature representationwhich then can be used to find a general initialization viapopular meta-learning algorithms like reptile (Nichol,Achiam, and Schulman 2018a).

We propose a set-wise feature transformation modelcalled chameleon, named after a reptile capable ofadjusting its colors according to the environment inwhich it is located. chameleon deals with differentschemas by projecting them to a fixed input space whilekeeping features from different tasks but of the sametype or distribution in the same positions, as illustratedby Figure 1.

Our main contributions are as follows: (1) We showhow current meta-learning approaches can work withtasks of different representation as long as they theshare some variables with respect to type or seman-tics. (2) We propose an architecture, chameleon, thatcan generalize a projection of any feature space to ashared predictor vector. (3) We demonstrate that wecan improve state-of-the-art meta-learning methods bycombining them with this component.

Related WorkOur goal is to extend recent meta-learning approachesthat make use of multi-task and transfer learningmethods by adding a feature alignment to cast differ-ent inputs to a fixed representation. In this section wewill discuss various works related to our approach.

Research on Transfer Learning (Sung et al. 2018;Pan and Yang 2010) has shown that training a model ondifferent auxiliary tasks before actually fitting it to thetarget problem may provide better results. For instance,(Zoph et al. 2018) pre-train blocks of convolutional neu-ral networks on a small data set before training a jointmodel with a much larger data set. In contrast, usingonly a single image data set with different tasks, (Ranjan,Patel, and Chellappa 2019) train a truncated versionof Alex-Net (Krizhevsky, Sutskever, and Hinton 2012)with different heads. Each head tackles one specific tasksuch as detecting human-faces, gender, position, land-mark and rotation of the image. Although they work onthe same input data, each task has a specific loss thatback-propagates to the initial layers of the network.

Similarly, the work of (Kendall, Gal, and Cipolla 2018)train an encoder with three different decoder heads, aim-ing to reconstruct the image pixel-wise based on seman-tic segmentation, object segmentation, and pixel depth.These papers show that different tasks can be used inparallel during the training or pre-training of a modelwith a distinct loss to further improve on the perfor-mance of the next task. Moreover, the positive effect oftransfer learning has also been successfully demonstratedacross tasks from different domains (Ganin et al. 2016;Tzeng et al. 2015; 2017).

Motivated by transfer learning, few-shot learning re-search aims to generalize a representation of a new classof samples given only few instances (Duan et al. 2017;Finn et al. 2017; Snell, Swersky, and Zemel 2017).Several meta-learning approaches have been devel-oped to tackle this problem by introducing architec-tures and parameterization techniques specifically suitedfor few-shot classification (Munkhdalai and Yu 2017;Mishra et al. 2018). Moreover, (Finn, Abbeel, and Levine2017) showed that an adapted learning paradigm couldbe sufficient for learning across tasks. Other researchdirections such as the work of (Jomaa, Grabocka, andSchmidt-Thieme 2019) uses meta-learning to extractuseful meta-features from data sets to improve hyper-parameter optimization. The Model Agnostic Meta-Learning (maml) method describes a model initializa-tion algorithm by training an arbitrary model f acrossdifferent tasks. Instead of sequentially training the modelone task at a time, it uses update steps from differenttasks to find a common gradient direction that achievesa fast convergence for any objective. In other words,for each meta-learning update, we would need an initialvalue for the model parameters θ. Then, we sample tasksT , and for each task Tτ we find an updated version ofθ using N examples from the task performing gradientdescent as in:

θ′i ← θ − α∇θLτi(fθ) (1)

The final update of θ will be:

θ ← θ − β 1|T |∇θ∑i

Lτi(fθ′i) (2)

Finn et al. state that maml does not require learning anupdate rule (Ravi and Larochelle 2016), or restrictingtheir model architecture (Santoro et al. 2016). They ex-tended their approach by incorporating a probabilisticcomponent such that for a new task, the model is sam-pled from a distribution of models to guarantee a highermodel diversification for ambiguous tasks (Finn, Xu,and Levine 2018). However, maml requires to computesecond order derivatives, ending up being computation-ally heavy. reptile (Nichol, Achiam, and Schulman2018a) simplifies maml by numerically approximatingEquation (2) to replace the second derivative with theweights difference:

θ ← θ − β 1|T |∑i

(θ′i − θ) (3)

Which means we can use the difference between theprevious and updated version as an approximation ofthe second derivatives to reduce computational costs.The serial version is presented in Algorithm 1.

All of these approaches rely not only on a fixed featurespace but also on an identical alignment across all tasks.However, even similar data sets only share a subset oftheir features, while often times having a different orderor representation. To make the feature space uniqueacross all data sets, it is required to align the features.

Page 3: Chameleon: Learning Model Initialization Across Tasks With ... · Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn-ing

Algorithm 1: reptile (Nichol, Achiam, andSchulman 2018a)

Data: meta data set T , learning rate β1 Randomly initialize model f with weight

parameters θ2 for iteration = 1, 2, ... do3 Sample task Tτ ∼ T with loss Lτ4 θ′ ← θ5 for k steps = 1,2,... do6 θ′ ← θ′ − α∇θ′Lτ (Yτ , f(Xτ ; θ′))7 end8 θ ← θ − β(θ′ − θ)9 end

10 return model f , weight parameters θ

The work of (Jomaa, Grabocka, and Schmidt-Thieme2019) uses meta-learning to extract useful meta-featuresfrom data sets independent of their size to improvehyper-parameter optimization. However, the data setfeatures are compressed without aligning them since thegoal is to generate a fixed number of meta-features andno instance-wise transformation.

Different approaches in the literature deal with fea-ture alignment in various fields. The work from (Panet al. 2010) describes a procedure for sentiment anal-ysis which aligns words across different domains thathave the same sentiment. Recently, (Zhang, Yu, and Tao2018) proposed an unsupervised framework called LocalDeep-Feature Alignment. The procedure computes theglobal alignment which is shared across all local neigh-borhoods by multiplying the local representations witha transformation matrix. So far none of these methodsfind a common alignment between features of differentdata sets.

We propose a novel feature alignment componentnamed chameleon which enables state-of-the-art meth-ods to work on top of tasks whose predictor vectorsdiffer not only in their length but also their concretealignment.

MethodologyMethods like reptile and maml try to find the bestinitialization for a specific model to work on a set Tof similar tasks. Typically, these approaches select asimple two-layer feed-forward neural network as theirbase model, in this work referred to as y. A task τ ∈ Tconsists of predictor data Xτ , a target Yτ , a predefinedtraining/test split τ = (Xtrain

τ , Y trainτ , Xtest

τ , Y testτ ) and a

loss function Lτ . However, every task τ has to share thesame predictor space of common size K, where similarfeatures shared across tasks are in the same position. Anencoder is needed to map the data representation Xτ

with feature length Fτ to a shared latent representationXτ with fixed feature length K.

enc : Xτ ∈ RN×Fτ 7−→ Xτ ∈ RN×K (4)

Where N represents the number of instances in Xτ , Fτis the number of features of task τ , and K is the sizeof the desired feature space. By combining this encoderwith a model y that works on a fixed input size K andoutputs the predicted target i.e. binary classification, itis possible to apply the reptile algorithm to learn aninitialization θinit across tasks with different predictorvectors. The optimization objective then becomes themeta loss for the combined network f = y ◦ enc over aset of tasks T :

argminθinit

Eτ∼T Lτ(Y testτ , f

(Xtestτ ; θ(u)

τ

))s.t. θ(u)

τ = A(u)(Xtrainτ , Y train

τ , Lτ , f ; θinit) (5)

Where θinit is the concatenation of the initial weightsθinit

enc for the encoder and θinity for model y, and θ

(u)τ

are the updated weights after applying the learningprocedure A for u iterations on the task τ as defined inAlgorithm 1 for the inner updates of reptile.

It is important to mention that learning one weightparameterization across any heterogeneous set of tasksis extremely difficult since it is most likely impossible tofind one initialization for two tasks with a vastly differentnumber and type of features. By contrast if two tasks areof similar length and share similar features, one can alignthe similar features in a shared representation (schema)so that a model can directly learn across different tasksby transforming the predictor as illustrated in Figure 1.

ChameleonConsider a set of tasks where a reordering matrix Πτ

exists that transforms predictor data Xτ into Xτ withXτ having the same schema for every task τ :

Xτ = Xτ ·Πτ ,where (6)[x1,1 . . . x1,K

.... . .

...xN,1 . . . xN,K

]︸ ︷︷ ︸

=

[x1,1 . . . x1,Fτ

.... . .

...xN,1 . . . xN,Fτ

]︸ ︷︷ ︸

·

[π1,1 . . . π1,K

.... . .

...πFτ ,1 . . . πFτ ,K

]︸ ︷︷ ︸

Πτ

Every xm,n represents the feature n of sample m. Everyπm,n represent how much of feature m (from samples inXτ ) should be shifted to position n in the adapted inputXτ . Finally, every xm,n represent the new feature n ofsample m in Xτ with the adapted shape and size. Thiscan be also expressed as xm,n =

∑Fτj=1 xm,jπj,n. In order

to achieve the same Xτ when permuting two features ofa task Xτ , we must simply permute the correspondingrows in Πτ to achieve the same Xτ .

For example: Consider that task a has features [apples,bananas, melons] and task b features [lemons, bananas,apples]. Both can be transformed to the same represen-tation [apples, lemons, bananas, melons] by replacingmissing features with zeros and reordering them. This

Page 4: Chameleon: Learning Model Initialization Across Tasks With ... · Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn-ing

InputXτSize:N×Fτ

Fτ×N Fτ×32 F ×32 Fτ×64 Fτ×KΠ(Xτ)Fτ×K

Chameleon

Reshape Conv1D(N×32×1)

DenseLayer(64)

Conv1D DenseLayer(K)

Softmax

OutputXτΠ(Xτ)Size:N×K

Mat.Mul.τ

(32×32×1)

Figure 2: The Chameleon Architecture: N represents the number of samples in τ , Fτ is equal to the featurelength of τ , and K is defined as the desired feature space. “Conv(a× b× c)” is a convolution operation with a inputchannels, b output channels and kernel length c. “Dense Layer(i)” is a fully connected layer with i neurons thattransforms each of the F inputs to an i-dimensional feature vector.

transformation must have the same result for a and b in-dependently of their feature order. In a real life scenario,features might come with different names or sometimestheir similarity is not clear to the human eye.

Our proposed component, denoted by Φ, takes a taskand outputs the corresponding reordering matrix:

Φ(Xτ , θenc) = Πτ (7)

The component Φ is a neural network parameterizedby θenc. It consists of two 1D-Convolutions, two densehidden layers, where the last one is the output layer thatestimates the alignment matrix via a softmax activation.The input transposed to size [Fτ ×N ] (where N is thenumber of samples) represents all the values that eachfeature has for every sample. We apply a linear trans-formation using 1D-Convolutions with kernel length 1and output size of 32 channels, twice. The output of size[Fτ ×32] is followed by one regular linear transformationof size 64 and another one of size K, resulting in one vec-tor per original feature displaying the relation to each ofthe K features in the target space. Each of these vectorspasses through a softmax layer, computing the probabil-ity that a feature of Xτ will be shifted to each positionof Xτ . This simplifies the objective of the network to asimple classification based on the new position of eachfeature in τ . The overall architecture can be seen in Fig-ure 2. This reordering can be used to encode the tasksto the shared representation as defined in Equation (6).The encoder necessary for training reptile across taskswith different predictor vector by optimizing Equation(5) is then given as:

enc : Xτ 7−→ Xτ · Φ(Xτ , θenc) = Xτ · Πτ (8)

Reordering TrainingSimply joint training the network y ◦ Φ as describedabove, will not teach Φ how to reorder the features to ashared representation. That is why training Φ specificallywith the objective of reordering features is necessary.In order to do so, it is essential to optimize the modelon a reordering objective while using a meta data set.

This contains similar tasks τ , meaning for all tasksthere exists a reordering matrix Πτ that maps to theshared representation. If Πτ is known beforehand, wecan optimize Chameleon by minimizing the expectedreordering loss over the meta data set:

θenc = argminθenc

Eτ∼T trainLΦ

(Πτ , Πτ

)(9)

where LΦ is the softmax cross-entropy loss, Πτ is theground-truth (one-hot encoding of the new position foreach variable), and Πτ is the prediction. This trainingprocedure can be seen in Algorithm 2. The trainedchameleon model can then be used to compute theΠτ for any unseen task τ ∈ T .

Algorithm 2: chameleon TrainingData: Meta data set T , latent dimension K,

learning rate γ1 Randomly initialize weight parameters θenc of

chameleon model2 for training iteration = 1, 2, ... do3 randomly sample τ ∼ T4 θenc ←− θenc − γ∇LΦ(Πτ ,Φ(Xτ , θenc)) .5 end6 return θenc

After this training procedure we can use the learnedweights as initialization for Φ before optimizing y◦Φ withreptile without further using LΦ. Experiments showthat this procedure improves our results significantlywhile compared to only optimizing the joint meta-loss.

Training the chameleon component to reorder simi-lar tasks to a shared representation not only requires ameta data set, but one where the true reordering matrixΠτ is provided for every task. In application this meansmanually matching similar features of different trainingtasks so that novel tasks can be matched automatically.However, it is possible to sample a broad number oftasks from a single data set by sampling smaller subtasks from it, selecting a random subset of features in

Page 5: Chameleon: Learning Model Initialization Across Tasks With ... · Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn-ing

Data set Samples Features Features in TrainWine 6497 12 8

Abalone 4177 8 5Telescope 19020 11 8

Hearts 597 13 13*Diabetes 768 8 0*

Table 1: List of data sets used in each of our experiments.∗ Used in conjunction with all of Hearts features inTraining and all of Diabetes Features in Test

arbitrary order for N random instances. Thus, it is notnecessary to manually match the features since all thesesub tasks share the same Πτ apart from the respectivepermutation of the rows as mentioned above.

ExperimentsIn this section, we describe our experimental setup andpresent the results of our approach on different datasets. For the sake of simplicity we restrict ourselves tobinary classification tasks. As described in the previoussection, our architecture consists of chameleon and abase model y. In all of our experiments we compare theperformance of four approaches by training on randomlysampled tasks: (i) with model y from a random initializa-tion, (ii) with model y from the initialization obtainedby reptile training on the meta training data T train

padded to size K, (iii) with the joint model y ◦ enc fromthe initialization obtained by reptile training on themeta training data T train, and (iv) with the joint modelfrom the initialization obtained by training chameleonas defined in Equation (9) before using reptile.

The number of gradient updates on a new task in thereptile algorithm is set to 10. Likewise, a task fromthe meta test data is evaluated by initializing with thecurrent weights θinit, performing 10 updates with itstraining data and then scoring the respective test data.All experiments are conducted with the same modelarchitecture. The base model y is a feed-forward neuralnetwork with two dense hidden layers which have 64neurons each. chameleon is utilized as described in theprevious section and combined with the base network.This architecture was chosen to match the original oneused in the papers for maml and reptile.

Meta Data sets For experimental purposes, we uti-lize a single data set as meta data set by samplingthe training and test tasks from the data set. This al-lows us to evaluate our method on different domainswithout manually matching training tasks since Πτ isnaturally given. Novel features can also be introducedduring testing by not only splitting the instances butalso the features of a data set in train and test partition.Training tasks τ ∈ T train are then sampled by selectinga random subset of the training features in arbitraryorder for N instances, while a stratified sampling guar-antees that test tasks τ ∈ T test contain both features

from train and test, while sampling the instances fromthe test set only. In order to demonstrate our methodon two actually separated tasks we use one data set asT train and a different, related data set as T test.

Data Sets. We evaluate our approach on five UCIdata sets (Dua and Graff 2017) in four experimentalsetups. The characteristics of the individual data setsare described in Table 1. The Hearts and the Diabetesdataset will be used in conjunction to evaluate the per-formance when testing on a similar but separate dataset.We binarize both datasets from the level of the severityof the disease to “Disease” or “No Disease”. For theWine data set we use binary labels according to thecolor of the Wine (red or white). In the case of theAbalone dataset, the binary labels indicate whether thenumber of rings of an abalone shell is less than 10 orgreater equal 10. The Telescope dataset is already abinary classification task.

In contrast to our baseline approach (Nichol, Achiam,and Schulman 2018a) and most related literature (Finn,Abbeel, and Levine 2017; Snell, Swersky, and Zemel2017; Finn, Xu, and Levine 2018) we are not evaluatingour approach on image data. Realigning features is notfeasible with images since only the original sequenceof pixels represents the image, while the channels (forcolored inputs) already follow a standard input sequence(red, green and blue). Additionally, as mentioned beforeimage schemas can be aligned trivially by scaling to thesame size, which is why we keep our scope on standard1-dimensional non-sequential data.

Design Setup. For the experiments on the Wine,Abalone and Telescope datasets, 30 percent of the in-stances are used for initializing chameleon, 50 percentfor training and 20 percent for testing. In the finalexperiment, the Hearts and the diabetes dataset areused in conjunction. The full training is performed onthe Hearts dataset, while the diabetes dataset is usedfor testing. For all experiments, we sample 30 trainingand 30 validation instances for a new task. During thereordering-training phase and the inner updates of jointtraining we use the adam optimizer (Kingma and Ba2014) with an initial learning rate of 0.00001 and 0.0001respectively. The meta-updates of reptile are carriedout with a learning rate β of 0.01. The reordering train-ing phase is run for 1000 epochs. All experiments areconducted in two variants: (i) with all features being uti-lized during training and (ii) with novel features duringtest time. For this purpose we impose a train test spliton the features as stated in Table 1 and when sampling avalidation task we sample 25 to 75% of the features fromthe unseen ones. We call this experiment setting ”Split”,while (i) is refered to as ”No Split”. In the final exper-iment, we evaluate the use of the Hearts Disease datafor training, while evaluating the resulting initializationon the Diabetes data. These data sets share a subset offeatures i.e. age and blood pressure, while having strong

Page 6: Chameleon: Learning Model Initialization Across Tasks With ... · Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn-ing

correlation between other features due to their shareddomain. Our work is built on top of reptile (Nichol,Achiam, and Schulman 2018a) but can be used in con-junction with any meta-learning method. We opted touse reptile since it does not require second derivatives,and the code is publicly available (Nichol, Achiam, andSchulman 2018b) while also being easy to adapt to ourproblem.

0 2000 4000 6000 8000 10000 12000 14000Sampled Tasks for Meta-Training

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Avera

ge T

ask

Loss

aft

er

Tra

inin

g 1

0 E

poch

s

(a)(i) y - Scratch (ii) y (iii) y (iv) y - Pretrained

10 100 200 300Epochs

0.3

0.4

0.5

0.6

0.7

Loss

(b)Chameleon

Scratch

10 100 200 300Epochs

0.2

0.4

0.6

0.8

(c)Chameleon

Scratch

10 100 200 300Epochs

0.3

0.5

0.7

0.9

(d)Chameleon

Scratch

Figure 3: Meta test loss (cf. eq. (3.5), u = 10) on theWine data set. Graph (a) shows the average Meta-Testloss over 10 runs during reptile training. The graphis smoothed with a moving average of size 20 due tothe variance when using a meta batch size of one whensampling Test tasks. Each point represents the test losson a sampled task after training on it for 10 epochs.Graphs (b), (c) and (d) show a snapshot for the cor-reponding training of a single task until convergenceafter initializing with the current learned parametersθinit. The dotted black line marks the value plotted ingraph (a) after 10 epochs. Note that the line for scratchtraining does not improve since it represents the modelrandomly initialized for every task.

Initial Experiment. In our first experiment we usethe Wine data set (Dua and Graff 2017; Cortez et al.2009) to illustrate how our approach can learn a usefulinitialization across tasks of different representation (con-tribution 1). We analyze the four approaches describedabove by training the model y or y ◦ Φ respectively.The models for approaches (ii) and (iii) are trained us-ing reptile, while for (iv) chameleon is first trainedfor 1000 epochs before performing reptile training on

1 3 5 7 9 11True Feature Position

1

3

5

7

9

11

Reord

eri

ng

0.0

0.2

0.4

0.6

0.8

1.0

Perc

ent

Shifte

d

Figure 4: Heat Map of the feature shifts for the Winedata set computed with chameleon after reordering-training. The x-axis represents the twelve features ofthe wine data set in the correct order and the y-axisshows to which position these features are shifted.

the joint model. In every reptile epoch we evaluatethe current initialization by training on a task sampledfrom the evaluation set T test (we always discard theupdates from the evaluation set). In parallel we alsorepeat the last step for (i) but evaluate on the model ywith randomly initialized weights, we refer to this stepas Scratch.

Our experimental results can be analyzed by observ-ing the test loss for a task after performing the innerupdates. Figure 3(a) shows the validation loss during thereptile training for the Wine experiments. Each datapoint corresponds to the training of a validation taskover 10 epochs and evaluating its test data afterwards.Figures 3(b), (c) and (d) show an exemplary snapshot ofthe test loss when training a sampled task with the pro-posed method (iv). The snapshots show the training of atask until convergence when initializing with the currentθinit, while the actual reptile training only computes10 updates, indicated by the dotted line, before perform-ing a meta-update. The snapshots show the expectedreptile behavior, namely a faster convergence whenusing the current learned initialization compared to arandom one. Note that Figure 3(a) shows the test loss af-ter 10 updates, which is why the snapshot (b) after 1000meta-epochs shows a lower convergence while the graphin (a) is already outperforming training from scratch.One can see that all three approaches trained with rep-tile generate an initialization that converges faster thanrandom. However, simply adding the chameleon com-ponent only leads to a faster meta-training convergence.Only when including the pre-training phase defined inAlgorithm 2, one can observe a lower meta-loss whichcorresponds to a more efficient initialization.

The result of pretraining chameleon on the Winedata set can be seen in Figure 4. The x-axis shows thetwelve distinct features of the Wine data, and the y-axisrepresents the predicted position. The color indicatesthe average portion of the feature that is shifted to the

Page 7: Chameleon: Learning Model Initialization Across Tasks With ... · Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn-ing

Loss AccuracyData Set

Wine (NS)Wine (S)

Abalone (NS)Abalone (S)

Telescope (NS)Telescope (S)

Hearts+Diabetes

Scratch Reptile Cha. Cha.+0.82 0.49 0.48 0.370.83 0.48 0.48 0.430.83 0.54 0.54 0.520.82 0.54 0.54 0.520.82 0.51 0.46 0.360.82 0.52 0.48 0.420.84 0.59 0.59 0.56

Scratch Reptile Cha. Cha.+0.52 0.78 0.79 0.850.52 0.79 0.80 0.820.52 0.72 0.72 0.740.52 0.72 0.72 0.730.52 0.76 0.78 0.840.52 0.75 0.77 0.800.51 0.69 0.68 0.70

Table 2: Loss and accuracy scores for the experiments with each data set and each model averaged over 20 tasks. Allresults are measured after performing 10 update steps on a new task and evaluating the loss/accuracy on a validationset. ’S’ stands for ”Split” and ’NS’ for ”No Feature Split”. ”Cha.” represents the Reptile experiments using Chameleonalong with the same model, while in ”Cha+” Chameleon is pre-trained. Best results are bold-faced.

0 2000 4000 6000 8000 10000 12000 14000Sampled Tasks for Meta-Training

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Aver

age

Task

Los

s afte

r Tra

inin

g 10

Epo

chs

(i) y - Scratch (ii) y (iii) y (iv) y - Pretrained

Figure 5: Meta test loss (cf. eq. (3.5), u = 10) on theHeart vs. Diabetes data sets. The graph shows the aver-age Meta-Test loss over 10 runs during reptile training.The graph is smoothed with a moving average of size20 due to the variance when using a meta batch size ofone when sampling Test tasks. Each point representsthe test loss on a sampled task after training on it for10 epochs.

corresponding position. One can see that the componentmanages to learn the true feature position in almost allcases. Moreover, this illustration does also show thatchameleon can be used to detect similarity betweendifferent features by indicating which pairs are confusedmost often. For example, feature two and three areshowing a strong correlation which is very plausiblesince they depict the “free sulfur dioxide” and “totalsulfur dioxide level” of the Wine. This demonstrates thatour proposed architecture is able to learn an alignmentbetween different feature spaces (contribution 2).

Final Results. In Figures 5 we can see the resultsfor the hearts vs. diabetes experiment (our supplemen-tary material contains the remaining results). In allexperiments simply adding chameleon does not ele-

vate the performance by a significant margin. However,adding reordering-training always results in a clear per-formance lift over the other approaches. The fact thatwe can only see performance improvements when usingreordering-training shows that a higher model capacityalone is not responsible for the results. In all experiments,pre-training chameleon shows superior performance(contribution 3). When adding novel features during testtime chameleon is still able to outperform the othersetups although with a lower margin. Most importantly,when training on the Hearts data while evaluating onthe Diabetes data, our method still provides noticeableimprovements. These findings are also depicted in ournumerical results stated in Table 2.

Hardware. We run our experiments in CPU usingIntel Xeon E5410 processors. Pre-training takes in av-erage 10 minutes, while joint training takes around 7hours to complete for all the tasks of each experiment.Our code is available online for reproduction purposesat: https://github.com/radrumond/Chameleon.

ConclusionIn this paper we show how multi-tasking meta-learnerscan work along tasks with different predictor vectors. Inorder to do so we present chameleon, a network capableof realigning features based on their distributions. Wealso propose a novel pre-training framework that isshown to learn useful permutations across tasks in asupervised fashion without requiring actual labels. Ourmodel shows performance lifts even when presented withfeatures not seen in training.

By experimenting with different data sets and evencombining two distinct ones from the same domain, weshow that chameleon can be generalized for any kindof task. Another advantage of chameleon is that itcan be used with any model and optimization method.

As future work, we would like to extend chameleonto time-series features and reinforcement learning, asthese tend to present more variations in different tasksand they would benefit greatly from our model.

Page 8: Chameleon: Learning Model Initialization Across Tasks With ... · Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn-ing

ReferencesCortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; andReis, J. 2009. Modeling wine preferences by data min-ing from physicochemical properties. Decision SupportSystems 47(4):547–553.Dua, D., and Graff, C. 2017. UCI machine learning repos-itory.Duan, Y.; Andrychowicz, M.; Stadie, B.; Ho, O. J.;Schneider, J.; Sutskever, I.; Abbeel, P.; and Zaremba, W.2017. One-shot imitation learning. In Advances in neuralinformation processing systems, 1087–1098.Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnosticmeta-learning for fast adaptation of deep networks. InProceedings of the 34th International Conference on Ma-chine Learning-Volume 70, 1126–1135. JMLR. org.Finn, C.; Yu, T.; Zhang, T.; Abbeel, P.; and Levine,S. 2017. One-shot visual imitation learning via meta-learning. arXiv preprint arXiv:1709.04905.Finn, C.; Xu, K.; and Levine, S. 2018. Probabilisticmodel-agnostic meta-learning. In Advances in Neural In-formation Processing Systems, 9537–9548.Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.;Larochelle, H.; Laviolette, F.; Marchand, M.; and Lem-pitsky, V. 2016. Domain-adversarial training of neuralnetworks. The Journal of Machine Learning Research17.1:2096–2030.Glorot, X., and Bengio, Y. 2010. Understanding thedifficulty of training deep feedforward neural networks.In Proceedings of the thirteenth international conferenceon artificial intelligence and statistics, 249–256.Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. DeepLearning. MIT Press. http://www.deeplearningbook.org.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delvingdeep into rectifiers: Surpassing human-level performanceon imagenet classification. In The IEEE InternationalConference on Computer Vision (ICCV).Jomaa, H. S.; Grabocka, J.; and Schmidt-Thieme, L.2019. Dataset2vec: Learning dataset meta-features. arXivpreprint arXiv:1905.11063.Kendall, A.; Gal, Y.; and Cipolla, R. 2018. Multi-tasklearning using uncertainty to weigh losses for scene geome-try and semantics. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 7482–7491.Kingma, D. P., and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980.Kotthoff, L.; Thornton, C.; Hoos, H. H.; Hutter, F.;and Leyton-Brown, K. 2017. Auto-weka 2.0: Auto-matic model selection and hyperparameter optimizationin weka. The Journal of Machine Learning Research18(1):826–830.Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-agenet classification with deep convolutional neural net-works. In Advances in neural information processing sys-tems, 1097–1105.Mishra, N.; Rohaninejad, M.; Chen, X.; and Abbeel, P.2018. A simple neural attentive meta-learner. In Inter-national Conference on Learning Representations.

Munkhdalai, T., and Yu, H. 2017. Meta networks. InProceedings of the 34th International Conference on Ma-chine Learning-Volume, 2554–2563.Nichol, A.; Achiam, J.; and Schulman, J. 2018a. On first-order meta-learning algorithms. CoRR abs/1803.02999.Nichol, A.; Achiam, J.; and Schulman, J. 2018b. Su-pervised reptile. https://github.com/openai/supervised-reptile.Pan, S. J., and Yang, Q. 2010. A survey on transferlearning. IEEE Transactions on knowledge and data en-gineering 22(10):1345–1359.Pan, S. J.; Ni, X.; Sun, J.-T.; Yang, Q.; and Chen, Z.2010. Cross-domain sentiment classification via spectralfeature alignment. In Proceedings of the 19th interna-tional conference on World wide web, 751–760. ACM.Ranjan, R.; Patel, V. M.; and Chellappa, R. 2019. Hyper-face: A deep multi-task learning framework for face detec-tion, landmark localization, pose estimation, and genderrecognition. IEEE Transactions on Pattern Analysis andMachine Intelligence 41(1):121–135.Ravi, S., and Larochelle, H. 2016. Optimization as amodel for few-shot learning. ICLR.Santoro, A.; Bartunov, S.; Botvinick, M.; Wierstra, D.;and Lillicrap, T. 2016. Meta-learning with memory-augmented neural networks. In International conferenceon machine learning, 1842–1850.Schilling, N.; Wistuba, M.; and Schmidt-Thieme, L. 2016.Scalable hyperparameter optimization with products ofgaussian process experts. In Joint European confer-ence on machine learning and knowledge discovery indatabases, 33–48. Springer.Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypicalnetworks for few-shot learning. In Advances in NeuralInformation Processing Systems, 4077–4087.Sung, F.; Yang, Yongxin an Zhang, d. L.; Xiang, T.; Torr,P. H.; and Hospedales, T. M. 2018. Learning to compare:Relation network for few-shot learning. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition, 1199–1208.Tzeng, E.; Hoffman, J.; Darrell, T.; and Saenko, K. 2015.Simultaneous deep transfer across domains and tasks.In Proceedings of the IEEE International Conference onComputer Vision, 4068–4076.Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017.Adversarial discriminative domain adaptation. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, 7167–7176.Wistuba, M.; Schilling, N.; and Schmidt-Thieme, L. 2015.Learning hyperparameter optimization initializations. In2015 IEEE International Conference on Data Scienceand Advanced Analytics (DSAA), 1–10. IEEE.Zhang, J.; Yu, J.; and Tao, D. 2018. Local deep-featurealignment for unsupervised dimension reduction. IEEETransactions on Image Processing 27(5):2420–2432.Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. V. 2018.Learning transferable architectures for scalable imagerecognition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, 8697–8710.

Page 9: Chameleon: Learning Model Initialization Across Tasks With ... · Due to the rapid growth of Deep Learning (Goodfellow, Bengio, and Courville 2016), the field of Machine Learn-ing

Supplementary Material

0 2500 5000 7500 10000 12500

0.4

0.6

0.8

Aver

age

Task

Los

s

Wine - No Split

0 2500 5000 7500 10000 125000.5

0.6

0.7

0.8

Abalone - No Split

0 2500 5000 7500 10000 12500

0.4

0.6

0.8

Telescope - No Split

0 2500 5000 7500 10000 12500Sampled Tasks for Meta-Training

0.4

0.5

0.6

0.7

0.8

0.9

Aver

age

Task

Los

s

Wine - Split

0 2500 5000 7500 10000 12500Sampled Tasks for Meta-Training

0.5

0.6

0.7

0.8

Abalone - Split

0 2500 5000 7500 10000 12500Sampled Tasks for Meta-Training

0.4

0.5

0.6

0.7

0.8

0.9Telescope - Split

(i) y - Scratch (ii) y (iii) y (iv) y - Pretrained

Figure 6: The graphs shows the average Meta-Test lossover 10 runs during reptile training. The graph issmoothed with a moving average of size 20 due to thevariance when using a meta batch size of one whensampling Test tasks. Each point represents the test losson a sampled task after training on it for 10 epochs.