diverse ensemble evolution: curriculum data-model marriage · diverse ensemble evolution:...

12
Diverse Ensemble Evolution: Curriculum Data-Model Marriage Tianyi Zhou, Shengjie Wang, Jeff A. Bilmes Depts. of Computer Science and Engineering, and Electrical and Computer Engineering University of Washington, Seattle {tianyizh, wangsj, bilmes}@uw.edu Abstract We study a new method “Diverse Ensemble Evolution (DivE 2 )” to train an en- semble of machine learning models that assigns data to models at each training epoch based on each model’s current expertise and an intra- and inter-model di- versity reward. DivE 2 schedules, over the course of training epochs, the relative importance of these characteristics; it starts by selecting easy samples for each model, and then gradually adjusts towards the models having specialized and complementary expertise on subsets of the training data, thereby encouraging high accuracy of the ensemble. We utilize an intra-model diversity term on data assigned to each model, and an inter-model diversity term on data assigned to pairs of models, to penalize both within-model and cross-model redundancy. We formulate the data-model marriage problem as a generalized bipartite matching, represented as submodular maximization subject to two matroid constraints. DivE 2 solves a sequence of continuous-combinatorial optimizations with slowly varying objectives and constraints. The combinatorial part handles the data-model marriage while the continuous part updates model parameters based on the assignments. In experiments, DivE 2 outperforms other ensemble training methods under a variety of model aggregation techniques, while also maintaining competitive efficiency. 1 Introduction Ensemble methods [7, 57, 31, 8] are simple and powerful machine learning approaches to obtain improved performance by aggregating predictions (e.g., majority voting or weighted averaging) over multiple models. Over the past few decades, they have been widely applied, consistently yielding good results. For neural networks (NN) in particular, ensemble methods have shown their utility from the early 1980s [72, 28, 39, 10] to recent times [50, 27, 66, 20]. State-of-the-art results on many contemporary competitions/benchmarks are achieved via ensembles of deep neural networks (DNNs), e.g., ImageNet [17], SQuAD [55], and the Kaggle competitions (https://www.kaggle.com/). In addition to boosting state-of-the-art performance of collections of large models, ensembles of small and weak models can achieve performance comparable to much larger individual models, and this can be useful when machine resources are limited. Inference over an ensemble of models, moreover, can be easily parallelized even on a distributed machine. A key reason for the success of ensemble methods is that the diversity among different models can reduce the variance of the combined predictions and improve generalization. Intuitively, diverse models tend to make mistakes on different samples in different ways (e.g., assigning largest probability to different wrong classes), so during majority voting or averaging, those different mistakes cancel each other out and the correct predictions can prevail. As neural networks grow larger in size and intricacy, their variance correspondingly increases, offering opportunity for reduction by a diverse ensemble of such networks. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Upload: others

Post on 29-Dec-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Diverse Ensemble Evolution:

    Curriculum Data-Model Marriage

    Tianyi Zhou, Shengjie Wang, Jeff A. Bilmes

    Depts. of Computer Science and Engineering, and Electrical and Computer EngineeringUniversity of Washington, Seattle

    {tianyizh, wangsj, bilmes}@uw.edu

    Abstract

    We study a new method “Diverse Ensemble Evolution (DivE2)” to train an en-semble of machine learning models that assigns data to models at each trainingepoch based on each model’s current expertise and an intra- and inter-model di-versity reward. DivE2 schedules, over the course of training epochs, the relativeimportance of these characteristics; it starts by selecting easy samples for eachmodel, and then gradually adjusts towards the models having specialized andcomplementary expertise on subsets of the training data, thereby encouraginghigh accuracy of the ensemble. We utilize an intra-model diversity term on dataassigned to each model, and an inter-model diversity term on data assigned topairs of models, to penalize both within-model and cross-model redundancy. Weformulate the data-model marriage problem as a generalized bipartite matching,represented as submodular maximization subject to two matroid constraints. DivE2solves a sequence of continuous-combinatorial optimizations with slowly varyingobjectives and constraints. The combinatorial part handles the data-model marriagewhile the continuous part updates model parameters based on the assignments. Inexperiments, DivE2 outperforms other ensemble training methods under a varietyof model aggregation techniques, while also maintaining competitive efficiency.

    1 Introduction

    Ensemble methods [7, 57, 31, 8] are simple and powerful machine learning approaches to obtainimproved performance by aggregating predictions (e.g., majority voting or weighted averaging) overmultiple models. Over the past few decades, they have been widely applied, consistently yieldinggood results. For neural networks (NN) in particular, ensemble methods have shown their utilityfrom the early 1980s [72, 28, 39, 10] to recent times [50, 27, 66, 20]. State-of-the-art results on manycontemporary competitions/benchmarks are achieved via ensembles of deep neural networks (DNNs),e.g., ImageNet [17], SQuAD [55], and the Kaggle competitions (https://www.kaggle.com/).In addition to boosting state-of-the-art performance of collections of large models, ensembles ofsmall and weak models can achieve performance comparable to much larger individual models,and this can be useful when machine resources are limited. Inference over an ensemble of models,moreover, can be easily parallelized even on a distributed machine.

    A key reason for the success of ensemble methods is that the diversity among different models canreduce the variance of the combined predictions and improve generalization. Intuitively, diversemodels tend to make mistakes on different samples in different ways (e.g., assigning largest probabilityto different wrong classes), so during majority voting or averaging, those different mistakes canceleach other out and the correct predictions can prevail. As neural networks grow larger in size andintricacy, their variance correspondingly increases, offering opportunity for reduction by a diverseensemble of such networks.

    32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

    https://www.kaggle.com/

  • Randomization is a widely-used technique to produce diverse ensembles. Classical ensemble methodssuch as random initialization [17, 63], random forests [31, 8] and Bagging [7, 19], encourage diversityby randomly initializing starting points/subspaces or resampling the training set for different models.Ensemble-like methods for DNNs, e.g., dropout [61] and swapout [59], implicitly train multiplediverse models by randomly dropping hidden units out during the training of a single model. Diversitycan also be promoted by sequentially training multiple models to encourage a difference between thecurrent and previously trained models, such as Boosting [57, 23, 50] and snapshot ensembles [32].Such sequential methods, however, are hard to parallelize and can lead to long training times whenapplied to neural networks.

    Despite the consensus that diversity is essential to ensemble training, there is little work explicitlyencouraging and controlling diversity during ensemble model training. Most previous methods encour-age diversity only implicitly, and are incapable of adjusting the amount of diversity precisely basedon criterion determined during different learning stages, nor do they have an explicit diversity repre-sentation. Some methods implicitly encourage diversity during training, but they rely on learning ratescheduling (e.g., snapshot ensembles [32]) or end-to-end training of an additive combination of mod-els (e.g., mixture of experts [33, 34, 58]) to promote diversity, which is hard to control and interpret.

    Moreover, many existing ensemble training methods train all models in the ensemble on all samplesin the training set by repeatedly iterating through it, so the training cost increases linearly withthe number of models and number of samples. In such case, each model might waste much of itstime on a large number of redundant or irrelevant samples that have already been learnt, and thatmight contribute nearly zero-valued gradients. The performance of an ensemble on each sample onlydepends on whether a subset of models (e.g., half for majority voting) makes a correct prediction,so it should be unnecessary to train each model on every sample.

    In this paper, we aim to achieve an ensemble of models using explicitly encouraged diversity andfocused expertise, i.e., each model is an expert in a sufficiently large local-region of the data space,and all the models together cover the entire space. We propose an efficient meta-algorithm “diverseensemble evolution (DivE2)”, that “evolves” the ensemble adaptively by changing over stages boththe diversity encouragement and each model’s expertise, and this is based on information availableduring ensemble training. It does this encouraging both intra- and inter-model diversity. Each trainingstage is formulated as a hybrid continuous-combinatorial optimization. The combinatorial part solvesa data-model-marriage assignment via submodular generalized bipartite matchings; the algorithmexplicitly controls the diversity of the ensemble and the expertise of each model by assigning differentsubsets of the training data to different models. The continuous part trains each model’s parametersusing the assigned subset of data. At each stage, all the models may be updated in parallel afterreceiving their assigned data.

    A similar approach to encourage inter-model diversity was used in[15] but there diversity of differentmodels is achieved by encouraging diverse subsets of features and the goal was to cluster the featuresinto potentially overlapping groups; here we are encouraging diverse subsets of samples to be assignedto models during the training process and we are matching data samples to models.

    We apply DivE2 to four benchmark datasets, and show that it improves over randomization-basedensemble training methods on a variety of approaches to aggregate ensemble models into a singleprediction. Moreover, with model selection based ensemble aggregation (defined below), DivE2 canquickly reach reasonably good ensemble performance after only a few learning stages even thougheach individual model has poor performance on the entire training set. Furthermore, DivE2 exhibitscompetitive efficiency and good of model expertise interpretability, both of which can be importantin DNN training.

    2 Diverse Ensemble Evolution (DivE2): Formulation

    2.1 Data-Model Marriage

    An ensemble of models can make an accurate prediction on a sample without requiring each modelmaking accurate predictions on that sample [41, 26, 42]. Rather, it requires a subset of models toproduce accurate predictions, and the remainder may err in different ways. Hence, rather than trainingeach model on the entire training set, we may in theory assign a data subset to each model. Then,each sample is learned by a subset of models, and different models are trained on different subsetsthereby avoiding common mistakes across models.

    Consider a weighted bipartite graph (see Fig. 1), with the set of n = |V | training samples V on theleft side, the set of m = |U | models U on the right side, and edges E , {(vj , ui)|vj 2 V, ui 2 U}

    2

  • connecting all sample-model pairs with edge weights defined by the loss `(vj ;wi) of sample vj =(xj , yj) (where xj is the features and yj is the label(s)) on model ui (where model ui is parameterizedby wi). We wish to marry samples with models by selecting a subset of edges having overall smallloss. We can express this as follows:

    max{wi}mi=1

    maxA✓E

    X

    (vj ,ui)2A

    (� � `(vj ;wi)), (1)

    where � � `(vj ;wi) translates loss to reward (or accuracy), and � is a constant larger than anyper-sample loss on any model, i.e., � � `(vj ;wi), 8i, j 1.

    Samples Models

    V UE

    ≤ " (k =2)

    ≤ ' ('=3)

    Figure 1: Data-Model Mar-riage as Bipartite Matching.

    With no constraints, all edges are selected thus requiring all models tolearn all samples. As mentioned above, for ensembles, every sampleneed only be learned by a few models, and thus, for any sample v, wemay wish to limit the number of incident edges selected to be no greaterthan k. This can be achieved using partition matroid MV = (E, IV ),where IV = (I1, I2, . . . , In) and Ii ✓ E. IV contains all subsets ofE where no sample is incident to more than k edges in any subset, i.e.IV = {A ✓ E : |A \ �(v)| k, 8v 2 V }, where �(v) ✓ E is the setof edges incident to v (likewise for �(u), u 2 U ). Therefore, as longas a selected subset A ✓ E satisfies the constraint (A 2 IV ), everysample is selected by at most k models.

    With only the constraint A 2 Iv, different models can be assigneddramatically differently sized data subsets. In the extremely unbal-anced case, k models might get all the training data, while the othermodels get no data at all. This is obviously undesirable because kmodels will learn from the same data (no diversity and no specializedand complementary expertise), while the others models learn nothing.Running time also is not improved since training time is linear in the size of the largest assigned dataset, which is all of the data in this case. We therefore introduce a second partition matroid constraintMU = (E, IU ), which limits to p the number of samples selected by each model. Specifically,Iu = {A ✓ E : |A \ �(u)| p, 8u 2 U}. Eq. (1) then becomes:

    max{wi}mi=1

    maxA✓E,A2Iv\Iu

    X

    (vj ,ui)2A

    (� � `(vj ;wi)). (2)

    The interplay between the two constraints (i.e., Iv: k models per sample, and Iu: p samples permodel) is important to our later design of a curriculum that leads to a diverse and complementaryensemble. When mp < nk, Iu tends to saturate (i.e., |A \ �(u)| = p, 8u 2 U ) earlier than Iv.Hence, each model generally has the opportunity to select the top-p easiest samples (i.e., those havingthe smallest loss) for itself. We call this the “model selecting sample” (or early learning) phase, whereeach model quickly learns to perform well on a subset of data. On the other hand, when mp > nk, Ivtends to saturate earlier (i.e., |A\ �(v)| = k, 8v 2 V ), and each sample generally has the opportunityto select the best-k models for itself. We call this the “sample selecting model” (or later learning)phase, where models may develop complementary expertise so that together they can performaccurately over the entire data space. We give conditions on which phase dominates in Lemma 1.

    2.2 Inter-model & Intra-model Diversity

    To encourage different multiple models to gain different proficiencies, the subsets of training dataassigned to different models should be diverse. The two constraints introduced above are helpfulto encourage diversity to a certain extent when p and k are small. For example, when k = 1 andp n/m, no pairs of models share any training sample. Different training samples, however, canstill be similar and thus redundant, and in this case the above approach might not encourage diversitywhen p or k is large. Therefore, we incorporate during training an explicit inter-model diversity termFinter(A) ,

    Pi,j2[m],i

  • v /2 B, v 2 V , we have F ({v} [A)� F (A) � F ({v} [B)� F (B). Submodular functions havebeen applied to a variety of diversity-driven tasks to achieve good results [45, 44, 3, 54, 25, 22].

    Another issue of Eq. (2) is that each model might select easy but redundant samples when constraintIu dominates (the model-selecting-sample phase). This is problematic as each model might quicklyfocus on a small group of easy samples, and may overfit to such small region in the data space. Wetherefore introduce another set function Fintra(A) =

    Pi2[m] F

    0(�(ui)\A) to promote the diversityof samples assigned to each model. Similar to F , we also choose F 0 to be a submodular function.The optimization procedure now becomes:

    maxW

    maxA✓E,A2Iv\Iu

    G(A,W ) ,X

    (vj ,ui)2A

    (� � `(vj ;wi)) + �Finter(A) + �Fintra(A), (3)

    where � and � are two non-negative weights to control the trade-offs between the reward term and thediversity terms, and we denote W , {wi}mi=1 for simplicity. By optimizing the objective G(A,W ),we explicitly encourage model diversity in the ensemble, while ensuring every sample gets learnedby k models so that the ensemble can generate correct predictions. A form of this objective has beencalled “submodular generalized matchings” [1] where it was used to associate peptides and spectra.

    3 Diverse Ensemble Evolution (DivE2): Algorithm

    3.1 Solving a Continuous-Combinatorial Optimization

    Algorithm 1 SELECTLEARN(k, p,�, �, {w0i }mi=1)1: Input: {vj}nj=1, {l(·;wi)}mi=1,⇡(·; ⌘)2: Output: {wi}mi=13: Initialize: wi w0i 8i 2 [m], t = 04: while not “converged” do5: W {wti}mi=1, define G(·,W ) by W ;6: Â SUBMODULARMAX(G(·,W ), k, p);7: if G(Â,W ) > G(A,W ) then8: A Â;9: end if

    10: for i 2 {1, · · · ,m} do11: �rĤ(wti) @@wti

    Pvj2V (A\�(ui)) `(vj ;w

    ti);

    12: wt+1i wti + ⇡⇣{w⌧i ,�rĤ(w⌧i )}⌧2[1,t]; ⌘t

    ⌘;

    13: end for14: t t+ 1;15: end while

    Eq. (3) is a hybrid optimization in-volving both a continuous variable Wand a discrete variable A. It de-grades to maximization of a piece-wise continuous function H(W ) ,maxA✓E,A2Iv\Iu G(A,W ), with eachpiece defined by a fixed A achievingthe maximum of G(A,W ) in a lo-cal region of W . Suppose that A isfixed, then maximizing G(A,W ) (orH(W )) consists of to m independentcontinuous minimization problems, i.e.,minwi

    Pvj2V (A\�(ui)) `(vj ;wi), 8i 2

    [m]. Here V (A) ✓ V denotes the sam-ples incident to the set of edges A ✓ E,so V (A \ �(ui)) is the subset of sam-ples assigned to model ui. When loss`(·;wi) is convex w.r.t. wi for every i,a global optimal solution to the abovecontinuous minimization can be obtained by various off-the-shelf algorithms. When `(·;wi) isnon-convex, e.g., each model is a deep neural networks, there also exist many practical and provablealgorithms that can achieve a local optimal solution, say, by backpropagation.

    Suppose we fix W , then maximizing G(A,W ) reduces to the data assignment problem (a generalizedbipartite matching problem [43], see Appendix [71] Sec. 5.3 for more details), and the optimal Adefines one piece of H(W ) in the vicinity of W . Finding the optimal assignment is NP-hard sinceG(·,W ) is a submodular function (a weighted sum of a modular and two submodular functions)and we wish to maximize over a feasibility constraint consisting of the intersection of two partitionmatroids (Iv \ Iu). Thanks to submodularity, fast approximate algorithms [51, 48, 49] exist that finda good quality approximate optimal solution. Let Ĥ(W ) denote the piecewise continuous functionachieved when the discrete problem is solved approximately using submodular optimization, then wehave Ĥ(W ) � ↵ ·H(W ) for every W , where ↵ 2 [0, 1] is the approximation factor.Therefore, solving the max-max problem in Eq. (3) requires interaction between a combinatorial(submodular in specific) optimizer and a continuous (convex or non-convex) optimizer ⇡(·; ⌘) 2.We alternate between the two optimizations while keeping the objective G(A,W ) non-decreasing.

    2The optimizer ⇡(·; ⌘) can be any gradient descent methods, e.g., SGD, momentum methods, Nesterov’saccelerated gradient [52], Adagrad [18], Adam [36], etc. Here the first parameter · can include any historicalsolutions and gradients, and ⌘ is a learning rate schedule (i.e., learning rate is ⌘t for iteration t).

    4

  • Intuitively, we utilize the discrete optimization to select a better piece of Ĥ(W ), and then apply thecontinuous optimization to find a better solution on that piece.

    Details are given in Algorithm 1. For each iteration, we compute an approximate solution  ✓ Eusing submodular maximization SUBMODULARMAX (line 6); in lines 7-9 we compare  with the oldA on G(·,W ) and choose the better one; lines 10-13 run an optimizer ⇡(·; ⌘) to update each modelwi according to its assigned data. Algorithm 1 always generates a non-decreasing (assuming ⇡(·; ⌘)does the same using, say, a line search) sequence of objective values. With a damped learning rate,only small adjustments get applied to W and G(·,W ). Thus, after a certain point the combinatorialpart repeatedly selects the same A (and line 7 eventually is always false), so the algorithm thenconverges as the continuous optimizer converges. 3

    3.2 Theoretical Perspectives

    An interesting viewpoint of the max-max problem in Eq. (3) is its analogy to K-means problems [46].Eq. (3) strictly generalizes the kmeans objective, by setting � = � = 0, k = 1, p to be the numberof desired clusters, and the loss to be the distance metric used in K-means (e.g., L2 distance), andthe model to be a real valued vector of having the same dimension as x. Since K-means problem isNP-hard, our objective is also NP-hard.

    We next analyze conditions for either of the constraints (Iv, Iu) introduced in Section 2.1 to saturate.In the two extreme cases, we know that the “sample selecting model” constraint Iv saturates whennk ⌧ mp (e.g., k = 1 and p = n), and the “model selecting sample” constraint Iu saturates whennk � mp (e.g., k = m and p = 1). However, it is not clear what exactly happens between them.The following Lemma shows the precise saturation conditions of the two constraints, with proofdetails in Section 5.1 of Appendix [71].Lemma 1. If SUBMODULARMAX is greedy algorithm or its variant, the data assignment  producedby lines 6-9 in Algorithm 1 fulfills: 1) Iv saturates, i.e., |Â\�(v)| = k, 8v 2 V , and |Â| = nk, if k <mp+p/n+(p�1); 2) Iu saturates, i.e., | \ �(u)| = p, 8u 2 U , and |Â| = mp, if k > mp�p/n�(p�1);3) when mp+p/n+(p�1) k mp�p/n�(p�1), we have |Â| � min{(k � 1) + (m � k + 1)p, (p �1) + (n� p+ 1)k}.As stated above, we can think the objective H(W ) as a piecewise function, where each piece isassociated with a solution to the discrete optimization problem. Since it is NP-hard to optimize thediscrete problem, Algorithm 1 optimizes W on Ĥ(W ), which is defined by the SUBMODULARMAXsolutions, rather than on H(W ). Algorithm 1 has the following properties.Proposition 1. Algorithm 1: (1) generates a monotonically non-decreasing sequence of objectivevalues G(A;W ) (assuming ⇡(·; ⌘) does the same) (2) converges to a stationary point on Ĥ(W ); and(3) for any loss `(u,w) that is �-strongly convex w.r.t. w, if SUBMODULARMAX has approximationfactor ↵, it converges to a local optimum Ŵ 2 argmaxW2K Ĥ(W ) (i.e., Ŵ is optimal in an localarea K) such that for any local optimum W ⇤loc 2 K on the true objective H(W ), we have

    Ĥ(Ŵ ) � ↵H(W ⇤loc)+�

    2·min{(k�1)+(m�k+1)p, (p�1)+(n�p+1)k}·kŴ �W ⇤lock22. (4)

    The proof is in Section 5.2 of Appendix [71]. The result in Eq. (4) implies that in any local area K,if Ŵ is not close to W ⇤loc (i.e., kŴ �W ⇤lock2 is large), the algorithm can still achieve an objectiveĤ(Ŵ ) close to H(W ⇤loc), which is a good approximate solution from the perspective of maximizingG(A,W ). Section 5.3 of Appendix [71] shows that the approximation factor is ↵ = 1/2+G for thegreedy algorithm, where G is the curvature of G(·,W ). When the weights � and � are small, Gdecreases and G(·,W ) becomes more modular. Therefore, the approximation ratio ↵ increases andthe lower bound in Eq. (4) improves. For general non-convex losses and models (e.g., DNNs), Eq. (4)degenerates to a weaker bound: Ĥ(Ŵ ) � ↵H(W ⇤loc).3.3 Ensemble Evolution: Curricula for Diverse Ensembles with Complementary Expertise

    For a model ensemble to produce correct predictions, we require only that every sample be learntby a few (small k) models. Optimizing Eq. (3) with small k from the beginning, however, mightbe harmful as the models are randomly initialized, and using the loss of such early stage modelsfor the edge weights and small k could lead to arbitrary samples being associated and subsequently

    3Convergence is defined as the gradient rĤ(W ) w.r.t. W being zero. In practice, we use krĤ(W )k ✏for a small ✏.

    5

  • locked to models. We would, instead, rather have a larger k and more use of the diversity terms at thebeginning. To address this, we design an ensemble curriculum to guide the training process and togradually approach our ultimate goal.

    Algorithm 2 Diverse Ensemble Evolution (DIVE2)1: Input: {(xj , yj)}nj=1, {w0i }mi=1,⇡(·; ⌘), µ,�k,�p, T2: Output: {wti}mi=13: Initialize: k m, p � 1 s.t. mp nk,

    � 2 [0, 1], � 2 [0, 1]4: for t 2 {1, · · · , T} do5: {wti}mi=1 SELECTLEARN(k, p,�, �, {w

    t�1i }mi=1);

    6: � (1� µ) · �, � (1� µ) · �;7: k max{dk ��ke, 1}, p min{bp+�pc, n};8: end for

    In Section 2.1, we discussed two(mp < nk and mp > nk) extremetraining regimes. In the first regime,there are plenty of samples to goaround but models may only choose alimited set of samples, so this encour-ages different models to improve onsamples they are already good at. Inthe first regime, however, inter-modeldiversity is important to encouragemodels to become sufficiently differ-ent from each other. Intra-diversity is

    also important in the first regime, since it discourages models from being trained on entirely redundantdata. In the second regime, there are plenty of models to go around but samples may choose onlya limited number of models. Each model is then given a set of samples that it is particularly goodat, and further training further specialization. Since all samples are assigned models, this leads tocomplementary proficiencies covering the data space.

    These observations suggest we start at the first regime mp nk with small p and large k, andgradually switch to the second regime with mp � nk by slowly increasing p and decreasing k. Inearlier stages, we also should set the diversity weights � and � to be large, and then slowly reducethem as we move towards the second regime. It is worth noting that besides intra-model diversityregularization, increasing p is also helpful to expand the expertise of each model since it encourageseach model to select more diverse samples. Decreasing k also helps to encourage inter-model diversitysince it allows each sample to be shared by fewer models.

    In later stages, the solution of Algorithm 1 becomes more exact. With � and � decreasing, accordingto Lemma 2, the curvature G of G(·,W ) approaches 0, the approximation factor ↵ = 1/2+G ofgreedy algorithm increases, and the approximate objective Ĥ(W ) � ↵H(W ) becomes closer to thetrue objective H(W ). Moreover, during later stages when the “sample selecting model” constraint(Iv) dominates and � and � are almost 0, the inner modular maximization is be exactly solved (↵ = 1)and greedy algorithm degenerates to simple sorting.

    The detailed diverse ensemble evolution (DivE2) procedure is shown in Algorithm 2. The curriculumis composed of T stages. Each stage uses SELECTLEARN (Algorithm 1) to (approximately) solve acontinuous-combinatorial optimization in the form of Eq. (3) with pre-specified values of (k, p,�, �)and initialization {wt�1i }mi=1 from the previous episode as a warm start (line 5). The procedurereduces � and � by a multiplicative factor (1�µ) in line 6, linearly decreases k by �k and additivelyincreases p by �p, in Line 7. Both k and p are restricted to be integers and within the legal ranges, i.e.,k 2 [1,m] and p 2 [1, n]. The warm start initialization is similar in spirit to continuation schemesused in previous curriculum learning (CL) [6, 5, 4, 35, 2, 60, 70] and SPL [40, 64, 62, 65], to avoidgetting trapped in local minima and to stabilize optimization. As consecutive problems have the sameform with similar parameters (k, p,�, �), the solution to the previous problem might still evaluatewell on the next one. Hence, instead of running lines 5-14 in Algorithm 1 until full convergence (asinstructed by line 4), we run them for 10 iterations for reduced running time.

    4 Experiments

    We apply three different ensemble training methods to train ensembles of neural networks with differ-ent structures on four datasets, namely: (1) MobileNetV2 [56] on CIFAR10 [38]; (2) ResNet18 [29]on CIFAR100 [38]; (3) CNNs with two convolutional layers4 on Fashion-MNIST (“Fashion” in alltables) [69]; (4) and lastly CNNs with six convolutional layers on STL10 [12]5. The three trainingmethods include DivE2 and two widely used approaches as baselines, which are

    • Bagging(BAG)[7]: sample a new training set of the same size as the original one (with replacement)for each model, and train it for several epochs on the sampled training set.

    4A variant of LeNet5 with 64 kernels for each convolutional layer.5The network structure is from https://github.com/aaron-xichen/pytorch-playground.

    6

    https://github.com/aaron-xichen/pytorch-playground

  • • RandINIT(RND): randomly initialize model weights of each model, and train it for several epochson the whole training set.

    Details can be found in Table 3 of Appendix [71]. We everywhere fix the number of models atm = 10, and use `2 parameter regularization on w with weight 1 ⇥ 10�4. In DivE2’s trainingphase, we start from k = 6, p = n/2m and linearly change to k = 1, p = 3n/m in T = 200 episodes.We employ the “facility location” submodular function [14, 45] for both the intra and inter-modeldiversity, i.e., F (A) =

    Pv02V maxv2V (A) !v,v0 where !v,v0 represents the similarity between

    sample v and v0. We utilize a Gaussian kernel for similarity using neural net features z(v) for eachv, i.e., !v,v0 = exp

    ��kz(v)�z(v0)k2/2�2

    �, where � is the mean value of all the n(n�1)/2 pairwise

    distances. For every dataset, we train a neural networks on a small random subset of training data(e.g., hundreds of samples) for one epoch, and use the inputs to the last fully connected layer asfeatures z. These features are also used in the Top-k DCS-KNN approach (below) to compute thepairwise `2 distances to find the K nearest neighbors.

    4.1 Aggregation Methods using an Ensemble of Models

    CIFAR10 CIFAR100 Fashion-MNIST STL10

    Figure 2: Compare DivE2 with Bagging(upper row) and RandINIT(lower row) in terms of test accuracy (%) vs.number of training batches on CIFAR10, CIFAR100, Fashion-MNIST and STL10, with m = 10 and k = 3.

    For ensemble model aggregation, when applying a trained ensemble of models to new samples, wemust determine (1) which models to use, and (2) how to aggregate their outputs. Here we mainlydiscuss the first point about different model selection methods, because the aggregation we employis either an evenly or a weighted average of the selected model outputs. Static model selectionmethods [72, 10, 53] choose a subset of models from the ensemble and apply it to all samples. Bycontrast, dynamic classifier selection (DCS) [11, 47, 73, 16] selects different subsets of models to beaggregated for each sample. KNN based DCS [68, 37] is a widely used method that usually achievesbetter performance than other DCS and static methods. When training, DivE2 assigns different subsetsof samples to different models, so for aggregation, we may benefit more from using sample-specificmodel selection methods. Therefore, we focus on DCS-type methods, in particular, the following:

    • Top-k Oracle: average the outputs (e.g., logits before applying softmax) of the top-k models withthe smallest loss on the given sample. It requires knowing the true label, and thus is a cheatingmethod that cannot be applied in practice. However, it shows a useful upper bound on the othermethods that select k models for aggregation.

    • All Average: evenly average the outputs of all m models.• Random-k Average: randomly select k models and average their outputs.• Top-k Confidence: select the top-k models with the highest confidence (i.e., highest probability of

    the predicted class) on the given sample, and average their outputs.• Top-k DCS-KNN: apply an KNN based DCS method, i.e., find the K nearest neighbors of the

    given sample from the training data, select the top-k models assigned to the K nearest neighborsby Top-k Oracle, and average their outputs.

    • Top-k NN-LossPredict: train an L2-regression neural nets with m outputs to predict the per-samplelosses on the m models by using a training set composed of all training samples and their losses on

    7

  • the trained models. For aggregation, select the top-k models with the smallest predicted losses onthe given sample, and average their outputs.

    We compare the three training methods used with the aforementioned aggregation methods withdifferent k6. We summarize the highest test-set accuracy when k = 3 in Table 2, and show how the

    Table 1: Total time (secs.) of DivE2 and time only on SUBMODULARMAX.Dataset CIFAR10 CIFAR100 Fashion STL10Total time 26790.75s 34658.27s 2922.89s 4065.81sSUBMODULARMAX 1857.36s 2697.36s 81.64s 378.84s

    test accuracy improves astraining proceeds (i.e., as thetotal training batches on allmodels increases) in Fig. 2.In Fig. 2, solid curves denoteDivE2, while dashed curvesdenote the three baseline training methods. Different colors refer to different aggregation methods,and gray curves represent single model performance (gray solid curves denote models trained byDivE2, while gray dashed curves denote models trained by other baselines). Similar results for k = 5and k = 7 can be found in Appendix [71]. In addition, we also tested DivE2 without the “modelselecting sample” constraint and any diversity, which equals to [41, 26, 42] in multi-class case. Itachieves a test accuracy of 90.11% (vs. 94.36% of DivE2) on CIFAR10 and 71.01% (vs. 78.89%of DivE2) on CIFAR100 when using Top-3 NN-LP for aggregation.

    Table 2: The highest test accuracy (%) achieved by different combinations ofensemble training and aggregation methods on four datasets, with k = 3. DivE2usually requires less training time than others to achieve the highest accuracy. Thebest non-cheating test accuracy (i.e., not Top-k Oracle) is highlighted below.

    Train:Aggregation CIFAR10 CIFAR100 Fashion STL10BAG:Top-k Oracle (Cheat) 97.85 88.02 95.60 89.13BAG:All Average 93.69 73.12 91.24 74.96BAG:Random-k Avg. 93.05 72.86 91.00 74.03BAG:Top-k Confidence 93.51 74.59 90.81 75.76BAG:Top-k DCS-KNN 92.86 73.06 91.39 74.07BAG:Top-k NN-L.P. 93.45 73.62 92.38 75.16RND:Top-k Oracle (Cheat) 97.80 87.01 95.71 89.54RND:All Average 93.28 75.71 91.13 77.13RND:Random-k Avg. 93.11 75.56 90.77 76.75RND:Top-k Confidence 93.51 76.54 91.07 77.93RND:Top-k DCS-KNN 93.18 75.72 92.01 77.23RND:Top-k NN-L.P. 93.69 76.69 92.48 77.28DivE2:Top-k Oracle (Cheat) 98.01 90.12 96.40 90.18DivE2:All Average 94.20 79.12 86.16 78.95DivE2:Random-k Avg. 93.26 77.69 82.75 78.59DivE2:Top-k Confidence 94.05 78.76 92.10 79.38DivE2:Top-k DCS-KNN 93.81 77.61 92.10 79.23DivE2:Top-k NN-L.P. 94.36 78.89 92.76 80.49

    Top-k Oracle (cheating)is always the best, andprovides an upper bound.In addition, DivE2usually has higher upperbound than others, andthus has more potentialfor future improvement.Solid curves (DivE2)are usually higher thandashed curves (otherbaselines) in laterstages, no matter whichaggregation method isused. Although diversityintroduces more difficultsamples and lead toslower convergence inearly stages, it helpsaccelerate convergencein later stages. Althoughthe test accuracy onsingle models achievedby DivE2 is usually lower than those obtained by other baselines, the test accuracy on the ensembleis better. This indicates that different models indeed develop different local expertise. Hence, eachmodel performs well good only in a local region but poorly elsewhere. However, their expertiseis complementary, so the overall performance of the ensemble outperforms other baselines. Wevisualize the expertise of each model across different classes in Fig. 3 of Appendix [71] forFashion-MNIST as an example. Among all aggregation methods, Top-k NN-LossPredict and Top-kDCS-KNN show comparable or better performance than other aggregation methods, but requiremuch less aggregation costs when k is small. As shown in Appendix [71], when changing k fromminority (k = 3) to majority (k = 7), the test accuracy of these two aggregation methods usuallyimproves by a large margin. According to Table 1, DivE2 only requires a few extra computationaltime for data assignment. The model training dominates the computations but is highly parallelizablesince the updates on different models are independent.

    Acknowledgments This material is based upon work supported by the National Science Foundationunder Grant No. IIS-1162606, the National Institutes of Health under award R01GM103544, and bya Google, a Microsoft, and an Intel research award. This research is also supported by the CONIXResearch Center, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) programsponsored by DARPA.

    6The k used in aggregation fixed, and is different from the k in training (which decreases from 6 to 1).

    8

  • References

    [1] Wenruo Bai, Jeffrey Bilmes, and William S. Noble. Bipartite matching generalizations forpeptide identification in tandem mass spectrometry. In 7th ACM Conference on Bioinformatics,Computational Biology, and Health Informatics (ACM BCB), ACM SIGBio, Seattle, WA,October 2016. ACM, ACM SIGBio.

    [2] Sumit Basu and Janara Christensen. Teaching classification boundaries to humans. In AAAI,pages 109–115, 2013.

    [3] Dhruv Batra, Payman Yadollahpour, Abner Guzman-Rivera, and Gregory Shakhnarovich.Diverse m-best solutions in markov random fields. In ECCV, pages 1–16, 2012.

    [4] Yoshua Bengio. Evolving Culture Versus Local Minima, pages 109–138. Springer BerlinHeidelberg, 2014.

    [5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review andnew perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.

    [6] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.In ICML, pages 41–48, 2009.

    [7] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

    [8] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

    [9] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. InProceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, KDD ’06, pages 535–541, 2006.

    [10] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selectionfrom libraries of models. In Proceedings of the Twenty-first International Conference onMachine Learning, ICML ’04, 2004.

    [11] Paulo R. Cavalin, Robert Sabourin, and Ching Y. Suen. Dynamic selection approaches formultiple classifier systems. Neural Computing and Applications, 22(3):673–688, 2013.

    [12] Adam Coates, Honglak Lee, and Andrew Y. Ng. An analysis of single-layer networks inunsupervised feature learning. In AISTATS, pages 215–223, 2011.

    [13] M. Conforti and G. Cornuejols. Submodular set functions, matroids and the greedy algorithm:tight worst-case bounds and some generalizations of the Rado-Edmonds theorem. DiscreteApplied Mathematics, 7(3):251–274, 1984.

    [14] G. Cornuéjols, M. Fisher, and G.L. Nemhauser. On the uncapacitated location problem. Annalsof Discrete Mathematics, 1:163–177, 1977.

    [15] Andrew Cotter, Mahdi Milani Fard, Seungil You, Maya Gupta, and Jeff Bilmes. Constrainedinteracting submodular groupings. In International Conference on Machine Learning (ICML),Stockholm, Sweden, July 2018.

    [16] Rafael M.O. Cruz, Robert Sabourin, and George D.C. Cavalcanti. Dynamic classifier selection:Recent advances and perspectives. Information Fusion, 41:195–216, 2018.

    [17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-ScaleHierarchical Image Database. In CVPR09, 2009.

    [18] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learningand stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

    [19] B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1–26,1979.

    9

  • [20] Gamaleldin F. Elsayed, Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, Ian J.Goodfellow, and Jascha Sohl-Dickstein. Adversarial examples that fool both human andcomputer vision. arXiv, 2018.

    [21] M. L. Fisher, G. L. Nemhauser, and L. A. Wolsey. An analysis of approximations for maximizingsubmodular set functions-II. Mathematical Programming Studies, 8, 1978.

    [22] Madalina Fiterau and Artur Dubrawski. Projection retrieval for classification. In Advances inNeural Information Processing Systems 25, pages 3023–3031. 2012.

    [23] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning andan application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.

    [24] Satoru Fujishige. Submodular functions and optimization. Annals of discrete mathematics.Elsevier, 2005.

    [25] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determi-nantal point processes. In NIPS, pages 2735–2743, 2012.

    [26] Abner Guzmán-rivera, Dhruv Batra, and Pushmeet Kohli. Multiple choice learning: Learningto produce multiple structured outputs. In Advances in Neural Information Processing Systems25, pages 1799–1807. 2012.

    [27] Shizhong Han, Zibo Meng, AHMED-SHEHAB KHAN, and Yan Tong. Incremental boost-ing convolutional neural network for facial action unit recognition. In Advances in NeuralInformation Processing Systems (NIPS), pages 109–117. 2016.

    [28] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on PatternAnalysis and Machine Intelligence, 12(10):993–1001, 1990.

    [29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),pages 770–778, 2016.

    [30] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.In NIPS Deep Learning and Representation Learning Workshop, 2015.

    [31] Tin Kam Ho. Random decision forests. In Proceedings of 3rd International Conference onDocument Analysis and Recognition, volume 1, pages 278–282, 1995.

    [32] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Wein-berger. Snapshot ensembles: Train 1, get M for free. In International Conference on LearningRepresentations (ICLR), 2017.

    [33] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptivemixtures of local experts. Neural Computing, 3(1):79–87, 1991.

    [34] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the em algorithm.Neural Computing, 6(2):181–214, 1994.

    [35] Faisal Khan, Xiaojin (Jerry) Zhu, and Bilge Mutlu. How do humans teach: On curriculumlearning and teaching dimension. In NIPS, pages 1449–1457, 2011.

    [36] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna-tional Conference on Learning Representations (ICLR), 2015.

    [37] Albert H. R. Ko, Robert Sabourin, and Alceu Souza Britto, Jr. From dynamic classifier selectionto dynamic ensemble selection. Pattern Recognition, 41(5):1718–1731, 2008.

    [38] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Technical report, University of Toronto, 2009.

    [39] Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and activelearning. In Advances in Neural Information Processing Systems (NIPS), pages 231–238. 1995.

    10

  • [40] M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variablemodels. In NIPS, pages 1189–1197, 2010.

    [41] Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David J. Crandall, and Dhruv Batra.Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv,abs/1511.06314, 2015.

    [42] Stefan Lee, Senthil Purushwalkam Shiva Prakash, Michael Cogswell, Viresh Ranjan, DavidCrandall, and Dhruv Batra. Stochastic multiple choice learning for training diverse deepensembles. In Advances in Neural Information Processing Systems 29, pages 2119–2127. 2016.

    [43] Hui Lin and Jeff Bilmes. Word alignment via submodular maximization over matroids. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies: short papers-Volume 2, pages 170–175. Association forComputational Linguistics, 2011.

    [44] Hui Lin and Jeff A. Bilmes. A class of submodular functions for document summarization. InACL, pages 510–520, 2011.

    [45] Hui Lin, Jeff A. Bilmes, and Shasha Xie. Graph-based submodular selection for extractivesummarization. In Proc. IEEE Automatic Speech Recognition and Understanding (ASRU),Merano, Italy, December 2009.

    [46] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory (TIT),28(2):129–137, 1982.

    [47] Christopher J. Merz. Dynamical Selection of Learning Algorithms, pages 281–290. SpringerNew York, 1996.

    [48] Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions.In Optimization Techniques, volume 7 of Lecture Notes in Control and Information Sciences,chapter 27, pages 234–243. Springer Berlin Heidelberg, 1978.

    [49] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and AndreasKrause. Lazier than lazy greedy. In AAAI, pages 1812–1818, 2015.

    [50] Mohammad Moghimi, Mohammad Saberian, Jian Yang, Li-Jia Li, Nuno Vasconcelos, andSerge Belongie. Boosted convolutional neural networks. In British Machine Vision Conference(BMVC), 2016.

    [51] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizingsubmodular set functions-I. Mathematical Programming, 14(1):265–294, 1978.

    [52] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,103(1):127–152, 2005.

    [53] Ioannis Partalas, Grigorios Tsoumakas, and Ioannis Vlahavas. Focused ensemble selection: Adiversity-based method for greedy ensemble selection. In European Conference on ArtificialIntelligence (ECML), pages 117–121, 2008.

    [54] Adarsh Prasad, Stefanie Jegelka, and Dhruv Batra. Submodular meets structured: Findingdiverse subsets in exponentially-large structured item sets. In NIPS, pages 2645–2653, 2014.

    [55] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+questions for machine comprehension of text. In EMNLP, 2016.

    [56] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Inverted residuals and linear bottlenecks: Mobile networks for classification, detection andsegmentation. arXiv, 2018.

    [57] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.

    [58] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-expertslayer. In International Conference on Learning Representations (ICLR), 2017.

    11

  • [59] Saurabh Singh, Derek Hoiem, and David Forsyth. Swapout: Learning an ensemble of deeparchitectures. In Advances in Neural Information Processing Systems (NIPS), pages 28–36.2016.

    [60] Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. Baby Steps: How “Less isMore” in unsupervised dependency parsing. In NIPS 2009 Workshop on Grammar Induction,Representation of Language and Language Learning, 2009.

    [61] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overfitting. Journal of MachineLearning Research (JMLR), 15:1929–1958, 2014.

    [62] James Steven Supancic III and Deva Ramanan. Self-paced learning for long-term tracking. InCVPR, pages 2379–2386, 2013.

    [63] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,and A. Rabinovich. Going deeper with convolutions. In Conference on Computer Vision andPattern Recognition (CVPR), volume 00, pages 1–9, 2015.

    [64] Kevin Tang, Vignesh Ramanathan, Li Fei-fei, and Daphne Koller. Shifting weights: Adaptingobject detectors from image to video. In NIPS, pages 638–646, 2012.

    [65] Ye Tang, Yu-Bin Yang, and Yang Gao. Self-paced dictionary learning for image classification.In MM, pages 833–836, 2012.

    [66] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adver-sarial training: Attacks and defenses. In International Conference on Learning Representations(ICLR), 2018.

    [67] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensemblesof relatively shallow networks. In Advances in Neural Information Processing Systems (NIPS),pages 550–558. 2016.

    [68] K. Woods, W. P. Kegelmeyer, and K. Bowyer. Combination of multiple classifiers usinglocal accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence,19(4):405–410, 1997.

    [69] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset forbenchmarking machine learning algorithms, 2017.

    [70] Tianyi Zhou and Jeff Bilmes. Minimax curriculum learning: Machine teaching with desirabledifficulties and scheduled diversity. In International Conference on Learning Representations(ICLR), 2018.

    [71] Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. Supplementary material for diverse ensembleevolution. In NIPS, 2018.

    [72] Zhi-Hua Zhou, Jianxin Wu, and Wei Tang. Ensembling neural networks: Many could be betterthan all. Artificial Intelligence, 137(1):239–263, 2002.

    [73] Xingquan Zhu, Xindong Wu, and Ying Yang. Dynamic classifier selection for effectivemining from noisy data streams. In Data Mining, 2004. ICDM ’04. Fourth IEEE InternationalConference on, pages 305–312, 2004.

    12