practicalrecommendationsforgradient-basedtrainingofdeep ...to involve many bells and whistles,...

31
arXiv:1206.5533v1 [cs.LG] 24 Jun 2012 Practical Recommendations for Gradient-Based Training of Deep Architectures Yoshua Bengio June 26, 2012 Abstract Learning algorithms related to artificial neural net- works and in particular for Deep Learning may seem to involve many bells and whistles, called hyper- parameters. This chapter is meant as a practical guide with recommendations for some of the most commonly used hyper-parameters, in particular in the context of learning algorithms based on back- propagated gradient and gradient-based optimiza- tion. It also discusses how to deal with the fact that more interesting results can be obtained when allow- ing one to adjust many hyper-parameters. Overall, it describes elements of the practice used to successfully and efficiently train and debug large-scale and often deep multi-layer neural networks. It closes with open questions about the training difficulties observed with deeper architectures. 1 Introduction Following a decade of lower activity, research in arti- ficial neural networks was revived after a 2006 break- through (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007) in the area of Deep Learning. See (Bengio, 2009) for a review. Many of the practical recommendations that justified the previous edition of this book are still valid, and new elements were added, while some survived longer by virtue of the practical advantages they provided. The panorama presented in this chapter regards some of these sur- viving or novel elements of practice, focusing on learning algorithms aiming at training deep neural networks, but leaving most of the material specific to the Boltzmann machine family to another chap- ter (Hinton, 2013). Although such recommendations come out of a liv- ing practice that emerged from years of experimenta- tion and to some extent mathematical justification, they should be challenged. They constitute a good starting point for the experimenter and user of learn- ing algorithms but very often have not been formally validated, leaving open many questions that can be answered either by theoretical analysis or by solid comparative experimental work (ideally by both). A good indication of the need for such validation is that different researchers and research groups do not al- ways agree on the practice of training neural net- works. Several of the recommendations presented here can be found implemented in the Deep Learning Tutori- als 1 and in the related Pylearn2 library 2 , all based on the Theano library (discussed below) written in the Python programming language. The 2006 Deep Learning breakthrough centered on the use of unsupervised representation learning to help learning internal representations 3 by pro- viding a local training signal at each level of a hi- erarchy of features 4 . Unsupervised representation 1 http://deeplearning.net/tutorial/ 2 http://deeplearning.net/software/pylearn2 3 A neural network computes a sequence of data transfor- mations, each step encoding the raw input into an intermediate or internal representation, in principle to make the prediction or modeling task of interest easier. 4 In standard multi-layer neural networks trained using back-propagated gradients, the only signal that drives param- eter updates is provided at the output of the network (and 1

Upload: others

Post on 06-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

arX

iv:1

206.

5533

v1 [

cs.L

G]

24

Jun

2012

Practical Recommendations for Gradient-Based Training of Deep

Architectures

Yoshua Bengio

June 26, 2012

Abstract

Learning algorithms related to artificial neural net-works and in particular for Deep Learning may seemto involve many bells and whistles, called hyper-parameters. This chapter is meant as a practicalguide with recommendations for some of the mostcommonly used hyper-parameters, in particular inthe context of learning algorithms based on back-propagated gradient and gradient-based optimiza-tion. It also discusses how to deal with the fact thatmore interesting results can be obtained when allow-ing one to adjust many hyper-parameters. Overall, itdescribes elements of the practice used to successfullyand efficiently train and debug large-scale and oftendeep multi-layer neural networks. It closes with openquestions about the training difficulties observed withdeeper architectures.

1 Introduction

Following a decade of lower activity, research in arti-ficial neural networks was revived after a 2006 break-through (Hinton et al., 2006; Bengio et al., 2007;Ranzato et al., 2007) in the area of Deep Learning.See (Bengio, 2009) for a review. Many of the practicalrecommendations that justified the previous editionof this book are still valid, and new elements wereadded, while some survived longer by virtue of thepractical advantages they provided. The panoramapresented in this chapter regards some of these sur-viving or novel elements of practice, focusing onlearning algorithms aiming at training deep neural

networks, but leaving most of the material specificto the Boltzmann machine family to another chap-ter (Hinton, 2013).Although such recommendations come out of a liv-

ing practice that emerged from years of experimenta-tion and to some extent mathematical justification,they should be challenged. They constitute a goodstarting point for the experimenter and user of learn-ing algorithms but very often have not been formallyvalidated, leaving open many questions that can beanswered either by theoretical analysis or by solidcomparative experimental work (ideally by both). Agood indication of the need for such validation is thatdifferent researchers and research groups do not al-ways agree on the practice of training neural net-works.Several of the recommendations presented here can

be found implemented in the Deep Learning Tutori-als 1 and in the related Pylearn2 library 2, all basedon the Theano library (discussed below) written inthe Python programming language.The 2006 Deep Learning breakthrough centered

on the use of unsupervised representation learningto help learning internal representations 3 by pro-viding a local training signal at each level of a hi-erarchy of features 4. Unsupervised representation

1 http://deeplearning.net/tutorial/2 http://deeplearning.net/software/pylearn23 A neural network computes a sequence of data transfor-

mations, each step encoding the raw input into an intermediateor internal representation, in principle to make the predictionor modeling task of interest easier.

4 In standard multi-layer neural networks trained usingback-propagated gradients, the only signal that drives param-eter updates is provided at the output of the network (and

1

Page 2: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

learning algorithms can be applied several times tolearn different layers of a deep model. Several un-supervised representation learning algorithms havebeen proposed since then. Those covered in thischapter (such as auto-encoder variants) retain manyof the properties of artificial multi-layer neural net-works, relying on the back-propagation algorithm toestimate stochastic gradients. Deep Learning algo-rithms such as those based on the Boltzmann ma-chine and those based on auto-encoder or sparse cod-ing variants often include a supervised fine-tuningstage. This supervised fine-tuning as well as thegradient descent performed with auto-encoder vari-ants also involves the back-propagation algorithm,just as like when training deterministic feedforwardor recurrent artificial neural networks. Hence thischapter also includes recommendations for trainingordinary supervised deterministic neural networks ormore generally, most machine learning algorithms re-lying on iterative gradient-based optimization of aparametrized learner with respect to an explicit train-ing criterion.This chapter assumes that the reader already un-

derstands the standard algorithms for training super-vised multi-layer neural networks, with the loss gra-dient computed thanks to the back-propagation algo-rithm (Rumelhart et al., 1986). It starts by explain-ing basic concepts behind Deep Learning and thegreedy layer-wise pretraining strategy (Section 1.1),and modern unsupervised pre-training algorithms(denoising and contractive auto-encoders) that areclosely related in the way they are trained to stan-dard multi-layer neural networks (Section 1.2). Itthen reviews in Section 2 basic concepts in itera-tive gradient-based optimization and in particularthe stochastic gradient method, gradient computa-tion with a flow graph, automatic differentation, andonline learning. The main section of this chapter isSection 3, which explains hyper-parameters in gen-eral, their optimization, and specifically covers themain hyper-parameters of neural networks. Section 4briefly describes simple ideas and methods to debug

then propagated backwards). Some unsupervised learning al-gorithms provide a local source of guidance for the parameterupdate in each layer, based only on the inputs and outputs ofthat layer.

and visualize neural networks, while Section 5 coversparallelism, sparse high-dimensional inputs, symbolicinputs and embeddings, and multi-relational learn-ing. The chapter closes (Section 6) with open ques-tions on the difficulty of training deep architecturesand improving the optimization methods for neuralnetworks.

1.1 Deep Learning and Greedy Layer-

Wise Pretraining

The notion of reuse, which explains the power ofdistributed representations (Bengio, 2009), is also atthe heart of the theoretical advantages behind DeepLearning. The depth of a circuit is the length of thelongest path from an input node of the circuit to anoutput node of the circuit. Formally, one can changethe depth of a given circuit by changing the defini-tion of what each node can compute, but only by aconstant factor (Bengio, 2009). The typical compu-tations we allow in each node include: weighted sum,product, artificial neuron model (such as a mono-tone non-linearity on top of an affine transforma-tion), computation of a kernel, or logic gates. Theo-retical results (Hastad, 1986; Hastad and Goldmann,1991; Bengio et al., 2006b; Bengio and LeCun, 2007;Bengio and Delalleau, 2011) clearly identify familiesof functions where a deep representation can be ex-ponentially more efficient than one that is insuffi-ciently deep. If the same family of functions canbe represented with fewer parameters (or more pre-cisely with a smaller VC-dimension 5), learning the-ory would suggest that it can be learned with fewerexamples, yielding improvements in both computa-tional efficiency and statistical efficiency.Another important motivation for feature learning

and Deep Learning is that they can be done with un-labeled examples, so long as the factors (unobservedrandom variables explaining the data) relevant to thequestions we will ask later (e.g. classes to be pre-dicted) are somehow salient in the input distributionitself. This is true under the manifold hypothesis,which states that natural classes and other high-level

5 Note that in our experiments, deep architectures tend togeneralize very well even when they have quite large numbersof parameters.

2

Page 3: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

concepts in which humans are interested are asso-ciated with low-dimensional regions in input space(manifolds) near which the distribution concentrates,and that different class manifolds are well-separatedby regions of very low density. It means that a smallsemantic change around a particular example canbe captured by changing only a few numbers in ahigh-level abstract representation space. As a conse-quence, feature learning and Deep Learning are in-timately related to principles of unsupervised learn-ing, and they can be exploited in the semi-supervisedsetting (where only a few examples are labeled), aswell as in the transfer learning and multi-task settings(where we aim to generalize to new classes or tasks).The underlying hypothesis is that many of the under-lying factors are shared across classes or tasks. Sincerepresentation learning aims to extract and isolatethese factors, representations can be shared acrossclasses and tasks.

One of the most commonly used approaches fortraining deep neural networks is based on greedylayer-wise pre-training (Bengio et al., 2007). Theidea, first introduced in Hinton et al. (2006), is totrain one layer of a deep architecture at a time us-ing unsupervised representation learning. Each leveltakes as input the representation learned at the pre-vious level and learns a new representation. Thelearned representation(s) can then be used as in-put to predict variables of interest, for example toclassify objects. After unsupervised pre-training,one can also perform supervised fine-tuning of thewhole system 6, i.e., optimize not just the classi-fier but also the lower levels of the feature hierar-chy with respect to some objective of interest. Com-bining unsupervised pre-training and supervised fine-tuning usually gives better generalization than puresupervised learning from a purely random initializa-tion. The unsupervised representation algorithms forpre-training proposed in 2006 were the RestrictedBoltzmann Machine or RBM (Hinton et al., 2006),the auto-encoder (Bengio et al., 2007) and a spar-sifying form of auto-encoder similar to sparse cod-ing (Ranzato et al., 2007).

6 the whole system composes the computation of the rep-resentation with computation of the predictor’s output,

1.2 Denoising and Contractive Auto-

Encoders

An auto-encoder has two parts: an encoder functionf that maps the input x to a representation h = f(x),and a decoder function g that maps h back in thespace of x in order to reconstruct x. In the regu-lar auto-encoder the reconstruction r(x) = g(f(x))function is trained to minimize the average value ofa reconstruction loss on the training examples (butreconstruction loss should be high for most otherinput configurations7). Examples of reconstructionloss functions include ||x− r(x)||2 (for real-valued in-puts) and −

i xi log ri(x) + (1 − xi) log(1 − ri(x))(when interpreting xi as a bit or a probability ofa binary event). Auto-encoders capture the inputdistribution by learning to better reconstruct morelikely input configurations. The difference betweenthe reconstruction vector and the input vector canbe shown to be related to the log-density gradient asestimated by the learner (Vincent, 2011) and the Ja-cobian matrix of the reconstruction with respect tothe input gives information about the second deriva-tive of the density, i.e., in which direction the den-sity remains high when you are on a high-densitymanifold (Rifai et al., 2011). In the Denoising Auto-Encoder (DAE) and the Contractive Auto-Encoder(CAE), the training procedure also introduces robust-ness (insensitivity to small variations), respectivelyin the reconstruction r(x) or in the representationf(x). In the DAE (Vincent et al., 2008, 2010), thisis achieved by training with stochastically corruptedinputs, but trying to reconstruct the uncorrupted in-puts. In the CAE (Rifai et al., 2011), this is achievedby adding an explicit regularizing term in the train-ing criterion, proportional to the norm of the Jaco-

bian of the encoder, ||∂f(x)∂x||2. But the CAE and the

DAE are very related: when the noise is Gaussian andsmall, the denoising error minimized by the DAE isequivalent to minimizing the norm of the Jacobian ofthe reconstruction function r(·) = g(f(·)) (whereas

7Different mechanisms have been proposed to push recon-struction error up in low density areas: denoising criterion,contractive criterion, and code sparsity. It has been arguedthat such constraints play a role similar to the partition func-tion for Boltzmann machines (Ranzato et al., 2008a)

3

Page 4: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

the CAE minimizes the Jacobian of the encoder f(·).Besides Gaussian noise, another interesting form ofcorruption has been very successful with DAEs: itis called the masking corruption and consists in ran-domly zeroing out a large fraction (like 20% or even50%) of the inputs, where the zeroed out subset israndomly selected for each example. In addition tothe contraction effect, it forces the learned encoderto be able to rely only on an arbitrary subset of theinput features.Another way to prevent the auto-encoder from per-

fectly reconstructing everywhere is to introduce asparsity penalty on h, discussed below (Section 3.1).

2 Gradients

2.1 Gradient Descent and Learning

Rate

Many learning algorithms rely on gradient-based nu-merical optimization of a training criterion. LetL(z, θ) be the loss incurred on example z when theparameter vector takes value θ. The gradient vector

for the loss associated with a single example is ∂L(z,θ)∂θ

and it is used as the core part the computation of pa-rameter updates for gradient-based numerical opti-mization algorithms. For example, simple online (orstochastic) gradient descent (Robbins and Monro,1951; Bottou and LeCun, 2004) updates the param-eters after each example is seen according to

θ(t) ← θ(t−1) − ǫt∂L(zt, θ)

∂θ

where zt is an example sampled at iteration t andwhere ǫt is a hyper-parameter that is called the learn-ing rate and whose choice is crucial. If the learn-ing rate is too large 8, the average loss will increase.The optimal learning rate is usually close to thelargest learning rate that does not cause divergence ofthe training criterion, an observation that can guideheuristics for setting the learning rate (Bengio, 2011),e.g., start with a large learning rate and if the train-ing criterion diverges, try again with 3 times smaller

8 above a value which is approximately 2 times the largesteigenvalue of the average loss Hessian matrix.

learning rate, etc., until no divergence is observed.The learning rate can be kept constant (with no guar-antee of convergence), or it can be decreased appro-priately (e.g., typically in O(1/t)) to ensure conver-gence. A common heuristic we use is

ǫt =ǫ0τ

τ + t(1)

where we now have two hyper-parameters (insteadof one in the constant learning rate case): the ini-tial learning rate ǫ0 and the time-constant of decay,τ . In practice, we use mini-batch updates based onthe average of the gradients 9 inside each block of Bexamples:

θ(t) ← θ(t−1) − ǫt1

B

B(t+1)∑

t′=Bt+1

∂L(zt′ , θ)

∂θ. (2)

With B = 1 we are back to ordinary online gradientdescent, while with B equal to the training set size,this is standard (also called ’batch’) gradient descent.With intermediate values of B there is generally asweet spot. When B increases we can get faster com-putation by taking advantage of parallelism or effi-cient matrix-matrix multiplications (instead of sep-arate matrix-vector multiplications), often gaining afactor of 2 in practice in overall training time. On theother hand, as B increases, the number of updatesper computation done decreases, which slows downclock-time convergence because less updates can bedone in the same computing time. Keep in mind thatthe even the true gradient direction (averaging overthe whole training set) is only the steepest descentdirection locally but may not point in the right di-rection when considering larger steps. In addition,because the training criterion is not quadratic in theparameters, as one moves in parameter space the op-timal descent direction keeps changing. Because thegradient direction is not quite the right direction ofdescent, there is no point in spending a lot of com-putation to estimate it precisely for gradient descent.Instead, doing more updates more frequently helps to

9 Compared to a sum, an average makes a small change inB have only a small effect on the optimal learning rate, with anincrease in B generally allowing a small increase in the learningrate because of the reduced variance of the gradient.

4

Page 5: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

explore more and faster, especially with large learningrates. In addition, smaller values of B may benefitfrom more exploration in parameter space and a formof regularization both due to the ’noise’ injected inthe gradient estimator, which may explain the bettertest results sometimes observed with smaller B.When the training set is finite, training proceeds

by sweeps through the training set called an epoch,and full training usually requires many epochs (itera-tions through the training set). Note that stochasticgradient (either one example at a time or with mini-batches) is different from ordinary gradient descent,sometimes called ’batch gradient descent’, which cor-responds to the case where B equals the training setsize, i.e., there is one parameter update per epoch).The great advantage of stochastic gradient descentand other online or minibatch update methods is thattheir convergence does not depend on the size of thetraining set, only on the number of updates and therichness of the training distribution. In the limit of alarge or infinite training set, a batch method (whichupdates only after seeing all the examples) is hope-less. In fact, even for ordinary datasets of tens or hun-dreds of thousands of examples (or more!), stochasticgradient descent converges much faster than ordinary(batch) gradient descent, and beyond some datasetsizes the speed-up is almost linear (i.e., doubling thesize almost doubles the gain). It is really importantto use the stochastic version in order to get reason-able clock-time convergence speeds.As for any stochastic gradient descent method (in-

cluding the mini-batch case), it is important for ef-ficiency of the estimator that each example or mini-batch be sampled approximately independently. Be-cause random access to memory (or even worse, todisk) is expensive, a good approximation, called in-cremental gradient (Bertsekas, 2010), is to visit theexamples (or mini-batches) in a fixed order corre-sponding to their order in memory or disk (repeatingthe examples in the same order on a second epoch,if not in the pure online case where each example isvisited only once), provided the examples or mini-batches are first put in a random order (to makesure this is the case, it could be useful to first shufflethe examples). Faster convergence has been observedif the order in which the mini-batches are visited is

shuffled before each epoch, which is efficient if thetraining set holds in computer memory.

2.2 Gradient Computation and Auto-

matic Differentiation

The gradient can be either computed manually orthrough automatic differentiation. Either way, ithelps to structure this computation as a flow graph,in order to prevent mathematical mistakes and makesure an implementation is computationally efficient.The computation of the loss L(z, θ) as a function ofθ can be laid out in a graph whose nodes correspondto elementary operations such as addition, multipli-cation, and non-linear operations such as the neuralnetworks activation function (e.g., sigmoid or hyper-bolic tangent), possibly at the level of vectors, matri-ces or tensors. The flow graph is directed and acyclicand has three types of nodes: input nodes, internalnodes, and output nodes. Each of its nodes is as-sociated with a numerical output which is the resultof the application of that computation (none in thecase of input nodes), taking as input the output ofprevious nodes in a directed acyclic graph. z and θ(or their elements) are the input nodes of the graph(they do not have inputs themselves) and L(z, θ) isa scalar output of the graph. Note that here, in thesupervised case, z can include an input part x (e.g.an image) and a target part y (e.g. a target classassociated with an object in the image). In the unsu-pervised case z = x. In a semi-supervised case, thereis a mix of labeled and unlabeled examples, and zincludes y on the labeled examples but not on theunlabeled ones.In addition to associating a numerical output oa to

each node a of the flow graph, we can associate a gra-

dient ga = ∂L(z,θ)∂oa

. The gradient will be defined andcomputed recursively in the graph, in the oppositedirection of the computation of the nodes’ outputs,i.e., whereas oa is computed using outputs op of pre-decessor nodes p of a, ga will be computed using thegradients gs of successor nodes s of a. More precisely,the chain rule dictates

ga =∑

s

gs∂os∂oa

5

Page 6: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

where the sum is over immediate successors of a.Only output nodes have no successor, and in par-ticular for the output node that computes L, thegradient is set to 1 since ∂L

∂L= 1, thus initializing

the recursion. Manual or automatic differentiationthen only requires to define the partial derivative as-sociated with each type of operation performed byany node of the graph. When implementing gradi-ent descent algorithms with manual differentiationthe result tends to be verbose, brittle code that lacksmodularity – all bad things in terms of software en-gineering. A better approach is to express the flowgraph in terms of objects that modularize how tocompute outputs from inputs as well as how to com-pute the partial derivatives necessary for gradient de-scent. One can pre-define the operations of these ob-jects (in a “forward propagation” or fprop method)and their partial derivatives (in a “backward prop-agation” or bprop method) and encapsulate thesecomputations in an object that knows how to com-pute its output given its inputs, and how to com-pute the gradient with respect to its inputs giventhe gradient with respect to its output. This is thestrategy adopted in the Theano library 10 with itsOp objects (Bergstra et al., 2010), as well as in li-braries such as Torch 11 (Collobert et al., 2011b) andLush 12.Compared to Torch and Lush, Theano adds an in-

teresting ingredient which makes it a full-fledged au-tomatic differentiation tool: symbolic computation.The flow graph itself (without the numerical valuesattached) can be viewed as a symbolic representation(in a data structure) of a numerical computation. InTheano, the gradient computation is first performedsymbolically, i.e., each Op object knows how to createother Ops corresponding to the computation of thepartial derivatives associated with that Op. Hence thesymbolic differentiation of the output of a flow graphwith respect to any or all of its input nodes can beperformed easily in most cases, yielding another flowgraph which specifies how to compute these gradi-ents, given the input of the original graph. Since thegradient graph typically contains the original graph

10 http://deeplearning.net/software/theano/11 http://www.torch.ch12 http://lush.sourceforge.net

(mapping parameters to loss) as a sub-graph, in or-der to make computations efficient it is important toautomate (as done in Theano) a number of simplifica-tions which are graph transformations preserving thesemantic of the output (given the input) but yield-ing smaller (or more numerically stable or more effi-ciently computed) graphs (e.g., removing redundantcomputations). To take advantage of the fact thatcomputing the loss gradient includes as a first stepcomputing the loss itself, it is advantageous to struc-ture the code so that both the loss and its gradient arecomputed at once, with a single graph having multi-ple outputs. The advantages of performing gradientcomputations symbolically are numerous. First of all,one can readily compute gradients over gradients, i.e.,second derivatives, which are useful for some learn-ing algorithms. Second, one can define algorithms ortraining criteria involving gradients themselves, as re-quired for example in the Contractive Auto-Encoder(which uses the norm of a Jacobian matrix in itstraining criterion, i.e., really requires second deriva-tives, which here are cheap to compute). Third, itmakes it easy to implement other useful graph trans-formations such as graph simplifications or numericaloptimizations and transformations that help makingthe numerical results more robust and more efficient(such as working in the domain of logarithms of prob-abilities rather than in the domain of probabilitiesdirectly). Other potential beneficial applications ofsuch symbolic manipulations include parallelizationand additional differential operators (such as the R-operator, recently implemented in Theano, which isvery useful to compute the product of a Jacobian

matrix partialf(x)∂x

or Hessian matrix ∂2L(x,θ)∂θ2 with a

vector without ever having to actually compute andstore the Hessian matrix itself (Pearlmutter, 1994)).

2.3 Online Learning and Optimization

of Generalization Error

The objective of learning is not to minimize train-ing error or even the training criterion. The latteris a surrogate for generalization error, i.e., perfor-mance on new (out-of-sample) examples, and thereare no hard guarantees that minimizing the training

6

Page 7: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

criterion will yield to good generalization error: it de-pends on the appropriateness of the parametrizationand training criterion (with the corresponding priorthey imply) for the task at hand.Many learning tasks of interest will require huge

quantities of data (most of which will be unlabeled)and as the number of examples increases, so long ascapacity is limited (the number of parameters is smallcompared to the number of examples), training er-ror and generalization approach each other. In theregime of such large datasets, we can consider thatthe learner sees an unending stream of examples (e.g.,think about a process that harvests text and imagesfrom the web and feeds it to a machine learning algo-rithm). In that context, it is most efficient to simplyupdate the parameters of the model after each exam-ple or few examples, as they arrive. This is the idealonline learning scenario, and in a simplified setting,we can even consider each new example z as beingsampled i.i.d. from an unknown generating distribu-tion with probability density p(z). More realistically,examples in online learning do not arrive i.i.d. butinstead from an unknown stochastic process whichexhibits serial correlation and other temporal depen-dencies. However, if we consider the simplified caseof i.i.d. data, there is an interesting observation tobe made: the online learner is performing stochasticgradient descent on its generalization error. Indeed,the generalization error of a learner with parametersθ and loss function L is

C = E[L(z, θ)] =

p(z)L(z, θ)dz

while the stochastic gradient from sample z is

g =∂L(z, θ)

∂θ

with z a random variable sampled from p. The gra-dient of generalization error is

∂C

∂θ=

∂θ

p(z)L(z, θ)dz =

p(z)∂L(z, θ)

∂θdz = E[g]

showing that the online gradient g is an unbiased es-timator of the generalization error gradient ∂C

∂θ. It

means that online learners, when given a stream of

non-repetitive training data, really optimize (maybenot in the optimal way, i.e., using a first-order gra-dient technique) what we really care about: general-ization error.

3 Hyper-Parameters

A pure learning algorithm can be seen as a func-tion taking data as input and producing as outputa function (e.g. a predictor) or model (i.e. a bunchof functions). However, in practice, many learningalgorithms involve hyper-parameters, i.e., annoyingbuttons to be adjusted. In many algorithms suchas Deep Learning algorithms the number of hyper-parameters (ten or more!) can make the idea of hav-ing to adjust all of them unappealing. In addition,it has been shown that the use of computer clustersfor hyper-parameter selection can have an importanteffect on results (Pinto et al., 2009). Choosing hyper-parameter values is formally equivalent to the ques-tion of model selection, i.e., given a family or set oflearning algorithms, how to pick the most appropri-ate one? We define a hyper-parameter for a learningalgorithm A as a value to be selected prior to the ac-tual application of A to the data, a value that is notdirectly selected by the learning algorithm itself. Itis basically an outside control knob. It can be dis-crete (as in model selection) or continuous (such asthe learning rate discussed above). Of course, onecan hide these hyper-parameters by wrapping an-other learning algorithm, say B, around A, to selectsA’s hyper-parameters (e.g. to minimize validation seterror). We can then call B a hyper-learner, and if Bhas no hyper-parameters itself then the compositionof B over A could be a “pure” learning algorithm,with no hyper-parameter. In the end, to apply alearner to data, one has to have a pure learning algo-rithm. The hyper-parameters can be fixed by hand ortuned by an algorithm, but they have to be selected.Some hyper-parameters can be selected based on theperformance of A on its training data, but most can-not. For any hyper-parameter that has an impacton the effective capacity of a learner, it makes moresense to select its value based on out-of-sample data(outside the training set), e.g., a validation set perfor-

7

Page 8: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

mance, online error, or cross-validation error. Notethat some learning algorithms (in particular unsu-pervised learning algorithms such as algorithms fortraining RBMs by approximate maximum likelihood)are problematic in this respect because we cannot di-rectly measure the quantity that is to be optimized(e.g. the likelihood) because it is intractable. Onthe other hand, the expected denoising reconstruc-tion error is easy to estimate (by just averaging thedenoising error over a validation set).

Once some out-of-sample data has been used for se-lecting hyper-parameters, it cannot be used anymoreto obtain an unbiased estimator of generalization per-formance, so one typically uses a test set (or doublecross-validation 13, in the case of small datasets) toestimate generalization error of the pure learning al-gorithm (with hyper-parameter selection hidden in-side).

3.1 Neural Network Hyper-

Parameters

Different learning algorithms involve different sets ofhyper-parameters, and it is useful to get a sense of thekinds of choices that the practitioners have to make,and we focus here mostly on those relevant to neuralnetworks and Deep Learning algorithms.

3.1.1 Hyper-Parameters of the ApproximateOptimization

First of all, several learning algorithms can be viewedas the combination of a training criterion and amodel on the one hand and of a particular proce-dure for approximately optimizing this criterion onthe other hand. Correspondingly, one should dis-tinguish hyper-parameters associated with the op-timizer from hyper-parameters associated with themodel itself, i.e., typically the function class, regular-izer and loss function. We have already mentioned

13 Double cross-validation applies recursively the idea ofcross-validation, using an outer loop cross-validation to evalu-ate generalization error and then applying an inner loop cross-validation inside each outer loop split’s training subset (i.e.,splitting it again into training and validation folds) in order toselect hyper-parameters for that split.

above some of the hyper-parameters typically asso-ciated with stochastic gradient descent with mini-batches. Here is a more extensive descriptive list,focusing on those used in stochastic (mini-batch) gra-dient descent (although number of training iterationsis used for all iterative optimization algorithms).

• The initial learning rate (ǫ0 above, Eq.(1)).This is often the single most important hyper-parameter and one should always make sure thatit has been tuned (up to approximately a fac-tor of 2). Typical values for a neural networkwith standardized inputs (or inputs mapped tothe (0,1) interval) are less than 1 and greaterthan 10−6 but these should not be taken as strictranges and greatly depend on the parametriza-tion of the model. A default value of 0.01 typi-cally works for standard multi-layer neural net-works but it would be foolish to rely exclu-sively on this default value. If there is onlytime to optimize one hyper-parameter and oneuses stochastic gradient descent, then this is thehyper-parameter that is worth tuning.

• The hyper-parameters of the learning rateschedule (such as the time constant τ inEq. (1)). The default value of τ →∞ meansthat the learning rate is constant over trainingiterations. In many cases the benefit of choosingother than this default value is small. An alter-native to Eq. (1) used in Bergstra and Bengio(2012) is

ǫt =ǫ0τ

max(t, τ)

which keeps the learning rate constant for thefirst τ steps and then decreases it in O(1/tα),with traditional recommendations (based onasymptotic analysis of the convex case) suggest-ing α = 1. See Bach and Moulines (2011) for arecent analysis of the rate of convergence for thegeneral case of α ≤ 1, suggesting that smallervalues of α should be used in the non-convexcase, especially when using a gradient averagingor momentum technique (see below). An adap-tive and heuristic way of automatically settingτ above is to keep ǫt constant until the training

8

Page 9: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

criterion stops decreasing significantly (by morethan some relative improvement threshold) fromepoch to epoch. That threshold is a less sensitivehyper-parameter than τ itself.

• The mini-batch size (B in Eq. (2)) is typicallychosen between 1 and a few hundreds, e.g. B =32 a good default value, with values above 10 totake advantage of the speed-up of matrix-matrixproducts over matrix-vector products. The im-pact of B is mostly computational, i.e., larger Byield faster computation (with appropriate im-plementations) but requires visiting more exam-ples in order to reach the same error, since thereare less updates per epoch. In theory, this hyper-parameter should impact training time and notreally test performance, so it can be optimizedseparately of the other hyper-parameters, bycomparing training curves (training and valida-tion error vs amount of training time), after theother hyper-parameters (except learning rate)have been selected. B and ǫ0 may slightly inter-act so both should be re-optimized at the end.Once B is selected, it can generally be fixed whilethe other hyper-parameters can be further opti-mized.

• Number of training iterations T (measuredin mini-batch updates). This hyper-parameteris particular in that it can be optimized almostfor free using the principle of early stopping: bykeeping track of the out-of-sample error (as forexample estimated on a validation set) as train-ing progresses (every N updates), one can decidehow long to train for any given setting of all theother hyper-parameters. Early stopping is aninexpensive way to avoid strong overfitting, i.e.,even if the other hyper-parameters would yieldto overfitting, early stopping will considerablyreduce the overfitting damage that would other-wise ensue. It also means that it hides the over-fitting effect of other hyper-parameters, possiblyobscuring the analysis that one may want to dowhen trying to figure out the effect of individualhyper-parameters, i.e., it tends to even out theperformance obtained by many otherwise overfit-ting configurations of hyper-parameters by com-

pensating a too large capacity with a smallertraining time. For this reason, it might be use-ful to turn early-stopping off when analyzing theeffect of individual hyper-parameters. Now letus turn to implementation details. Practically,one needs to continue training beyond the se-lected number of training iterations T (whichshould be the point of lowest validation errorin the training run) in order to ascertain thatvalidation error is unlikely to go lower than atthe selected point. A heuristic introduced in theDeep Learning Tutorials 14 is based on the ideaof patience (set initially to 10000 examples in theMLP tutorial), which is a minimum number oftraining examples to see after the candidate se-lected point T before deciding to stop training(i.e. accept this candidate as the final answer).As new candidate selected points T (new minimaof the validation error) are observed, patienceis increased, either multiplicatively or additivelyon top of the last T found. Hence, if we find anew minimum15 at t, we save the current bestmodel, update T ← t and we increase our pa-tience up to t + constant or t× constant. Notethat validation error should not be estimated af-ter each training update (that would be reallywasteful) but after every N examples, where Nis at least as large as the validation set (ideallyseveral times larger so that the early stoppingoverhead remains small).

• Momentum β. It has long been advo-cated (Hinton, 1978, 2010) to temporally smoothout the stochastic gradient samples obtainedduring the stochastic gradient descent. For ex-ample, a moving average of the past gradientcan be computed with g ← (1 − β)g + βg,

where g = ∂L(zt,θ)∂θ

is the instantaneous gradi-ent and β is a small positive coefficient thatcontrols how fast the old examples get down-weighted in the moving average. The simplest

14 http://deeplearning.net/tutorial/15Ideally, we should use a statistical test of significance and

accept a new minimum (over a longer training period) only ifthe improvement is statistically significant, based on the sizeand variance estimates one can compute for the validation set.

9

Page 10: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

momentum recipe is to make the updates pro-portional to this smoothed gradient estimatorg instead of the instantaneous gradient g. Theidea is that it removes some of the noise andoscillations that gradient descent has, in partic-ular in the directions of high curvature of theloss function 16. A default value of β = 1works well in many cases but in some casesmomentum seems to make a positive differ-ence. Polyak averaging (Polyak and Juditsky,1992) is a related form of parameter aver-aging 17 that has theoretical advantages andbeen advocated and shown to bring improve-ments on some unsupervised learning proceduressuch as RBMs (Swersky et al., 2010). More re-cently, several mathematically motivated algo-rithms (Nesterov, 2009; Le Roux et al., 2012)have been proposed that incorporate some formof momentum and that also ensure much fasterconvergence (linear rather than sublinear) com-pared to stochastic gradient descent, at leastfor convex optimization problems. Note how-ever that in the pure online case (stream of ex-amples) and under some assumptions, the sub-linear rate of convergence of stochastic gradi-ent descent with O(1/t) decrease of learningrate is an optimal rate, at least for convexproblems (Nemirovski and Yudin, 1983). Thatwould suggest that for really large trainingsets it may not be possible to obtain betterrates than ordinary stochastic gradient descent,albeit the constants in front (which dependon the condition number of the Hessian) maystill be greatly reduced by using second-orderinformation online (Bottou and LeCun, 2004;Bottou and Bousquet, 2008).

• Layer-wise optimization hyper-

16 Think about a ball coming down a valley. Since it has notstarted from the bottom of the valley it will oscillate betweenits sides as it settles deeper, forcing the learning rate to besmall to avoid large oscillations that would kick it out of thevalley. Averaging out the local gradients along the way willcancel the opposing forces from each side of the valley.

17 Polyak averaging uses for predictions a moving average ofthe parameters found in the trajectory of stochastic gradientdescent.

parameters: although rarely done, it ispossible to use different values of optimizationhyper-parameters (such as the learning rate) ondifferent layers of a multi-layer network. This isespecially appropriate (and easier to do) in thecontext of layer-wise unsupervised pre-training,since each layer is trained separately (while thelayers below are kept fixed). This would beparticularly useful when the number of unitsper layer varies a lot from layer to layer. Seethe paragraph below entitled Layer-wise opti-mization of hyper-parameters (Sec. 3.3.4).Some researchers also advocate the use ofdifferent learning rates for the different typesof parameters one finds in the model (such asbiases and weights in the standard multi-layernetwork, but the issue becomes more importantwhen parameters such as precision or varianceare included in the lot (Courville et al., 2011).

With other optimization algorithms, the hyper-parameters are typically different. For example,Conjugate Gradients (CG) algorithms typically havea number of line search steps (which is a hyper-parameter) and a tolerance for stopping each linesearch (another hyper-parameter). An optimizationalgorithm like L-BFGS (limited-memory BFGS) alsohas a hyper-parameter controlling the memory usageof the algorithm, the rank of the Hessian approxima-tion kept in memory, which also has an influence onthe efficiency of each step. Both CG and L-BFGS areiterative (e.g. the number of line searchers) and thenumber of iterations can be optimized as describedabove for stochastic gradient descent, with early stop-ping.

3.2 Hyper-Parameters of the Model

and Training Criterion

Let us now turn to “model” and “criterion” hyper-parameters typically found in neural networks, espe-cially deep neural networks.

• Number of hidden units nh. Each layer in amulti-layer neural network typically has a sizethat we are free to set and that controls ca-pacity. Because of early stopping and possibly

10

Page 11: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

other regularizers (e.g., weight decay, discussedbelow), it is mostly important to choose nh largeenough. Larger than optimal values typically donot hurt generalization performance much, butof course they require proportionally more com-putation (in O(n2

h) if scaling all the layers atthe same time in a fully connected architecture).Like for many other hyper-parameters, there isthe option of allowing a different value of nh foreach hidden layer 18 of a deep architecture. Seethe paragraph below entitled Layer-wise opti-mization of hyper-parameters (Sec. 3.3.4).In a large comparative study (Larochelle et al.,2009), we found that using the same size for alllayers worked generally better or the same as us-ing a decreasing size (pyramid-like) or increasingsize (upside down pyramid), but of course thismay be data-dependent. For most tasks that weworked on, we find that an overcomplete 19 firsthidden works better than an undercomplete one.Another even more often validated empirical ob-servation is that the optimal nh is much largerwhen using unsupervised pre-training in a super-vised neural network, e.g., going from hundredsof units to thousands of units.

• Weight decay regularization coefficient λ. Away to reduce overfitting is to add a regulariza-tion term to the training criterion, which limitsthe capacity of the learner. The parameters ofmachine learning models can be regularized bypushing them towards a prior value, which is typ-ically 0. L2 regularization adds a term λ

i θ2i

to the training criterion, while L1 regularizationadds a term λ

i |θi|. Both types of terms canbe included. There is a clean Bayesian justifi-cation for such a regularization term: it is thenegative log-prior − logP (θ) on the parametersθ. The training criterion then corresponds to thenegative joint likelihood of data and parameters,− logP (data|θ) − logP (θ), with the loss func-tion L(z, θ) being interpreted as− logP (z|θ) and

− logP (data|θ) = −∑T

t=1 L(zt, θ) if the data

18 A hidden layer is a group of units that is neither an inputlayer nor an output layer.

19 larger than the input vector

consists of T i.i.d. examples zt. This detailis important to note because when one is do-ing stochastic gradient-based learning, there is aneed to use the gradient of an unbiased estima-tor of the gradient of the total training criterion(including both the total loss and the regular-izer), but one only considers a single mini-batchor example at a time. How should the regularizerbe weighted in this sum, which is different fromthe sum of the regularizer and the total loss onall examples? On each mini-batch update, thegradient of the regularization penalty should bemultiplied not just by λ but also by B

T, i.e., one

over the number of updates needed to go oncethrough the training set. When the training setsize is not a multiple of B, the last mini-batchwill have size B′ < B and the contribution ofthe regularizer to the mini-batch gradient shouldtherefore be modified accordingly (i.e. scaled byB′

Bcompared to other mini-batches). In the pure

online setting (there is no fixed ahead of timetraining set size nor iterating again on the ex-amples), it would then make sense to use B

tat

example t, or one over the number of updatesto date. L2 regularization penalizes large val-ues more strongly and corresponds to a Gaus-

sian prior ∝ exp(− 12||θ||2

σ2 ) with prior varianceσ2 = 1/(2λ). Note that there is a connectionbetween early stopping (see above, choosing thenumber of training iterations) and L2 regular-ization (Collobert and Bengio, 2004), with onebasically playing the same role as the other (butearly stopping allowing a much more efficient se-lection of the hyper-parameter value, which sug-gests dropping L2 regularization altogether whenearly-stopping is used). L1 regularization makessure that parameters that are not really very use-ful are driven to zero (i.e. encouraging sparsityof the parameter values), and corresponds to a

Laplace density prior ∝ e−|θ|s with scale parame-

ter s = 1λ. L1 regularization often helps to make

the input filters 20 cleaner (more spatially local-

20 The input weights of a 1st layer neuron are often called“filters” because of analogies with signal processing techniquessuch as convolutions.

11

Page 12: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

ized) and easier to interpret. Stochastic gradi-ent descent will not yield actual zeros but valueshovering around zero. If both L1 and L2 regu-larization are used, a different coefficient (i.e. adifferent hyper-parameter) should be consideredfor each, and one may also use a different coeffi-cient for different layers. In particular, the inputweights and output weights may be treated dif-ferently.

One reason for treating output weights differ-ently is that we know that it is sufficient to reg-ularize only the output weights in order to con-strain capacity: in the limit case of the num-ber of hidden units going to infinity, L2 regular-ization corresponds to Support Vector Machines(SVM) while L1 regularization corresponds toboosting (Bengio et al., 2006a). Another reasonfor treating inputs and outputs differently fromhidden units is because they may be sparse. Forexample, some input features may be 0 most ofthe time while others are non-zero frequently. Inthat case, there are fewer examples that informthe model about that rarely active input feature,and the corresponding parameters (weights out-going from the corresponding input units) shouldbe more regularized than the parameters associ-ated with frequently observed inputs. A similarsituation may occur with target variables thatare sparse (e.g., trying to predict rarely observedevents). In both cases, the effective number ofmeaningful updates seen by these parameters isless than the actual number of updates. Thissuggests to scale the regularization coefficient ofthese parameters by one over the effective num-ber of updates seen by the parameter. A relatedformula turns up in Bayesian probit regressionapplied to sparse inputs (Graepel et al., 2010).Some practitioners also choose to penalize onlythe weights w and not the biases b associatedwith the hidden unit activations w′z+b for a unittaking the vector of values z as input. This guar-antees that even with strong regularization, thepredictor would converge to the optimal constantpredictor, rather than the one corresponding to0 activation. For example, with the mean-square

loss and the cross-entropy loss, the optimal con-stant predictor is the output average.

• Sparsity of activation regularization coeffi-cient α. A common practice in the DeepLearning literature (Ranzato et al., 2007, 2008b;Lee et al., 2008, 2009; Bagnell and Bradley,2009; Glorot et al., 2011a; Coates and Ng, 2011;Goodfellow et al., 2011) consists in adding apenalty term to the training criterion that en-courages the hidden units to be sparse, i.e.,with values at or near 0. Although the L1penalty (discussed above in the case of weights)can also be applied to hidden units activations,this is mathematically very different from theL1 regularization term on parameters. Whereasthe latter corresponds to a prior on the pa-rameters, the former does not because it in-volves the training distribution (since we arelooking at data-dependent hidden units out-puts). Although we will not discuss this muchhere, the inspiration for a sparse representa-tion in Deep Learning comes from the ear-lier work on sparse coding (Olshausen and Field,1997). As discussed in Goodfellow et al. (2009)sparse representations may be advantageous be-cause they encourage representations that dis-entangle the underlying factors of representa-tion. A sparsity-inducing penalty is also away to regularize (in the sense of reducing thenumber of examples that the learner can learnby heart) (Ranzato et al., 2008b), which meansthat the sparsity coefficient is likely to interactwith the many other hyper-parameters which in-fluence capacity. In general, increased sparsitycan be compensated by a larger number of hid-den units.

Several approaches have been proposed to in-duce a sparse representation (or with more hid-den units whose activation is closer to 0). Oneapproach (Ranzato et al., 2008b; Le et al., 2011;Zou et al., 2011) is simply to penalize the L1norm of the representation or another function ofthe hidden units activation (such as the student-t log-prior). This typically makes sense for non-linearities such as the sigmoid which have a sat-

12

Page 13: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

urating output around 0, but not for hyper-bolic tangent non-linearity (whose saturation isnear the -1 and 1 extremes rather than near theorigin). Another option is to penalize the bi-ases of the hidden units, to make them morenegative (Ranzato et al., 2007; Lee et al., 2008;Goodfellow et al., 2009; Larochelle and Bengio,2008). Note that penalizing the bias runs thedanger that the weights could compensate forthe bias 21, which could hurt the numerical op-timization of parameters. When directly pe-nalizing the hidden unit outputs, several vari-ants can be found in the literature, but noclear comparative analysis has been publishedto evaluate which one works better. Althoughthe L1 penalty (i.e., simply α times the sumof output elements hj in the case of sigmoidnon-linearity) would seem the most natural (be-cause of its use in sparse coding), it is used infew papers involving sparse auto-encoders. Aclose cousin of the L1 penalty is the Student-t penalty (log(1 + h2

j)), originally proposed forsparse coding (Olshausen and Field, 1997). Sev-eral researchers penalize the average output hj

(e.g. over a mini-batch), and instead of push-ing it to 0, encourage it to approach a fixed tar-get. This can be done through a mean-squareerror penalty such as

j(ρ − hj)2, or maybe

more sensibly (because h behaves like a prob-ability), a Kullback-Liebler divergence with re-spect to the binomial distribution with probabil-ity ρ, −ρ log hj − (1 − ρ) log(1 − hj)+constant,e.g., with ρ = 0.05, as in (Hinton, 2010). In ad-dition to regularization penalty itself, the choiceof activation function can have a strong impacton the sparsity obtained. In particular, rectify-ing non-linearities (such as max(0, x), instead ofa sigmoid) have been very successful in severalinstances (Jarrett et al., 2009; Nair and Hinton,2010; Glorot et al., 2011a; Mesnil et al., 2011;Glorot et al., 2011b). In sparse coding andsparse predictive coding (Kavukcuoglu et al.,2009) the activations are directly optimized and

21 because the input to the layer generally has a non-zeroaverage, that when multiplied by the weights acts like a bias

actual zeros are the expected result of the op-timization. In that case, ordinary stochasticgradient is not guaranteed to find these zeros(it will oscillate around) and other methodssuch as proximal gradient are more appropri-ate (Bertsekas, 2010).

• Neuron non-linearity. The typical neuronoutput is s(a) = s(w′x + b), where x is thevector of inputs into the neuron, w the vec-tor of weights and b the offset or bias pa-rameter, while s is a scalar non-linear func-tion. Several non-linearities have been proposedand some choices of non-linearities have beenshown to be more successful (Jarrett et al., 2009;Glorot and Bengio, 2010; Glorot et al., 2011a).The most commonly used by the author, for hid-den units, are the sigmoid 1/(1+e−a), the hyper-

bolic tangent ea−e−a

ea+e−a and the rectifier max(0, a).Note that the sigmoid was shown to yield se-rious optimization difficulties when used as thetop hidden layer of a deep supervised network(Glorot and Bengio, 2010) without unsupervisedpre-training, but works well for auto-encodervariants 22 For output (or reconstruction) units,hard neuron non-linearities like the rectifier donot make sense because when the unit is satu-rated (e.g. a < 0 for the rectifier) and associ-ated with a loss, no gradient is propagated in-side the network, i.e., there is no chance to cor-rect the error 23 In the case of hidden layers thegradient manages to go through a subset of thehidden units, even if the others are saturated.For output units a good recipe is to obtain theoutput non-linearity and the loss by consideringthe associated negative log-likelihood and choos-ing appropriate (conditional) output probability

22 The author hypothesizes that this discrepency is dueto the fact that the weight matrix W of an auto-encoder ofthe form r(x) = W

T sigmoid(Wx) is pulled towards being or-thonormal since this would make the auto-encoder closer to theidentity function, because WT

Wx ≈ x when W is orthonormaland x is in the span of the rows of W .

23 A hard non-linearity for the output units non-linearity isvery different from a hard non-linearity in the loss function,such as the hinge loss. In the latter case the derivative is 0only when there is no error.

13

Page 14: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

model, usually in the exponential family. For ex-ample, one can typically take squared error andlinear outputs to correspond to a Gaussian out-put model, cross-entropy and sigmoids to cor-respond to a binomial output model, and loss− log output[target class] with softmax outputsto correspond to multinomial output variables.For reasons yet to be elucidated, having a sig-moidal non-linearity on the output (reconstruc-tion) units (along with target inputs normalizedin the (0,1) interval) seems to be helpful whentraining the contractive auto-encoder.

• Weights initialization scaling coefficient.Biases can generally be initialized to zero butweights need to be initialized carefully to breakthe symmetry between hidden units of the samelayer 24 Because different output units receivedifferent gradient signals, this symmetry break-ing issue does not concern the output weights(into the output units), which can thereforealso be set to zero. Although several fixedrecipes (LeCun et al., 1998a; Glorot and Bengio,2010) for initializing the weights into hid-den layers have been proposed (i.e. ahyper-parameter is the discrete choice betweenthem), Bergstra and Bengio (2012) also insertedas extra hyper-parameter a scaling coefficientfor the initialization range. These recipesare based on the idea that units with moreinputs (the fan-in of the unit) should havesmaller weights. Both LeCun et al. (1998a) andGlorot and Bengio (2010) recommend scaling bythe inverse of the square root of the fan-in, al-though Glorot and Bengio (2010) and the DeepLearning Tutorials use a combination of the fan-in and fan-out, e.g., sample a Uniform(−r, r)with r =

6/(fan-in + fan-out) for hyperbolic

tangent units and r = 4√

6/(fan-in + fan-out)for sigmoid units. We have found that we couldavoid any hyper-parameter related to initializa-tion using these formulas (and the derivation in

24 By symmetry, if hidden units of the same layer share thesame input and output weights, they will compute the sameoutput and receive the same gradient, hence performing thesame update and remaining identical, thus wasting capacity.

Glorot and Bengio (2010) can be used to derivethe formula for other settings). Note howeverthat in the case of RBMs, a zero-mean Gaussianwith a small standard deviation around 0.1 or0.01 works well (Hinton, 2010) to initialize theweights, while visible biases are typically set totheir optimal value if the weights were 0, i.e.,log(x/(1 − x) in the case of a binomial visibleunit whose corresponding binary input featurehas empirical mean x in the training set.

An important choice is whether one should useunsupervised pre-training (and which unsuper-vised feature learning algorithm to use) in or-der to initialize parameters. In most settingswe have found unsupervised pre-training to helpand very rarely to hurt, but of course thatimplies additional training time and additionalhyper-parameters.

• Random seeds. There are often several sourcesof randomness in the training of neural net-works and deep learners (such as for randominitialization, sampling examples, sampling hid-den units in stochastic models such as RBMs,or sampling corruption noise in denoising auto-encoders). Some random seeds could thereforeyield better results than others. Because of thepresence of local minima in the training criterionof neural networks (except in the linear case orwith fixed lower layers), parameter initializationmatters. See Erhan et al. (2010b) for an exam-ple of measuring the histogram of test errors forhundreds of different random seeds. Typically,the choice of random seed only has a slight effecton the result and can mostly be ignored generalor for most of the hyper-parameter search pro-cess. If computing power is available, then afinal set of jobs with different random seeds (5to 10) for a small set of best choices of hyper-parameter values can squeeze a bit more perfor-mance. Another way to exploit computing powerto push performance a bit is model averaging, asin Bagging (Breiman, 1994) and Bayesian meth-ods. After training them, the outputs of dif-ferent networks (or in general different learningalgorithms) can be averaged. For example, the

14

Page 15: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

difference between the neural networks being av-eraged into a committee may come from the dif-ferent seeds used for parameter initialization, orthe use of different subsets of input variables, ordifferent subsets of training examples (the latterbeing called Bagging).

• Preprocessing. Many preprocessing steps havebeen proposed to massage raw data into ap-propriate inputs for neural networks and modelselection must also choose among them. Inaddition to element-wise standardization (sub-tract mean and divide by standard devia-tion), Principal Components Analysis (PCA)has often been advocated (LeCun et al., 1998a;Bergstra and Bengio, 2012) and also allows di-mensionality reduction, at the price of an extrahyper-parameter (the number of principal com-ponents retained, or the proportion of varianceexplained). A convenient non-linear preprocess-ing is the uniformization (Mesnil et al., 2011) ofeach feature (which estimates its cumulative dis-tribution Fi and then transforms each featurexi by its quantile F−1

i (xi), i.e., returns an ap-proximate normalized rank or quantile for thevalue xi). A simpler to compute transform thatmay help reduce the tails of input features is anon-linearity such as the logarithm or the squareroot, in an attempt to make to make them moreGaussian-like.

In addition to the above somewhat generic choices,more choices arise with different architectures andlearning algorithms. For example, the denois-ing auto-encoder has a hyper-parameter scaling theamount of input corruption and the contractive auto-encoder has as hyper-parameter a coefficient scalingthe norm of the Jacobian of the encoder, i.e., control-ling the importance of the contraction penalty. Thelatter seems to be a rather sensitive hyper-parameterthat must be tuned carefully. The contractive auto-encoder’s success also seems sensitive to the weighttying constraint used in many auto-encoder archi-tectures: the decoder’s weight matrix is equal to thetranspose of the encoder’s weight matrix. The spe-cific architecture used in the contractive auto-encoder(with tied weights, sigmoid non-linearies on hidden

and reconstruction units, along with squared loss orcross-entropy loss) works quite well but other relatedvariants do not always train well, for reasons thatremain to be understood.There are also many architectural choices that

are relevant in the case of convolutional architec-tures (e.g. for modeling images, time-series orsound) (LeCun et al., 1989, 1998b; Le et al., 2010) inwhich hidden units have local receptive fields. Theirdiscussion is postponed to another chapter (LeCun,2013).

3.3 Manual Search and Grid Search

Many of the hyper-parameters or model choices de-scribed above can be ignored by picking a standardrecipe suggested here or in some other paper. Still,one remains with a substantial number of choices tobe made, which may give the impression of neuralnetwork training as an art. With modern comput-ing facilities based on large computer clusters, it ishowever possible to make the optimization of hyper-parameters a more reproducible and automated pro-cess, using techniques such as grid search or better,random search, or even hyper-parameter optimiza-tion, all discussed below.

3.3.1 General guidance for the exploration ofhyper-parameters

First of all, let us consider recommendations for ex-ploring hyper-parameter settings, whether with man-ual search, an automated procedure, or a combina-tion of both. We call a numerical hyper-parameterone that involves choosing a real number or an in-teger, as opposed to making a discrete choice froman unordered set. Examples of numerical hyper-parameters are regularization coefficients, number ofhidden units, number of training iterations, etc. Onehas to think of hyper-parameter selection as a diffi-cult form of learning. The training criterion for thislearning is the validation error, which is a proxy forgeneralization error. Unfortunately, the relation be-tween hyper-parameters and validation error can becomplicated. Although to first approximation we ex-pect a kind of U-shaped curve (when considering only

15

Page 16: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

a single hyper-parameter, the others being fixed), thiscurve can also have noisy variations, in part due tothe use of finite data sets.

• Best value on the border. When consideringthe validation error obtained for different valuesof a numerical hyper-parameter one should payattention as to whether or not the best valuefound is near the border of the investigated in-terval. If it is near the border, then it is recom-mended to explore further beyond that border,for the obvious reason that such a result sug-gests strongly that better values could be foundbeyond. Because the relation between a hyper-parameter and validation error can be noisy, itis generally not enough to try very few values.For instance, trying only 3 values for a numer-ical hyper-parameter is insufficient, even if thebest value found is the middle one.

• Scale of values considered. Exploring valuesof a numerical hyper-parameter entails choosinga starting interval, which is therefore a kind ofhyper-hyper-parameter. By choosing the inter-val large enough to start with, but based on pre-vious experience with this hyper-parameter, weensure that we do not get completely wrong re-sults. Now instead of choosing the intermedi-ate values linearly in the chosen interval, it of-ten makes much more sense to consider a linearor uniform sampling in the log-domain (in thespace of the logarithm of the hyper-parameter).For example, the results obtained with a learn-ing rate of 0.01 are likely to be very similar tothe results with 0.011 while results with 0.001could be quite different from results with 0.002even though the difference is the same in bothcases. The ratio between different values is a bet-ter guide of the expected impact of the change.That is why exploring uniformly or regularly-spaced values in the space of the logarithm ofthe numerical hyper-parameter is typically pre-ferred.

• Computational considerations. Validationerror is actually not the only measure to considerin selecting hyper-parameters. Often, one has to

consider computational cost, either of trainingor prediction. Computing resources for trainingand prediction are limited and generally condi-tion the choice of intervals of values to consider(for example increasing the number of hiddenunits or number of training iterations also scalesup computation). An interesting idea is to usecomputationally cheap estimators of validationerror to select some hyper-parameters. For ex-ample, Saxe et al. (2011) showed that the archi-tecture hyper-parameters of convolutional net-works could be selected using random weights inthe lower layers of the network. While this yieldsa noisy and biased (pessimistic) estimator of thevalidation error which could be obtained withfull training, it appears to be correlated with it,which is enough for hyper-parameter selection(or for keeping under consideration for furtherand more expensive evaluation only the few bestchoices found). Even without cheap estimatorsof generalization errors, high-throughput com-puting (e.g., on clusters, GPUs, or clusters ofGPUs) can be exploited to run not just hun-dreds but thousands of training jobs, somethingnot conceivable only a few years ago, with eachjob taking on the order of hours or days forlarger datasets. With cheap surrogates, someresearchers have run on the order of ten thou-sands trials, and we can expect future advancesin parallelized computing power to boost thesenumbers.

3.3.2 Coordinate Descent and Multi-Resolution Search

When performing a manual search and with access toonly a single computer, a reasonable strategy is coor-dinate descent: change only one hyper-parameter attime, always making a change from the best configu-ration of hyper-parameters found up to now. Insteadof a standard coordinate descent (which systemati-cally cycles through all the variables to be optimized)one can make sure to regularly fine-tune the mostsensitive variables, such as the learning rate.

Another important idea is that there is no pointin exploring the effect of fine changes before one or

16

Page 17: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

more reasonably good settings have been found. Theidea of multi-resolution search is to start the searchby considering large changes of the numerical hyper-parameters. One can then start from the one or fewbest configurations found and explore more locallyaround them with smaller changes around these val-ues.

3.3.3 Automated and Semi-automated GridSearch

Once some interval or set of values has been selectedfor each hyper-parameter (thus defining a searchspace), a simple strategy that exploits parallel com-puting is the grid search. One first needs to con-vert the numerical intervals into lists of values (e.g.,K regularly-spaced values in the log-domain of thehyper-parameter). The grid search is simply an ex-haustive search through all the combinations of thesevalues. The cross-product of these lists contains anumber of elements that is unfortunately exponen-tial in the number of hyper-parameters (e.g., with5 hyper-parameters, each allowed to take 6 differentvalues, one gets 65 = 7776 configurations). In sec-tion 3.4 below we consider an approach that worksmore efficiently than the grid search when the num-ber of hyper-parameters increases beyond 2 or 3.The advantage of the grid search, compared to

many other optimization strategies (such as coordi-nate descent), is that it is fully parallelizable. If alarge computer cluster is available, it is tempting tochoose a model selection strategy that can take ad-vantage of parallelization. One practical disadvan-tage of grid search, with a parallelized set of jobs ona cluster, is that if only one of the jobs fails (for allkinds of hardware and software reasons this is verycommon) then one has to launch another volley ofjobs to complete the grid (and yet a third one if anyof these fails, etc.), thus multiplying the overall com-puting time.Typically, a single grid search is not enough and

practitioners tend to proceed with a sequence of gridsearches, each time adjusting the ranges of valuesconsidered based on the previous results obtained.Although this can be done manually, this procedurecan also be automated by considering the idea of

multi-resolution search to guide this outer loop. Dif-ferent more local grid searches can be launched inthe neighborhood of the best solutions found previ-ously. In addition, the idea of coordinate descent canalso be thrown in, by making each grid search focuson only a few of the hyper-parameters. For exam-ple, it is common practice to start by exploring theinitial learning rate while keeping fixed (and initiallyconstant) the learning rate descent schedule. Oncethe shape of the schedule has been chosen, it may bepossible to further refine the learning rate, but in asmaller interval around the best value found.Humans can get very good at performing hyper-

parameter search, and having a human in the loopalso has the advantage that it can help detect bugsor unwanted or unexpected behavior of a learningalgorithm. However, for the sake of reproducibil-ity, machine learning researchers should strive to useprocedures that do not involve human decisions inthe middle, only at the outset (e.g., setting hyper-parameter ranges, which can be specified in a paperdescribing the experiments).

3.3.4 Layer-wise optimization of hyper-parameters

In the case of Deep Learning with unsupervisedpre-training there is an opportunity for combin-ing coordinate descent and cheap relative valida-tion set performance evaluation associated withsome hyper-parameter choices. The idea, describedby Mesnil et al. (2011); Bengio (2011), is to performgreedy choices for the hyper-parameters associatedwith lower layers (near the input) before training thehigher layers. One first trains (unsupervised) thefirst layer with different hyper-parameter values andsomehow estimates the relative validation error thatwould be obtained from these different configurationsif the final network only had this single layer as in-ternal representation. In the common case where theultimate task is supervised, it means training a sim-ple supervised predictor (e.g. a linear classifier) ontop of the learned representation. In the case of linearpredictor (e.g. regression or logistic regression) thiscan even be done on the fly while unsupervised train-ing of the representation progresses (i.e. can be used

17

Page 18: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

for early stopping as well), as in (Larochelle et al.,2009). Once a set of apparently good (accordingto this greedy evaluation) hyper-parameters valueshas been found (or possibly using only the best onefound), these good values can be used as startingpoint to train (and hyper-optimize) a second layer inthe same way, etc. The completely greedy approach isto keep only the best configuration up to now (for thelower layers), but keeping the K best configurationsoverall only multiplies computational costs of hyper-parameter selection by K for layers beyond the firstone. Since greedy layer-wise pre-training does notmodify the lower layers when pre-training the upperlayers, this is also very efficient computationally. Thisprocedure allows one to set the hyper-parameters as-sociated with the unsupervised pre-training stage,and then there remains hyper-parameters to be se-lected for the supervised fine-tuning stage, if oneis desired. A final supervised fine-tuning stage isstrongly suggested, especially when there are manylabeled examples (Lamblin and Bengio, 2010).

3.4 Random Sampling of Hyper-

Parameters

A serious problem with the grid search approachto find good hyper-parameter configurations is thatit scales exponentially with the number of hyper-parameters considered. In the above sections we havediscussed numerous hyper-parameters and if all ofthem were to be explored at the same time it wouldbe impossible to use only a grid search to do so.One may think that there are no other options sim-

ply because this is an instance of the curse of di-mensionality. But like we have found in our workon Deep Learning (Bengio, 2009), if there is somestructure in a target function we are trying to dis-cover, then there is a chance to find good solutionswithout paying an exponential price. It turns outthat in many practical cases we have encountered,there is a kind of structure that random samplingcan exploit (Bergstra and Bengio, 2012). The ideaof random sampling is to replace the regular gridby a random (typically uniform) sampling. Eachtested hyper-parameter configuration is selected byindependently sampling each hyper-parameter from

a prior distribution (typically uniform in the log-domain, inside the interval of interest). For a dis-crete hyper-parameter, a multinomial distributioncan be defined according to our prior beliefs onthe likely good values (at worse a uniform distri-bution across the allowed values). In fact, we canuse our prior knowledge to make this prior distri-bution quite sophisticated. For example, we canreadily include knowledge that some values of somehyper-parameters only make sense in the context ofother particular values of hyper-parameters. This isa practical consideration for example when consider-ing layer-specific hyper-parameters when the numberof layers itself is a hyper-parameter.

The experiments performed (Bergstra and Bengio,2012) show that random sampling can be many timesmore efficient than grid search as soon as the numberof hyper-parameters goes beyond the 2 or 3 typicallyseen with SVMs and vanilla neural networks. Themain reason why faster convergence is observed isbecause it allows one to explore more values for eachhyper-parameter, whereas in grid search, the samevalue of a hyper-parameter is repeated in exponen-tially many configurations (of all the other hyper-parameters). In particular, if only a small subset ofthe hyper-parameters really matters, then this proce-dure can be shown to be exponentially more efficient.What we found is that for different datasets and ar-chitectures, the subset of hyper-parameters that mat-tered most was different, but if was often the casethat only few hyper-parameters made a big differ-ence (and the learning rate is always one of them!).When marginalizing (by averaging or minimizing) thevalidation performance to visualize the effect of oneor two hyper-parameters, we get a more noisy picture(because of the random variations of the other hyper-parameters) but one with much more resolution (be-cause so many more different values are considered).Practically, one can plot the curves of best validationerror as the number of random trials 25 is increased(with mean and standard deviation, obtained by con-sidering, for each choice of number of trials, all possi-ble same-size subsets of trials), and these curves tell

25 each random trial corresponding to a training job with aparticular choice of hyper-parameter values

18

Page 19: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

us that we are approaching a plateau, i.e., that tellsus whether it is worth it or not to continue launchingjobs, i.e., we can perform a kind of early stopping inthe outer optimization over hyper-parameters. Notethat one should distinguish the curve of the “besttrial in first N trials” with the curve of the mean (andstandard deviation) of the “best in a subset of sizeN”. The latter is a better statistical representative ofthe improvements we should expect if we increase thenumber of trials. Even if the former has a plateau,the latter may still be on the increase, pointing for theneed to more hyper-parameter configuration samples(more trials) (Bergstra and Bengio, 2012). Compar-ing these curves with the equivalent obtained fromgrid search we see faster convergence with randomsearch. On the other hand, note that one advan-tage of grid search compared to random sampling isthat the qualitative analysis of results is easier be-cause one can consider variations of a single hyper-parameter with all the other hyper-parameters beingfixed. It may remain a valid option to do a smallgrid search around the best solutions found by ran-dom search, considering only the hyper-parametersthat were found to matter or which concern a scien-tific question of interest 26.Random search maintains the advantage of easy

parallelization provided by grid search and improveson it. Indeed, a practical advantage of random searchcompared to grid search is that if one of the jobsfails (due to hardware or software problems, a com-mon issue on large clusters) then there is no needto re-launch that job. It also means that if one haslaunched 100 random search jobs, and finds that theconvergence curves still has an interesting slope, onecan launch another 50 or 100 without wasting thefirst 100. It is not that simple to combine the resultsof two grid searches because they are not always com-patible (i.e., one is not a subset of the other).Finally, although random search is a useful ad-

dition to the toolbox of the practitioner, semi-automatic exploration is still helpful and one willoften iterate between launching a new volley of

26 This is often the case in machine learning research, e.g.,does depth of architecture matter? then we need to control ac-curately for the effect of depth, with all other hyper-parametersoptimized for each value of depth.

jobs and analysis of the results obtained withthe previous volley in order to guide model de-sign and research. What we need is more, andmore efficient, automation of hyper-parameter op-timization. There are some interesting steps inthis direction (Hutter, 2009; Bergstra et al., 2011;Hutter et al., 2011; Srinivasan and Ramakrishnan,2011) but much more needs to done.

4 Debugging and Analysis

4.1 Gradient Checking and Con-

trolled Overfitting

A very useful debugging step consists in verifyingthat the implementation of the gradient ∂L

∂θis com-

patible with the computation of L as a function ofθ. If the analytically computed gradient does notmatch the one obtained by a finite difference approx-imation, this signals that a bug is probably presentsomewhere. First of all, looking at for which i onegets important relative change between ∂L

∂θiand its

finite difference approximation, we can get hints asto where the problem may be. An error in sign isparticularly troubling, of course. A good next step isthen to verify in the same way intermediate gradients∂L∂a

with a some quantities that depend on the faultyθ, such as intervening neuron activations.As many researchers know, the gradient can be

approximated by a finite difference approximationobtained from the first-order Taylor expansion of ascalar function f with respect to a scalar argumentx:

∂f(x)

∂x≈

f(x+ ε)− f(x)

ε

But a less known fact is that a second order approx-imation can be achieved by considering the followingalternative formula:

∂f(x)

∂x≈

f(x+ ε)− f(x− ε)

2ε.

The second order terms of the Taylor expansion off(x+ ε) and f(x− ε) cancel each other because theyare even, i.e., this formula is twice more expensivebut provides quadratically more precision.

19

Page 20: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

Note that because of finite precision computation,there will be a difference between the analytic (evencorrect) and finite difference gradient. Contrary tonaive expectations, the relative difference may growif we choose an ε that is too small, i.e., the errorshould first decrease as ε is decreased, and then mayworsen when numerical precision kicks in, due to non-linearities. We have often used a value of ε = 10−4

in neural networks, a value that is sufficiently smallto detect most bugs.

Once the gradient is known to be well computed,another sanity check is that gradient descent (or anyother gradient-based optimization) should be ableto overfit on a small training set 27. In particu-lar, to factor out effects of SGD hyper-parameters, agood sanity check for the code (and the other hyper-parameters) is to verify that one can overfit on a smalltraining set using a powerful second order methodsuch as L-BFGS. For any optimizer, though, as thenumber of examples is increased, the degradation oftraining error should be gradual while validation er-ror should improve. And one typically sees the advan-tages of SGD over batch second-order methods likeL-BFGS increase as the training set size increases.The break-even point may depend on the task, paral-lelization (multi-core or GPU, see Sec.5 below), andarchitecture (number of computations compared tonumber of parameters, per example).

Of course, the real goal of learning is to achievegood generalization error, and the latter can be es-timated by measuring performance on an indepen-dent test set. When test error is considered too high,the first question to ask is whether it is because of adifficulty in optimizing the training criterion or be-cause of overfitting. Comparing training error andtest error (and how they change as we change hyper-parameters that influence capacity, such as the num-ber of training iterations) helps to answer that ques-tion. Depending on the answer, of course, the ap-

27 In principle, bad local minima could prevent that, but inthe overfitting regime, e.g., with more hidden units than exam-ples, the global minimum of the training error can generally bereached almost surely from random initialization, presumablybecause the training criterion becomes convex in the parame-ters that suffice to get the training error to zero (Bengio et al.,2006a), i.e., the output weights of the neural network.

propriate ways to improve test error are different.Optimization difficulties can be fixed by looking forbugs in the training code, inappropriate values of op-timization hyper-parameters, or simply in insufficientcapacity (e.g. not enough degrees of freedom, hiddenunits, embedding sizes, etc.). Overfitting difficultiescan be addressed by collecting more training data,introducing more or better regularization terms, orconsidering different function families (or neural net-work architectures). In a multi-layer neural network,both problems can be simultaneously present. Forexample, as discussed in Bengio et al. (2007); Bengio(2009), it is possible to have zero training error witha large top-level hidden layer that allows the outputlayer to overfit, while the lower layer are not doinga good job of extracting useful features because theywere not properly optimized.Unless using a framework such as Theano which

automatically handles the efficient allocation ofbuffers for intermediate results, it is important to payattention such buffers in the design of the code. Thefirst objective is to avoid memory allocation in themiddle of the training loop, i.e., all memory buffersshould be allocated once and for all. Careless reuseof the same memory buffers for different uses canhowever lead to bugs, which can be checked, in thedebugging phase, by initializing buffers to the NaN(Not-A-Number) value, which propagates into down-stream computation (making it easy to detect thatuninitialized values were used) 28.

4.2 Visualizations and Statistics

The most basic statistics that should be measuredduring training are error statistics. The average losson the training set and the validation set and theirevolution during training are very useful to monitorprogress and differentiate overfitting from poor op-timization. To make comparisons easier, it may beuseful to compare neural networks during training interms of their “age” (number of updates made timesmini-batch size B, i.e., number of examples visited)rather than in terms of number of epochs (which isvery sensitive to the training set size).

28 Personal communication from David Warde-Farley, wholearned this trick from Sam Roweis.

20

Page 21: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

When using unsupervised training to learn the firstfew layers of a deep architecture, a very common de-bugging and analysis tool is the visualization of fil-ters, i.e., of the weight vectors associated with in-dividual hidden units. This is simplest in the caseof the first layer and where the inputs are images(or image patches), time-series, or spectrograms (allof which are visually interpretable). Several recipeshave been proposed to extend this idea to visualizethe preferred input of hidden units in layers thatfollow the first one (Lee et al., 2008; Erhan et al.,2010a). In the case of the first layer, since one of-ten obtains Gabor filters, a parametric fit of thesefilters to the weight vector can be done so as to vi-sualize the distribution of orientations, position andscale of the learned filters. An interesting specialcase of visualizing first-layer weights is the visual-ization of word embeddings (see Section 5.3 below)using a dimensionality reduction technique such ast-SNE (van der Maaten and Hinton, 2008).In the case where the inputs are not images or eas-

ily visualizable, or to get a sense of the weight valuesin different hidden units, Hinton diagrams (Hinton,1989) are also very useful, using small squares whosecolor (black or white) indicates a weight’s sign andwhose area represents its magnitude.Another way to visualize what has been learned

by an unsupervised (or joint label-input) model isto look at samples from the model. Sampling pro-cedures have been defined at the outset for RBMs,Deep Belief Nets, and Deep Boltzmann Machines,for example based on Gibbs sampling. When weightsbecome larger, mixing between modes can becomevery slow with Gibbs sampling. An interesting alter-native is rates-FPCD (Tieleman and Hinton, 2009;Breuleux et al., 2011) which appears to be more ro-bust to this problem and generally mixes faster, butat the cost of losing theoretical guarantees.In the case of auto-encoder variants, it was not

clear until recently whether they were really captur-ing the underlying density (since they are not opti-mized with respect to the maximum likelihood prin-ciple or an approximation of it). It was thereforeeven less clear if there existed appropriate samplingalgorithms for auto-encoders, but a recent proposalfor sampling from contractive auto-encoders appears

to be working very well (Rifai et al., 2012), based onarguments about the geometric interpretation of thefirst derivative of the encoder.To get a sense of what individual hidden units rep-

resent, it has also been proposed to vary only oneunit while keeping the others fixed, e.g., to the valueobtained by finding the hidden units representationassociated with a particular input example.Another interesting technique is the visual-

ization of the learning trajectory in functionspace (Erhan et al., 2010b). The idea is to asso-ciate the function (as opposed to simply the pa-rameters) computed by a neural network with alow-dimensional (2-D or 3-D) representation, e.g.,with the t-SNE (van der Maaten and Hinton, 2008)or Isomap (Tenenbaum et al., 2000) algorithms, andthen plot the evolution of this function during train-ing, or the population of such trajectories for differentinitializations. This provides visualization of effec-tive local minima 29 and shows that no two differentrandom initializations ended up in the same effectivelocal minimum.Finally, another useful type of visualization is to

display statistics (e.g., histogram, mean and stan-dard deviation) of activations (inputs and outputsof the non-linearities at each layer), activation gradi-ents, parameters and parameter gradients, by groups(e.g. different layers, biases vs weights) and acrosstraining iterations. See Glorot and Bengio (2010) fora practical example.

5 Other Recommendations

5.1 Multi-core machines, BLAS and

GPUs

Matrix operations are the most time-consuming inefficient implementations of many machine learningalgorithms and this is particularly true of neural net-works and deep architectures. The basic operationsare matrix-vector products (forward propagation andback-propagation) and vector times vector outer

29 It is difficult to know for sure if it is a true local minimaor if it appears like one because the optimization algorithm isstuck.

21

Page 22: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

products (resulting in a matrix of weights for theweight gradient computation). Matrix-matrix mul-tiplications can be done substantially faster than theequivalent sequence of matrix-vector products for tworeasons: by smart caching mechanisms such as imple-mented in the BLAS library (which is called frommany higher-level environments such as python’snumpy and Theano, Matlab, Torch or Lush), andthanks to parallelism. Appropriate versions of BLAScan take advantage of multi-core machines to dis-tribute these computations on multi-core machines.The speed-up is however generally a fraction of thetotal speedup one can hope (e.g. 3× to 4× on a 4-core machine), because of communication overheadsand because not all computation is parallelized. Par-allelism becomes more efficient when the sizes ofthese matrices is increased, which is why mini-batchupdates can be computationally advantageous, andmore so when more cores are present.The extreme multi-core machines are the GPUs

(Graphics Processing Units), with hundreds of cores.Unfortunately, they also come with constraints andspecialized compilers which make it more difficult tofully take advantage of their potential. On 512-coremachines, we are routinely able to get speed-ups of10× to 40× for large neural networks. To make theuse of GPUs practical, it really helps to use existinglibraries that efficiently implement computations onGPUs. See Bergstra et al. (2010) for a comparativestudy of the Theano library (which compiles numpy-like code for GPUs). One practical issue is that onlythe GPU-compiled operations will typically be doneon the GPU, and that transfers between the GPUand CPU considerably slow things down. It is im-portant to use a profiler to find out what is doneon the GPU and how efficient these operations arein order to quickly invest one’s time where neededto make an implementation GPU-efficient and keepmost operations on the GPU card.

5.2 Sparse High-Dimensional Inputs

Sparse high-dimensional inputs can be efficiently han-dled by traditional supervised neural networks by us-ing a sparse matrix multiplication. Typically, the in-put is a sparse vector while the weights are in a dense

matrix, and one should use an efficient implementa-tion made for just this case in order to optimally takeadvantage of sparsity. There is still going to be anoverhead on the order of 2× or more (on the multiply-add operations, not the others) compared to a denseimplementation of the matrix-vector product.For many unsupervised learning algorithms there is

unfortunately a difficulty. The computation for theselearning algorithms usually involves some kind of re-construction of the input (like for all auto-encodervariants, but also for RBMs and sparse coding vari-ants), as if the inputs were in the output space ofthe learner. Two exceptions to this problem aresemi-supervised embedding (Weston et al., 2008) andSlow Feature Analysis (Wiskott and Sejnowski, 2002;Berkes and Wiskott, 2002). The former pulls the rep-resentation of nearby examples near each other whilealso tuning the representation for a supervised learn-ing task, and the latter maximizes the learned fea-tures’ variance while minimizing their covariance andmaximizing their temporal auto-correlation.For algorithms that do need a form of input re-

construction, an efficient approach based on sam-pled reconstruction (Dauphin et al., 2011) has beenproposed, successfully implemented and evaluatedfor the case of auto-encoders and denoising auto-encoders. The first idea is that on each example (ormini-batch), one samples a subset of the elements ofthe reconstruction vector, along with the associatedreconstruction loss. Only the loss for these sampledelements (and the associated back-propagation oper-ations into hidden units and reconstruction weights)needs to be computed. That alone would reduce thecomputational cost but make the gradient much morenoisy and possibly biased as well, if the sampling dis-tribution was chosen not uniform. To reduce thevariance of that estimator, the idea is to guess forwhich features the reconstruction loss will be largerand to sample with higher probability these features(and their loss). In particular, the authors alwayssample the features associated with a non-zero in theinput (or the corrupted input, in the denoising case),and uniformly sample an equal number of zeros. Tomake the estimator unbiased now requires introduc-ing a weight on the reconstruction loss associatedwith each sampled feature, inversely proportional to

22

Page 23: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

the probability of sampling it, i.e., this is an impor-tance sampling scheme. The experiments show thatthe speed-up increases linearly with the amount ofsparsity while the average loss is optimized as well asin the deterministic full-computation case.

5.3 Symbolic Variables, Embeddings,

Multi-Task Learning and Multi-

Relational Learning

Parameter sharing (Lang and Hinton, 1988; LeCun,1989; Lang and Hinton, 1988; Caruana, 1993; Baxter,1995, 1997) is an old neural network technique for in-creasing statistical power: if a parameter is used in Ntimes more contexts (different tasks, different parts ofthe input, etc.) then it may be as if we had N timesmore training examples for tuning its value. Moreexamples to estimate a parameter reduces its vari-ance (with respect to sampling of training examples),which is directly influencing generalization error: forexample the generalization mean squared error canbe decomposed as the sum of a bias term and a vari-ance term (Geman et al., 1992). The reuse idea wasfirst exploited by applying the same parameter to dif-ferent parts of the input, as in convolutional neu-ral networks (Lang and Hinton, 1988; LeCun, 1989).Reuse was also exploited by sharing the lower lay-ers of a network (and the representation of the inputthat they capture) across multiple tasks associatedwith different outputs of the network (Caruana, 1993;Baxter, 1995, 1997). This idea is also one of the keymotivations behind Deep Learning (Bengio, 2009) be-cause one can think of the intermediate features com-puted in higher (deeper) layers as different tasks thatcan share the sub-features computed in lower layers(nearer the input). This very basic notion of reuseis key to improving generalization in many settings,guiding the design of neural network architectures inpractical applications as well.

An interesting special case of these ideas is in thecontext of learning with symbolic data. If some in-put variables are symbolic, taking value in a finitealphabet, they can be represented as neural networkinputs by a one-hot subvector (with a 0 everywhereexcept at the position associated with the particular

symbol) of the input vector. Now, sometimes differ-ent input variables refer to different instances of thesame type of symbol. A patent example is with neurallanguage models (Bengio et al., 2003; Bengio, 2008),where the input is a sequence of words. In thesemodels, the same input layer weights are reused forwords at different positions in the input sequence (asin convolutional networks). The product of a one-hotsub-vector with this shared weight matrix is a gener-ally dense vector, and this associates each symbol inthe alphabet with a point in a vector space (the resultof the matrix multiplication, which equals one of thecolumns of the matrix), which we call its embedding.The idea of vector space representations for wordsand symbols is older (Deerwester et al., 1990) and isa particular case of the notion of distributed repre-sentation (Hinton, 1986, 1989) central to the connec-tionist approaches. Learned embeddings of symbols(or other objects) can be conveniently visualized us-ing a dimensionality reduction algorithm such as t-SNE (van der Maaten and Hinton, 2008).In addition to sharing the embedding parame-

ters across positions of words in an input sentence,Collobert et al. (2011a) share them across naturallanguage processing tasks such as Part-Of-Speechtagging, chunking and semantic role labeling. Param-eter sharing is a key idea behind convolutional nets,recurrent neural networks and dynamic Bayes nets, inwhich the same parameters are used for different tem-poral or spatial slices of the data. This idea has beengeneralized from sequences and 2-D images to arbi-trary graphs with recursive neural networks or recur-sive graphical models (Pollack, 1990; Frasconi et al.,1998; Bottou, 2011; Socher et al., 2011), MarkovLogic Networks (Richardson and Domingos, 2006)and relational learning (Getoor and Taskar, 2006). Arelational database can be seen as a set of objects (orvalues) and relations between them, of the form (ob-ject1, relation-type, object2). The same global set ofparameters can be shared to characterize such rela-tions, across relations (which can be seen as tasks)and objects. Object-specific parameters are the pa-rameters specifying the embedding of a discrete ob-ject. One can think of the elements of each embed-ding vector as implicit learned attributes. Differenttasks may demand different attributes, so that ob-

23

Page 24: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

jects which share some underlying characteristics andbehavior should end up having similar values of someof their attributes. For example, words appearing insemantically and syntactically similar contexts endup getting a very close embedding (Collobert et al.,2011a). If the same attributes can be useful for sev-eral tasks, then statistical power is gained throughparameter sharing, and transfer of information be-tween tasks can happen, making the data of sometask informative for generalizing properly on anothertask.The idea proposed in Bordes et al. (2011, 2012) is

to learn an energy function that is lower for posi-tive (valid) relations present in the training set, andparametrized in two parts: on the one hand the sym-bol embeddings and on the other hand the rest ofthe neural network that maps them to a scalar en-ergy. In addition, by considering relation types them-selves as particular symbolic objects, the model canreason about relations themselves and have relationsbetween relation types. For example, “to be” can actas a relation type (in subject-attribute relations) butin the statement “’To be’ is a verb” it appears bothas a relation type and as an object of the relation.Such multi-relational learning opens the door to

the application of neural networks outside of theirtraditional applications, which was based on a singlehomogeneous source of data, often seen as a matrixwith one row per example and one column (or groupof columns) per random variable. Instead, one of-ten has multiple heterogeneous sources of data (typ-ically providing examples seen as a tuple of values),each involving different random variables. So longas these different sources share some variables, thenthese multi-relational multi-task learning approachescan be applied. Each variable can be associated withits embedding function (that maps the value of a vari-able to a generic representation space that is validacross tasks and data sources). This framework canbe applied not only to symbolic data but to mixedsymbolic/numeric data if the mapping from objectto embedding is generalized from a table look-up toa parametrized function (the simplest being a linearmapping) from its raw attributes (e.g., image fea-tures) to its embedding. This has been exploitedsuccessfully to design image search systems in which

images and queries are mapped to the same semanticspace (Weston et al., 2011).

6 Open Questions

6.1 On the Added Difficulty of Train-

ing Deeper Architectures

There are experimental results which provide someevidence that, at least in some circumstances, deeperneural networks are more difficult to train thanshallow ones, in the sense that there is a greaterchance of missing out on better minima when start-ing from random initialization. This is borne outby all the experiments where we find that initializ-ing global training of a deep architecture can dras-tically improve performance. In the Deep Learn-ing literature this has been shown with the use ofunsupervised pre-training, both applied to super-vised tasks — training a neural network for clas-sification (Hinton et al., 2006; Bengio et al., 2007;Ranzato et al., 2007) — and unsupervised tasks —training a Deep Boltzmann Machine to model thedata distribution (Salakhutdinov and Hinton, 2009).

The learning trajectories visualizationsof Erhan et al. (2010b) have shown that evenwhen starting from nearby configurations in functionspace, different initializations seem to always fall ina different effective local minimum. Furthermore,the same study showed that the minima found whenusing unsupervised pre-training were far in functionspace from those found from random initialization,in addition to giving better generalization error.Both of these findings highlight the importance ofinitialization, hence of local minima effects, in deepnetworks. Finally, it has been shown that theseeffects were both increased when considering deeperarchitectures (Erhan et al., 2010b).

There are also results showing that specific waysof setting the initial distribution and ordering ofexamples (“curriculum learning”) can yield bet-ter solutions (Elman, 1993; Bengio et al., 2009;Krueger and Dayan, 2009). This also suggest thatvery particular ways of initializing parameters, verydifferent from uniformly sampled, can have a strong

24

Page 25: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

impact on the solutions found by gradient descent.The hypothesis proposed in (Bengio et al., 2009) isthat curriculum learning can act similarly to a con-tinuation method, i.e., starting from an easier opti-mization task (e.g. convex) and tracking the localminimum as the learning task is gradually made moredifficult and closer to the real task of interest.

Why would training deeper networks be more dif-ficult? This is clearly still an open question. Aplausible partial answer is that deeper networks arealso more non-linear (since each layer composes morenon-linearity on top of the previous ones), makinggradient-based methods less efficient. It may also bethat the number and structure of local minima bothchange qualitatively as we increase depth. Theoreti-cal arguments support a potentially exponential gainin expressive power of deeper architectures (Bengio,2009; Bengio and Delalleau, 2011) and it would beplausible that with this added expressive power com-ing from the combinatorics of composed reuse of sub-functions could come a corresponding increase in thenumber (and possibly quality) of local minima. Butthe best ones could then also be more difficult to find.

On the practical side, several experimental resultspoint to factors that may help training deep architec-tures:

• A local training signal. What many success-ful procedures for training deep networks havein common is that they involve a local trainingsignal that helps each layer decide what to dowithout requiring the back-propagation of gradi-ents through many non-linearities. This includesof course the many variants of greedy layer-wisepre-training but also the less well-known semi-supervised embedding algorithm (Weston et al.,2008).

• Initialization in the right range. Basedon the idea that both activations and gradientsshould be able to flow well through a deep ar-chitecture without significant reduction in vari-ance, Glorot and Bengio (2010) proposed set-ting up the initial weights to make the Jaco-bian of each layer have singular values near 1 (orpreserve variance in both directions). In their

experiments this clearly helped greatly reduc-ing the gap between purely supervised and pre-trained (with layer-wise unsupervised learning)deep networks.

• Choice of non-linearities. In the samestudy (Glorot and Bengio, 2010) and a follow-up (Glorot et al., 2011a) it was shown that thechoice of hidden layer non-linearities interactedwith depth. In particular, without unsupervisedpre-training, a deep neural network with sig-moids in the top hidden layer would get stuckfor a long time on a plateau and generally pro-duce inferior results, due to the special role of0 and of the initial gradients from the outputunits. Symmetric non-linearities like the hy-perbolic tangent did not suffer from that prob-lem, while softer non-linearities (without ex-ponential tails) such as the softsign functions(a) = a

1+|a| worked even better. In Glorot et al.

(2011a) it was shown that an asymmetric buthard-limiting non-linearity such as the rectifier(s(a) = max(0, a), see also (Nair and Hinton,2010)) actually worked very well (but should notbe used for output units), in spite of the prior be-lief that the fact that when hidden units are sat-urated gradients would not flow well into lowerlayers. In fact gradients flow very well, but onselected paths. A recent heuristic that is relatedto the difficulty of gradient propagation throughneural net non-linearities is the idea of “cen-tering” the non-linear operation such that eachhidden unit has zero average output and zeroaverage slope (Schraudolph, 1998; Raiko et al.,2012).

6.2 Adaptive Learning Rates and

Second-Order Methods

To improve convergence and remove learning ratesfrom the list of hyper-parameters, many authors haveadvocated exploring adaptive learning rate methods,either for a global learning rate (Cho et al., 2011),a layer-wise learning rate, a neuron-wise learningrate, or a parameter-wise learning rate (Bordes et al.,2009) (which then starts to look like a diagonal New-

25

Page 26: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

ton method). LeCun (1987); LeCun et al. (1998a)advocate the use of a second-order diagonal New-ton (always positive) approximation, with one learn-ing rate (associated with the approximated inversesecond derivative of the loss with respect to theparameter). Hinton (2010) proposes scaling learn-ing rates so that the average weight update is onthe order of 1/1000th of the weight magnitude.LeCun et al. (1998a) also propose a simple powermethod in order to estimate the largest eigenvalueof the Hessian (which would be the optimal learn-ing rate). An interesting alternative to variants ofNewton’s method are variants of the natural gradi-ent method (Amari, 1998), but like the basic Newtonmethod it is computationally too expensive, requir-ing operations on a too large square matrix (num-ber of parameters by number of parameters). Diag-onal and low-rank online approximations of naturalgradient (Le Roux et al., 2008; Le Roux et al., 2011)have been proposed and shown to speed-up trainingin some contexts.Whereas stochastic gradient descent converges

very quickly initially it is generally slower thansecond-order methods for the final convergence, andthis may be important in some applications. As aconsequence, batch training algorithms (performingonly one update after seeing the whole training set)such as the Conjugate Gradients method (a secondorder method) have dominated stochastic gradientdescent for not too large datasets (e.g. less thanthousands or tens of thousands of examples). Fur-thermore, it has recently been proposed and success-fully applied to use second-order methods over largemini-batches (Le et al., 2011; Martens, 2010). Theidea is to do just a few iterations of the second-ordermethods on each mini-batch and then move on tothe next mini-batch, starting from the best previouspoint found.At this point in time however, although the second-

order and natural gradient methods are appealingconceptually, have demonstrably helped in the stud-ied cases and may in the end prove to be very impor-tant, they have not yet become a standard for neuralnetworks optimization and need to be validated andmaybe improved by other researchers, before displac-ing simple (mini-batch) stochastic gradient descent

variants.

6.3 Conclusion

In spite of decades of experimental and theoreticalwork on artificial neural networks, and with all theimpressive progress made since the first edition ofthis book, in particular in the area of Deep Learning,there is still much to be done to better train neuralnetworks and better understand the underlying issuesthat can make the training task difficult. As stated inthe introduction, the wisdom distilled here should betaken as a guideline, to be tried and challenged, notas a practice set in stone. The practice summarizedhere, coupled with the increase in available comput-ing power, now allows researchers to train neural net-works on a scale that is far beyond what was possibleat the time of the first edition of this book, helpingto move us closer to artificial intelligence.

Acknowledgements

The author is grateful for the comments and feed-back provided by Nicolas Le Roux, Ian Goodfel-low, James Bergstra, Guillaume Desjardins, RazvanPascanu, David Warde-Farley, Eric Larsen, FredericBastien, and Sina Honari, as well as for the finan-cial support of NSERC, FQRNT, CIFAR, and theCanada Research Chairs.

References

Amari, S. (1998). Natural gradient works efficientlyin learning. Neural Computation, 10(2), 251–276.

Bach, F. and Moulines, E. (2011). Non-asymptoticanalysis of stochastic approximation algorithms. InNIPS’2011 .

Bagnell, J. A. and Bradley, D. M. (2009). Differen-tiable sparse coding. In NIPS’2009 , pages 113–120.

Baxter, J. (1995). Learning internal representations.In COLT’95 , pages 311–320.

Baxter, J. (1997). A Bayesian/information theoreticmodel of learning via multiple task sampling. Ma-chine Learning, 28, 7–40.

26

Page 27: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

Bengio, Y. (2008). Neural net language models.Scholarpedia, 3(1), 3881.

Bengio, Y. (2009). Learning deep architectures forAI . Now Publishers.

Bengio, Y. (2011). Deep learning of representationsfor unsupervised and transfer learning. In JMLRW&CP: Proc. Unsupervised and Transfer Learn-ing.

Bengio, Y. and Delalleau, O. (2011). On the expres-sive power of deep architectures. In ALT’2011 .

Bengio, Y. and LeCun, Y. (2007). Scaling learningalgorithms towards AI. In Large Scale Kernel Ma-chines .

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin,C. (2003). A neural probabilistic language model.JMLR, 3, 1137–1155.

Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O.,and Marcotte, P. (2006a). Convex neural networks.In NIPS’2005 , pages 123–130.

Bengio, Y., Delalleau, O., and Le Roux, N. (2006b).The curse of highly variable functions for local ker-nel machines. In NIPS’2005 , pages 107–114.

Bengio, Y., Lamblin, P., Popovici, D., andLarochelle, H. (2007). Greedy layer-wise trainingof deep networks. In NIPS’2006 .

Bengio, Y., Louradour, J., Collobert, R., and We-ston, J. (2009). Curriculum learning. In ICML’09 .

Bergstra, J. and Bengio, Y. (2012). Random searchfor hyper-parameter optimization. J. MachineLearning Res., 13, 281–305.

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P.,Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: aCPU and GPU math expression compiler. In Proc.Python for Scientific Comp. Conf. (SciPy).

Bergstra, J., Bardenet, R., Bengio, Y., and Kegl, B.(2011). Algorithms for hyper-parameter optimiza-tion. In NIPS’2011 .

Berkes, P. and Wiskott, L. (2002). Applying slow fea-ture analysis to image sequences yields a rich reper-toire of complex cell properties. In ICANN’02 ,pages 81–86.

Bertsekas, D. P. (2010). Incremental gradient, sub-gradient, and proximal methods for convex opti-mization: a survey. Technical Report 2848, LIDS.

Bordes, A., Bottou, L., and Gallinari, P. (2009). Sgd-qn: Careful quasi-newton stochastic gradient de-scent. Journal of Machine Learning Research, 10,1737–1754.

Bordes, A., Weston, J., Collobert, R., and Bengio, Y.(2011). Learning structured embeddings of knowl-edge bases. In AAAI 2011 .

Bordes, A., Glorot, X., Weston, J., and Bengio, Y.(2012). Joint learning of words and meaning rep-resentations for open-text semantic parsing. AIS-TATS’2012 .

Bottou, L. (2011). From machine learning to machinereasoning. Technical report, arXiv.1102.1808.

Bottou, L. and Bousquet, O. (2008). The tradeoffs oflarge scale learning. In NIPS’2008 .

Bottou, L. and LeCun, Y. (2004). Large-scale on-linelearning. In NIPS’2003 .

Breiman, L. (1994). Bagging predictors. MachineLearning, 24(2), 123–140.

Breuleux, O., Bengio, Y., and Vincent, P. (2011).Quickly generating representative samples from anrbm-derived process. Neural Computation, 23(8),2053–2073.

Caruana, R. (1993). Multitask connectionist learn-ing. In Proceedings of the 1993 Connectionist Mod-els Summer School , pages 372–379.

Cho, K., Raiko, T., and Ilin, A. (2011). Enhancedgradient and adaptive learning rate for trainingrestricted boltzmann machines. In ICML’2011 ,pages 105–112.

27

Page 28: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

Coates, A. and Ng, A. Y. (2011). The importanceof encoding versus training with sparse coding andvector quantization. In ICML’2011 .

Collobert, R. and Bengio, S. (2004). Links betweenperceptrons, MLPs and SVMs. In ICML’2004 .

Collobert, R., Weston, J., Bottou, L., Karlen, M.,Kavukcuoglu, K., and Kuksa, P. (2011a). Naturallanguage processing (almost) from scratch. Journalof Machine Learning Research, 12, 2493–2537.

Collobert, R., Kavukcuoglu, K., and Farabet, C.(2011b). Torch7: A matlab-like environment formachine learning. In BigLearn, NIPS Workshop.

Courville, A., Bergstra, J., and Bengio, Y. (2011).Unsupervised models of images by spike-and-slabRBMs. In ICML’2011 .

Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Sam-pled reconstruction for large-scale learning of em-beddings. In Proc. ICML’2011 .

Deerwester, S., Dumais, S. T., Furnas, G. W., Lan-dauer, T. K., and Harshman, R. (1990). Indexingby latent semantic analysis. J. Am. Soc. Informa-tion Science, 41(6), 391–407.

Elman, J. L. (1993). Learning and development inneural networks: The importance of starting small.Cognition, 48, 781–799.

Erhan, D., Courville, A., and Bengio, Y. (2010a).Understanding representations learned in deep ar-chitectures. Technical Report 1355, Universite deMontreal/DIRO.

Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010b). Why doesunsupervised pre-training help deep learning? J.Machine Learning Res., 11, 625–660.

Frasconi, P., Gori, M., and Sperduti, A. (1998). Ageneral framework for adaptive processing of datastructures. IEEE Transactions on Neural Net-works , 9(5), 768–786.

Geman, S., Bienenstock, E., and Doursat, R. (1992).Neural networks and the bias/variance dilemma.Neural Computation, 4(1), 1–58.

Getoor, L. and Taskar, B. (2006). Introduction toStatistical Relational Learning. MIT Press.

Glorot, X. and Bengio, Y. (2010). Understandingthe difficulty of training deep feedforward neuralnetworks. In AISTATS’2010 , pages 249–256.

Glorot, X., Bordes, A., and Bengio, Y. (2011a).Deep sparse rectifier neural networks. In AIS-TATS’2011 .

Glorot, X., Bordes, A., and Bengio, Y. (2011b). Do-main adaptation for large-scale sentiment classifi-cation: A deep learning approach. In ICML’2011 .

Goodfellow, I., Le, Q., Saxe, A., and Ng, A.(2009). Measuring invariances in deep networks.In NIPS’2009 , pages 646–654.

Goodfellow, I., Courville, A., and Bengio, Y. (2011).Spike-and-slab sparse coding for unsupervised fea-ture discovery. In NIPS Workshop on Challengesin Learning Hierarchical Models .

Graepel, T., Candela, J. Q., Borchert, T., and Her-brich, R. (2010). Web-scale Bayesian click-throughrate prediction for sponsored search advertising inmicrosoft’s bing search engine. In ICML’2010 .

Hastad, J. (1986). Almost optimal lower bounds forsmall depth circuits. In STOC’86 , pages 6–20.

Hastad, J. and Goldmann, M. (1991). On the powerof small-depth threshold circuits. ComputationalComplexity, 1, 113–129.

Hinton, G. E. (1978). Relaxation and its role in vi-sion. Ph.D. thesis, University of Edinburgh.

Hinton, G. E. (1986). Learning distributed represen-tations of concepts. In Proc. 8th Annual Conf. Cog.Sc. Society, pages 1–12.

Hinton, G. E. (1989). Connectionist learning proce-dures. Artificial Intelligence, 40, 185–234.

28

Page 29: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

Hinton, G. E. (2010). A practical guide to train-ing restricted Boltzmann machines. Technical Re-port UTML TR 2010-003, Department of Com-puter Science, University of Toronto.

Hinton, G. E. (2013). A practical guide to trainingrestricted boltzmann machines. In K.-R. Muller,G. Montavon, and G. B. Orr, editors, Neural Net-works: Tricks of the Trade, Reloaded . Springer.

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006).A fast learning algorithm for deep belief nets. Neu-ral Computation, 18, 1527–1554.

Hutter, F. (2009). Automated Configuration of Algo-rithms for Solving Hard Computational Problems .Ph.D. thesis, University of British Columbia.

Hutter, F., Hoos, H., and Leyton-Brown, K. (2011).Sequential model-based optimization for generalalgorithm configuration. In LION-5 .

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and Le-Cun, Y. (2009). What is the best multi-stage ar-chitecture for object recognition? In ICCV’09 .

Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., andLeCun, Y. (2009). Learning invariant featuresthrough topographic filter maps. In CVPR’2009 .

Krueger, K. A. and Dayan, P. (2009). Flexible shap-ing: how learning in small steps helps. Cognition,110, 380–394.

Lamblin, P. and Bengio, Y. (2010). Important gainsfrom supervised fine-tuning of deep architectureson large labeled sets. NIPS*2010 Deep Learningand Unsupervised Feature Learning Workshop.

Lang, K. J. and Hinton, G. E. (1988). The develop-ment of the time-delay neural network architecturefor speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University.

Larochelle, H. and Bengio, Y. (2008). Classifica-tion using discriminative restricted Boltzmann ma-chines. In ICML’2008 .

Larochelle, H., Bengio, Y., Louradour, J., and Lam-blin, P. (2009). Exploring strategies for trainingdeep neural networks. J. Machine Learning Res.,10, 1–40.

Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh,P. W., and Ng, A. (2010). Tiled convolutional neu-ral networks. In NIPS’2010 .

Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow,B., and Ng, A. (2011). On optimization methodsfor deep learning. In ICML’2011 .

Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008).Topmoumoute online natural gradient algorithm.In NIPS’07 .

Le Roux, N., Bengio, Y., and Fitzgibbon, A. (2011).Improving first and second-order methods by mod-eling uncertainty. In Optimization for MachineLearning. MIT Press.

Le Roux, N., Schmidt, M., and Bach, F. (2012).A stochastic gradient method with an exponen-tial convergence rate for strongly-convex optimiza-tion with finite training sets. Technical report,arXiv:1202.6258.

LeCun, Y. (1987). Modeles connexionistes del’apprentissage. Ph.D. thesis, Universite de ParisVI.

LeCun, Y. (1989). Generalization and network de-sign strategies. Technical Report CRG-TR-89-4,University of Toronto.

LeCun, Y. (2013). to appear. In K.-R. Muller,G. Montavon, and G. B. Orr, editors, Neural Net-works: Tricks of the Trade, Reloaded . Springer.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,Howard, R. E., Hubbard, W., and Jackel, L. D.(1989). Backpropagation applied to handwrittenzip code recognition. Neural Computation, 1(4),541–551.

LeCun, Y., Bottou, L., Orr, G. B., and Muller, K.(1998a). Efficient backprop. In Neural Networks,Tricks of the Trade.

29

Page 30: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.(1998b). Gradient based learning applied to docu-ment recognition. IEEE , 86(11), 2278–2324.

Lee, H., Ekanadham, C., and Ng, A. (2008). Sparsedeep belief net model for visual area V2. InNIPS’07 .

Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y.(2009). Convolutional deep belief networks for scal-able unsupervised learning of hierarchical represen-tations. In ICML’2009 .

Martens, J. (2010). Deep learning via Hessian-freeoptimization. In ICML’2010 , pages 735–742.

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Ben-gio, Y., Goodfellow, I., Lavoie, E., Muller, X.,Desjardins, G., Warde-Farley, D., Vincent, P.,Courville, A., and Bergstra, J. (2011). Unsuper-vised and transfer learning challenge: a deep learn-ing approach. In JMLR W&CP: Proc. Unsuper-vised and Transfer Learning, volume 7.

Nair, V. and Hinton, G. E. (2010). Rectified linearunits improve restricted Boltzmann machines. InICML’2010 .

Nemirovski, A. and Yudin, D. (1983). Problem com-plexity and method efficiency in optimization. Wi-ley.

Nesterov, Y. (2009). Primal-dual subgradient meth-ods for convex problems. Mathematical program-ming, 120(1), 221–259.

Olshausen, B. A. and Field, D. J. (1997). Sparsecoding with an overcomplete basis set: a strategyemployed by V1? Vision Research, 37, 3311–3325.

Pearlmutter, B. (1994). Fast exact multiplication bythe Hessian. Neural Computation, 6(1), 147–160.

Pinto, N., Doukhan, D., DiCarlo, J. J., and Cox,D. D. (2009). A high-throughput screening ap-proach to discovering good forms of biologicallyinspired visual representation. PLoS Comput Biol ,5(11), e1000579.

Pollack, J. B. (1990). Recursive distributed represen-tations. Artificial Intelligence, 46(1), 77–105.

Polyak, B. and Juditsky, A. (1992). Acceleration ofstochastic approximation by averaging. SIAM J.Control and Optimization, 30(4), 838–855.

Raiko, T., Valpola, H., and LeCun, Y. (2012). Deeplearning made easier by linear transformations inperceptrons. In AISTATS’2012 .

Ranzato, M., Poultney, C., Chopra, S., and LeCun,Y. (2007). Efficient learning of sparse representa-tions with an energy-based model. In NIPS’06 .

Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008a).Sparse feature learning for deep belief networks.In J. Platt, D. Koller, Y. Singer, and S. Roweis,editors, Advances in Neural Information Process-ing Systems 20 (NIPS’07), pages 1185–1192, Cam-bridge, MA. MIT Press.

Ranzato, M., Boureau, Y., and LeCun, Y. (2008b).Sparse feature learning for deep belief networks. InNIPS’2007 .

Richardson, M. and Domingos, P. (2006). Markovlogic networks. Machine Learning, 62, 107–136.

Rifai, S., Vincent, P., Muller, X., Glorot, X., andBengio, Y. (2011). Contracting auto-encoders:Explicit invariance during feature extraction. InICML’2011 .

Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P.(2012). A generative process for sampling contrac-tive auto-encoders. In ICML’2012 .

Robbins, H. and Monro, S. (1951). A stochasticapproximation method. Annals of MathematicalStatistics , 22, 400–407.

Rumelhart, D. E., Hinton, G. E., and Williams,R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.

Salakhutdinov, R. and Hinton, G. (2009). DeepBoltzmann machines. In AISTATS’2009 .

30

Page 31: PracticalRecommendationsforGradient-BasedTrainingofDeep ...to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations

Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M.,Suresh, B., and Ng, A. (2011). On random weightsand unsupervised feature learning. In ICML’2011 .

Schraudolph, N. N. (1998). Centering neural networkgradient factors. In G. B. Orr and K.-R. Muller, ed-itors, Neural Networks: Tricks of he Trade, pages548–548. Springer.

Socher, R., Lin, C., Ng, A. Y., and Manning, C.(2011). Learning continuous phrase representa-tions and syntactic parsing with recursive neuralnetworks. In ICML’2011 .

Srinivasan, A. and Ramakrishnan, G. (2011). Pa-rameter screening and optimisation for ILP usingdesigned experiments. Journal of Machine Learn-ing Research, 12, 627–662.

Swersky, K., Chen, B., Marlin, B., and de Freitas,N. (2010). A tutorial on stochastic approximationalgorithms for training restricted boltzmann ma-chines and deep belief nets. In Information Theoryand Applications Workshop.

Tenenbaum, J., de Silva, V., and Langford, J. C.(2000). A global geometric framework for nonlin-ear dimensionality reduction. Science, 290(5500),2319–2323.

Tieleman, T. and Hinton, G. (2009). Using fastweights to improve persistent contrastive diver-gence. In ICML’2009 .

van der Maaten, L. and Hinton, G. E. (2008). Visual-izing data using t-sne. J. Machine Learning Res.,9.

Vincent, P. (2011). A connection between scorematching and denoising autoencoders. NeuralComputation, 23(7).

Vincent, P., Larochelle, H., Bengio, Y., and Man-zagol, P.-A. (2008). Extracting and composingrobust features with denoising autoencoders. InICML 2008 .

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y.,and Manzagol, P.-A. (2010). Stacked denoising au-toencoders: Learning useful representations in a

deep network with a local denoising criterion. J.Machine Learning Res., 11.

Weston, J., Ratle, F., and Collobert, R. (2008). Deeplearning via semi-supervised embedding. In ICML2008 .

Weston, J., Bengio, S., and Usunier, N. (2011). Ws-abie: Scaling up to large vocabulary image anno-tation. In Proceedings of the International JointConference on Artificial Intelligence, IJCAI .

Wiskott, L. and Sejnowski, T. J. (2002). Slow fea-ture analysis: Unsupervised learning of invari-ances. Neural Computation, 14(4), 715–770.

Zou, W. Y., Ng, A. Y., and Yu, K. (2011). Unsu-pervised learning of visual invariance with tempo-ral coherence. In NIPS 2011 Workshop on DeepLearning and Unsupervised Feature Learning.

31