lecture 4: backpropagation and neural networks part...

Download Lecture 4: Backpropagation and Neural Networks part 1cs231n.stanford.edu/slides/2016/winter1516_lecture4.pdf · Lecture 4: Backpropagation and Neural Networks part 1. ... Full implementation

If you can't read please download the document

Upload: hoangliem

Post on 07-Feb-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 20161

    Lecture 4:

    Backpropagation and

    Neural Networks part 1

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 20162

    AdministrativeA1 is due Jan 20 (Wednesday). ~150 hours leftWarning: Jan 18 (Monday) is Holiday (no class/office hours)

    Also note: Lectures are non-exhaustive. Read course notes for completeness.

    Ill hold make up office hours on Wed Jan20, 5pm @ Gates 259

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 20163

    want

    scores function

    SVM loss

    data loss + regularization

    Where we are...

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 20164

    (image credits to Alec Radford)

    Optimization

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 20165

    Gradient Descent

    Numerical gradient: slow :(, approximate :(, easy to write :)Analytic gradient: fast :), exact :), error-prone :(

    In practice: Derive analytic gradient, check your implementation with numerical gradient

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 20166

    Computational Graph

    x

    W

    * hinge loss

    R

    + Ls (scores)

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 20167

    Convolutional Network(AlexNet)

    input imageweights

    loss

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 20168

    Neural Turing Machine

    input tape

    loss

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 20169

    Neural Turing Machine

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201610

    e.g. x = -2, y = 5, z = -4

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201611

    e.g. x = -2, y = 5, z = -4

    Want:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201612

    e.g. x = -2, y = 5, z = -4

    Want:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201613

    e.g. x = -2, y = 5, z = -4

    Want:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201614

    e.g. x = -2, y = 5, z = -4

    Want:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201615

    e.g. x = -2, y = 5, z = -4

    Want:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201616

    e.g. x = -2, y = 5, z = -4

    Want:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201617

    e.g. x = -2, y = 5, z = -4

    Want:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201618

    e.g. x = -2, y = 5, z = -4

    Want:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201619

    e.g. x = -2, y = 5, z = -4

    Want:

    Chain rule:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201620

    e.g. x = -2, y = 5, z = -4

    Want:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201621

    e.g. x = -2, y = 5, z = -4

    Want:

    Chain rule:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201622

    f

    activations

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201623

    f

    activations

    local gradient

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201624

    f

    activations

    local gradient

    gradients

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201625

    f

    activations

    gradients

    local gradient

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201626

    f

    activations

    gradients

    local gradient

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201627

    f

    activations

    gradients

    local gradient

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201628

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201629

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201630

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201631

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201632

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201633

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201634

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201635

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201636

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201637

    Another example:

    (-1) * (-0.20) = 0.20

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201638

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201639

    Another example:

    [local gradient] x [its gradient][1] x [0.2] = 0.2[1] x [0.2] = 0.2 (both inputs!)

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201640

    Another example:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201641

    Another example:

    [local gradient] x [its gradient]x0: [2] x [0.2] = 0.4w0: [-1] x [0.2] = -0.2

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201642

    sigmoid function

    sigmoid gate

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201643

    sigmoid function

    sigmoid gate

    (0.73) * (1 - 0.73) = 0.2

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201644

    Patterns in backward flow

    add gate: gradient distributormax gate: gradient routermul gate: gradient switcher?

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201645

    Gradients add at branches

    +

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201646

    Implementation: forward/backward API

    Graph (or Net) object. (Rough psuedo code)

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201647

    Implementation: forward/backward API

    (x,y,z are scalars)

    *

    x

    y

    z

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201648

    Implementation: forward/backward API

    (x,y,z are scalars)

    *

    x

    y

    z

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201649

    Example: Torch Layers

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201650

    Example: Torch Layers

    =

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201651

    Example: Torch MulConstant

    initialization

    forward()

    backward()

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201652

    Example: Caffe Layers

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201653

    Caffe Sigmoid Layer

    *top_diff (chain rule)

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201654

    Gradients for vectorized code

    f

    local gradient

    This is now the Jacobian matrix (derivative of each element of z w.r.t. each element of x)

    (x,y,z are now vectors)

    gradients

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201655

    Vectorized operations

    f(x) = max(0,x)(elementwise)

    4096-d input vector

    4096-d output vector

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201656

    Vectorized operations

    f(x) = max(0,x)(elementwise)

    4096-d input vector

    4096-d output vector

    Q: what is the size of the Jacobian matrix?

    Jacobian matrix

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201657

    max(0,x)(elementwise)

    4096-d input vector

    4096-d output vector

    Q: what is the size of the Jacobian matrix?[4096 x 4096!]

    Q2: what does it look like?

    Vectorized operations

    Jacobian matrix

    f(x) = max(0,x)(elementwise)

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201658

    max(0,x)(elementwise)

    100 4096-d input vectors

    100 4096-d output vectors

    Vectorized operations

    in practice we process an entire minibatch (e.g. 100) of examples at one time:

    i.e. Jacobian would technically be a[409,600 x 409,600] matrix :\

    f(x) = max(0,x)(elementwise)

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201659

    Assignment: Writing SVM/SoftmaxStage your forward/backward computation!

    E.g. for the SVM:margins

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201660

    Summary so far

    - neural nets will be very large: no hope of writing down gradient formula by hand for all parameters

    - backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates

    - implementations maintain a graph structure, where the nodes implement the forward() / backward() API.

    - forward: compute result of an operation and save any intermediates needed for gradient computation in memory

    - backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs.

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201661

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201662

    Neural Network: without the brain stuff

    (Before) Linear score function:

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201663

    Neural Network: without the brain stuff

    (Before) Linear score function:

    (Now) 2-layer Neural Network

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201664

    Neural Network: without the brain stuff

    (Before) Linear score function:

    (Now) 2-layer Neural Network

    x hW1 sW2

    3072 100 10

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201665

    Neural Network: without the brain stuff

    (Before) Linear score function:

    (Now) 2-layer Neural Network

    x hW1 sW2

    3072 100 10

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201666

    Neural Network: without the brain stuff

    (Before) Linear score function:

    (Now) 2-layer Neural Network or 3-layer Neural Network

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201667

    Full implementation of training a 2-layer Neural Network needs ~11 lines:

    from @iamtrask, http://iamtrask.github.io/2015/07/12/basic-python-network/

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201668

    Assignment: Writing 2layer NetStage your forward/backward computation!

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201669

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201670

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201671

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201672

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201673

    sigmoid activation function

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201674

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201675

    Be very careful with your Brain analogies:

    Biological Neurons:- Many different types- Dendrites can perform complex non-

    linear computations- Synapses are not a single weight but

    a complex non-linear dynamical system

    - Rate code may not be adequate

    [Dendritic Computation. London and Hausser]

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201676

    Activation Functions

    Sigmoid

    tanh tanh(x)

    ReLU max(0,x)

    Leaky ReLUmax(0.1x, x)

    MaxoutELU

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201677

    Neural Networks: Architectures

    Fully-connected layers2-layer Neural Net, or1-hidden-layer Neural Net

    3-layer Neural Net, or2-hidden-layer Neural Net

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201678

    Example Feed-forward computation of a Neural Network

    We can efficiently evaluate an entire layer of neurons.

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201679

    Example Feed-forward computation of a Neural Network

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201680

    Setting the number of layers and their sizes

    more neurons = more capacity

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201681

    (you can play with this demo over at ConvNetJS: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html)

    Do not use size of neural network as a regularizer. Use stronger regularization instead:

    http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.htmlhttp://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.htmlhttp://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201682

    Summary

    - we arrange neurons into fully-connected layers- the abstraction of a layer has the nice property that it

    allows us to use efficient vectorized code (e.g. matrix multiplies)

    - neural networks are not really neural- neural networks: bigger = better (but might have to

    regularize more strongly)

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201683

    Next Lecture:

    More than you ever wanted to know about Neural Networks and how to train them.

  • Lecture 4 - 13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 201684

    reverse-mode differentiation (if you want effect of many things on one thing)

    forward-mode differentiation (if you want effect of one thing on many things)

    for many different x

    for many different y

    complex graph

    inputs x outputs y