lecture 2: deep learning fundamentals -...
TRANSCRIPT
![Page 1: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/1.jpg)
1Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Lecture 2:Deep Learning Fundamentals
![Page 2: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/2.jpg)
2Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Announcements
- A0 was released yesterday. Due next Tuesday, Sep 22.- Setup assignment for later homeworks.
- Project partner finding session this Friday, Sep 18.- Office hours will start next week
![Page 3: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/3.jpg)
3Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Last time: key ingredients of deep learning successAlgorithms Compute
Data
![Page 4: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/4.jpg)
4Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Today: Review of deep learning fundamentals
- Machine learning vs. deep learning framework
- Deep learning basics through a simple example- Defining a neural network architecture- Defining a loss function- Optimizing the loss function
- Model implementation using deep learning frameworks
- Design choices
![Page 5: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/5.jpg)
5Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Machine learning framework
Data-driven learning of a mapping from input to output
![Page 6: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/6.jpg)
6Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Machine learning framework
Input Feature extractor
Machine learning model
Output
(e.g., ) (e.g., color and texture histograms)
(e.g., support vector machines and random forests)
(e.g., presence or not or disease)
Data-driven learning of a mapping from input to output
Traditional machine learning approaches
![Page 7: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/7.jpg)
7Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Machine learning framework
Traditional machine learning
Input Feature extractor
Machine learning model
Output
(e.g., ) (e.g., color and texture histograms)
(e.g., support vector machines and random forests)
(e.g., presence or not or disease)
Q: What other features could be of interestin this X-ray? (Raise hand or type in chat box)
![Page 8: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/8.jpg)
8Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning (a type of machine learning)
![Page 9: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/9.jpg)
9Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning (a type of machine learning)
Traditional machine learning
Input Feature extractor
Machine learning model
Output
(e.g., ) (e.g., color and texture histograms)
(e.g., support vector machines and random forests)
(e.g., presence or not or disease)
![Page 10: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/10.jpg)
10Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning (a type of machine learning)
Deep Learning Model
Traditional machine learning
Input Feature extractor
Machine learning model
Output
Deep learning
Input Output
(e.g., ) (e.g., color and texture histograms)
(e.g., support vector machines and random forests)
(e.g., presence or not or disease)
(e.g., ) (e.g., presence or not or disease)
(e.g., convolutional and recurrent neural networks)
![Page 11: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/11.jpg)
11Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning (a type of machine learning)
Deep Learning Model
Traditional machine learning
Input Feature extractor
Machine learning model
Output
Deep learning
Input Output
(e.g., ) (e.g., color and texture histograms)
(e.g., support vector machines and random forests)
(e.g., presence or not or disease)
(e.g., ) (e.g., presence or not or disease)
(e.g., convolutional and recurrent neural networks)
Directly learns what are useful (and better!) features from the training data
![Page 12: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/12.jpg)
12Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
How do deep learning models perform feature extraction?
Input
(e.g., )
Output
(e.g., presence or not or disease)
![Page 13: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/13.jpg)
13Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
How do deep learning models perform feature extraction?
Input
(e.g., )
Output
(e.g., presence or not or disease)
Hierarchical structure of neural networks
allows compositional extraction of
increasingly complex features
Low-level features
Mid-level features
High-level features
Feature visualizations from Zeiler and Fergus 2013
![Page 14: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/14.jpg)
14Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Topics we will cover
1. Preparing data for deep learning2. Neural network models3. Training neural networks4. Evaluating models
Input
(e.g., )
Output
(e.g., presence or not or disease)
![Page 15: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/15.jpg)
15Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
First (today): deep learning basics through a simple example
Model input: data vector Model output: prediction (single number)
Let us consider the task of regression: predicting a single real-valued output from input data
![Page 16: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/16.jpg)
16Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Model input: data vector Model output: prediction (single number)
Let us consider the task of regression: predicting a single real-valued output from input data
Example: predicting hospital length-of-stay from clinical variables in the electronic health record
[age, weight, …, temperature, oxygen saturation] length-of-stay (days)
Example: predicting expression level of a target gene from the expression levels of N landmark genes
expression levels of N landmark genes expression level of target gene
First (today): deep learning basics through a simple example
![Page 17: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/17.jpg)
17Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Regression tasks
Breakout session:1x 5-minute breakout (~4 students each)
- Introduce yourself!- What other regression tasks are there in the
medical field?- What should be the inputs?- What should be the output?- What are the key features to extract?
![Page 18: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/18.jpg)
18Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a neural network architectureOur first architecture: a single-layer, fully connected neural network
![Page 19: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/19.jpg)
19Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a neural network architectureOur first architecture: a single-layer, fully connected neural network
all inputs of a layer are connected to all outputs of a layer
![Page 20: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/20.jpg)
20Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a neural network architectureOur first architecture: a single-layer, fully connected neural network
all inputs of a layer are connected to all outputs of a layer
For simplicity, use a 3-dimensional input (N = 3)
Output:
![Page 21: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/21.jpg)
21Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a neural network architectureOur first architecture: a single-layer, fully connected neural network
all inputs of a layer are connected to all outputs of a layer
Output:
For simplicity, use a 3-dimensional input (N = 3)
bias term (allows constant shift)
![Page 22: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/22.jpg)
22Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a neural network architectureOur first architecture: a single-layer, fully connected neural network
all inputs of a layer are connected to all outputs of a layer
Output:
For simplicity, use a 3-dimensional input (N = 3)
layer inputs
layer output(s)
bias term (allows constant shift)
![Page 23: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/23.jpg)
23Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a neural network architecture
Neural network parameters:
Our first architecture: a single-layer, fully connected neural network
all inputs of a layer are connected to all outputs of a layer
Output:
For simplicity, use a 3-dimensional input (N = 3)
layer inputs
layer output(s)
bias term (allows constant shift)
![Page 24: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/24.jpg)
24Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a neural network architecture
Neural network parameters:
Our first architecture: a single-layer, fully connected neural network
all inputs of a layer are connected to all outputs of a layer
Output:
For simplicity, use a 3-dimensional input (N = 3)
layer inputs
layer output(s)
bias term (allows constant shift)
layer “weights” layer bias
![Page 25: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/25.jpg)
25Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a neural network architecture
Neural network parameters:
Our first architecture: a single-layer, fully connected neural network
all inputs of a layer are connected to all outputs of a layer
Output:
For simplicity, use a 3-dimensional input (N = 3)
layer inputs
layer output(s)
bias term (allows constant shift)
layer “weights” layer bias
Often refer to all parameters together as just “weights”. Bias is implicitly assumed.
![Page 26: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/26.jpg)
26Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a neural network architecture
Neural network parameters:
Our first architecture: a single-layer, fully connected neural network
all inputs of a layer are connected to all outputs of a layer
Output:
For simplicity, use a 3-dimensional input (N = 3)
layer inputs
layer output(s)
Caveats of our first (simple) neural network architecture:- Single layer still “shallow”, not yet a “deep” neural network. Will see how to stack multiple layers.- Also equivalent to a linear regression model! But useful base case for deep learning.
bias term (allows constant shift)
layer “weights” layer bias
Often refer to all parameters together as just “weights”. Bias is implicitly assumed.
![Page 27: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/27.jpg)
27Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a loss function
Neural network parameters:
Output:
![Page 28: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/28.jpg)
28Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a loss function
Loss functions are quantitative measures of how satisfactory the model predictions are (i.e., how “good” the model parameters are).
Neural network parameters:
Output:
![Page 29: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/29.jpg)
29Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a loss function
Loss functions are quantitative measures of how satisfactory the model predictions are (i.e., how “good” the model parameters are).
We will use the mean square error (MSE) loss which is standard for regression.
Neural network parameters:
Output:
![Page 30: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/30.jpg)
30Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a loss function
MSE loss for a single example , when the prediction is and the correct (ground truth) output is :
Neural network parameters:
Output:
Loss functions are quantitative measures of how satisfactory the model predictions are (i.e., how “good” the model parameters are).
We will use the mean square error (MSE) loss which is standard for regression.
![Page 31: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/31.jpg)
31Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a loss function
MSE loss for a single example , when the prediction is and the correct (ground truth) output is :
the loss is small when the prediction is close to the ground truth
Neural network parameters:
Output:
Loss functions are quantitative measures of how satisfactory the model predictions are (i.e., how “good” the model parameters are).
We will use the mean square error (MSE) loss which is standard for regression.
![Page 32: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/32.jpg)
32Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Defining a loss function
MSE loss for a single example , when the prediction is and the correct (ground truth) output is :
MSE loss over a set of examples :
the loss is small when the prediction is close to the ground truth
Neural network parameters:
Output:
Loss functions are quantitative measures of how satisfactory the model predictions are (i.e., how “good” the model parameters are).
We will use the mean square error (MSE) loss which is standard for regression.
![Page 33: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/33.jpg)
33Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Goal: find the “best” values of the model parameters that minimize the loss function
Optimizing the loss function: gradient descent
![Page 34: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/34.jpg)
34Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Goal: find the “best” values of the model parameters that minimize the loss function
The approach we will take: gradient descent
Optimizing the loss function: gradient descent
![Page 35: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/35.jpg)
35Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Goal: find the “best” values of the model parameters that minimize the loss function
The approach we will take: gradient descent
Optimizing the loss function: gradient descent
“Loss landscape”: the value of the loss function at every value of the model parameters
Loss
Figure credit: https://easyai.tech/wp-content/uploads/2019/01/tiduxiajiang-1.png
![Page 36: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/36.jpg)
36Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Goal: find the “best” values of the model parameters that minimize the loss function
The approach we will take: gradient descent
Optimizing the loss function: gradient descent
“Loss landscape”: the value of the loss function at every value of the model parameters
Main idea: iteratively update the model parameters, to take steps in the local direction of steepest (negative) slope, i.e., the negative gradient
Loss
Figure credit: https://easyai.tech/wp-content/uploads/2019/01/tiduxiajiang-1.png
![Page 37: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/37.jpg)
37Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Goal: find the “best” values of the model parameters that minimize the loss function
The approach we will take: gradient descent
Optimizing the loss function: gradient descent
“Loss landscape”: the value of the loss function at every value of the model parameters
Main idea: iteratively update the model parameters, to take steps in the local direction of steepest (negative) slope, i.e., the negative gradient
We will be able to use gradient descent to iteratively optimize the complex loss function landscapes corresponding to deep neural networks!
Loss
Figure credit: https://easyai.tech/wp-content/uploads/2019/01/tiduxiajiang-1.png
![Page 38: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/38.jpg)
38Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
The derivative of a function is a measure of local slope.
Ex:
Review from calculus: derivatives and gradients
![Page 39: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/39.jpg)
39Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
The derivative of a function is a measure of local slope.
Ex:
Review from calculus: derivatives and gradients
![Page 40: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/40.jpg)
40Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
The derivative of a function is a measure of local slope.
Ex:
Review from calculus: derivatives and gradients
![Page 41: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/41.jpg)
41Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
The derivative of a function is a measure of local slope.
Ex:
Review from calculus: derivatives and gradients
![Page 42: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/42.jpg)
42Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
The gradient of a function of multiple variables is the vector of partial derivatives of the function with respect to each variable.
Ex:
The derivative of a function is a measure of local slope.
Ex:
Review from calculus: derivatives and gradients
![Page 43: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/43.jpg)
43Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
The gradient of a function of multiple variables is the vector of partial derivatives of the function with respect to each variable.
Ex:
The gradient evaluated at a particular point is the direction of steepest ascent of the function.
The derivative of a function is a measure of local slope.
Ex:
Review from calculus: derivatives and gradients
![Page 44: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/44.jpg)
44Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
The gradient of a function of multiple variables is the vector of partial derivatives of the function with respect to each variable.
Ex:
The gradient evaluated at a particular point is the direction of steepest ascent of the function.
The derivative of a function is a measure of local slope.
Ex:
Review from calculus: derivatives and gradients
The negative of the gradient is the direction of steepest descent -> direction we want to move in the loss function landscape!
![Page 45: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/45.jpg)
45Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Gradient descent algorithmLet the gradient of the loss function with respect to the model parameters w be:
![Page 46: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/46.jpg)
46Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Gradient descent algorithmLet the gradient of the loss function with respect to the model parameters w be:
For ease of notation, rewrite parameter as corresponding to :
![Page 47: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/47.jpg)
47Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Gradient descent algorithmLet the gradient of the loss function with respect to the model parameters w be:
Then we can minimize the loss function by iteratively updating the model parameters (“taking steps”) in the direction of the negative gradient, until convergence:
For ease of notation, rewrite parameter as corresponding to :
![Page 48: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/48.jpg)
48Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Gradient descent algorithmLet the gradient of the loss function with respect to the model parameters w be:
Then we can minimize the loss function by iteratively updating the model parameters (“taking steps”) in the direction of the negative gradient, until convergence:
For ease of notation, rewrite parameter as corresponding to :
“Step size” hyperparameter (design choice) indicating how big of a step in the negative gradient direction we want to take at each update.Too big -> may overshoot minima.Too small -> optimization takes too long.
![Page 49: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/49.jpg)
49Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Gradient descent algorithm: in code
![Page 50: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/50.jpg)
50Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Stochastic gradient descent (SGD)Evaluating gradient involves iterating over all data examples, which can be slow!
In practice, usually use stochastic gradient descent: estimate gradient over a sample of data examples (usually as many as can fit on GPU at one time, e.g. 32, 64, 128)
![Page 51: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/51.jpg)
51Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Optimizing the loss function: our example
Loss function:
Per-example:
Over M examples:
Neural network parameters:
Output:
![Page 52: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/52.jpg)
52Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Optimizing the loss function: our example
Loss function:
Per-example:
Over M examples:
Neural network parameters:
Output:
Gradient of loss w.r.t. weights:
Partial derivative of loss w.r.t. kth weight:
![Page 53: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/53.jpg)
53Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Optimizing the loss function: our example
Loss function:
Per-example:
Over M examples:
Neural network parameters:
Output:
Gradient of loss w.r.t. weights:
Partial derivative of loss w.r.t. kth weight:Chain rule
![Page 54: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/54.jpg)
54Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Optimizing the loss function: our example
Loss function:
Gradient of loss w.r.t. weights:
Per-example:
Over M examples:
Partial derivative of loss w.r.t. kth weight:
Neural network parameters:
Output:
Over M examples
![Page 55: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/55.jpg)
55Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Optimizing the loss function: our example
Loss function:
Gradient of loss w.r.t. weights:
Per-example:
Over M examples:
Partial derivative of loss w.r.t. kth weight:
Neural network parameters:
Output:
Full gradient expression:
![Page 56: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/56.jpg)
56Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Now: a two-layer fully-connected neural networkOutput:
![Page 57: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/57.jpg)
57Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Output:
Sigmoid “activation function”
Now: a two-layer fully-connected neural network
![Page 58: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/58.jpg)
58Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Output:
Full function expression:
Now: a two-layer fully-connected neural network
![Page 59: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/59.jpg)
59Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Output:
Sigmoid “activation function”Activation functions
Introduce non-linearity into the model -- allowing it to represent highly complex functions.
Full function expression:
Now: a two-layer fully-connected neural network
![Page 60: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/60.jpg)
60Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Output:
Sigmoid “activation function”Activation functions
introduce non-linearity into the model -- allowing it to represent highly complex functions.
A fully-connected neural network (also known as multi-layer perceptron) is a stack of [affine transformation + activation function] layers.
Full function expression:
Now: a two-layer fully-connected neural network
![Page 61: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/61.jpg)
61Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Output:
Now: a two-layer fully-connected neural network
![Page 62: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/62.jpg)
62Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Neural network parameters:
Output:
Now: a two-layer fully-connected neural network
![Page 63: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/63.jpg)
63Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Neural network parameters:
Output:
Loss function (regression loss, same as before):
Per-example:
Over M examples:
Now: a two-layer fully-connected neural network
![Page 64: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/64.jpg)
64Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Neural network parameters:
Output:
Loss function (regression loss, same as before):
Per-example:
Over M examples:
Gradient of loss w.r.t. weights:Function more complex -> now much harder to derive the expressions! Instead… computational graphs and backpropagation.
Now: a two-layer fully-connected neural network
![Page 65: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/65.jpg)
65Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Computing gradients with backpropagation
![Page 66: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/66.jpg)
66Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Computing gradients with backpropagation
Think of computing loss function as staged computation of intermediate variables:
Network output:
![Page 67: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/67.jpg)
67Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Computing gradients with backpropagation
Think of computing loss function as staged computation of intermediate variables:
Network output:
“Forward pass”:
![Page 68: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/68.jpg)
68Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Computing gradients with backpropagation
Think of computing loss function as staged computation of intermediate variables:
Now, can use a repeated application of the chain rule, going backwards through the computational graph, to obtain the gradient of the loss with respect to each node of the computation graph.
Network output:
“Forward pass”:
![Page 69: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/69.jpg)
69Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Computing gradients with backpropagation
Think of computing loss function as staged computation of intermediate variables:
Now, can use a repeated application of the chain rule, going backwards through the computational graph, to obtain the gradient of the loss with respect to each node of the computation graph.
Network output:
“Forward pass”:
“Backward pass”: (not all gradients shown)(not all gradients shown)
![Page 70: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/70.jpg)
70Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Computing gradients with backpropagation
Think of computing loss function as staged computation of intermediate variables:
Now, can use a repeated application of the chain rule, going backwards through the computational graph, to obtain the gradient of the loss with respect to each node of the computation graph.
Network output:
“Forward pass”:
“Backward pass”: (not all gradients shown)
Plug in from earlier computations via chain rule
![Page 71: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/71.jpg)
71Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Computing gradients with backpropagation
Think of computing loss function as staged computation of intermediate variables:
Now, can use a repeated application of the chain rule, going backwards through the computational graph, to obtain the gradient of the loss with respect to each node of the computation graph.
Network output:
“Forward pass”:
“Backward pass”:
Plug in from earlier computations via chain rule
Local gradients to derive
(not all gradients shown)
![Page 72: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/72.jpg)
72Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Computing gradients with backpropagation
Key idea: Don’t mathematically derive entire math expression for e.g. dL / dW1. By writing it as nested applications of the chain rule, only have to derive simple “local” gradients representing relationships between connected nodes of the graph (e.g. dH / dW1).
![Page 73: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/73.jpg)
73Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Computing gradients with backpropagation
Key idea: Don’t mathematically derive entire math expression for e.g. dL / dW1. By writing it as nested applications of the chain rule, only have to derive simple “local” gradients representing relationships between connectected nodes of the graph (e.g. dH / dW1).
Can use more or less intermediate variables to control how difficult local gradients are to derive!
![Page 74: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/74.jpg)
74Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, using backpropagation
![Page 75: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/75.jpg)
75Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, using backpropagation
Initialize model parameters
![Page 76: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/76.jpg)
76Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, using backpropagation
Forward pass
![Page 77: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/77.jpg)
77Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, using backpropagation
Backwardpass
![Page 78: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/78.jpg)
78Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, using backpropagation
Upstream gradient
Downstream gradient Backward
pass
![Page 79: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/79.jpg)
79Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, using backpropagation
Gradientupdate
![Page 80: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/80.jpg)
80Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning software frameworks- Makes our lives easier by providing implementations and higher-level abstractions of many
components for deep learning, and running them on GPUs:- Dataset batching, model definition, gradient computation, optimization, etc.
![Page 81: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/81.jpg)
81Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning software frameworks- Makes our lives easier by providing implementations and higher-level abstractions of many
components for deep learning, and running them on GPUs:- Dataset batching, model definition, gradient computation, optimization, etc.
- Automatic differentiation: if we define nodes in a computational graph, will automatically implement backpropagation for us
- Supports many common operations with local gradients already implemented- Can still define custom operations
![Page 82: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/82.jpg)
82Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning software frameworks- Makes our lives easier by providing implementations and higher-level abstractions of many
components for deep learning, and running them on GPUs:- Dataset batching, model definition, gradient computation, optimization, etc.
- Automatic differentiation: if we define nodes in a computational graph, will automatically implement backpropagation for us
- Supports many common operations with local gradients already implemented- Can still define custom operations
- A number of popular options, e.g. Tensorflow and PyTorch. Recent stable versions (TF 2.0, PyTorch 1.3) work largely in a similar fashion (not necessarily true for earlier versions). We will use Tensorflow 2.0 in this class.
![Page 83: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/83.jpg)
83Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning software frameworks- Makes our lives easier by providing implementations and higher-level abstractions of many
components for deep learning, and running them on GPUs:- Dataset batching, model definition, gradient computation, optimization, etc.
- Automatic differentiation: if we define nodes in a computational graph, will automatically implement backpropagation for us
- Supports many common operations with local gradients already implemented- Can still define custom operations
- A number of popular options, e.g. Tensorflow and PyTorch. Recent stable versions (TF 2.0, PyTorch 1.3) work largely in a similar fashion (not necessarily true for earlier versions). We will use Tensorflow 2.0 in this class.
- More next Friday, Sept 25, during our Tensorflow section.
![Page 84: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/84.jpg)
84Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, in Tensorflow 2.0
![Page 85: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/85.jpg)
85Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, in Tensorflow 2.0
Convert data to TF tensors, create a TF dataset
![Page 86: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/86.jpg)
86Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, in Tensorflow 2.0
Initialize parameters to be learned as tf.Variable -> allows them to receive gradient updates during optimization
![Page 87: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/87.jpg)
87Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, in Tensorflow 2.0
Initialize a TF optimizer
![Page 88: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/88.jpg)
88Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, in Tensorflow 2.0
All operations defined under the gradient tape will be used to construct a computational graph
![Page 89: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/89.jpg)
89Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, in Tensorflow 2.0
The computational graph for our two-layer neural network
![Page 90: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/90.jpg)
90Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training our two-layer neural network in code, in Tensorflow 2.0
Evaluate gradients using automatic differentiation and perform gradient update
![Page 91: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/91.jpg)
91Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Also high level libraries built on top of Tensorflow, that provide even easier-to-use APIs:
In Tensorflow 2.0: In Keras:
![Page 92: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/92.jpg)
92Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Also high level libraries built on top of Tensorflow, that provide even easier-to-use APIs:
In Tensorflow 2.0: In Keras:
Stack of layers
![Page 93: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/93.jpg)
93Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Also high level libraries built on top of Tensorflow, that provide even easier-to-use APIs:
In Tensorflow 2.0: In Keras:Fully-connected layer
![Page 94: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/94.jpg)
94Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Also high level libraries built on top of Tensorflow, that provide even easier-to-use APIs:
In Tensorflow 2.0: In Keras:
Activation function and bias configurations included!
![Page 95: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/95.jpg)
95Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Also high level libraries built on top of Tensorflow, that provide even easier-to-use APIs:
In Tensorflow 2.0: In Keras:
Specify hyperparameters for the training procedure
![Page 96: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/96.jpg)
96Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices
![Page 97: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/97.jpg)
97Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training hyperparameters: control knobs for the art of training neural networks
Optimization methods: SGD, SGD with momentum, RMSProp, Adam
SGD
SGD+MomentumRMSProp
Adam
- Adam is a good default choice in many cases; it often works ok even with constant learning rate
- SGD+Momentum can outperform Adam but may require more tuning of LR and schedule
![Page 98: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/98.jpg)
98Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
Q: Which one of these learning rates is best to use?
Slide credit: CS231n
![Page 99: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/99.jpg)
99Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
Q: Which one of these learning rates is best to use?
A: All of them! Start with large learning rate and decay over time
Slide credit: CS231n
![Page 100: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/100.jpg)
100Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Reduce learning rateStep: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.
Learning rate decay
Slide credit: CS231n
![Page 101: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/101.jpg)
101Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
: Initial learning rate: Learning rate at epoch t: Total number of epochs
Step: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.
Fancy decay schedules like cosine, linear, inverse sqrt.
Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N
Learning rate decay
Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018Feichtenhofer et al, “SlowFast Networks for Video Recognition”, arXiv 2018Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019
Slide credit: CS231n
![Page 102: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/102.jpg)
102Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Monitor learning curves
Periodically evaluate validation loss
Also useful to plot performance on final metric
Figure credit: https://www.learnopencv.com/wp-content/uploads/2017/11/cnn-keras-curves-without-aug.jpg
![Page 103: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/103.jpg)
103Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Monitor learning curves
Figure credit: CS231n
Training loss can be noisy. Using a scatter plot or plotting moving average can help better visualize trends.
![Page 104: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/104.jpg)
104Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Overfitting vs. underfittingOverfitting
Loss
Time
Training loss much better than validation
Training loss may continue to get better while validation plateaus or gets worse
TrainingValidation
![Page 105: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/105.jpg)
105Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Overfitting vs. underfittingOverfitting Underfitting
Loss
Time
Loss
Time
Training loss much better than validation
Training loss may continue to get better while validation plateaus or gets worse
Small or no gap between training and validation loss
May have relatively higher loss overall (model not learning sufficiently)
TrainingValidation
![Page 106: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/106.jpg)
106Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Overfitting vs. underfitting
Breakout session:1x 5-minute breakout (~4 students each)
- Introduce yourself!- What are some ways to combat overfitting?- What are some ways to combat underfitting?
![Page 107: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/107.jpg)
107Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Overfitting vs. underfittingOverfitting Underfitting
Loss
Time
Loss
Time
Training loss much better than validation
Training loss may continue to get better while validation plateaus or gets worse
Small or no gap between training and validation loss
May have relatively higher loss overall (model not learning sufficiently)
TrainingValidation
![Page 108: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/108.jpg)
108Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Overfitting vs. underfittingOverfitting
Loss
Time
Training loss much better than validation
Training loss may continue to get better while validation plateaus or gets worse
Model is “overfitting” to the training data. Best strategy: Increase data or regularize model. Second strategy: decrease model capacity (make simpler)
TrainingValidation
![Page 109: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/109.jpg)
109Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Overfitting vs. underfittingOverfitting Underfitting
Loss
Time
Loss
Time
Training loss much better than validation
Training loss may continue to get better while validation plateaus or gets worse
Model is “overfitting” to the training data. Best strategy: Increase data or regularize model. Second strategy: decrease model capacity (make simpler)
Small or no gap between training and validation loss
May have relatively higher loss overall (model not learning sufficiently)
Model is not able to sufficiently learn to fit the data well. Best strategy: Increase complexity (e.g. size) of the model. Second strategy: make problem simpler (easier task, cleaner data)
TrainingValidation
![Page 110: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/110.jpg)
110Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Overfitting vs. underfitting: more intuition
Overfitting Underfitting
Figure credit: https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
![Page 111: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/111.jpg)
111Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Healthy learning curves
Loss
Time
Steep improvement at beginning
Continue to gradually improve
![Page 112: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/112.jpg)
112Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Healthy learning curves
Loss
Time
In practice, models with best final metric (e.g. accuracy) often have slight overfitting.
Intuition: slightly push complexity of model to the highest that the data can handle
Steep improvement at beginning
Continue to gradually improve
![Page 113: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/113.jpg)
113Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
More debugging
Loss
Time
TrainingValidation
![Page 114: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/114.jpg)
114Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
More debugging
Loss
Time
Plateau may be bad weight initialization
TrainingValidation
![Page 115: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/115.jpg)
115Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
More debugging
Loss
Time
Loss
Time
Plateau may be bad weight initialization
TrainingValidation
![Page 116: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/116.jpg)
116Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
More debugging
Loss
Time
Loss
Time
Plateau may be bad weight initialization
Loss decreasing but slowly -> try higher learning rate
TrainingValidation
![Page 117: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/117.jpg)
117Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
More debugging
Loss
Time
Loss
Time
Plateau may be bad weight initialization Loss
Time
Loss decreasing but slowly -> try higher learning rate
TrainingValidation
![Page 118: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/118.jpg)
118Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
More debugging
Loss
Time
Loss
Time
Plateau may be bad weight initialization Loss
Time
Healthy loss curve plateaus -> try further learning rate decay at plateau point
Loss decreasing but slowly -> try higher learning rate
TrainingValidation
![Page 119: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/119.jpg)
119Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
More debugging
Loss
Time
Loss
Time
Loss
Time
Plateau may be bad weight initialization Loss
Time
Healthy loss curve plateaus -> try further learning rate decay at plateau point
Loss decreasing but slowly -> try higher learning rate
TrainingValidation
![Page 120: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/120.jpg)
120Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
More debugging
Loss
Time
Loss
Time
Loss
Time
Plateau may be bad weight initialization Loss
Time
Healthy loss curve plateaus -> try further learning rate decay at plateau point
Loss decreasing but slowly -> try higher learning rate
If you further decay learning rate too early, may look like this -> inefficient learning vs. keeping higher learning rate longer
TrainingValidation
![Page 121: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/121.jpg)
121Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
More debugging
Loss
Time
Loss
Time
Accuracy
Time
Loss
Time
Plateau may be bad weight initialization Loss
Time
Healthy loss curve plateaus -> try further learning rate decay at plateau point
Loss decreasing but slowly -> try higher learning rate
If you further decay learning rate too early, may look like this -> inefficient learning vs. keeping higher learning rate longer
Final metric is still improving -> keep training!
TrainingValidation
![Page 122: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/122.jpg)
122Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Iteration
Loss
Iteration
Accuracy
Stop training here
Stop training the model when accuracy on the validation set decreasesOr train for a long time, but always keep track of the model snapshot that worked best on val.
Early stopping: always do this
Slide credit: CS231n
TrainingValidation
![Page 123: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/123.jpg)
123Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: network architectures
Major design choices:- Architecture type
(ResNet, DenseNet, etc. for CNNs)
- Depth ( # layers)- For MLPs, # neurons in
each layer (hidden layer size)
- For CNNs, # filters, filter size, filter stride in each layer
- Look at argument options in Tensorflow when defining network layers
![Page 124: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/124.jpg)
124Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: network architectures
Major design choices:- Architecture type
(ResNet, DenseNet, etc. for CNNs)
- Depth ( # layers)- For MLPs, # neurons in
each layer (hidden layer size)
- For CNNs, # filters, filter size, filter stride in each layer
- Look at argument options in Tensorflow when defining network layers
If trying to make network bigger (when underfitting) or smaller (when overfitting), network depth and hidden layer size best to adjust first. Don’t waste too much time early on fiddling with choices that only minorly change architecture.
![Page 125: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/125.jpg)
125Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (loss term)Remember optimizing loss functions, which express how well model fit training data, e.g.:
![Page 126: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/126.jpg)
126Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (loss term)Remember optimizing loss functions, which express how well model fit training data, e.g.:
Regularization adds a term to this, to express preferences on the weights (that prevent it from fitting too well to the training data). Used to combat overfitting:
Data loss Regularization loss
importance of reg. term
![Page 127: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/127.jpg)
127Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (loss term)Remember optimizing loss functions, which express how well model fit training data, e.g.:
Regularization adds a term to this, to express preferences on the weights (that prevent it from fitting too well to the training data). Used to combat overfitting:
ExamplesL2 regularization: (weight decay) L1 regularization: Elastic net (L1 + L2):
Data loss Regularization loss
https://www.tensorflow.org/api_docs/python/tf/keras/regularizers
importance of reg. term
![Page 128: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/128.jpg)
128Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (loss term)Remember optimizing loss functions, which express how well model fit training data, e.g.:
Regularization adds a term to this, to express preferences on the weights (that prevent it from fitting too well to the training data). Used to combat overfitting:
ExamplesL2 regularization: (weight decay) L1 regularization: Elastic net (L1 + L2):
Data loss Regularization loss
L2 most popular: low loss when all weights are relatively small. More strongly penalizes large weights vs L1. Expresses preference for simple models (need large coefficients to fit a function to extreme outlier values).
https://www.tensorflow.org/api_docs/python/tf/keras/regularizers
importance of reg. term
![Page 129: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/129.jpg)
129Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (loss term)Remember optimizing loss functions, which express how well model fit training data, e.g.:
Regularization adds a term to this, to express preferences on the weights (that prevent it from fitting too well to the training data). Used to combat overfitting:
ExamplesL2 regularization: (weight decay) L1 regularization: Elastic net (L1 + L2):
Data loss Regularization loss
L2 most popular: low loss when all weights are relatively small. More strongly penalizes large weights vs L1. Expresses preference for simple models (need large coefficients to fit a function to extreme outlier values).
https://www.tensorflow.org/api_docs/python/tf/keras/regularizers
Next: implicit regularizers that do not add an explicit term; instead do something implicit in network to prevent it from fitting too well to training data
importance of reg. term
![Page 130: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/130.jpg)
130Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (dropout)First example of an implicit regularizer.During training, at each iteration of forward pass randomly set some neurons to zero (i.e., change network architecture such that paths to some neurons are removed).
During testing, all neurons are active. But scale neuron outputs by dropout probability p, such that expected output during training and testing match.
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014. Figure credit: CS231n.
![Page 131: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/131.jpg)
131Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (dropout)First example of an implicit regularizer.During training, at each iteration of forward pass randomly set some neurons to zero (i.e., change network architecture such that paths to some neurons are removed).
During testing, all neurons are active. But scale neuron outputs by dropout probability p, such that expected output during training and testing match.
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014. Figure credit: CS231n.
Probability of “dropping out” each neuron at a forward pass is hyperparameter p. 0.5 and 0.9 are common (high!).
![Page 132: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/132.jpg)
132Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (dropout)First example of an implicit regularizer.During training, at each iteration of forward pass randomly set some neurons to zero (i.e., change network architecture such that paths to some neurons are removed).
During testing, all neurons are active. But scale neuron outputs by dropout probability p, such that expected output during training and testing match.
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014. Figure credit: CS231n.
Probability of “dropping out” each neuron at a forward pass is hyperparameter p. 0.5 and 0.9 are common (high!).
Intuition: dropout is equivalent to training a large ensemble of different models that share parameters.
![Page 133: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/133.jpg)
133Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (batch normalization)Another example of an implicit regularizer.Insert BN layers after FC or conv layers, before activation function.During training, at each iteration of forward pass normalize neuron activations by mean and variance of minibatch. Also learn scale and shift parameter to get final output.
During testing, normalize by a fixed mean and variance computed from the entire training set. Use learned scale and shift parameters.
![Page 134: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/134.jpg)
134Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: regularization (batch normalization)Another example of an implicit regularizer.Insert BN layers after FC or conv layers, before activation function.During training, at each iteration of forward pass normalize neuron activations by mean and variance of minibatch. Also learn scale and shift parameter to get final output.
During testing, normalize by a fixed mean and variance computed from the entire training set. Use learned scale and shift parameters.
Intuition: batch normalization allows keeping the weights in a healthy range. Also some randomness at training due to different effect from each minibatch sampling -> regularization!
![Page 135: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/135.jpg)
135Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: data augmentationAugment effective training data size by simulating more diversity from existing data. Random combinations of:
- Translation and scaling- Distortion- Image color adjustment- Etc.
![Page 136: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/136.jpg)
136Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Design choices: data augmentationAugment effective training data size by simulating more diversity from existing data. Random combinations of:
- Translation and scaling- Distortion- Image color adjustment- Etc.
Think about the domain of your data: what makes sense as realistic augmentation operations?
![Page 137: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/137.jpg)
137Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning for healthcare: the rise of medical data
Q: What are other examples of potential data augmentations?
(Raise hand or type in chat box)
![Page 138: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/138.jpg)
138Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Model inference
![Page 139: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/139.jpg)
139Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Maximizing test-time performance: apply data augmentation operations
Main idea: apply model on multiple variants of a data example, and then take average or max of scores
Can use data augmentation operations we saw during training! E.g.:
- Evaluate at different translations and scales- Common approach for images: evaluate image crops at 4 corners and center,
+ horizontally flipped versions -> average 10 scores
![Page 140: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/140.jpg)
140Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
1. Train multiple independent models2. At test time average their results
Enjoy 2% extra performance
Model ensembles
Slide credit: CS231n
![Page 141: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/141.jpg)
141Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Instead of training independent models, use multiple snapshots of a single model during training!
Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
Model ensembles: tips and tricks
Slide credit: CS231n
![Page 142: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/142.jpg)
142Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Instead of training independent models, use multiple snapshots of a single model during training!
Cyclic learning rate schedules can make this work even better!Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
Model ensembles: tips and tricks
Slide credit: CS231n
![Page 143: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/143.jpg)
143Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deeper models
![Page 144: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/144.jpg)
144Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Training more complex neural networks is a straightforward extension
Now a 6-layer network
![Page 145: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/145.jpg)
145Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
“Deep learning”Can continue to stack more layers to get deeper models!
![Page 146: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/146.jpg)
146Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
“Deep learning”Can continue to stack more layers to get deeper models!
Input layer
![Page 147: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/147.jpg)
147Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
“Deep learning”Can continue to stack more layers to get deeper models!
Input layer“Hidden” layers - will see lots of diversity in size (# neurons), type (linear, convolutional, etc.), and activation function (sigmoid, ReLU, etc.)
![Page 148: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/148.jpg)
148Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
“Deep learning”Can continue to stack more layers to get deeper models!
Input layer
Output layer - will differ for different types of tasks (e.g. regression). Should match with loss function.“Hidden” layers - will see lots of diversity
in size (# neurons), type (linear, convolutional, etc.), and activation function (sigmoid, ReLU, etc.)
![Page 149: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/149.jpg)
149Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
“Deep learning”Can continue to stack more layers to get deeper models!
Input layer
Output layer - will differ for different types of tasks (e.g. regression). Should match with loss function.“Hidden” layers - will see lots of diversity
in size (# neurons), type (linear, convolutional, etc.), and activation function (sigmoid, ReLU, etc.)
Vanilla fully-connected neural networks (MLPs) usually pretty shallow -- otherwise too many parameters! ~2-3 layers. Can have wide range in size of layers (16, 64, 256, 1000, etc.) depending on how much data you have.
![Page 150: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/150.jpg)
150Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
“Deep learning”Can continue to stack more layers to get deeper models!
Input layer
Output layer - will differ for different types of tasks (e.g. regression). Should match with loss function.“Hidden” layers - will see lots of diversity
in size (# neurons), type (linear, convolutional, etc.), and activation function (sigmoid, ReLU, etc.)
Vanilla fully-connected neural networks (MLPs) usually pretty shallow -- otherwise too many parameters! ~2-3 layers. Can have wide range in size of layers (16, 64, 256, 1000, etc.) depending on how much data you have.
Will see different classes of neural networks that leverage structure in data to reduce parameters + increase network depth
![Page 151: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/151.jpg)
151Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Deep learning
Q: What tasks other than regression are there?
(Raise hand or type in chat box)
![Page 152: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/152.jpg)
152Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Common activation functionsYou will see these extensively, typically after linear or convolutional layers.They add nonlinearity to allow the model to express complex nonlinear functions.
Sigmoid
Tanh Leaky ReLU
ReLU
and many more...
You can find these in Keras: https://keras.io/layers/advanced-activations/
![Page 153: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/153.jpg)
153Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Common activation functionsYou will see these extensively, typically after linear or convolutional layers.They add nonlinearity to allow the model to express complex nonlinear functions.
Sigmoid
Tanh Leaky ReLU
ReLU
and many more...
You can find these in Keras: https://keras.io/layers/advanced-activations/
Typical in modern CNNs and MLPs
![Page 154: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/154.jpg)
154Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Common activation functionsYou will see these extensively, typically after linear or convolutional layers.They add nonlinearity to allow the model to express complex nonlinear functions.
Sigmoid
Tanh Leaky ReLU
ReLU
and many more...
You can find these in Keras: https://keras.io/layers/advanced-activations/
Will see in recurrent neural networks. Also used in early MLPs and CNNs.
![Page 155: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/155.jpg)
155Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Will see different classes of neural networks
...
Input sequence
Output sequence
Fully connected neural networks(linear layers, good for “feature vector” inputs)
Convolutional neural networks(convolutional layers, good for image inputs)
Recurrent neural networks(linear layers modeling recurrence relation across
sequence, good for sequence inputs)
![Page 156: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/156.jpg)
156Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Project
Zoom Poll:- Do you have a project in mind?- Are you looking for project partners?
![Page 157: Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review](https://reader036.vdocument.in/reader036/viewer/2022071213/603dfb2c2713463c837f0ad3/html5/thumbnails/157.jpg)
157Serena Yeung BIODS 220: AI in Healthcare Lecture 2 -
Summary- Went over how to define, train, and tune a neural network
- This Friday’s section (Zoom link on Canvas) will be a project partner finding section
- Next Friday’s section will provide an in-depth tutorial on Tensorflow
- Next class: will go in-depth into- Medical image data- Classification models- Data, model, evaluation considerations