cp365 artificial intelligence - colorado...
TRANSCRIPT
CP365Artificial Intelligence
Tech News!
Apple news conference tomorrow?
Tech News!
Apple news conference tomorrow?
Google cancels Project Ara modular phone
Weather-Based Stock Market Predictions?
Dataset Preparation
Clean – remove bogus data/fill in missing data
Normalize data – adjust features to be similar magnitudes
Deal with Missing Data
Option 1: remove datapoints with any missing feature values
Deal with Missing Data
Option 1: remove datapoints with any missing feature values
Option 2: fill in missing data with <data_missing> tags for categorical data
Deal with Missing Data
Option 1: remove datapoints with any missing feature values
Option 2: fill in missing data with <data_missing> tags for categorical data
Option 3: fill in missing data with global means for numeric data
Deal with Missing Data
Option 1: remove datapoints with any missing feature values
Option 2: fill in missing data with <data_missing> tags for categorical data
Option 3: fill in missing data with global means for numeric data
Option 4: fill in missing data with values from similar data points
Remove Outliers
Some datapoints may have ridiculous feature values.
We can remove outliers from our dataset to increase performance.
What is an outlier?
Outliers
Patient Height (cm)
Patient Weight (kg)
... Prognosis
131.2 59.2 ... Good
176.7 82.9 ... Good
12613.9 66.0 ... Poor
161.0 70.2 ... Poor
Outliers
Patient Height (cm)
Patient Weight (kgs)
... Prognosis
131.2 59.2 ... Good
176.7 82.9 ... Good
12613.9 66.0 ... Poor
161.0 70.2 ... Poor
Obvious outlier. How can we define what makes an outlier?
We could use 3σ as the threshold.
Outliers
Patient Height (cm)
Patient Weight (kgs)
... Prognosis
131.2 59.2 ... Good
176.7 82.9 ... Good
12613.9 66.0 ... Poor
161.0 70.2 ... Poor
This column has x = 156.3 and σ = 23.1 (without the possible
outlier).
The 3σ thresholds would be (156.3 - 3 * 23.1, 156.3 + 3 * 23.1)
or(87, 225.6)
A Bad Dataset
Patient Height (nm)
Patient Weight (tons)
... Prognosis
1.31 x 109 0.065 ... Good
1.76 x 109 0.091 ... Good
1.23 x 109 0.073 ... Poor
1.61 x 109 0.077 ... Poor
A Bad Dataset
Patient Height (nm)
Patient Weight (tons)
... Prognosis
1.31 x 109 0.065 ... Good
1.76 x 109 0.091 ... Good
1.23 x 109 0.073 ... Poor
1.61 x 109 0.077 ... Poor
How will these large differences affect learning?
Data Normalization Procedure
Patient Height (nm)
1.31 x 109
1.76 x 109
1.23 x 109
1.61 x 109
Range of Extreme Values
1.76 x 109
1.23 x 109
Data Normalization Procedure
Patient Height (nm)
1.31 x 109
1.76 x 109
1.23 x 109
1.61 x 109
Range of Extreme Values
1.76 x 109
1.23 x 109
Normalized Range
1.0
0.0 (-1.0)
Mapping
Data Normalization Formula
Patient Height (nm)
1.31 x 109
1.76 x 109
1.23 x 109
1.61 x 109
Say we want the normalized value, newpt, for the first height, 1.31 x 109, called pt.
oldmax = 1.76 x 109 oldmin = 1.23 x 109
newmax = 1.0newmin = 0.0
Data Normalization Formula
newpt=( pt−oldminoldmax−oldmin )⋅(newmax−newmin )+newmin
Patient Height (nm)
1.31 x 109
1.76 x 109
1.23 x 109
1.61 x 109
newpt=0.15
Say we want the normalized value, newpt, for the first height, 1.31 x 109, called pt.
oldmax = 1.76 x 109 oldmin = 1.23 x 109
newmax = 1.0newmin = 0.0
How do we know if an ML model is any good?
Overfitting
Error
Epoch
Training
Testing
A Biological Neuron
Human Brain
How many neurons?
AnimalNumber Neurons (cerebral cortex)
Rat 20,000,000
Dog 160,000,000
Cat 300,000,000
Pig 450,000,000
Horse 1,200,000,000
Dolphin 5,800,000,000
African Elephant 11,000,000,000
Human 20,000,000,000
How many connections?
Human 100,000,000,000,000
How many connections?
Human 100,000,000,000,000
Google (2012) 1,700,000,000
Google/Stanford (2013) 11,200,000,000
Digital Reasoning (2015) 160,000,000,000
Artificial Neuron
Threshold Function
w1
w2 w3Input connectionsand weights
Output connections
Hard Threshold
Threshold Function
w1
w2 w3
S = Sum up all inputi * weight
i
if S > THRESHOLD: output = 1else: output = 0
Hard Threshold:Step Function
Write down artificial neurons with weights and thresholds that model the following functions:
IdentityLogical ANDLogical OR
Logical XORConstant function
Sigmoid Threshold
Threshold Function
w1
w2 w3
S = Sum up all inputi * weight
i
output = 1
1e−S
Sigmoid Threshold:'S' Function
sigmoid
w1 = 0.1
w2 = 0.2
w3 = 0.42
sigmoid
w1 = 0.1
w2 = 0.2
w3 = 0.42
Features x1 = 0.66 x2 = 0.11 x3 = 0.20
s = w1 * x1 + w2 * x2 + w3 * x3s = 0.1 * 0.66 + 0.2 * 0.11 + 0.42 * 0.2s = 0.09
1
1e−0.09=0.52
Output Calculations
sigmoid
w1 = 0.1
w2 = 0.2
w3 = 0.42
Features x1 = 0.66 x2 = 0.11 x3 = 0.20
y1 = 0.52
Perceptron Network
Input Layer
Output Layer
Perceptron: Linear Boundary
Linear Boundary?
Multilayer Network
Input Layer
Hidden Layer(s)
Output Layer
ANN Learning – How to get the weights?
ANN Learning – How to get the weights?
weight1 weight2
error
ANN Learning
● How do we get the right weights?
● Perceptron:● Gradient descent
● Multilayer Network:● Back propagation
Node Activation Function
Activation (output) of node j.
a j=g(input j)=g(∑i=0
n
wij ai)
Node Activation Function
g is the threshold activation function.
Activation (output) of node j.
a j=g(input j)=g(∑i=0
n
wij ai)
Node Activation Function
a j=g(input j)=g(∑i=0
n
wij ai)
g is the threshold activation function.
Activation (output) of node j.
Sum of all weights and input values.
Minimize Global Error Function
error=∑j
(t j−a j)2
For every output node, j, sum up...
Minimize Global Error Function
error=∑j
(t j−a j)2
...the difference in target value vs. generated output
value and square it.For every output node, j, sum up...
Perceptron Learning
Δwij=η(t j−a j)ai
Update the weight on connection i → j
Perceptron Learning
Δwij=η(t j−a j)ai
Update the weight on connection i → j
The learning rate (0.3ish)
Perceptron Learning
Δwij=η(t j−a j)ai
Update the weight on connection i → j
The learning rate (0.3ish)
Difference in target and generated output.
Perceptron Learning
Δwij=η(t j−a j)ai
Update the weight on connection i → j
The learning rate (0.3ish)
Difference in target and generated output.
Input activation
Let's learn NAND!
Input1 Input2 Label
0 0 1
0 1 1
1 0 1
1 1 0In1 In2
Out
Starting weight values: W1 = 0.81, W2 = 0.55, W3 = 0.16
η = 0.3
Use sigmoid threshold
W1 W2
Dataset: NAND
a j=g (input j)=g (∑i=0
n
w jiai)
Δwij=η(t j−a j)ai
1.0
W3
ANN Learning - Backpropagation
Input Layer
Hidden Layer
Output Layer
Put in input
values and feed
the activation forward
to produce
the output.
ANN Learning - Backpropagation
Input Layer
Hidden Layer
Output Layer
Calculate the error in the output layer and
then back-propagate it to update
lower weights.
ANN Learning - Backpropagation
Δwij=ηδ j ai
Update the weight on connection i → j
ANN Learning - Backpropagation
Δwij=ηδ j ai
Update the weight on connection i → j
Think of this as the error measure for node j.
Different for output and hidden weights.
ANN Learning - Backpropagation
Δwij=ηδ j ai
Update the weight on connection i → j Input activation
Think of this as the error measure for node j.
Different for output and hidden weights.
ANN Learning – Backpropagationfor Output Nodes
δ j=a j(1−a j)(t j−a j)
Error measure for output node, j.
ANN Learning – Backpropagationfor Output Nodes
δ j=a j(1−a j)(t j−a j)
Derivative of sigmoid function.
Error measure for output node, j.
ANN Learning – Backpropagationfor Output Nodes
δ j=a j(1−a j)(t j−a j)
Difference in target vs. generated output.
Derivative of sigmoid function.
Error measure for output node, j.
ANN Learning – Backpropagationfor Hidden Nodes
δ j=a j(1−a j)∑k
δk w jk
Error measure for hidden node, j.
ANN Learning – Backpropagationfor Hidden Nodes
δ j=a j(1−a j)∑k
δk w jk
Error measure for hidden node, j.
Derivative of sigmoid function.
ANN Learning – Backpropagationfor Hidden Nodes
δ j=a j(1−a j)∑k
δk w jk
Error measure a combination of output errors that this weight
contributes to.
Error measure for hidden node, j.
Derivative of sigmoid function.
ANN Learning
● Initialize random network weights● for epoch in range NUMBER_EPOCHS:
● Train network on random presentation of instances● Update weights with backpropagation● Report global error function value
Choosing the Learning Rate, η
What happened when our learning rate was too high for linear regression?
How do we choose an appropriate learning rate for ANNs?
Bold Driver
After each epoch...
if error went down:η = η * 1.05
else:η = η * 0.50
sodahead.com
Choosing the Network Structure
Input Layer
Hidden Layer
Output Layer
How many nodes? What are their connections?
Choosing the Network Structure
Input Layer
Hidden Layer
Output Layer
# of output nodes determined by the number of function
outputs.
Choosing the Network Structure
Input Layer
Hidden Layer
Output Layer
# of input nodes determined by the number of function
inputs.
Choosing the Network Structure
Input Layer
Hidden Layer
Output Layer
Too few hidden nodes: unable to get
a detailed enough approximation of the
target function
Choosing the Network Structure
Input Layer
Hidden Layer
Output Layer
Too many hidden nodes: slower to train and easier to overfit
training data
ANN Representational Power
● With one hidden layer:● Model all continuous functions
● With two hidden layers:● Model all functions
Rules of Thumb
● Use 1 or 2 hidden layers
Rules of Thumb
● Use 1 or 2 hidden layers
● Use about (2/3)n hidden nodes for reasonably complex functions
Rules of Thumb
● Use 1 or 2 hidden layers
● Use about (2/3)n hidden nodes for reasonably complex functions
● Don't train for too many epochs
Splitting up datasets
Training data – use to train your ML model
Validation data – use to improve your ML model while training
Testing data – use to test performance of your ML model
K-Fold Cross Validation
Full Dataset Dataset split into k chunks
K-Fold Cross Validation: Pass 1
Training Dataset Validation Dataset
K-Fold Cross Validation: Pass 2
Training Dataset Validation Dataset
K-Fold Cross Validation
Perform K training/validation passes
Each pass counts as a classification accuracy sample
Extreme case: K = datasetSizeLeave-one-out testing
ANN Implementation?
Break!