cp365 artificial intelligence - colorado...

Post on 19-Aug-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CP365Artificial Intelligence

Tech News!

Apple news conference tomorrow?

Tech News!

Apple news conference tomorrow?

Google cancels Project Ara modular phone

Weather-Based Stock Market Predictions?

Dataset Preparation

Clean – remove bogus data/fill in missing data

Normalize data – adjust features to be similar magnitudes

Deal with Missing Data

Option 1: remove datapoints with any missing feature values

Deal with Missing Data

Option 1: remove datapoints with any missing feature values

Option 2: fill in missing data with <data_missing> tags for categorical data

Deal with Missing Data

Option 1: remove datapoints with any missing feature values

Option 2: fill in missing data with <data_missing> tags for categorical data

Option 3: fill in missing data with global means for numeric data

Deal with Missing Data

Option 1: remove datapoints with any missing feature values

Option 2: fill in missing data with <data_missing> tags for categorical data

Option 3: fill in missing data with global means for numeric data

Option 4: fill in missing data with values from similar data points

Remove Outliers

Some datapoints may have ridiculous feature values.

We can remove outliers from our dataset to increase performance.

What is an outlier?

Outliers

Patient Height (cm)

Patient Weight (kg)

... Prognosis

131.2 59.2 ... Good

176.7 82.9 ... Good

12613.9 66.0 ... Poor

161.0 70.2 ... Poor

Outliers

Patient Height (cm)

Patient Weight (kgs)

... Prognosis

131.2 59.2 ... Good

176.7 82.9 ... Good

12613.9 66.0 ... Poor

161.0 70.2 ... Poor

Obvious outlier. How can we define what makes an outlier?

We could use 3σ as the threshold.

Outliers

Patient Height (cm)

Patient Weight (kgs)

... Prognosis

131.2 59.2 ... Good

176.7 82.9 ... Good

12613.9 66.0 ... Poor

161.0 70.2 ... Poor

This column has x = 156.3 and σ = 23.1 (without the possible

outlier).

The 3σ thresholds would be (156.3 - 3 * 23.1, 156.3 + 3 * 23.1)

or(87, 225.6)

A Bad Dataset

Patient Height (nm)

Patient Weight (tons)

... Prognosis

1.31 x 109 0.065 ... Good

1.76 x 109 0.091 ... Good

1.23 x 109 0.073 ... Poor

1.61 x 109 0.077 ... Poor

A Bad Dataset

Patient Height (nm)

Patient Weight (tons)

... Prognosis

1.31 x 109 0.065 ... Good

1.76 x 109 0.091 ... Good

1.23 x 109 0.073 ... Poor

1.61 x 109 0.077 ... Poor

How will these large differences affect learning?

Data Normalization Procedure

Patient Height (nm)

1.31 x 109

1.76 x 109

1.23 x 109

1.61 x 109

Range of Extreme Values

1.76 x 109

1.23 x 109

Data Normalization Procedure

Patient Height (nm)

1.31 x 109

1.76 x 109

1.23 x 109

1.61 x 109

Range of Extreme Values

1.76 x 109

1.23 x 109

Normalized Range

1.0

0.0 (-1.0)

Mapping

Data Normalization Formula

Patient Height (nm)

1.31 x 109

1.76 x 109

1.23 x 109

1.61 x 109

Say we want the normalized value, newpt, for the first height, 1.31 x 109, called pt.

oldmax = 1.76 x 109 oldmin = 1.23 x 109

newmax = 1.0newmin = 0.0

Data Normalization Formula

newpt=( pt−oldminoldmax−oldmin )⋅(newmax−newmin )+newmin

Patient Height (nm)

1.31 x 109

1.76 x 109

1.23 x 109

1.61 x 109

newpt=0.15

Say we want the normalized value, newpt, for the first height, 1.31 x 109, called pt.

oldmax = 1.76 x 109 oldmin = 1.23 x 109

newmax = 1.0newmin = 0.0

How do we know if an ML model is any good?

Overfitting

Error

Epoch

Training

Testing

A Biological Neuron

Human Brain

How many neurons?

AnimalNumber Neurons (cerebral cortex)

Rat 20,000,000

Dog 160,000,000

Cat 300,000,000

Pig 450,000,000

Horse 1,200,000,000

Dolphin 5,800,000,000

African Elephant 11,000,000,000

Human 20,000,000,000

How many connections?

Human 100,000,000,000,000

How many connections?

Human 100,000,000,000,000

Google (2012) 1,700,000,000

Google/Stanford (2013) 11,200,000,000

Digital Reasoning (2015) 160,000,000,000

Artificial Neuron

Threshold Function

w1

w2 w3Input connectionsand weights

Output connections

Hard Threshold

Threshold Function

w1

w2 w3

S = Sum up all inputi * weight

i

if S > THRESHOLD: output = 1else: output = 0

Hard Threshold:Step Function

Write down artificial neurons with weights and thresholds that model the following functions:

IdentityLogical ANDLogical OR

Logical XORConstant function

Sigmoid Threshold

Threshold Function

w1

w2 w3

S = Sum up all inputi * weight

i

output = 1

1e−S

Sigmoid Threshold:'S' Function

sigmoid

w1 = 0.1

w2 = 0.2

w3 = 0.42

sigmoid

w1 = 0.1

w2 = 0.2

w3 = 0.42

Features x1 = 0.66 x2 = 0.11 x3 = 0.20

s = w1 * x1 + w2 * x2 + w3 * x3s = 0.1 * 0.66 + 0.2 * 0.11 + 0.42 * 0.2s = 0.09

1

1e−0.09=0.52

Output Calculations

sigmoid

w1 = 0.1

w2 = 0.2

w3 = 0.42

Features x1 = 0.66 x2 = 0.11 x3 = 0.20

y1 = 0.52

Perceptron Network

Input Layer

Output Layer

Perceptron: Linear Boundary

Linear Boundary?

Multilayer Network

Input Layer

Hidden Layer(s)

Output Layer

ANN Learning – How to get the weights?

ANN Learning – How to get the weights?

weight1 weight2

error

ANN Learning

● How do we get the right weights?

● Perceptron:● Gradient descent

● Multilayer Network:● Back propagation

Node Activation Function

Activation (output) of node j.

a j=g(input j)=g(∑i=0

n

wij ai)

Node Activation Function

g is the threshold activation function.

Activation (output) of node j.

a j=g(input j)=g(∑i=0

n

wij ai)

Node Activation Function

a j=g(input j)=g(∑i=0

n

wij ai)

g is the threshold activation function.

Activation (output) of node j.

Sum of all weights and input values.

Minimize Global Error Function

error=∑j

(t j−a j)2

For every output node, j, sum up...

Minimize Global Error Function

error=∑j

(t j−a j)2

...the difference in target value vs. generated output

value and square it.For every output node, j, sum up...

Perceptron Learning

Δwij=η(t j−a j)ai

Update the weight on connection i → j

Perceptron Learning

Δwij=η(t j−a j)ai

Update the weight on connection i → j

The learning rate (0.3ish)

Perceptron Learning

Δwij=η(t j−a j)ai

Update the weight on connection i → j

The learning rate (0.3ish)

Difference in target and generated output.

Perceptron Learning

Δwij=η(t j−a j)ai

Update the weight on connection i → j

The learning rate (0.3ish)

Difference in target and generated output.

Input activation

Let's learn NAND!

Input1 Input2 Label

0 0 1

0 1 1

1 0 1

1 1 0In1 In2

Out

Starting weight values: W1 = 0.81, W2 = 0.55, W3 = 0.16

η = 0.3

Use sigmoid threshold

W1 W2

Dataset: NAND

a j=g (input j)=g (∑i=0

n

w jiai)

Δwij=η(t j−a j)ai

1.0

W3

ANN Learning - Backpropagation

Input Layer

Hidden Layer

Output Layer

Put in input

values and feed

the activation forward

to produce

the output.

ANN Learning - Backpropagation

Input Layer

Hidden Layer

Output Layer

Calculate the error in the output layer and

then back-propagate it to update

lower weights.

ANN Learning - Backpropagation

Δwij=ηδ j ai

Update the weight on connection i → j

ANN Learning - Backpropagation

Δwij=ηδ j ai

Update the weight on connection i → j

Think of this as the error measure for node j.

Different for output and hidden weights.

ANN Learning - Backpropagation

Δwij=ηδ j ai

Update the weight on connection i → j Input activation

Think of this as the error measure for node j.

Different for output and hidden weights.

ANN Learning – Backpropagationfor Output Nodes

δ j=a j(1−a j)(t j−a j)

Error measure for output node, j.

ANN Learning – Backpropagationfor Output Nodes

δ j=a j(1−a j)(t j−a j)

Derivative of sigmoid function.

Error measure for output node, j.

ANN Learning – Backpropagationfor Output Nodes

δ j=a j(1−a j)(t j−a j)

Difference in target vs. generated output.

Derivative of sigmoid function.

Error measure for output node, j.

ANN Learning – Backpropagationfor Hidden Nodes

δ j=a j(1−a j)∑k

δk w jk

Error measure for hidden node, j.

ANN Learning – Backpropagationfor Hidden Nodes

δ j=a j(1−a j)∑k

δk w jk

Error measure for hidden node, j.

Derivative of sigmoid function.

ANN Learning – Backpropagationfor Hidden Nodes

δ j=a j(1−a j)∑k

δk w jk

Error measure a combination of output errors that this weight

contributes to.

Error measure for hidden node, j.

Derivative of sigmoid function.

ANN Learning

● Initialize random network weights● for epoch in range NUMBER_EPOCHS:

● Train network on random presentation of instances● Update weights with backpropagation● Report global error function value

Choosing the Learning Rate, η

What happened when our learning rate was too high for linear regression?

How do we choose an appropriate learning rate for ANNs?

Bold Driver

After each epoch...

if error went down:η = η * 1.05

else:η = η * 0.50

sodahead.com

Choosing the Network Structure

Input Layer

Hidden Layer

Output Layer

How many nodes? What are their connections?

Choosing the Network Structure

Input Layer

Hidden Layer

Output Layer

# of output nodes determined by the number of function

outputs.

Choosing the Network Structure

Input Layer

Hidden Layer

Output Layer

# of input nodes determined by the number of function

inputs.

Choosing the Network Structure

Input Layer

Hidden Layer

Output Layer

Too few hidden nodes: unable to get

a detailed enough approximation of the

target function

Choosing the Network Structure

Input Layer

Hidden Layer

Output Layer

Too many hidden nodes: slower to train and easier to overfit

training data

ANN Representational Power

● With one hidden layer:● Model all continuous functions

● With two hidden layers:● Model all functions

Rules of Thumb

● Use 1 or 2 hidden layers

Rules of Thumb

● Use 1 or 2 hidden layers

● Use about (2/3)n hidden nodes for reasonably complex functions

Rules of Thumb

● Use 1 or 2 hidden layers

● Use about (2/3)n hidden nodes for reasonably complex functions

● Don't train for too many epochs

Splitting up datasets

Training data – use to train your ML model

Validation data – use to improve your ML model while training

Testing data – use to test performance of your ML model

K-Fold Cross Validation

Full Dataset Dataset split into k chunks

K-Fold Cross Validation: Pass 1

Training Dataset Validation Dataset

K-Fold Cross Validation: Pass 2

Training Dataset Validation Dataset

K-Fold Cross Validation

Perform K training/validation passes

Each pass counts as a classification accuracy sample

Extreme case: K = datasetSizeLeave-one-out testing

ANN Implementation?

Break!

top related