artificial neural networksartificial neural networks x1 x2 x3 what are they? inspired by the human...

Post on 25-Aug-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Artificial Neural Networks

x1

x2

x3

What are they?Inspired by the Human Brain. The human brain has about 86 Billion neurons and requires 20% of your body’s energy to function. These neurons are connected to between 100 Trillion to 1 Quadrillion synapses!

What are they?Human Machine

Input Output

What are they?1. Originally developed by Warren McCullough and Walter Pitts (1944)  2. First trainable network, perceptron, proposed by Frank Rosenblatt

(1957) 3. Started off as an unsupervised learning tool.

1. Had problems with computing time and could not compute XOR 2. Was abandoned in favor of other algorithms

4. Werbos's (1975) backpropagation algorithm 1. Incorporated supervision and solved XOR

2. But were still too slow vs. other algorithms e.g., Support Vector Machines

5. Backpropagation was accelerated by GPUs in 2010 and shown to be more efficient and cost effective

GPUSGPUS handle parallel operations much better (thousands of threads per core) but are not as quick as CPUs. However, the matrix multiplication steps in ANNs can be run in parallel resulting in considerable time + cost savings. The best CPUs handle about 50GB/s while the best GPUs handle 750GB/s memory bandwidth.

CPU i9 Xseries

GeForce GTX 1080

Cores 18 (36 threads) 2560

Clock Speed (GHz)

4.4 1.6G

Memory Shared 8GB

Price ($) 1799 549

What are they?Human Machine

Input Output

Neuron

x0

x1

x2

w0 = − 10

w1 = 20

w2 = 20

y = g(w0x0 + w1x1 + w2x2)

g − activation function

bias

Neuron

Inputs

z = x0w0 + x1w1 + x2w2 =2

∑i=0

wixi y = g(z)

x0

x1

x2

w0 = − 10

w1 = 20

w2 = 20

y = g(w0x0 + w1x1 + w2x2)

g − activation function

bias

Activation function

z = x0w0 + x1w1 + x2w2 =2

∑i=0

wi xi y = g (z )

x0

x1

x2

w0 = − 10

w1 = 20

w2 = 20

y = g (w0x0 + w1x1 + w2x2)

g − activation function

g(z) =1

1 + e−z

z = w0x0 + w1x1 + w2x2

bias

Activation Functions■ Sigmoid

■ ReLU

■ Hyperbolic tangent

■ Leaky ReLU

g(z) =1

1 + e−z

tanh(z) =2

1 + e−2z− 1

r(z) = max(0,z) lr(z) = {αz  for z < 0z  for z ≥ 0

Neural Network

x1

x2

x3

Input layer Hidden layers Output layer

Input

Output

b0b1

XOR Gate

YX

0 0 0

0 1 1

1 0 1

1 1 0

X

Y

h0

h1

w0

w1

w2

w3

w5

w4z0

z1

h2z2

za

z

XOR Gate

0 0 0

0 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

output = h2 = z

x y

x

y

Feedforward propagation

0 0 00 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

out put = h2 = z

x y

x

y

z0 = wox + w2y= 0.4 × 1 + 0.3 × 1= 0.7

x = 1

activation function, g(z) =1

1 + e−z

z1 = w1x + w3y= 0.6 × 1 + 0.8 × 1= 1.4

h0 = g(z0) =1

1 + e−z0

=1

1 + e−0.7

h0 = 0.67

h1 = g(z1) =1

1 + e−z1

=1

1 + e−1.4

h1 = 0.8

z2 = w4h0 + w5h1

= 0.7 × 0.67 + 0.1 × 0.8= 0.63

h2 = g(z2) =1

1 + e−z2

=1

1 + e−0.63

h2 = 0.65

y = 1

h2 = z = 0.65

Feedforward propagation

0 0 00 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

out put = h2 = z

x y

x

y

activation function, g(z) =1

1 + e−z

z0 = wox + w2y= 0.4 × 1 + 0.3 × 1= 0.7

x = 1

z1 = w1x + w3y= 0.6 × 1 + 0.8 × 1= 1.4

h0 = g (z0) =1

1 + e−z0

=1

1 + e−0.7h0 = 0.67

h1 = g (z1) =1

1 + e−z1

=1

1 + e−1.4h1 = 0.8

z2 = w4h0 + w5h1= 0.7 × 0.67 + 0.1 × 0.8= 0.63

h2 = g (z2) =1

1 + e−z2

=1

1 + e−0.63h2 = 0.65

y = 1

; h2 = z = 0.65

Mean Squared Error, E =12

( z − za)2 =12

(0.65 − 0)2 = 0.21

Backpropagation

0 0 00 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

out put = h2 = z

x y

x

y

activation function, g(z) =1

1 + e−z

Mean Squared Error, E =12

( z − za)2 =12

(h2 − za)2 =12

(0.65 − 0)2 = 0.21

E =12

( z − za)2

∂E∂w4

=∂E∂h2

⋅∂h2

∂z2⋅

∂z2

∂w4= 0.65 × 0.228 × 0.67 = 0.09

∂E∂h2

=∂

∂h2 [ 12

(h2 − za)2] = h2 − za = 0.65

∂h2

∂z2=

∂∂z2

g(z2) =∂

∂z2 ( 11 + e−z2 ) = g(z2)(1 − g(z2)) = h2(1 − h2) = 0.65 × 0.35 = 0.228

g′�(z) =∂∂z

g(z) =∂∂z ( 1

1 + e−z ) = g(z)[1 − g(z)]

∂z2

∂w4=

∂∂w4

[w4h0 + w5h1] = h0 = 0.67

Backpropagation

0 0 00 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

out put = h2 = z

x y

x

y

activation function, g(z) =1

1 + e−z

Mean Squared Error, E =12

( z − za)2 =12

(h2 − za)2 =12

(0.65 − 0)2 = 0.21

E =12

( z − za)2

∂E∂w5

=∂E∂h2

⋅∂h2

∂z2⋅

∂z2

∂w5= 0.65 × 0.228 × 0.8 = 0.118

∂E∂h2

=∂

∂h2 [ 12

(h2 − za)2] = h2 − za = 0.65

∂h2

∂z2=

∂∂z2

g(z2) =∂

∂z2 ( 11 + e−z2 ) = g(z2)(1 − g(z2)) = h2(1 − h2) = 0.65 × 0.35 = 0.228

g′�(z) =∂∂z

g(z) =∂∂z ( 1

1 + e−z ) = g(z)[1 − g(z)]

∂z2

∂w5=

∂∂w5

[w4h0 + w5h1] = h1 = 0.8

Backpropagation

0 0 00 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

out put = h2 = z

x y

x

y

activation function, g(z) =1

1 + e−z

Mean Squared Error, E =12

( z − za)2 =12

(h2 − za)2 =12

(0.65 − 0)2 = 0.21

E =12

( z − za)2

∂E∂w0

=∂E∂h0

⋅∂h0

∂z0⋅

∂z0

∂w0= 0.103 × 0.221 × 1 = 0.227

∂E∂h0

=∂E∂h2

⋅∂h2

∂z2⋅

∂z2

∂h0= h2 × h2(1 − h2) × w4 = 0.65 × 0.65(1 − 0.65) × 0.7 = 0.103

∂h0

∂z0=

∂∂z0

g(z0) =∂

∂z0 ( 11 + e−z0 ) = g(z0)(1 − g(z0)) = h0(1 − h0) = 0.67 × 0.33 = 0.221

g′�(z) =∂∂z

g(z) =∂∂z ( 1

1 + e−z ) = g(z)[1 − g(z)]

∂z0

∂w0=

∂∂w0

[w0x + w2y] = x = 1

Backpropagation

0 0 00 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

out put = h2 = z

x y

x

y

activation function, g(z) =1

1 + e−z

Mean Squared Error, E =12

( z − za)2 =12

(h2 − za)2 =12

(0.65 − 0)2 = 0.21

E =12

( z − za)2

∂E∂w2

=∂E∂h0

⋅∂h0

∂z0⋅

∂z0

∂w2= 0.103 × 0.221 × 1 = 0.227

∂E∂h0

=∂E∂h2

⋅∂h2

∂z2⋅

∂z2

∂h0= h2 × h2(1 − h2) × w4 = 0.65 × 0.65(1 − 0.65) × 0.7 = 0.103

∂h0

∂z0=

∂∂z0

g(z0) =∂

∂z0 ( 11 + e−z0 ) = g(z0)(1 − g(z0)) = h0(1 − h0) = 0.67 × 0.33 = 0.221

g′�(z) =∂∂z

g(z) =∂∂z ( 1

1 + e−z ) = g(z)[1 − g(z)]

∂z0

∂w2=

∂∂w2

[w0x + w2y] = y = 1

Backpropagation

0 0 00 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

out put = h2 = z

x y

x

y

activation function, g(z) =1

1 + e−z

Mean Squared Error, E =12

( z − za)2 =12

(h2 − za)2 =12

(0.65 − 0)2 = 0.21

E =12

( z − za)2

∂E∂w1

=∂E∂h1

⋅∂h1

∂z1⋅

∂z1

∂w1= 0.014 × 0.16 × 1 = 0.0022

∂E∂h1

=∂E∂h2

⋅∂h2

∂z2⋅

∂z2

∂h1= h2 × h2(1 − h2) × w5 = 0.65 × 0.65(1 − 0.65) × 0.1 = 0.014

∂h1

∂z1=

∂∂z1

g(z1) =∂

∂z1 ( 11 + e−z1 ) = g(z1)(1 − g(z1)) = h1(1 − h1) = 0.8 × 0.2 = 0.16

g′�(z) =∂∂z

g(z) =∂∂z ( 1

1 + e−z ) = g(z)[1 − g(z)]

∂z0

∂w1=

∂∂w1

[w1x + w3y] = x = 1

Backpropagation

0 0 00 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

out put = h2 = z

x y

x

y

activation function, g(z) =1

1 + e−z

Mean Squared Error, E =12

( z − za)2 =12

(h2 − za)2 =12

(0.65 − 0)2 = 0.21

E =12

( z − za)2

∂E∂w3

=∂E∂h1

⋅∂h1

∂z1⋅

∂z1

∂w3= 0.014 × 0.16 × 1 = 0.0022

∂E∂h1

=∂E∂h2

⋅∂h2

∂z2⋅

∂z2

∂h1= h2 × h2(1 − h2) × w5 = 0.65 × 0.65(1 − 0.65) × 0.1 = 0.014

∂h1

∂z1=

∂∂z1

g(z1) =∂

∂z1 ( 11 + e−z1 ) = g(z1)(1 − g(z1)) = h1(1 − h1) = 0.8 × 0.2 = 0.16

g′�(z) =∂∂z

g(z) =∂∂z ( 1

1 + e−z ) = g(z)[1 − g(z)]

∂z0

∂w1=

∂∂w3

[w1x + w3y] = y = 1

Gradient descent weight updates0 0 00 1 1

1 0 1

1 1 0

h0

h1

z0

z1

h2z2

za

z

w0 = 0.4

w1 = 0.6

w2 = 0.3

w3 = 0.8w5 = 0.1

w4 = 0.7

out put = h2 = z

x y

x

y

activation function, g(z) =1

1 + e−z

Mean Squared Error, E =12

( z − za)2 =12

(h2 − za)2 =12

(0.65 − 0)2 = 0.21

E =12

( z − za)2

g′�(z) =∂∂z

g(z) =∂∂z ( 1

1 + e−z ) = g(z)[1 − g(z)]

z0 = wox + w2y= 0.4 × 1 + 0.3 × 1= 0.7

x = 1

z1 = w1x + w3y= 0.6 × 1 + 0.8 × 1= 1.4

h0 = g (z0) =1

1 + e−z0

=1

1 + e−0.7h0 = 0.67

h1 = g (z1) =1

1 + e−z1

=1

1 + e−1.4h1 = 0.8

z2 = w4h0 + w5h1= 0.7 × 0.67 + 0.1 × 0.8= 0.63

h2 = g (z2) =1

1 + e−z2

=1

1 + e−0.63h2 = 0.65

y = 1

h2 = z = 0.65

∂E∂ w4

=∂E∂h2

⋅∂h2∂z2

⋅∂z2∂ w4

= 0.65 × 0.228 × 0.67 = 0.09

∂E∂ w5

=∂E∂h2

⋅∂h2∂z2

⋅∂z2∂ w5

= 0.65 × 0.228 × 0.8 = 0.118

∂E∂ w0

=∂E∂h0

⋅∂h0∂z0

⋅∂z0∂ w0

= 0.103 × 0.221 × 1 = 0.227

∂E∂w2

=∂E∂h0

⋅∂h0∂z0

⋅∂z0∂w2

= 0.103 × 0.221 × 1 = 0.227

∂E∂w1

=∂E∂h1

⋅∂h1∂z1

⋅∂z1∂w1

= 0.014 × 0.16 × 1 = 0.0022

∂E∂w3

=∂E∂h1

⋅∂h1∂z1

⋅∂z1∂w3

= 0.014 × 0.16 × 1 = 0.0022

Learning rate, α = 0.01

w4 = w4 − α∂E∂w4

; w5 = w5 − α∂E∂w5

; w0 = w0 − α∂E∂w0

; w1 = w1 − α∂E∂w1

; w2 = w2 − α∂E∂w2

; w3 = w3 − α∂E∂w3

;

feed for ward

backprop

grad descentweight update

top related