neural network regularization and activation functionesl.ait.ac.th/courses/at81.xx...

51
Neural Network Regularization and Activation Function Dr. Mongkol Ekpanyapong

Upload: others

Post on 19-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Neural Network

Regularization and Activation

FunctionDr. Mongkol Ekpanyapong

Page 2: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

MNIST Dataset

• MNIST is used as the “hello world” for

machine learning

• In this study, we will try to use ANN for

MNIST classification

Page 3: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

One hot encoding

• The vector value in where the index is set

to 1 or 0

• It helps machine learning performing

better classification and less confusion

• For example, if we want to calculate the

average when 5 time of digit 1 and 5 time

digit 3, using number representation for

the average function (without do it

carefully) can result in digit 2

Page 4: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

MNIST Trainingimport sys, numpy as np

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0:1000].reshape(1000,28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))

for i,l in enumerate(labels):

one_hot_labels[i][l] = 1

labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255

test_labels = np.zeros((len(y_test),10))

for i,l in enumerate(y_test):

test_labels[i][l] = 1

np.random.seed(1)

relu = lambda x:(x>=0) * x # returns x if x > 0, return 0 otherwise

relu2deriv = lambda x: x>=0 # returns 1 for input > 0, return 0 otherwise

alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40,

784, 10)

Page 5: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1

weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations):

error, correct_cnt = (0.0, 0)

for i in range(len(images)):

layer_0 = images[i:i+1]

layer_1 = relu(np.dot(layer_0,weights_0_1))

layer_2 = np.dot(layer_1,weights_1_2)

error += np.sum((labels[i:i+1] - layer_2) ** 2)

correct_cnt += int(np.argmax(layer_2) == \

np.argmax(labels[i:i+1]))

layer_2_delta = (labels[i:i+1] - layer_2)

layer_1_delta = layer_2_delta.dot(weights_1_2.T)\

* relu2deriv(layer_1)

weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)

weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

if j == 0 or (j + 1) % 50 == 0:

print("[INFO] epoch={}, loss={:.7f}".format(j + 1, error))

sys.stdout.write("\r I:"+str(j)+ \

" Train-Err:" + str(error/float(len(images)))[0:5] +\

" Train-Acc:" + str(correct_cnt/float(len(images))))

Page 6: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

MNIST Training

• Three layers of ANN

• Accuracy is 1.00 on training data

• What’s about the result on test data?

Page 7: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Testing on the test data

if(j % 10 == 0 or j == iterations-1):

error, correct_cnt = (0.0, 0)

for i in range(len(test_images)):

layer_0 = test_images[i:i+1]

layer_1 = relu(np.dot(layer_0,weights_0_1))

layer_2 = np.dot(layer_1,weights_1_2)

error += np.sum((test_labels[i:i+1] - layer_2) ** 2)

correct_cnt += int(np.argmax(layer_2) == \

np.argmax(test_labels[i:i+1]))

sys.stdout.write(" Test-Err:" + str(error/float(len(test_images)))[0:5] +\

" Test-Acc:" + str(correct_cnt/float(len(test_images))) + "\n")

print()

Page 8: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Results

Loss function

Page 9: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Memorization vs. Generalization

• Why the accuracy goes down?

• The system is trying to memorize the data

• This is also known as system overfit

(ANN can go worse if we train them too

much)

• How can we solve the problem?

Page 10: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Regularization

• Early Stopping

• Dropout

• Batch Gradient Descent

• Loss function modification

Page 11: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Early Stopping

• Once the system is trained to much, it will

start to memorize the training data

• How do we know when to stop training?

• The only way to know is to run the model

that isn’t in your training dataset and truly

represent the real data

• The rational behinds validation set is for

this purpose

Page 12: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Dropout

• Randomly turning neurons off (setting to 0)

• Dropout makes our big network act like a

small one by randomly training little

subsections of the network at a time

• Small networks are more difficult to overfit

Page 13: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

ExampleNeed to multiply

with layer 0

Page 14: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Explanation

• We create a Bernoulli random distribution

with 50% one and 50% zero

• Note that we have to multiply the weight of

layer 1 by 2 to compensate with the 50%

dropout

• Otherwise, the sum of the value to layer 2

will be cut in half

Page 15: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set
Page 16: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Output

Page 17: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Batch Gradient Descent

• We train a batch of data instead of one at a time

• It allows smoother accuracy due to the reduce of

individual noise

• Note that the alpha weight have to be increased

proportion to the number of batch size

• The accuracy is also better, due to the reduction

of the noise

• The computation is much faster as we use a big

matrix size

Page 18: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Code

Page 19: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Output

Page 20: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Activation Function

• Activation function is a function applied to

the neuron in a layer during prediction

• RELU is one example of activation

function

• There are some constrains for

the properties of activation functions

Page 21: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Activation Function properties

• The function must be continuous and

infinite in domain

• The function is monotonic

• The function is nonlinear

• The function is efficiently computable

Page 22: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Example of activation functions

• Sigmoid

• Tanh

• RELU (Rectified Linear Unit) and

ELU (Exponential Linear Unit)

Page 23: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Standard Recommendation

• Most recent papers tend to use RELU as

the baseline

• With good enough configuration, ELU can

has 2-3% better accuracy

Page 24: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Standard Output Layer

Activation Functions• Predict Raw Data Value (Regression)

- no activation function

• Predict Binary Output (Yes/No)

– Sigmoid activation function

• Predict Which one (from many classes)

– Softmax activation function

Page 25: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Example

• Raw output

• Sigmoid

• Softmax

Page 26: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Softmax Function

Page 27: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Modify Activation Function

• Feed Forward

• Back Propagation

Page 28: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Derivative function

• RELU derivative function

• Multiply delta by the weight

Page 29: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Back Propagation

Page 30: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set
Page 31: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Output

Page 32: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Weight Initialization

How do we initialize weight matrices

• Constant initialization

• Uniform and Normal distribution

• LeCun Uniform and Normal

• Glorot/Xavier Uniform and Normal

• He et al./Kaiming/MSRA Uniform and

Normal

Page 33: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Constant Initialization

• We can initial the weight with constant

zero, one, or a constant value

64 rows, 32 columns

• It is not good in practice due to a fixed

value of the weight

Page 34: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Uniform and Normal Distribution

• A uniform distribution draws a random

value from the range [lower,upper] with

equal probability

• A normal distribution uses Gaussian

distribution

• Both can be used for weight initialization,

but various heuristics provide better

performance

• Normal distribution with std. variation=0.05

Image from wikipedia

Page 35: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

LeCun Uniform and Normal

• The idea is to use the number of fan in

(number of inputs to the layer) and fan out

(number of outputs from the layer) along

with uniform or normal distribution

• For LeCun normal distribution

Page 36: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Glorot/Xavier Uniform and

Normal• Similar to LeCun with minor change in the

equation

• For Glorot/Xavier normal distribution

• Provide good performance in general cases

Page 37: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

He et al.

• Usually uses for deep network

• For He et al., normal distribution

Page 38: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

For Deep learning, the

number of layers has to be

more than two hidden layers

Page 39: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Keras implementation of MNISTfrom sklearn.preprocessing import LabelBinarizer

from keras.models import Sequential

from keras.layers.core import Dense

from keras.optimizers import SGD

import matplotlib.pyplot as plt

import numpy as np

import argparse

from keras.datasets import mnist

(trainX, trainY), (testX, testY) = mnist.load_data()

trainX, trainY = (trainX.reshape(trainX.shape[0],28*28) / 255, trainY)

testX, testY = (testX.reshape(testX.shape[0],28*28) / 255, testY)

Page 40: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

lb = LabelBinarizer()

trainY = lb.fit_transform(trainY)

testY = lb.transform(testY)

model = Sequential()

model.add(Dense(256, input_shape=(784,), activation="sigmoid"))

model.add(Dense(128, activation="sigmoid"))

model.add(Dense(10, activation="softmax"))

# train the model usign SGD

print("[INFO] training network...")

sgd = SGD(0.01)

model.compile(loss="categorical_crossentropy", optimizer=sgd,

metrics=["accuracy"])

H = model.fit(trainX, trainY, validation_data=(testX, testY),

epochs=100, batch_size=128)

# evaluate the network

print("[INFO] evaluating network...")

predictions = model.predict(testX, batch_size=128)

print(classification_report(testY.argmax(axis=1),

predictions.argmax(axis=1),

target_names=[str(x) for x in lb.classes_]))

Page 41: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

# plot the training loss and accuracy

plt.style.use("ggplot")

plt.figure()

plt.plot(np.arange(0, 100), H.history["loss"], label="train_loss")

plt.plot(np.arange(0, 100), H.history["val_loss"], label="val_loss")

plt.plot(np.arange(0, 100), H.history["acc"], label="train_acc")

plt.plot(np.arange(0, 100), H.history["val_acc"], label="val_acc")

plt.title("Training Loss and Accuracy")

plt.xlabel("Epoch #")

plt.ylabel("Loss/Accuracy")

plt.legend()

plt.show()

# plt.savefig("output_file")

Page 42: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Output

• Test result should get around 92%

accuracy

Page 43: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Four Ingredients in a NN Recipe

• Dataset

The more the data, the better the accuracy

• Loss Function

Usually use categorical cross-entropy

• Model/Architecture

Next slide

• Optimization

The default is Stochastic Gradient Descent

Page 44: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Model/Architecture

Which model to use depend on these

question?

• How many data points you have?

• The number of classes

• How similar/dissimilar the classes are

• The intra-class variance

Page 45: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

CIFARfrom sklearn.preprocessing import LabelBinarizer

from sklearn.metrics import classification_report

from keras.models import Sequential

from keras.layers.core import Dense

from keras.optimizers import SGD

from keras.datasets import cifar10

import matplotlib.pyplot as plt

import numpy as np

Page 46: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

print("[INFO] loading CIFAR-10 data...")

((trainX, trainY), (testX, testY)) = cifar10.load_data()

trainX = trainX.astype("float") / 255.0

testX = testX.astype("float") / 255.0

trainX = trainX.reshape((trainX.shape[0], 3072))

testX = testX.reshape((testX.shape[0], 3072))

lb = LabelBinarizer()

trainY = lb.fit_transform(trainY)

testY = lb.transform(testY)

labelNames = ["airplane", "automobile", "bird", "cat", "deer",

"dog", "frog", "horse", "ship", "truck"]

model = Sequential()

model.add(Dense(1024, input_shape=(3072,), activation="relu"))

model.add(Dense(512, activation="relu"))

model.add(Dense(10, activation="softmax"))

Page 47: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

print("[INFO] training network...")

sgd = SGD(0.01)

model.compile(loss="categorical_crossentropy", optimizer=sgd,

metrics=["accuracy"])

H = model.fit(trainX, trainY, validation_data=(testX, testY),

epochs=100, batch_size=32)

print("[INFO] evaluating network...")

predictions = model.predict(testX, batch_size=32)

print(classification_report(testY.argmax(axis=1),

predictions.argmax(axis=1), target_names=labelNames))

Page 48: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

plt.style.use("ggplot")

plt.figure()

plt.plot(np.arange(0, 100), H.history["loss"], label="train_loss")

plt.plot(np.arange(0, 100), H.history["val_loss"], label="val_loss")

plt.plot(np.arange(0, 100), H.history["acc"], label="train_acc")

plt.plot(np.arange(0, 100), H.history["val_acc"], label="val_acc")

plt.title("Training Loss and Accuracy")

plt.xlabel("Epoch #")

plt.ylabel("Loss/Accuracy")

plt.legend()

plt.show()

Page 49: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

CIFAR Output

Page 50: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Training Loss and Accuracy

Page 51: Neural Network Regularization and Activation Functionesl.ait.ac.th/courses/AT81.XX DeepLearning/class5-3 - ANN... · One hot encoding • The vector value in where the index is set

Questions?