neural network regularization and activation functionesl.ait.ac.th/courses/at81.xx...

Neural Network

Regularization and Activation

FunctionDr. Mongkol Ekpanyapong

MNIST Dataset

• MNIST is used as the “hello world” for

machine learning

• In this study, we will try to use ANN for

MNIST classification

One hot encoding

• The vector value in where the index is set

to 1 or 0

• It helps machine learning performing

better classification and less confusion

• For example, if we want to calculate the

average when 5 time of digit 1 and 5 time

digit 3, using number representation for

the average function (without do it

carefully) can result in digit 2

MNIST Trainingimport sys, numpy as np

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0:1000].reshape(1000,28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))

for i,l in enumerate(labels):

one_hot_labels[i][l] = 1

labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255

test_labels = np.zeros((len(y_test),10))

for i,l in enumerate(y_test):

test_labels[i][l] = 1

np.random.seed(1)

relu = lambda x:(x>=0) * x # returns x if x > 0, return 0 otherwise

relu2deriv = lambda x: x>=0 # returns 1 for input > 0, return 0 otherwise

alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40,

784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1

weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations):

error, correct_cnt = (0.0, 0)

for i in range(len(images)):

layer_0 = images[i:i+1]

layer_1 = relu(np.dot(layer_0,weights_0_1))

layer_2 = np.dot(layer_1,weights_1_2)

error += np.sum((labels[i:i+1] - layer_2) ** 2)

correct_cnt += int(np.argmax(layer_2) == \

np.argmax(labels[i:i+1]))

layer_2_delta = (labels[i:i+1] - layer_2)

layer_1_delta = layer_2_delta.dot(weights_1_2.T)\

* relu2deriv(layer_1)

weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)

weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

if j == 0 or (j + 1) % 50 == 0:

print("[INFO] epoch={}, loss={:.7f}".format(j + 1, error))

sys.stdout.write("\r I:"+str(j)+ \

" Train-Err:" + str(error/float(len(images)))[0:5] +\

" Train-Acc:" + str(correct_cnt/float(len(images))))

MNIST Training

• Three layers of ANN

• Accuracy is 1.00 on training data

• What’s about the result on test data?

Testing on the test data

if(j % 10 == 0 or j == iterations-1):

error, correct_cnt = (0.0, 0)

for i in range(len(test_images)):

layer_0 = test_images[i:i+1]

layer_1 = relu(np.dot(layer_0,weights_0_1))

layer_2 = np.dot(layer_1,weights_1_2)

error += np.sum((test_labels[i:i+1] - layer_2) ** 2)

correct_cnt += int(np.argmax(layer_2) == \

np.argmax(test_labels[i:i+1]))

sys.stdout.write(" Test-Err:" + str(error/float(len(test_images)))[0:5] +\

" Test-Acc:" + str(correct_cnt/float(len(test_images))) + "\n")

print()

Results

Loss function

Memorization vs. Generalization

• Why the accuracy goes down?

• The system is trying to memorize the data

• This is also known as system overfit

(ANN can go worse if we train them too

much)

• How can we solve the problem?

Regularization

• Early Stopping

• Dropout

• Batch Gradient Descent

• Loss function modification

Early Stopping

• Once the system is trained to much, it will

start to memorize the training data

• How do we know when to stop training?

• The only way to know is to run the model

that isn’t in your training dataset and truly

represent the real data

• The rational behinds validation set is for

this purpose

Dropout

• Randomly turning neurons off (setting to 0)

• Dropout makes our big network act like a

small one by randomly training little

subsections of the network at a time

• Small networks are more difficult to overfit

ExampleNeed to multiply

with layer 0

Explanation

• We create a Bernoulli random distribution

with 50% one and 50% zero

• Note that we have to multiply the weight of

layer 1 by 2 to compensate with the 50%

dropout

• Otherwise, the sum of the value to layer 2

will be cut in half

Output

Batch Gradient Descent

• We train a batch of data instead of one at a time

• It allows smoother accuracy due to the reduce of

individual noise

• Note that the alpha weight have to be increased

proportion to the number of batch size

• The accuracy is also better, due to the reduction

of the noise

• The computation is much faster as we use a big

matrix size

Output

Activation Function

• Activation function is a function applied to

the neuron in a layer during prediction

• RELU is one example of activation

function

• There are some constrains for

the properties of activation functions

Activation Function properties

• The function must be continuous and

infinite in domain

• The function is monotonic

• The function is nonlinear

• The function is efficiently computable

Example of activation functions

• Sigmoid

• Tanh

• RELU (Rectified Linear Unit) and

ELU (Exponential Linear Unit)

Standard Recommendation

• Most recent papers tend to use RELU as

the baseline

• With good enough configuration, ELU can

has 2-3% better accuracy

Standard Output Layer

Activation Functions• Predict Raw Data Value (Regression)

- no activation function

• Predict Binary Output (Yes/No)

– Sigmoid activation function

• Predict Which one (from many classes)

– Softmax activation function

Example

• Raw output

• Sigmoid

• Softmax

Softmax Function

Modify Activation Function

• Feed Forward

• Back Propagation

Derivative function

• RELU derivative function

• Multiply delta by the weight

Back Propagation

Output

Weight Initialization

How do we initialize weight matrices

• Constant initialization

• Uniform and Normal distribution

• LeCun Uniform and Normal

• Glorot/Xavier Uniform and Normal

• He et al./Kaiming/MSRA Uniform and

Normal

Constant Initialization

• We can initial the weight with constant

zero, one, or a constant value

64 rows, 32 columns

• It is not good in practice due to a fixed

value of the weight

Uniform and Normal Distribution

• A uniform distribution draws a random

value from the range [lower,upper] with

equal probability

• A normal distribution uses Gaussian

distribution

• Both can be used for weight initialization,

but various heuristics provide better

performance

• Normal distribution with std. variation=0.05

Image from wikipedia

LeCun Uniform and Normal

• The idea is to use the number of fan in

(number of inputs to the layer) and fan out

(number of outputs from the layer) along

with uniform or normal distribution

• For LeCun normal distribution

Glorot/Xavier Uniform and

Normal• Similar to LeCun with minor change in the

equation

• For Glorot/Xavier normal distribution

• Provide good performance in general cases

He et al.

• Usually uses for deep network

• For He et al., normal distribution

For Deep learning, the

number of layers has to be

more than two hidden layers

Keras implementation of MNISTfrom sklearn.preprocessing import LabelBinarizer

from keras.models import Sequential

from keras.layers.core import Dense

from keras.optimizers import SGD

import matplotlib.pyplot as plt

import numpy as np

import argparse

from keras.datasets import mnist

(trainX, trainY), (testX, testY) = mnist.load_data()

trainX, trainY = (trainX.reshape(trainX.shape[0],28*28) / 255, trainY)

testX, testY = (testX.reshape(testX.shape[0],28*28) / 255, testY)

lb = LabelBinarizer()

trainY = lb.fit_transform(trainY)

testY = lb.transform(testY)

model = Sequential()

model.add(Dense(256, input_shape=(784,), activation="sigmoid"))

model.add(Dense(128, activation="sigmoid"))

model.add(Dense(10, activation="softmax"))

# train the model usign SGD

print("[INFO] training network...")

sgd = SGD(0.01)

model.compile(loss="categorical_crossentropy", optimizer=sgd,

metrics=["accuracy"])

H = model.fit(trainX, trainY, validation_data=(testX, testY),

epochs=100, batch_size=128)

# evaluate the network

print("[INFO] evaluating network...")

predictions = model.predict(testX, batch_size=128)

print(classification_report(testY.argmax(axis=1),

predictions.argmax(axis=1),

target_names=[str(x) for x in lb.classes_]))

# plot the training loss and accuracy

plt.style.use("ggplot")

plt.figure()

plt.plot(np.arange(0, 100), H.history["loss"], label="train_loss")

plt.plot(np.arange(0, 100), H.history["val_loss"], label="val_loss")

plt.plot(np.arange(0, 100), H.history["acc"], label="train_acc")

plt.plot(np.arange(0, 100), H.history["val_acc"], label="val_acc")

plt.title("Training Loss and Accuracy")

plt.xlabel("Epoch #")

plt.ylabel("Loss/Accuracy")

plt.legend()

plt.show()

# plt.savefig("output_file")

Output

• Test result should get around 92%

accuracy

Four Ingredients in a NN Recipe

• Dataset

The more the data, the better the accuracy

• Loss Function

Usually use categorical cross-entropy

• Model/Architecture

Next slide

• Optimization

The default is Stochastic Gradient Descent

Model/Architecture

Which model to use depend on these

question?

• How many data points you have?

• The number of classes

• How similar/dissimilar the classes are

• The intra-class variance

CIFARfrom sklearn.preprocessing import LabelBinarizer

from sklearn.metrics import classification_report

from keras.models import Sequential

from keras.layers.core import Dense

from keras.optimizers import SGD

from keras.datasets import cifar10

import matplotlib.pyplot as plt

import numpy as np

print("[INFO] loading CIFAR-10 data...")

((trainX, trainY), (testX, testY)) = cifar10.load_data()

trainX = trainX.astype("float") / 255.0

testX = testX.astype("float") / 255.0

trainX = trainX.reshape((trainX.shape[0], 3072))

testX = testX.reshape((testX.shape[0], 3072))

lb = LabelBinarizer()

trainY = lb.fit_transform(trainY)

testY = lb.transform(testY)

labelNames = ["airplane", "automobile", "bird", "cat", "deer",

"dog", "frog", "horse", "ship", "truck"]

model = Sequential()

model.add(Dense(1024, input_shape=(3072,), activation="relu"))

model.add(Dense(512, activation="relu"))

model.add(Dense(10, activation="softmax"))

print("[INFO] training network...")

sgd = SGD(0.01)

model.compile(loss="categorical_crossentropy", optimizer=sgd,

metrics=["accuracy"])

H = model.fit(trainX, trainY, validation_data=(testX, testY),

epochs=100, batch_size=32)

print("[INFO] evaluating network...")

predictions = model.predict(testX, batch_size=32)

print(classification_report(testY.argmax(axis=1),

predictions.argmax(axis=1), target_names=labelNames))

plt.style.use("ggplot")

plt.figure()

plt.plot(np.arange(0, 100), H.history["loss"], label="train_loss")

plt.plot(np.arange(0, 100), H.history["val_loss"], label="val_loss")

plt.plot(np.arange(0, 100), H.history["acc"], label="train_acc")

plt.plot(np.arange(0, 100), H.history["val_acc"], label="val_acc")

plt.title("Training Loss and Accuracy")

plt.xlabel("Epoch #")

plt.ylabel("Loss/Accuracy")

plt.legend()

plt.show()

CIFAR Output

Training Loss and Accuracy

Questions?

neural network regularization and activation functionesl.ait.ac.th/courses/at81.xx...

Documents