hand detection and pose estimation using convolutional ...859561/fulltext01.pdf · convolutional...

DEGREE PROJECT, IN ,COMPUTER SCIENCE AND COMMUNICATIONSECOND LEVEL

STOCKHOLM, SWEDEN 2015

Hand Detection and Pose Estimationusing Convolutional Neural Networks

ADAM KNUTSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Hand Detection and Pose Estimationusing Convolutional Neural Networks

ADAM KNUTSSON

Master’s Thesis atComputer Vision and Active Perception Lab (CVAP)

School of Computer Science and Communication (CSC)KTH Royal Institute of Technology, Sweden

Supervisors: Hedvig Kjellström & Alessandro PieropanExaminer: Danica Kragic Jensfelt

AcknowledgementsFirst and foremost, I would like to thank Hedvig Kjellströmand Alessandro Pieropan at the Computer Vision and Ac-tive Perception Lab (CVAP), School of Computer Scienceand Communication (CSC), KTH Royal institute of Tech-nology, for their excellent supervision of this project. Afterour meetings during the project, I have always felt inspiredwith new ideas, empowered to continue my work, and reas-sured that the scientific direction of the project is interest-ing and relevant. Without their support and guidance, theoutcome of the project would probably be very different.

Furthermore, since this project is one of the final ele-ments to be completed in my M.Sc. degree, I would like tothank all the people that I have had the benefit of meetingand working with during my time at KTH, especially myfellow students. I’ve been on a fantastic journey and trulygrown as a person, and I know that this development is nota product of my own education, but a product of all thepeople that I have met and worked with.

I’d also like to thank my family for their endless sup-port. Even though science and technology are not amongtheir main areas of interest, they have always been verysupportive of me walking on different paths in life than theones they chose for themselves. Their support have been,and will continue to be, invaluable to me.

Finally, I would like to convey my gratitude to NVIDIACorporation and their hardware donation program for sup-plying this project with a Tesla K40 GPU.

AbstractThis thesis examines how convolutional neural networks canapplied to the problem of hand detection and hand poseestimation.

Two families of convolutional neural networks are trained,aimed at performing the task of classification or regres-sion. The networks are trained on specialized data gener-ated from publicly available datasets. The algorithms usedto generate the specialized data are also disclosed.

The main focus has been to investigate the differentstructural properties of convolutional neural networks, notbuilding optimized hand detection, or hand pose estima-tion, systems.

Experiments revealed, that classifier networks featuringa relatively high number of convolutions offers the high-est performance on external validation data. Additionally,shallow classifier networks featuring a relatively low num-ber of convolutions, yields a high classification accuracy ontraining and testing data, but a very low accuracy on thevalidation set. This effect uncovers one of the fundamentaldifficulties in building a hand detection system: The asym-metric classification problem. In further investigation, itis also remarked, that relatively shallow classifier networksprobably becomes color sensitive.

Furthermore, regressor networks featuring multiscaleinputs typically yielded the lowest error, when tasked withcomputing key-point locations directly from data. It is alsorevealed, that color data implicitly contain more informa-tion, making it easier to compute key-point locations, es-pecially in the image space. However, to be able to derivethe color invariant features, deeper regressor networks arerequired.

ReferatHanddetektering och pose-estimering med

användning av faltande neuronnät

I detta examensarbete undersöks hur faltande neuronnätkan användas för detektering av, samt skattning av posehos, händer.

Två familjer av neuronnät tränas, med syftet att utfö-ra klassificering eller regression. Neuronnäten tränas medspecialiserad data genererad ur publikt tillgängliga data-set. Algoritmerna för att generera den specialiserade datanpresenteras även i sin helhet.

Huvudsyftet med arbetet, har varit att undersöka neu-ronnätens strukturella egenskaper, samt relatera dessa tillprestanda, och inte bygga ett färdigt system för handdetek-tering eller skattning av handpose.

Experimenten visade, att neuronnät för klassificeringmed ett relativt stor antal faltningar ger högst prestandapå valideringsdata. Vidare, så verkar neuronnät för klassifi-cering med relativt litet antal faltningar ge en god prestan-da på träning- och testdata, men mycket dålig prestand påvalideringsdata. Detta sambandet avslöjar en fundamentalsvårighet med att träna ett neuronnät för klassificering avhänder, nämligen det kraftigt asymmetriska klassificerings-problemet. I vidare undersökningar visar det sig också, attneuronnät för klassificering med ett relativt litet antal falt-ningar troligtvis enbart blir färgkänsliga.

Experimenten visade också, att neuronnät för regres-sion som använde sig av data i flera skalor gav lägst felnär de skulle beräkna positioner av handmarkörer direktur data. Slutligen framkom det, att färgdata, i konstrasttill djupdata, implicit innehåller mer information, vilket gördet relativt sett lättare att beräkna markörer, framför allti det tvådimensionella bildrummet. Dock, för att kunna fåfram den implicita informationen, så krävs relativt djupaneuronnät.

Contents

1 Introduction 11.1 The Human Hand: An Abstract Concept . . . . . . . . . . . . . . . 11.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Hand Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . 21.3.3 GPGPU & HPC Technology . . . . . . . . . . . . . . . . . . 31.3.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Report Disposition and Structure . . . . . . . . . . . . . . . . . . . . 5

I Convolutional Neural Networks 7

2 Artificial Neural Networks 92.1 The Artificial Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 The Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 The Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . 122.1.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Alternative Interconnection Operations . . . . . . . . . . . . . . . . . 152.2.1 Convolution Layers . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Local Response Normalization . . . . . . . . . . . . . . . . . 18

3 Learning in Multilayer Networks 193.1 The Classic Backpropagation Algorithm . . . . . . . . . . . . . . . . 193.2 Gradient Approximations . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . 213.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.1 Euclidean Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.2 Softmax Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Learning Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.1 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.5.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5.2 Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6 Learning Rate Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6.1 Step (STEP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Network Architectures 274.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . 284.1.2 Fully Connected Layers . . . . . . . . . . . . . . . . . . . . . 32

4.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.1 Shallow Network . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.2 Deep Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.3 Multiscale Network . . . . . . . . . . . . . . . . . . . . . . . . 36

II Using Convolutional Neural Networks for Hand Detectionand Pose Estimation 39

5 Datasets 415.1 The NYU Hand Pose Dataset (NYU) . . . . . . . . . . . . . . . . . . 41

5.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.2 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 The Oxford Hand Dataset (OX) . . . . . . . . . . . . . . . . . . . . . 445.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.2 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Hand Detection 476.1 Training & Testing Data . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Validation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.3.1 Learning Mechanism . . . . . . . . . . . . . . . . . . . . . . . 526.4 Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . 52

6.4.1 Classification Accuracy on Training & Testing Set . . . . . . 526.4.2 Examples of Misclassified Samples in the Testing Set . . . . . 606.4.3 Classification Accuracy on Validation Set . . . . . . . . . . . 60

6.5 Detection Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.5.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.5.2 Detection & Precision Result . . . . . . . . . . . . . . . . . . 62

7 Hand Pose Estimation 677.1 Training & Testing Data . . . . . . . . . . . . . . . . . . . . . . . . . 677.2 Data Interfacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.2.1 Depth Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.2.2 RGB Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.2.3 RGB & Depth Data . . . . . . . . . . . . . . . . . . . . . . . 70

7.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.3.1 Learning Mechanism . . . . . . . . . . . . . . . . . . . . . . . 72

7.4 Euclidean Loss on Training & Testing Set . . . . . . . . . . . . . . . 737.5 Key-point Prediction on Test Set Data . . . . . . . . . . . . . . . . . 74

IIIAnalysis 87

8 Discussion 898.1 Hand Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8.1.1 High Test Set Accuracy . . . . . . . . . . . . . . . . . . . . . 898.1.2 High Performance of Relatively Shallow Networks . . . . . . 908.1.3 Maximum Test Set Accuracy . . . . . . . . . . . . . . . . . . 908.1.4 The Asymmetric Classification Problem . . . . . . . . . . . . 908.1.5 Number of Convolutions and Validation Set Performance . . 91

8.2 Hand Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 918.2.1 Relative Performance of Regressor Networks . . . . . . . . . . 918.2.2 Information Content in Different Types of Data . . . . . . . . 92

9 Conclusions & Future Work 939.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9.2.1 The Asymmetric Classification Problem and Data . . . . . . 949.2.2 Non-naive Convolutional Neural Networks . . . . . . . . . . . 949.2.3 Low Dimensional Embedding and Model Constraints . . . . . 95

Bibliography 97

Appendices 100

A Additional Training Strategies 101A.1 Gradient Approximations . . . . . . . . . . . . . . . . . . . . . . . . 101

A.1.1 Nesterov’s Accelerated Gradient (NESTEROV) . . . . . . . . . . 101A.1.2 Adaptive Gradient (ADAGRAD) . . . . . . . . . . . . . . . . . . 101

A.2 Learning Rate Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 102A.2.1 Exponential (EXP) . . . . . . . . . . . . . . . . . . . . . . . . 102A.2.2 Inverse (INV) . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 103A.3.1 The Absolute Value Unit (ABSVAL) . . . . . . . . . . . . . . . 103A.3.2 The Hyperbolic Tangent Unit (TANH) . . . . . . . . . . . . . . 103A.3.3 The Binomial Normal Log Likelihood Unit (BNLL) . . . . . . 104

Chapter 1

Introduction

1.1 The Human Hand: An Abstract Concept

The human hand is an intricate and complex piece of the human anatomy. Con-sisting of 27 bones, a large set of muscles and tendons, an essential model of thehand can easily constitute 30 - 50 degrees of freedom [1]. The hand also consistsof many different types of ligaments, each inferring constraints on the motion andflexibility of the hand. However, motion ability, visual attributes and structure ofa hand, can vary significantly between individuals. The hand is also subject to ex-tensive occlusion, both caused by the hand itself, e.g. from crossing ones fingers orclinching the hand into a closed fist, and from objects that the hand is interactingwith, e.g. grasping a ball or playing an instrument.

While humans are quite good at inferring the presence and pose of a hand,even during complex motion and strong occlusion, the task is relatively difficult formachines and computer vision systems compared to e.g. face detection and headpose estimation [2]. This is because the complex and articulated structure of thehand makes the mapping from appearance to pose highly non-linear. The high levelof non-linearity makes the task difficult for classic feature based machine learningmethods.

However, in recent years, machine learning methods capable of performing socalled deep (or hierarchical) learning has been introduced. The family of methodshas the ability of modeling high level abstraction, and a high degree of non-linearity,in different types of data. In other words, the algorithms are capable of learningabstract concepts that are composed of a set of less abstract features.

One group of methods within the deep learning family, called convolutional neu-ral networks, is able to learn highly non-linear mappings by interconnecting layersof artificial neurons in a fashion inspired by biological systems and neuroscience.A convolutional neural network may consist of many different layer types, e.g. in-terconnection pattern and activation functions, at different layers, which makesoperations layer dependent. This is in clear contrast to the more classic, homoge-neous interconnected neural networks, e.g. the Multilayer Perceptron (MLP) and

1

CHAPTER 1. INTRODUCTION

Hopfield Network.The non-linear modeling capability of convolutional neural networks and its

relation to biological vision systems, which are highly successful and sophisticated,makes the method a suitable candidate for the task of hand detection and poseestimation.

1.2 ContributionsIn this report, several convolutional neural network architectures suitable for handdetection and pose estimation are evaluated. The evaluated architectures are mainlysimplified, naïve versions of architectures successfully applied to well known com-puter vision challenges by previous authors within the computer vision community.

Also, algorithms are developed to derive specialized datasets, usable for training,testing and validation, from publicly available datasets.

However, due to the nature of the project, this report omits meta-parameteroptimization in training. Instead, the different structural properties of the convo-lutional neural networks are examined.

1.3 Related Work

1.3.1 Hand TrackingOne of the simplest and, unfortunately, most computationally demanding scheme,to hand detection and pose estimation is to do a template based search. Thisapproach can be implemented successfully despite the complexity of the humanhand. However, some non-trivial simplification has to be done in order for thesystem to perform well, e.g. using a colored glove and an RGB camera [3].

Another widely used approach to any computer vision and parameter estimationsystem, is the utilization of feature extractors. The extractors generates points ina high-dimensional feature space given the input data, e.g. and RGB-image, andthen performs classification. Example of feature extractors that has been used isHistogram of Oriented Gradients (HOG), Hu-Moments and Shape Context Descrip-tors (SCD) [1]. Feature extractors can also be used in conjunction with frequencyanalysis to increase performance [4].

Probabilistic and generative methods can also be utilized. Markov RandomFields have proven to yield good detection and pose estimation performance of thehuman hand even when the hand is interacting with objects [5]. Also, HierarchicalBayesian Filters have been implemented, resulting in a great performance improve-ment compared to the the more classic particle filters [6].

1.3.2 Deep Neural NetworksNeural networks has re-gained a lot of scientific traction in recent years. The con-volutional neural network, which is a deep neural networks with receptive field

2

1.3. RELATED WORK

properties inspired by the visual cortex in animals [7], has become the dominatingapproach to image classification, object detection and localization, e.g. in the Im-ageNet Large Scale Visual Recognition Challenge [8, 9]. The convolutional neuralnetworks has also proven to be a feasible approach in areas where other technologiesare computationally impossible [10].

Convolutional networks was first applied with great success to image classifi-cation, sparking the remarkable interest in using deep neural networks in manydifferent kinds of applications [11]. The functionality of the convolutional neuralnetwork was then extended to object detection and localization using trainable re-gressor networks [12, 13].

A convolutional neural network can be considered to have (at least) two distinctlayer classes: convolutional and classification layers. The convolutional layers canbe considered to act as adaptive feature extractors, capable of learning and decom-posing raw input data into hierarchical features [14]. Already trained convolutionallayers can be reused on other recognition tasks with great results by retraining onlythe classification layers on new data [15].

Convolutional neural networks has also shown a remarkable ability to learnhighly complex tasks from unprocessed, raw input data in areas such as naturallanguage processing, without any information about underlying language semanticsor syntax [16].

One approach to improve performance and extend the functionality of neuralnetworks is to interconnect network modules instead of increasing the number of lay-ers and neurons in each layer [17]. In order to train larger and/or deeper networks,the learning rule and training strategy has to be revised. One important milestonein neural network training, is the introduction of Stochastic Gradient Descent [18].Also, as networks become larger, and thereby more powerful, one has to take ac-tions to prevent the system from over fitting. One approach, is to force individualneurons to take on learning responsibility from neighboring neurons by introducingrandom ”brain damage”. This scheme, called Dropout, has been applied with greatsuccess [19, 11].

1.3.3 GPGPU & HPC Technology

The essential arithmetic required by neural networks is multiplication (e.g. synapticstrength modulation), summation (e.g. neuron excitement) and function evaluation(e.g. neuron activation and output). A relatively large, deep and densely connectedneural network can easily consist of billions of hyperparameters that require tuningin order for the system to emulate learning. However, the structure of a neuralnetwork makes it highly suitable for parallel computation [21, 22].

During the past 15 years, processor technology has forked into two distinctdevelopment branches: throughput optimized and latency optimized systems [23].Graphics Processing Units (GPUs) has been developed into high throughput sys-tems, capable off supplying highly parallel computing resources in consumer gradeproducts, see Figure 1.1. Another, more recent, key enhancement is the development

3

CHAPTER 1. INTRODUCTION

Figure 1.1: Developement of theoretical GFLOP per second performance on dif-ferent computing hardware w.r.t. time [20]. Blue lines represent CPU single anddouble precision performance. Green lines represent GPU single and double preci-sion performance.

of General-Purpose Computing on Graphics Processing Units (GPGPU) platformswhich allows GPUs to be utilized for non-graphics related computations. One suchplatform, is the NVIDIA Compute Unified Device Architecture (CUDA), whichexposes the virtual instruction set of the NVIDIA GPU to C and C++ [24, 20].

Finally, GPU developers is researching and providing GPU accelerated librariesthat can be used to increase the performance in certain application such as machinelearning and signal processing. One recently released library that can be used toaccelerate neural networks is the NVIDIA cuDNN library [25].

1.3.4 Software

The recently developed Caffe deep learning framework, can be used to combineeffective low level GPGPU operations suitable for neural networks, an expressiveand modular architecture and a high-level programming language interface [26].Since neural networks operations often consists of linear algebra [22], Caffe requires

4

1.4. REPORT DISPOSITION AND STRUCTURE

the Basic Linear Algebra Subprograms (BLAS) [27] and the Boost C++ library [28].Caffe also requires the OpenCV library for image and data pre-processing as well ashigh-level machine learning tasks. OpenCV can also provide generic image I/O anddisplay interfaces into a high-level programming language.

1.3.5 DatasetsDuring the recent years, many datasets aimed at hand pose estimation have beenpresented [29]. The ones most interesting and suitable for this project is the NYUHand Pose Dataset and the Oxford Hand Dataset. The NYU Hand Pose datasetconsisting of about 243 k RGB-D images labeled with the key-point locations ofthe hand [30]. The Oxford Hand Dataset consists of about 13 k RGB images ofdifferent scenes featuring humans with bounding box annotations around the handsvisible in the image [31].

As a final remark, large amount of synthetic data depicting hands can be gener-ated by using computer software and graphics rendering engines, e.g. the LibHandlibrary [32, 1].

1.4 Report Disposition and StructureIn Chapter 2, a small historic overview, the fundamental idea behind neural net-works, and the basic components of a convolutional neural network is described indetail. Chapter 3 explains the fundamental learning mechanism, backwards prop-agation of errors, used to train neural networks. In other words, the purpose ofChapter 2 and Chapter 3 is to familiarize the reader with the basic operationalprinciples and mechanisms of neural networks.

Chapter 4 then describes network architectures suitable for the different tasks.Section 4.2 presents architectures suitable for hand pose estimation, and Section 4.1presents architectures suitable for hand Classification in a detector framework.

The utilized public datasets, and their content, are presented in Chapter 5.The experimental setup, e.g. training, testing and validation data and meta-

parameter settings, and performance of the networks suitable for hand pose estima-tion is presented in Chapter 7. The corresponding information about the networkssuitable for hand classification is presented in Chapter 6. Chapter 6 also evaluatesthe detector performance of some suitable network classifiers.

The results are analyzed and discussed in 8 and, finally, conclusions and sugges-tions regarding future work are presented in Chapter 9.

5

Part I

Convolutional Neural Networks

7

Chapter 2

Artificial Neural Networks

A convolutional neural network is a special type of feed-forward, non-recurring arti-ficial neural network where the neurons are tiled so that they respond to overlappingsegments of the data. In order for the familiarize the reader with the concept ofconvolutional neural networks, this chapter will present the fundamental compo-nents of such a network as well as a small historical overview of their developmentand motivation.

2.1 The Artificial Neuron

The human brain consists of about 1011 neurons, each of which are connected tothousands of other neurons, yielding a massive 1014 number of connections withinthe brain [22]. There are many different type of connective configurations sincedifferent groups of neurons are performing different task. However, the most basicfunction of all neurons are the same: to decide whether or not to propagate anaction potential through its axon, given an input to its dendrites [33].

The study of biological, individual neurons has historically been difficult dueto their small size and complex structure. However, in 1952 Alan Lloyd Hodgkinand Andrew Huxley where able to measure and characterize a differential equation,called the Hodgkin-Huxley model, that describes the membrane potential within aneuron by studying the nervous system of a giant squid. The model earned themthe Nobel prize in 1952 [22].

However, the Hodgkin-Huxley model is based on the knowledge of chemicalconcentrations in the nervous system, which describes more than the basic input-output relationship of a neuron. A model that only describes this relationship wheresuggested by Warren S. McCulloch and Walter Pitts in 1943 [22].It consisting ofthree components:

• A set of weighted inputs, corresponding to the interconnecting synapses andthe dendrites of a neuron.

9

CHAPTER 2. ARTIFICIAL NEURAL NETWORKS

a1

a2

a3

Σ ϕf(·)

w1w2w3

Figure 2.1: A schematic of an McCulloch and Pitts artificial neuron with threeinputs and input values, an, a set of synaptic weights, wn, a summation functionand a generic activation function, f(·) yielding an activation, ϕ. In this exampleN = 3.

• A summation of the weighted inputs, corresponding to the membrane andsoma of the neuron.

• An activation function, which historically has been a threshold function yield-ing a (binary) action potential, corresponding to the axon of the neuron.However, in artificial neural networks, other activation functions are typicallyused. These are described in section 2.1.3.

A schematic of an McCulloch and Pitts neuron is shown in Figure 2.1. The operationperformed by this neuron is the one typically reffered to in this report. The outputof the neuron is the weighted sum of the input values mapped by the activationfunction, f(·), i.e.

ϕ = f

(N∑n=1

anwn

). (2.1)

One can observe, that the sum computed by the neuron (or the output of the neuronif a unity linear activation function is used, i.e. f(·) = 1) can be described as thematrix operation

N∑n=1

anwn =(a1 a2 · · · aN

)w1w2...wN

= ~A~w (2.2)

where ~A is the input vector and ~w is the weight vector.

2.1.1 The Perceptron

The perceptron is able to perform simple machine learning tasks, e.g. linear classi-fication and linear regression [22]. An example of a perceptron is shown in Figure2.2. It is essentially a set of artificial neurons with (typically) thresholded binaryoutputs, i.e.

10

2.1. THE ARTIFICIAL NEURON

a1

a2

a3

Σ1

Σ2

w11

w12

w21

w22

w31

w32

ϕ1f(·)

ϕ2f(·)

Neuronlayer

Inputlayer

Figure 2.2: A schematic of a perceptron featuring three inputs and input values, an,and two neurons with generic activation functions, f(·), yielding a set of activations,ϕm. In this example N = 3 and M = 2. The input neurons are not performingany computations, they are only depicted in order to show how the different inputvalues are fed into the network.

ϕm ={

1 if ∑Nn=1 anwnm > θm

0 if ∑Nn=1 anwnm ≤ θm

. (2.3)

However, in general, the threshold vector ~θ is replaced by introducing a bias inputto each neuron with a constant input value. By utilizing this scheme, the perceptronis capable to adjust the threshold in individual neurons by changing the bias inputduring learning.

The activation of neuron m in the perceptron is, again, the weighted sum of itsinput mapped by the activation function, i.e.

ϕm = f

(N∑n=1

anwnm

)(2.4)

One can also observe, that the sum computed by the neurons (or the output of theneurons if unity linear activation functions are used, i.e. f(·) = 1) in the perceptroncan be described by the matrix operation

∑Nn=1 anwn1∑Nn=1 anwn2

...∑Nn=1 anwnM

T

=(a1 a2 · · · aN

)w11 w12 · · · w1Mw21 w22 · · · w2M...

... . . . ...wN1 wN2 · · · wNM

= Aw (2.5)

where A is the input vector and w is the weight matrix.

11


a1

a2

a3

Σ1

Σ2

Σ1

Σ2

w11

w12

w21

w22

w31

w32

v11

v12v21

v22

ϕkg(·)

ϕkg(·)

Hiddenlayer

Inputlayer

OutputLayer

Figure 2.3: A schematic of a Multilayer Perceptron featuring three inputs and inputvalues, an, two neurons in the hidden layer and two neurons in the output layer.The hidden layer has activation function f(·) (here considered built in to the outputof the neuron in the hidden layer) and the output layer has activation function g(·).In this example N = 3, M = 2 and K = 2.

Perceptron Learning

During learning, the perceptron is adjusting the weights residing within the weightmatrix in order to generate the desired output, or target, given some inputs thatshould generate that specific target. The perceptron learning rule states that

wnm ← wnm + α(tm − ϕm)an (2.6)

where the arrow indicates an update of the value, tm is the target value of neuronm, ϕm is the computed (”current”) output of neuron m, an is input n and α is theso called learning rate, described in more detail in Section 3.4.

2.1.2 The Multilayer PerceptronA multilayer perceptron (MLP) is essentially a set of perceptrons interconnectedinto layers. An example is shown in Figure 2.3. An MLP can consist of multiplehidden layers but normally only a few hidden layers are used [22]. The MLP isable to perform nonlinear classification and regression in contrast to the single layerperceptron.

The activation of neuron k in the output layer, are now the weighted sum of theactivations in the previous, hidden layer, mapped by the activation function in theoutput layer, i.e.

ϕk = g

(M∑m=1

vmkϕm

)= g

(M∑m=1

vmkf

(N∑n=1

anwnm

))(2.7)

If the activation functions in the hidden layer and output layer is unity linear,i.e. f(·) = g(·) = 1, then the output of the network can be described by the matrix

12

2.1. THE ARTIFICIAL NEURON

operationϕ = (Aw) v (2.8)

where A is 1-by-N , w is N -by-M and v is M -by-K. However, if we now considerthe associative property of the matrices we know that

ϕ = (Aw) v = A (wv) = Aw′. (2.9)

This means, that a multilayer perceptron collapses into a perceptron with weightmatrix w′ = wv if unity linear activation functions are used. In other words,the MLP requires nonlinear activation functions in order to give any improvementcompared to the perceptron. Common activation functions are presented in 2.1.3.

Multilayer Perceptron Learning

The MLP, compared to the perceptron, also performs learning by adjusting theweights residing within the weight matrices representing the network. However,one significant difficulty arises compared to the perceptron learning rule: Whichweights to adjust and how much? Both sets of weights undeniably contributes tothe output of the network! The problem can be described more formally as:

• We know the network output and desired target, but we don’t know the inputto the neurons in the output layer from the hidden layer.

• If the network has one hidden layer, we know the input to the neurons in thislayer, but not the desired target. If the network has multiple hidden layers,we don’t know the inputs nor the targets of these layers.

The solution to this problem, is to use an algorithm that performs backward propa-gation of errors, or Backpropagation, by means of gradient descent. This algorithmis described in detail in Chapter 3, Section 3.1.

2.1.3 Activation Functions

Instead of considering the activation function to be an operational part of a neuron,e.g. the neuron that computes the weighted sum of it’s inputs, it can be thoughtof as a special type of neuron that maps one input to one output. These types ofneurons are refereed to as units. See Figure 2.4. There are a wide range of activationfunctions. For the sake of simplicity, the most common ones are characterized below.

The Rectified Linearity Unit (ReLU)

The Rectified Linearity unit (ReLU) is a non-continuous, non-saturating activationfunction described by

ϕ(h) = max(h, 0) (2.10)

13


h f(·) ϕ

ActivationUnit

Figure 2.4: Schematic of an activation neuron, or unit, used to yield an activation,ϕ, given the input h. Note that this type on neuron only has one input and oneoutput.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1

0

1

2

Weighted sum (h)

Activation(ϕ

)

Figure 2.5: Activation function of the Rectified Linearity unit, (ReLU).

meaning that

ϕ(h) ={h if h > 00 if h ≤ 0.

(2.11)

The ReLU is thereby unity linear w.r.t. the input if the input value is larger than zeroand zero otherwise. See Figure 2.5. The ReLU has shown to give fast computatiosnand good performance in so called convolutional neural networks [11].

The Power Unit (POWER)

the Power unit (POWER) is a continuous, non-saturating activation function describedby

ϕ(h) = (a+ bh)c (2.12)

where a, b and c are parameters controlling the shift, scale, and power respectively.The POWER is thereby able to yield x and y-antisymmetric activation if c is odd, andy-symmetric activation if c is even. The unit is also able to yield non-rectified linearactivation. See Figure 2.6.

14

2.2. ALTERNATIVE INTERCONNECTION OPERATIONS

−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6−5

0

5

Weighted sum (h)

Activation(ϕ

)

a = 0, b = 1, c = 3a = 0, b = 0.5, c = 3a = 0, b = 1, c = 1

Figure 2.6: Activation function of the Power unit, POWER. The different lines corre-sponds to different parameters settings, i.e. shift, a, scale, b and power, c.

−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6

−0.5

0

0.5

Weighted sum (h)

Activation(ϕ

)

Figure 2.7: Activation function of the Sigmoid unit, SIGMOID.

The Sigmoid Unit (SIGMOID)

The Sigmoid unit (SIGMOID) is a continuous, saturating activation function de-scribed by a special case of the logistic function

ϕ(h) = 11 + e−βh

(2.13)

where β = 1. The SIGMOID does not yield negative activation and is therefore onlyantisymmetric w.r.t. the y-axis. See Figure 2.7.

2.2 Alternative Interconnection OperationsBoth the Perceptron and Multilayer Perceptron features so called fully connectedlayers, meaning that every neuron in each layer is connected to every neuron in theprevious layer. However, there are other essential types of connective configurations,e.g. convolution and pooling, that are described in this section.

15


a1

a2

a3

a4

Σ1

Σ2

w1

w2

w3w1

w2

w3

hk

hk

Inputlayer

Convolutionlayer

Figure 2.8: A schematic of a one dimensional convolution layer with one kernel ofsize 3, i.e. ~w = (w1, w2, w3), and a stride of 1. Note that the weights used by theneurons in the convolutional layer are shared between all neurons. The output ofthe convolutional layer is called the feature map corresponding to the kernel ~w.

2.2.1 Convolution Layers

A neuron in a convolutional layer is typically connected to only a set, or tile, ofneurons in the previous layer. Thereby, the neuron in the convolutional layer onlyresponds to the activation of the neurons within this tile. The size of the tile, andthereby also the size of the weight matrix (also called kernel, filter or filter kernel),can be of any size that fits the outputs of the previous layer.

The overlapping of the tiles in a convolutional layer are defined by how muchthe centers of the tiles are placed apart, this is refereed to as the stride of theconvolutional layer. A convolutional layer often features many kernels, resulting ina set of different outputs, called feature maps. The weights used in each kernel isshared by all neurons in that feature map, this is refereed to as weight sharing. SeeFigure 2.8.

2.2.2 Pooling Layers

A neuron in a pooling layer essentially performs a sub-sampling operations on a tileof neurons in the previous layer by letting its output correspond to only one of theinputs, e.g. max pooling (MAX), or by letting its output correspond to the average of

16

2.2. ALTERNATIVE INTERCONNECTION OPERATIONS

ϕ1

ϕ2

ϕ3

ϕ4

f(~ϕ1)

f(~ϕ2)

ϕok

ϕok

Inputlayer

Convolutionlayer

Figure 2.9: A schematic of a one dimensional pooling layer with a pooling kernel ofsize 3 and a stride of 1. Note that no weights or summation is done by the poolingneurons and that both the input and output are activation values

the input, i.e. average pooling (AVG). An example of the generic pooling operationis shown in Figure 2.9.

Max Pooling (MAX)

In max pooling, the output of the pooling neuron is the maximum value that resideswithin the input tile, i.e.

ϕo = max(ϕ1, ϕ2, · · · , ϕN ) (2.14)

where N is the total number of elements within the tile.

Average Pooling (AVG)

In average pooling, the output of the pooling neuron is the average value of theinput values that resides within the input tile, i.e.

ϕo = 1N

N∑n=1

ϕn (2.15)

where N is the total number of elements within the tile.

17


2.2.3 Local Response NormalizationLocal response normalization performs lateral inhibition by normalizing over localinput regions. The regions can be a spatial region within a feature map or across aset of feature maps at one spatial point. Each input value is normalized by

ϕ← ϕ

(1 + α

n

∑i

ϕ2i

)β−1

(2.16)

where n is the size of each local region. The sum is taken over the region canteredat that value. α is the scaling parameter and β is the exponent parameter.

18

Chapter 3

Learning in Multilayer Networks

As mentioned in Section 2.1.2 learning, or weight updating, becomes difficult whenthe network consists of multiple layers. A set of solutions to this problems aredescribed below. However, all algorithms works be utilizing the same principle,namely backwards propagation of errors. The principle can in short be describedas

1. Feed an input vector to the network. Compute and store the activation of allneurons and generate and output, i.e. prediction. This is refereed to as theforward pass, or going forward.

2. Measure the error between the output prediction and the desired target.

3. Compute the gradient w.r.t. the weights in each layer, starting at the output.This is refereed to as the backwards pass, or going backwards.

4. Update the weights in the network by using a combination of the negativegradient, previous weights etc.

3.1 The Classic Backpropagation AlgorithmHistorically, multilayer perceptrons featuring a few layers where trained by utilizingthe Backpropagation algorithm [34], described below [22]. In this example, theSigmoid function is used as activation function, i.e.

ϕ = f(h) = 11 + e−h

(3.1)

where h is the weighted sum computed by a neuron. The Sigmoid function has thederivative

∂ϕ

∂h= ϕ(1− ϕ). (3.2)

The algorithm works as follows:

19

CHAPTER 3. LEARNING IN MULTILAYER NETWORKS

1. Compute the activation of all neurons in the network, yielding an output (theforward pass, or going forward.):

• Activation of neuron m in the hidden layer(s):

ϕm = 11 + e−hm

, where hm =N∑n=1

anwnm (3.3)

• Do this for all the hidden layers until you get to the output layer, in thisexample, we only have one hidden layer in order to match the annotationsused in Figure 2.3, which has activation:

ϕk = 11 + ehk

, where hk =M∑m=1

ϕmvmk (3.4)

2. Compute error and correct weights layer-wise (the backward pass, or goingbackwards):

• Compute the error at the output:

δok = (tk − ϕk)ϕk(1− ϕk) (3.5)

• Compute the error in the hidden layer(s):

δhm = ϕm(1− ϕm)K∑k=1

vmkδok (3.6)

• Update the output layer weights:

vmk ← vmk + αδokϕm (3.7)

• Update the hidden layer weights

wnm ← wnm + αδhman (3.8)

where α is the learning rate.

This, original, backpropagation algorithm is performing so called gradient descentoptimization [34]. If we consider the input vector x and predicted output y, we definethe loss functions, `, as the cost of predicting y when the target is y, i.e. `(y, y).We know, that the predicted output, y can be thought of as a transformation ofthe arbitrary input vector by a function, f , parametrized by the weights insidethe network, i.e. y = fw(x). Now, the loss function can be described as `(y, y) =`(fw(x), y), or Q(z, w) = `(fw(x), y) where z is an input and output data pair (x, y).

20

3.2. GRADIENT APPROXIMATIONS

In this framework, gradient descent optimization is performed by by updating theweights according to

vt+1 = µvt − α1n

N∑i=1∇wtQ(zt, wt) (3.9)

wt+1 = wt + vt+1

where α is the learning rate. Also, note that the loss is computed as the averageover a set of n data pairs. This algorithm reaches, when the learning rate is smallenough, linear convergence [18].

3.2 Gradient ApproximationsHowever, this algorithm is not suitable when data is abundant or when the net-work consists of many layers, since the calculation of the gradient becomes verycomputationally expensive. Instead, we can utilize approximations of the gradients,estimated by only using some data pairs. These schemes are presented in Sections3.2.1, A.1.1 and A.1.2. These approximations are essential in order to make deepneural networks computationally reasonable.

3.2.1 Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent approximates the gradient w.r.t. the weights in the lossfunction by computing it from only one, randomized, data pair, zt, i.e.

vt+1 = µv − α∇wQ(zt, wt). (3.10)

wt+1 = wt + vt+1

where α is the learning rate, µ is the momentum and t is the current weight state,i.e. before updating. The Convergence speed of SGD is approximately O(1/t) whenthe learning rate are reduced both fast and slow enough.

3.3 Loss FunctionsAs mentioned previously, the loss function computes, in some sense, the predictionerror at the output of the network. It is the gradient of this error w.r.t. the weightsthat are used to update the weights according to the algorithm described in 3.2.1.The loss function is thereby considered to drive the learning process. There aredifferent types of loss functions, and the choice of loss function is mainly dependenton the problem at hand. Normally, the loss is computed over a batch of data, i.e. aset of data and its labels. In the equations below, the batch index is denoted b, andn denotes the n:th dimension of the arrays.

21


3.3.1 Euclidean Loss

The Euclidean loss function computes the sum of the square L2-norm, i.e.

l(y, y) = 12B

B∑b=1||yb − yb||22 where y, y ∈ RN (3.11)

where yb is the prediction made by the network, and yb the desired output. Here,both yb and yb are N -dimensional vectors.

The Euclidean loss function is suitable when the network should be trained toperform regression, e.g. train to yield numeric output given some input, i.e. yb =fw(xb) where yb, xb ∈ RN .

3.3.2 Softmax Loss

The Softmax loss function computes the multinomial logistic loss of the predictionmade by the network, yb, and the target label, lb, i.e.

l(y, y) = −1B

B∑b=1

log (σ(yb), lb) where (3.12)

σ (yb)i = eybi∑Nn=1 e

ybnfor i = 1, ..., N. (3.13)

Essentially, Equation (3.13) describes a squashing operation that transforms thepredicted output of the network, yb into a discrete probability distribution, σ (yb).Equation (3.12) then computes the multinomial logistic loss of this probability dis-tribution w.r.t the target label lb. The Softmax loss function is therefore suitablewhen the network should be trained to perform classification over a set of distinctclasses.

3.4 Learning Parameters

As mention previously, a neural network transforms the input vector, x, yielding apredicted output vector, y, by means of a function parameterized by the weightsresiding within the network, i.e. y = fw(x). The loss function describes the errorbetween the predicted output, y, and the desired output, y. The weights are thenupdated according to the equation (3.10), (A.1) or (A.2). However, as seen in theequations, the actual computed weight update is also dependent on the learningrate, α, and the momentum, µ. These are so called the meta-parameters, sincethey can be adjusted in order to optimize the weight updating, which is also anoptimization algorithm.

22

3.5. WEIGHTS

3.4.1 Learning RateThe learning rate, α, describes how much the weights should be updated when thenetwork predicts an output that is different from the desired output. A large learn-ing rate makes the weights change quickly, even when the predicted output erroris small. A small learning rate may not change the weights enough to compensatefor the predicted output error, meaning that the network will have to be shown theinput and desired output a larger amount of times. In other words, a large learn-ing rate will prevent the network from converging, i.e. preventing the weights fromreaching asymptotically stable numeric values, whilst a small learning rate makesthe network slower to train. Normally, the learning rate is large at first and thendecreased as training progresses, offering a trade off between speed and convergence[22]. This scheme is described in 3.6.

3.4.2 MomentumThe optimization performed when updating the weights residing within the neuralnetwork typically guarantees convergence at a local minimum w.r.t. the loss function[22]. For neural networks in general, only the so called Boltzmann machine can,under some conditions, guarantee convergence to a global minimum [21].

However by adding momentum, µ, to the previous weights, a neural networkis less likely to converge to small local local minimum, which can be thought ofas shallow crevasses compared to a valley, which in this analogy is a large localminimum. Using momentum also makes it possible to use a smaller learning ratewhich means, as mentioned earlier, that learning becomes more stable [22]. Thelarning rate is constant during training and typically set to a value of µ = 0.9[26, 35, 22].

3.5 WeightsIn addition to the learning rate and momentum, some parameters related to theweights themselves can be altered in order to improve the performance of the net-work. These include weight initialization and weight decay.

3.5.1 InitializationBefore training can commence, initial values has to be assigned to the weights inorder to allow for numeric computation. This process, called weight initialization,can assign constant values (CONST) to the weights or assign values drawn from aprobabilistic distribution, which is the approach typically used.

Random Initialization

The two most typical means of random initialization is to draw the initial weightvalue as an instance of a stochastic variable associated with a random distribution

23


−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

x

f W(x

)

W ∈ N (0, 0.5)W ∈ U(−1, 1)

Figure 3.1: A comparison between a the Gaussian and the uniform probabilitydensity function.

which is typically Gaussian (GAUSSIAN) or uniform (UNIFORM). For the Gaussianinitialization, we let the weights assume values corresponding to a set of outcomesgenerated by the stochastic variable W ∈ N (µ, σ2), which has the probability den-sity function

f(x, µ, σ) = 1√2πσ2

e(x−µ)2

2σ2 . (3.14)

For uniform initialization, we let the weights assume values corresponding to a set ofoutcome generated by the stochastic variableW ∈ U(a, b), which has the probabilitydensity function

f(x, a, b) ={ 1b−a for a ≤ x ≤ b,0 for x < a or x > b.

(3.15)

A comparison between the two distributions is shown in Figure 3.1.When performing random initialization, the distribution parameters, i.e. µ and

σ or a (MIN) and b (MAX), must be assigned manually. As a good rule of thumb,sampling the Gaussian probability density function is preferred over sampling theuniform probability density function. The probability function is commonly cen-tered around zero, i.e. µ = 0 and the standard deviation (STD) is often small,e.g. σ = 0.01.

Xavier Initialization (XAVIER)

Instead of manually selecting the distribution parameters, they can be computedadaptively by selecting the standard deviation based on how many neurons thatare interacting. This is refereed to as Xavier initialization. More specifically, eachneuron assigns values to its own weights locally, by sampling a Gaussian probabilitydensity function with a variance that are defined by how many neurons that isfeeding into the neuron itself. Let Wn ∈ N (µn, σn) be the stochastic variable that

24

3.5. WEIGHTS

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

x

f Wn(x

)

iin,n = 5iin,n = 10iin,n = 20

Figure 3.2: Plot showing a set of the resulting Gaussian probability density functionsused by a specific neuron when utilizing Xavier weight initialization. As seen, thevariance of the sampled probability distribution becomes smaller when the numberof inputs to the neuron becomes larger.

neuron n draw its weight samples form, then the Xavier algorithm states that

σ2n = 1

iin,n⇔ σ2

n =√

1iin,n

(3.16)

where iin,n is the number of neurons feeding into neuron n [26]. Which means that

Wn ∈ N(µn,

√1iin,n

)(3.17)

when utilizing Xavier initialization. The resulting Gaussian probability densityfunction describing Wn as a function of the numbers neurons feeding into neuron nis shown in Figure 3.2.

3.5.2 Decay

Another scheme that can improve generalization performance of a neural network, isto introduce so called weight decay. The process essentially reduces the magnitudeof the weight in proportion to the weights themselves as training progresses.

The motivation for using weight decay is that smaller weights centered aroundzero yields more linear activation, especially in networks that mostly utilizes theSigmoid or hyperbolic tangent as activation functions, see Figure 2.7 and A.4. Whenusing weight decay, only the weights responsible for non-linear learning are allowedto assume large values.

Formally, each weight is multiplied by a factor of 1−λ during each iteration [26].However, weight decay may also decrease learning performance since the learningrate are altered be means of weight decay [22].

25


0 10 20 30 40 50 60 70 80 90 100

0.4

0.6

0.8

1

Iteration (n)Base

Learning

RateMod

ulation

γ = 0.90, s = 25γ = 0.85, s = 25γ = 0.75, s = 25

Figure 3.3: The base learning modulation factor when using the step learning ratepolicy.

3.6 Learning Rate PolicyAssigning appropriate values to meta-parameters, i.e. learning rate and momentum,selecting weight initialization scheme and, if desired, setting up a weight decay, areall important for learning and generalization performance. However, some of theparameters, especially learning rate, can also be altered during training in orderto, as mentioned earlier, offer a good trade off between learning performance andconvergence. Typically, the learning rate should be relatively large during the be-ginning of training and then reduced as training progresses. The learning rate couldalso be constant during all iterations of training. The rule applied to reduce thelearning rate, or keep it constant, are called learning rate policy and the differentschemes, except for a constant learning rate policy, are presented in short below.

3.6.1 Step (STEP)The stepped learning rate policy can be described as multiplying the learning rateby a factor, γ, every s:th iteration. Mathematically, the learning rate, α, at iterationn, i.e. αn, can be described as

αn = α0γbnsc (3.18)

where α0 is the base learning rate and γ (bn/sc)−1 is the so called base learning ratemodulation factor. By convention, the b·c, represents flooring integer rounding. Thebase learning rate modulation for the step learning rate policy is depicted in Figure3.3.

26

Chapter 4

Network Architectures

Given the information about the common layer types of a neural network provided inChapter 2, another important question arises: How should these layers be arrangedand specified in order to be able to perform the desired task?

In this chapter, some different types of network architectures that can be suit-able for regression and classification will be presented. However, the number ofimplementable architectures is understandably very large, since there are many dif-ferent design options to consider, e.g. selecting number of layers, which type of layerto use at a certain depth, layer parameters, number of outputs, activation functions,appropriate meta-parameters etc. Therefore, the architectures utilized in this thesiswill mainly leverage on work conducted by previous authors.

4.1 Classification

Object detection is a common task within the area of computer vision and dealswith detecting instances of semantic objects of certain classes, e.g. humans, apples,cars or hands.

Detection is normally conducted by searching, i.e. selecting sub-regions of theimage and then classify the different sub-regions into a set of classes. There aredifferent types of searching algorithms that ranges from exhaustive search, i.e. se-quential sub-region selection at different positions and scales at a fixed stride andscaling, to searches that employed schemes of segmentation in order to select sub-regions [36].

In this section, three convolutional architectures and two types of fully connectedarchitectures, inspired by the work of [11], will be presented. The full networkdescribed by [11] was proposed to classify images from the ImageNET dataset in theILSVRC-2010 competition [8]. At that time, the network achieved a record breakingperformance, sparking a big interest in deep convolutional networks. However, theImageNET dataset consists of 1 k classes, which means that there is a substantialrisk of overfitting when training the full network on only two classes, i.e. hand orno hand. Therefore, We’ll evaluate different instances of the original network.

27

CHAPTER 4. NETWORK ARCHITECTURES

As mentioned previously, convolutional neural networks can be considered toconsist of two parts: feature extracting layers and the layers used to perform classifi-cation. In practice, this means that the convolutional layers are the ones performingfeature extraction, and the fully connected layers are performing the classificationbased on the features provided by the convolutional layers. Because of this, thedifferent convolutional architectures are presented in Section 4.1.1 and two fullyconnected architectures are presented in Section 4.1.2.

4.1.1 Convolutional Layers

The network described in [11] consists of five convolutional layers, each with differ-ent convolving parameters, e.g.kernel size, number of feature maps etc. However, asmentioned previously, with the reduced number of classes, utilizing all the convolu-tional layers will probably lead to overfitting and loss of generalization performance.Therefore, we’ll examine the performance of one, three and five convolutional layerscorresponding to the first, first three, and all five convolutional layers in the originalnetwork. In the original network, some convolutional layers utilizes Local ResponseNormalization (LRN), described in 2.2.3. However, we omit using this scheme andfocus on exploring the performance of structural differences.

Single Convolution

The single convolution feature extraction architecture has one data input, acceptingdata with a batch size of B, entries, a height, H, of 224 elements, a width, W , of224 elements and 3 channels, C.

The data is then convolved by a convolutional layer consisting of 96 kernels witha size of 11, a stride of 4, and no padding. The 96 feature maps are then rectifiedby means of the ReLU and finally pooled by means of MAX-pooling, with a kernelsize of 3 and a stride of 2. A schematic of these layers are shown in Figure 4.1.

Three Convolutions

The feature extraction layers featuring three convolutions has one data input, ac-cepting the same data as the single convolution feature extraction layers.

The first convolution and pooling operation is identical to the single convolu-tion network described above. However, the output is then convolved by anotherconvolutional layer consisting of 256 kernels with a size of 3, a stride of 2 and nopadding. Thereafter follows yet another MAX-pooling with a kernel size of 3 and astride of 2.

Finally, another convolution is performed, consisting of 384 kernels with a sizeof 3, a stride of 1 and a padding of 1. A schematic of all the three layers is shownin Figure 4.2.

28

4.1. CLASSIFICATION

Convolution 1

ReLU 1 (ReLU) Pool 1 (MAX Pooling) kernel size: 3 stride: 2 pad: 0

Convolution 1 (Convolution) kernel size: 11 stride: 4 pad: 0

96

RGB Image Data (B, 3, 224, 224)

Feature Data (B, F)

Figure 4.1: A schematic of feature extraction layers featuring one convolution. It isa simplified version of the first layer in the network described by [11].

Five Convolutions

The feature extraction layers featuring three convolutions has one data input, ac-cepting the same data as the single convolution feature extraction layers.

The first three convolutional layers and associated pooling operation are identi-cal to the three convolutional network described above. After the third convolution,the output is convolved two times, without intermediate pooling operations. Thefourth convolution layer consists of 384 kernels, a stride of 1 and a padding of 1.The fifth convolution layer consists of 256 kernels with a stride of 3 and a paddingof 1. A schematic of all the five layers is shown in Figure 4.3.

29


Convolution 2

Pool 2 (MAX Pooling) kernel size: 3 stride: 2 pad: 0 ReLU 2 (ReLU)

Feature Data (B, F)

Pool 2


Convolution 1


Pool 1



96

256


ReLU 3 (ReLU)

Convolution 3

384

Figure 4.2: A schematic of feature extraction layers featuring three convolutions. Itis a simplified version of the first three layers in the network described by [11].

30

4.1. CLASSIFICATION

Convolution 2


ReLU 4 (ReLU)

Convolution 4


256

Feature Data (B, F)

Pool 1


Convolution 3

384


384

Pool 2

Convolution 1




Pool 5 (MAX Pooling) kernel size: 3 stride: 2 pad: 0

ReLU 3 (ReLU)


Convolution 5

256

96

ReLU 5 (ReLU)

Figure 4.3: A schematic of feature extraction layers featuring five convolutions. Itis a simplified version of the five layers in the network described by [11].

31


Score (B, C)

Fully Connected 1

ReLU 2 (ReLU) Fully Connected 2 (InnerProduct)

Fully Connected 1 (InnerProduct)

4096

2

Feature Data (B, F)

Figure 4.4: A schematic of a classifier using two fully connected layers with 4096neurons in the first layer, two neurons in the output layer and a ReLU non-linearity.

4.1.2 Fully Connected Layers

The network described in [11] consists of three fully connected layers with 4096neurons in the first and second layer and 1000 neurons in the output layer in orderto yield prediction scores over 1000 classes.

However, as mentioned previously, with the reduced number of classes, it isprobable that the decision boundary could have a relatively simple shape. Therefore,we’ll examine the performance of two and three fully connected layers with ≤ 4096neurons in the first, or first and second, layer respectively.

The original network in [11] also utilizes the so called Dropout scheme in orderto prevent overfitting. However, since we are more interested in examining theclassification accuracy w.r.t. structural properties, we omit utilizing the Dropoutscheme in our architectures.

32

4.1. CLASSIFICATION

Score (B, C)

Fully Connected 1



4096

Fully Connected 2

4096


2

Feature Data (B, F)

Figure 4.5: A schematic of a classifier using three fully connected layers with 4096neurons in the first and second layer, two neurons in the output layer and two ReLUnon-linearities.

Two Layers

The classifier layers accepts feature data from the convolutional layers representedas a feature vector consisting of B entries and F dimensions. After the first innerproduct operation, the ReLU is applied. Finally, the last inner product operationproduces yields a scoring data of B entries and C classes. A schematic overview ofthese classifier layers is shown in 4.4.

Three Layers

Using three fully connected layers, the ReLU is applied after the first and secondinner product operation. Finally, the last inner product operation produces yieldsa scoring data of B entries and C classes. A schematic overview of these classifierlayers is shown in 4.5.

33


4.2 RegressionHand pose estimation is mainly a regression problem: given an input array, x,generate an output array directly, i.e. y = fw(x) describing some parameters andnot a class nor distribution over classes. The output array should also describe acontinuous mapping of the input array to the output array.

Continuous hand pose estimation (regression) has seen a lot of progress duringrecent years and hand pose classification, i.e. classifying the appearance of a handinto a set of distinct poses, e.g. sign language characters, is mainly solved [29].However, continuous hand pose estimation has proven difficult when the hand isinteracting with objects and subject to occlusion [29].

In the work of [35], three different network architectures, with the task of per-forming continuous hand pose estimation given depth data of uncluttered scenes,are presented:

• A shallow convolutional network, consisting of only one convolutional layer,one pooling layer and two fully connected layers.

• A deep convolutional neural network, consisting of three convolutional layers,three pooling layers and three fully connected layers.

• A multiscale convolutional network with three scaled input, i.e. the depthdata is subsampled and fed into the network through a different input, threeconvolutional layers, two pooling layers and two fully connected layers.

Schematics describing the architecture of the shallow, deep and multi-scale networkis shown in Figures 4.6, 4.7 and 4.8 respectively. The architectures are described inmore detail in Sections 4.2.1, 4.2.2 and 4.2.3

4.2.1 Shallow Network

The shallow network has one data input, accepting data with a batch size of Nentries, a height, H, of 256 elements and a width, W , of 256 elements and Cnumber of channels.

The data is then convolved by a convolutional layer consisting of 8 kernels witha size of 5 and a stride of 1. The convolutional layer uses no padding, which meansthat each kernel will only output values corresponding to valid convolutions overthe data. In other words, the output of each kernel will have a reduced height bykh− 1 and a reduced width of kw − 1, where kh and kw is the height and the widthof the kernel respectively. The 8 feature maps computed by the convolutional layeris then pooled by a max pooling layer with a kernel size of 4.

The output of the pooling layer is then fed into the the fully connected layers,the first of which has 1024 outputs and the second of which has Y outputs, corre-sponding to the output vector y. All layers with activation functions utilizes theRectified Linearity unit (ReLU).

34

4.2. REGRESSION

Data (B, C, 256, 256)


Output (B, Y)

Fully Connected 1


1


1024

Convolution 1


Pool 1

8

Figure 4.6: A schematic of the shallow network, featuring only one convolutionallayer, one pooling layer and two fully connected layers.

35


4.2.2 Deep NetworkThe deep network has one data input, which accepts the same type of data as theshallow network, i.e. data with a batch size of N entries, a hight, H, of 256 elements,a width, W , of 256 elements and C number of channels.

The data is then convolved by a convolutional layer consisting of 8 convolutionkernels with a size of 5, a stride of 1 and no padding. The 8 feature maps computedby the convolutional layer is then pooled by a max pooling layer with a poolingkernel size of 4. This convolution and pooling operation is then iterated once again,i.e. a total of two convolutions and two pooling operations.

The data is then convolved again, but this time with a smaller size of convolutionkernel, namely 3. The output of this convolutional layer is then fed directly intothe first fully connected layer without a pooling layer in between.

The first and second folly connected layers has 1024 outputs each and the thirdfully connected layer has Y outputs, corresponding to the output vector y. Alllayers with activation functions utilizes the Rectified Linearity unit (ReLU).

4.2.3 Multiscale NetworkThe multiscale network has three data inputs, which is in clear contrast to theshallow and deep network described previously. Each first input accepts data witha batch size of N entries, a hight, H, of 256 elements, a width, W , of 256 elementsand C numbers of channels. This data is then convolved by a convolution layerconsisting of 8 kernels with a size of 5, a stride of 1 and no padding. The 8 featuremaps computed by the convolutional layer is the pooled by a max pooling layerwith a kernel size of 2 and a stride of 1.

The second input accepts data with the same number of batch entries, N , andthe same number of channels, C, but the height and the width of the data is halved.This is achieved bu subsampling the data, reducing the height and width by a factorof two. This data is then convolved by a convolution layer consisting of 8 kernelswith a size of 5, a stride of 1 and no padding. The 8 feature maps computed by theconvolutional layer is the pooled by a max pooling layer with a kernel size of 2 anda stride of 1.

The third input accepts data with the same number of batch entries, N , andsame number of channels, C, as the first and second input, but the height and widthof the data is only one fourth of the original data. This is achieved by sumbsamplingthe original data, reducing the height and the width by a factor of four. This data isthen convolved by a convolution layer consisting of 8 kernels with a size of 5, a strideof 1 and no padding. No pooling is performed on the 8 feature maps computed bythe convolutional layer on this data.

The data generated after performing the correct operations on each input isthen merged (indicated by a flattening and concatenation in Figure 4.8) and thenfed into two fully connected layers, the first of which has 1024 outputs and thesecond of which as Y outputs, corresponding to the output vector y. All layers withactivation functions utilizes the Rectified Linearity unit (ReLU).

36

4.2. REGRESSION

Convolution 2


Data (B, C, 256, 256)



Fully Connected 1

1024


8

Fully Connected 2

Fully Connected 3 (InnerProduct) ReLU 5 (ReLU)


Convolution 3

8

Pool 1

ReLU 4 (ReLU)

Pool 2

Output (B, Y)


Convolution 1


ReLU 3 (ReLU)

1024

8

1

Figure 4.7: A schematic of the deep network, consisting of three convolutional layersand three fully connected layers.

37


Pool

1 (2

56x2

56)

Flat

(256

x256

) (Fl

atte

n)

Dat

a (B

, C, 2

56, 2

56)

Conv

olut

ion

1 (2

56x2

56) (

Conv

olut

ion)

ker

nel s

ize:

5 st

ride:

1 p

ad: 0

Dat

a (B

, C, 1

28, 1

28)

Conv

olut

ion

1 (1

28x1

28) (

Conv

olut

ion)

ker

nel s

ize:

5 st

ride:

1 p

ad: 0

ReLU

2 (R

eLU

)

Conv

olut

ion

1 (1

28x1

28)

Conc

aten

ate

(Con

cat)

Conc

aten

ate

Pool

1 (2

56x2

56) (

MA

X P

oolin

g) k

erne

l siz

e: 2

strid

e: 1

pad

: 0

Conv

olut

ion

1 (2

56x2

56)

8

Fully

Con

nect

ed 2

(Inn

erPr

oduc

t)

Out

put (

B, Y

)

1

Pool

1 (1

28x1

28) (

MA

X P

oolin

g) k

erne

l siz

e: 2

strid

e: 1

pad

: 0

Flat

(256

x256

)

Pool

1 (1

28x1

28)

Conv

olut

ion

1 (6

4x64

)

Flat

(64x

64) (

Flat

ten)

ReLU

3 (R

eLU

)

Flat

(64x

64)

Conv

olut

ion

1 (6

4x64

) (Co

nvol

utio

n) k

erne

l siz

e: 5

strid

e: 1

pad

: 0

8

Flat

(128

x128

) (Fl

atte

n)

Flat

(128

x128

)

ReLU

4 (R

eLU

)

Fully

Con

nect

ed 1

8

Dat

a (B

, C, 6

4, 6

4)

Fully

Con

nect

ed 1

(Inn

erPr

oduc

t)

1024

ReLU

1 (R

eLU

)

Figure 4.8: A schematic of the multiscale network, consisting of three data inputswith one convolutional layer each. After data concatenation, the network has twofully connected layers, similar to the shallow network.

38

Part II

Using Convolutional Neural Networksfor Hand Detection and Pose Estimation

39

Chapter 5

Datasets

In order to achieve deep learning, and to properly train and test deep neural net-works, relatively large datasets are required. Also, publicly available datasets arepreferred, since they makes it possible to compare performance to previous work.It also makes it possible for others to recreate the results.

In this chapter, the datasets utilized in this thesis will be presented. The detailswill circle around the extents of the datasets, i.e. how much data they contain, whatdata they contain and what labels they contain.

5.1 The NYU Hand Pose Dataset (NYU)

The dataset was presented in 2014 and part of a paper suggesting a pipelinefor continuous pose recovery of human hands using convolutional networks [30].The dataset consists of images with depth and color data generated using threePrimeSense cameras. The dataset also contains synthetic depth data that essen-tially is noise and artifact free versions of the depth data obtained from the camerasgenerated by the LibHand-library [32]. The data is labeled with (u, v, d) and (x, y, z)location information, marking 36 key-points of the hand, e.g. position of finger tips,joints etc.

5.1.1 Data

The data are already split into training and testing dataset, each containing aset of depth images, color images and synthetic depth images. The training datasetconsists of 72757 depth images and color images per camera, as well as correspondingsynthetic depth images. The training dataset thereby consists of a total of 654813(72757 by 3 by 3) images. The testing dataset consists of 8252 depth images andcolor images per camera and corresponding synthetic depth images. In total, thetesting dataset consists of 74268 (8252 by 3 by 3) images.

The depth image from the three PrimeSense cameras corresponding to the firstframe is shown in Figure 5.1. The corresponding synthetic depth image is shown in

41

CHAPTER 5. DATASETS

(a) Camera #1, frame #1 (b) Camera #2, frame #1 (c) Camera #3, frame #1

Figure 5.1: Disparity (depth) images corresponding to the first frame from the threedifferent PrimeSense cameras in the training dataset packed into the three channelcolor space. The top 8 bits of the depth data are packed into the green color channeland the bottom 8 bits of depth data are packed into the blue color channel.

(a) Pseudo-camera #1,frame #1

(b) Pseudo-camera #2,frame #1

(c) Pseudo-camera #3,frame #

Figure 5.2: Synthetic disparity (depth) image corresponding to the first frame ofthe three cameras generated using the LibHand library [32]. Again, the top 8 bits ofdepth data are packed into the green color channel and the bottom 8 bits of depthdata are packed into the blue color channel.

Figure 5.2 and the corresponding color image is shown in Figure 5.3.

5.1.2 LabelsThe labeling of the dataset consists of (u, v, d) and (x, y, z) position informationcorresponding to 36 hand key-points associated to each frame. The labels are gen-erated by annotating data generated from all cameras [30]. An example of the(u, v, d) labels from the first frame and first camera overlayed a gray-scale imagewhere the intensity represents depth data are shown in Figure 5.4. Additionally, thelabels overlayed a 3D rendering of PrimeSense generated depth data and syntheticdepth data, is shown in Figure 5.5.

42

5.1. THE NYU HAND POSE DATASET (NYU)

(a) Camera #1, frame #1 (b) Camera #2, frame #1 (c) Camera #3, frame #1

Figure 5.3: Color information from the three different PrimeSense cameras corre-sponding to the first frame.

Figure 5.4: The 36 (u, v, d) labels of the first frame overlayed the disparity infor-mation from the first camera.

43

CHAPTER 5. DATASETS

(a) PrimeSense depth data. (b) Synthetic depth data.

Figure 5.5: The 36 (u, v, d) labels corresponding to the first frame converted into(x, y, z) Cartesian space overlayed a 3D rendering of the (u, v, d) information con-verted into (x, y, z) Cartesian space (both real and synthetic data).

5.2 The Oxford Hand Dataset (OX)

The dataset was presented in 2011, and part of a paper suggesting a two stagemethod for detecting unconstrained hands and their orientation [31].

The dataset itself is a concatenation of data from different sources, that has beenlabeled by the in the work of [31], and split into training, testing and validation sets.The training set consists of 4070 images with a total of 9163 labeled hand instances.The testing dataset consists of 822 images with a total 2031 labeled hand instances.Finally, the validation set consists of 739 images with a total of 2031 labeled handinstances.

In this thesis, only the testing set from the dataset will be utilized. This implies,that all forward referencing to this dataset actually refers to the testing set ofthe dataset. The data and labels are presented in Section 5.2.1 and Section 5.2.2respectively.

5.2.1 Data

The data consists of 822 RGB images of varying sizes. The images are essentiallycontaining humans in different scenes and environments. There are also images

44

5.2. THE OXFORD HAND DATASET (OX)

(a) Image #30. (b) Image #31. (c) Image #47.

Figure 5.6: Three images from the testing set of the OX dataset.

(a) Image #30 with label. (b) Image #31 with label. (c) Image #47 with label.

Figure 5.7: Three images with overlayed labels from the testing set of the OX dataset.

consisting of multiple subjects and people interacting. More formally, the dataconsists of images from the PASCAL VOC 2007 test set and PASCAL VOC 2010 humanlayout validation set [31, 37]. Some example images from the OX test set is shownin Figure 5.6.

5.2.2 LabelsEach image has n labels, each consisting of four (u, v) points. Together, the fourpoints represents a bounding box around one of the hands inside the image. Thesides of the bounding box are all parallel, or orthogonal, to the other sides, meaningthat each bounding box constitutes a proper rectangle. However, the bounding boxis orientation sensitive, meaning that the sides of the bounding box does not haveto be parallel to the image (u, v) index system. The images shown in Figure 5.6with their corresponding bounding boxes is shown in Figure 5.7.

45

Chapter 6

Hand Detection

Using the different architectures described in Section 4.1, and the datasets describedin Chapter 5, several convolutional neural network classifiers, suitable for handdetection, can be trained.

In this chapter, an augmented versions of the NYU dataset used for trainingand testing of the different classifiers will be presented. Meta-parameters used intraining will also be declared. An augmented version of the Oxford dataset will beused for external validation. Finally, the performance of the different classifiers willbe reported using the classification accuracy measurement in training, testing andvalidation.

6.1 Training & Testing Data

In order to train a hand classifier usable in a detector framework, a dataset consistingof hands, i.e.positive samples, and non-hands, i.e. negative samples, must be created.

The positive samples should consist of images relatively dominated by a hand.In other words, the positive samples should not be images containing a hand amongother things. Using images dominated by a hand as positive samples also aidsnon-maximum suppression in the detector: regions that should be classified as ahand must be dominated by the hand in order to classify the entire region as ahand. Thereby, the bounding box proposed by the detector should bound the handrelatively tightly.

Regarding negative samples, they should contain all other things that are nothands. This criteria is quite difficult to meet but, as a simplification, the set ofnegative samples should be as diverse as possible.

With the above in mind, we have created a new dataset, derived from the NYUdataset by using its labels and automated scripting. The algorithm can be describedas follows:

1. Load the RGB image and the (u, v, d) label corresponding to a certain cameraand frame.

47

CHAPTER 6. HAND DETECTION

2. Compute the minimum and maximum (u, v) values in the label and roundthem to integers usable for slicing.

3. Slice the original image array by using the minimum and maximum (u, v)values as slicing bounds. Store the sliced image array as a positive sample,i.e. an image dominated by the hand. Return to the original image array.If the slicing operations fails due to bounding errors, skip the current RGBimage and continue with the next RGB image.

4. Compute the average of the minimum and maximum (u, v) values and, ifnecessary, round the result. This is the center of the sliced image array and,approximately, the center of the hand in the original image.

5. Randomly generate two new values corresponding to a random height, hr,and a random width, wr. These values are drawn from the discrete uniformdistribution U{32, 128}.

6. Do until success:

a) Randomly generate two new values, ur and vr corresponding to a point inthe RGB image, drawn from the discrete uniform distributions U{wr, 640−wr} and U{hr, 480− hr} respectively.

b) If the Euclidean distance between this random point is larger than max(hr, wr),slice the original image array around the randomized point using the ran-domized width and height. Store this sliced array as a negative sample.Register success.

c) If the euclidean distance between the random point is smaller than max(hr, wr),randomly generate a new point in the image array.

Applying this scheme to the NYU dataset, more specifically to the first 70 k imagesper camera in the training set and to the first 8 k images per camera in the testingset, yields 414882 images for training and 48000 images for testing. Half of eachset constitutes positive samples and the other half negative samples. The positiveand negative samples generated from the first frame of the three different differentcameras is shown in Figure 6.1 and Figure 6.2 respectively.

48

6.2. VALIDATION DATA

(a) Camera #1, frame #1. (b) Camera #2, frame #1. (c) Camera #3, frame #1.

Figure 6.1: Positive classification data samples created from the NYU dataset byusing its annotations.


Figure 6.2: Negative classification data samples created from the NYU dataset byusing its annotations and random sampling.

6.2 Validation DataIn order to test the generalization performance of the hand classifier, i.e.validatethe classifier, another type of dataset, with different scenery and generation setup,should be used. To this task, the testing set of the OX dataset, described in detail inSection 5.2, appears to be very suitable. This is because it contains a lot of labeledhands with highly variant size, orientation and occlusion.

In order to adapt the data in the dataset to validate a classifier, we again derivea new dataset from the original dataset by using its labels and automated scripting.The algorithm can be described as follows:

1. Load the RGB image and all the n labels associated to the RGB image.

2. For all labels associated the RGB image:

a) Compute the average minimum and maximum (u, v) values in the n:th

49



Figure 6.3: Positive classification data samples created from the OX dataset by usingits annotations.

label and round them to integers suitable for slicing. Compute the width,wn, and height, hn, and center point (ucn , vcn) of this new bounding box.

b) Randomly generate two new values, urn and vrn , corresponding to apoint in the RGB image, drawn from the discrete uniform distributionsU{wn/2,W − wn} and U{hn, H − hn} respectively, where H and W isthe height and width of the RGB image.

c) Do until success or a maximum of 500 tries:

i. If the Euclidean distance between the random (ucn , vcn) point islarger than

√(max(hn, wn))2 + (max(hn, wn))2 ∀ n (6.1)

slice the original RGB image using the n:th minimum and maximum(u, v) values. Store this sliced array as positive sample. Also slicethe original RGB image around the randomized point using the hnand wn values. Store this sliced array as a negative sample. Registersuccess.

ii. If the Euclidean distance between the randomized point and anyoneof the n (u, v) points, re-randomize (ucn , vcn).

Applying this scheme to all the images in the OX test set generated a total of 4010samples, half of these samples constitutes positive samples, i.e. images containing ahand, and the other half constitutes negative samples, i.e. images with equal size asthe corresponding positive sample but not portraying hand. The positive samplesgenerated from the images in Figure 5.6 are shown in Figure 6.3. Negative samplesare shown in Figure 6.4.

50

6.3. TRAINING DETAILS


Figure 6.4: Negative classification data samples created from the OX dataset byusing its annotations and random sampling.

Meta-parametersα0 µ Policy γ s Decay

0.001 0.9 STEP 0.1 104 0.0005

InitializationType Setting Parameter

Weights GAUSSIAN STD: 0.01Biases CONSTANT VALUE: 0

MultipliersType Learning Rate Weight Decay

Weights 1 1Biases 2 0

Table 6.1: Training meta-parameters, initialization schemes and multiplier settingsused when training all network classifiers.

6.3 Training Details

In order to compare only the classification performance of different networks, iden-tical training methodologies, e.g meta-parameter settings, are used globally for allnetworks. The settings are presented in detail in Table 6.1.

Images are processed in batches of 128 and training is carried out for a totalof 20 k iterations. In other words, a total of 2.56 M images are processed meaningthat the all the data will be processed for approximately 6 epochs. The ordering ofthe data is shuffled between every epoch. Testing is carried out every 1 k iteration.

The different networks are implemented using the Caffe framework and com-putation is carried out on a NVIDIA Tesla K40 GPU. The SGD algorithm is used toapproximate gradients.

51


Accuracy (Accuracy)

Accuracy (1)

Score (B, C)

Softmax Loss (SoftmaxWithLoss)

Class Label (B, C)

Loss (1)

Figure 6.5: Top layers when training all neural networks classifiers. The Accuracylayer does not contribute to the backwards propagation of error, but the accuracymeasurement is simpler to relate to classifier performance. The Softmax with losslayer is the one computing error and drives backpropagation.

6.3.1 Learning Mechanism

When training a classifier, the network predicts a probability distribution over allthe possible classes. The probability distribution is then compared to the correctclass, i.e. the label, and a scalar error metric calculated. This metric is then usedto drive the backwards propagation of error. The error metric is computed fromthe scores yielded by the network by means of softmax squashing and multinomiallogistic regression. However, these two operation are merged into one operation inthe Caffe framework and referenced as a softmax with loss layer.

The softmax with loss error measurement is a measure of classifier performance.However, the classification accuracy is a simpler measurement more directly relat-able to classifier performance. The classification accuracy is the ratio of correctlyclassified samples to the total number of samples. A classification accuracy of onemeans that all samples were correctly classified and vice versa.

In order to drive the learning process, and report the training and testing classi-fication accuracy as training progresses, the layers shown in figure 6.5 are connectedto the top of each classifier network during training and testing.

6.4 Classification Performance

6.4.1 Classification Accuracy on Training & Testing Set

The training and testing classification accuracy are reported by the Caffe frame-work as training progresses and the resulting accuracy for the different classifier

52

6.4. CLASSIFICATION PERFORMANCE

networks are reported in the graphs on the following pages. The results are orga-nized as follows:

• The testing and training accuracy of the classifier network featuring one con-volution with two or three fully connected layers is shown in Figure 6.6 and6.9 respectively.

• The testing and training accuracy of the classifier network featuring threeconvolutions with two or three fully connected layers is shown in Figure 6.7and 6.10 respectively.

• The testing and training accuracy of the classifier network featuring five con-volutions with two or three fully connected layers is shown in Figure 6.8 and6.11 respectively.

53


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy

nnf = 128nnf = 256nnf = 512nnf = 1024nnf = 2048nnf = 4096

(a) Training accuracy.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy


(b) Testing accuracy.

Figure 6.6: Classification accuracy on the training and testing set derived from theNYU dataset when using a neural network classifier consisting of one convolution andtwo fully connected layers with nnf number of neurons in the first fully connectedlayer.

54


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



Figure 6.7: Classification accuracy on the training and testing set derived from theNYU dataset when using a neural network classifier consisting of three convolutionsand two fully connected layers with nnf number of neurons in the first fully connectedlayer.

55


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



Figure 6.8: Classification accuracy on the training and testing set derived from theNYU dataset when using a neural network classifier consisting of five convolution andtwo fully connected layers with nnf neurons in the first fully connected layer.

56


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



Figure 6.9: Classification accuracy on the training and testing set derived from theNYU dataset when using a neural network classifier consisting of one convolution andthree fully connected layers with nnf neurons in the first and second fully connectedlayers.

57


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



Figure 6.10: Classification accuracy on the training and testing set derived from theNYU dataset when using a neural network classifier consisting of three convolutionsand three fully connected layers with nnf neurons in the first and second fullyconnected layers.

58


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2·104

0.4

0.6

0.8

1

1.2

Iteration

Classificatio

nAccuracy



Figure 6.11: Classification accuracy on the training and testing set derived from theNYU dataset when using a neural network classifier consisting of five convolution andthree fully connected layers with nnf neurons in the first and second fully connectedlayers.

59


(a) Camera #1. Positivesample #3068.

(b) Camera #1. Positivesample #5133.

(c) Image #3. Positivesample #53.

Figure 6.12: Positive samples classified as negative samples by a classifier networkfeaturing one convolutional layer and two fully connected layers.

(a) Camera #1. Negativesample #7304.

(b) Camera #1. Negativesample #1047.

(c) Image #3. Negativesample #3841.

Figure 6.13: Negative samples classified as positive samples by a classifier networkfeaturing one convolutional layer and two fully connected layers.

6.4.2 Examples of Misclassified Samples in the Testing SetAs seen in Figure 6.6 throughout Figure 6.11, the classification accuracy on the testset reaches a maximum value of about 92 %, meaning that approximately 8 % of thesamples are misclassified by the majority of the networks. Some of these samples,misclassified by a network featuring one convolutional layer and two fully connectedlayers with 1024 neurons in the first layer is shown in Figure 6.12 and 6.13.

6.4.3 Classification Accuracy on Validation SetThe classification accuracy, of all classifier networks, on the validation set, describedin section 6.2, are reported in Figure 6.14. Figure 6.14a shows the classificationaccuracy of all networks featuring one convolution. Figure 6.14b and Figure 6.14cshows the classification accuracy of all networks featuring three and five convolutionsrespectively.

60


128 256 512 1024 2048 40960

0.2

0.4

0.6

0.8

1

nnf

Classificatio

nAccuracy

nfc = 2nfc = 3

(a) One Convolution.

128 256 512 1024 2048 40960

0.2

0.4

0.6

0.8

1

nnf

Classificatio

nAccuracy

nfc = 2nfc = 3

(b) Three Convolutions

128 256 512 1024 2048 40960

0.2

0.4

0.6

0.8

1

nnf

Classificatio

nAccuracy

nfc = 2nfc = 3

(c) Five Convolutions

Figure 6.14: Classification accuracy on the validation dataset derived from the OXdataset testing set. Here, nnf is the number of neurons in the first fully connectedlayer (in networks featuring two fully connected layers) or the number of neuronsin the first and second fully connected layer (in networks featuring three fully con-nected layers). In the legend, nfc represents the number of fully connected layers.

61


6.5 Detection TestIn order to test the network classifiers detection performance, a small sliding windowtest can be conducted. To reduce the number of experiments necessary, two classifiernetworks are selected as candidates:

• The network featuring one convolutional layer, and two fully connected layerswith 1024 neurons in the first fully connected layer. This network is relativelyshallow, but still capable of yielding a 100 % accuracy on the training set andan accuracy of about 92 % on the testing set.

• The network featuring five convolutional layers, and three fully connectedlayers with 512 neurons in the first and second fully connected layers. Thisnetwork is relatively deep and narrow, but the network yielding the highestaccuracy on the validation set.

6.5.1 Test SetupIn order to speed up the detection test additionally, the size of the sliding windowis set to be approximately the same size as the label bounding box. This is the onlybounding box used in each test. A constant window step size of 8 pixels is used.Widows are stepped in a row major order fashion.

By convention, a detection is perceived as correct, or successful, when the ratiobetween the intersection area and the union area is larger than 0.5, i.e. 50 %. Thisratio, or overlap score, can be described more formally as

O = A (BL ∩ BW)A (BL ∪ BW) (6.2)

where BL and BW are the axis aligned bounding boxes around the hand, i.e. thelabel, and the positive classified window respectively.

6.5.2 Detection & Precision ResultDetection is performed on different test set images from the original NYU dataset.The sliding window size and the resulting precision is reported in each figure caption.

• Examples of test set images where the relatively shallow classification networkis able to detect the hand is shown Figure 6.15. Figure 6.16 shows test setimages where the network is unable to detect the hand.

• Examples of test set images where the relatively deep classification networkis able to detect the hand is shown Figure 6.17. Figure 6.18 shows test setimages where the network is unable to detect the hand.

62

6.5. DETECTION TEST

(a) Detection test. Camera #1. Image #2000. Resulting precision: 2.17 %. Window height:146. Window width: 129.

(b) Detection test. Camera #1. Image #4000. Resulting precision: 0.51 %. Window height:69. Window width: 47.

Figure 6.15: Detection test, conducted on test set images by the network featur-ing one convolutional layer and two fully connected layers, where the hand havebeen successfully detected. The green bounding boxes indicate successful detection,whilst red bounding boxes indicate false detection.

63


(a) Detection test. Camera #2. Image #2000. Resulting precision: 0 %. Window height:122. Window width: 55.

(b) Detection test. Camera #1. Image #6000. Resulting precision: 0 %. Window height:74. Window width: 38.

Figure 6.16: Detection test, conducted on test set images by the network featuringone convolutional layer and two fully connected layers, where the hand have notbeen successfully detected. The green bounding boxes indicate successful detection,whilst red bounding boxes indicate false detection.

64

6.5. DETECTION TEST

(a) Detection test. Camera #3. Image #2000. Resulting precision: 14.3 %. Window height:122. Window width: 55.

(b) Detection test. Camera #3. Image #6000. Resulting precision: 2.4 %. Window height:67. Window width: 61.

Figure 6.17: Detection test, conducted on test set images by the network featuringfive convolutional layers and three fully connected layers, where the hand havebeen successfully detected. The green bounding boxes indicate successful detection,whilst red bounding boxes indicate false detection.

65


(a) Detection test. Camera #3. Image #1. Resulting precision: 0 %. Window height: 122.Window width: 38.

(b) Detection test. Camera #3. Image #8000. Resulting precision: 0 %. Window height:72. Window width: 34.

Figure 6.18: Detection test, conducted on test set images by the network featuringfive convolutional layers and three fully connected layers, where the hand have notbeen successfully detected. The green bounding boxes indicate successful detection,whilst red bounding boxes indicate false detection.

66

Chapter 7

Hand Pose Estimation

Using the network architectures suitable for regression presented in Section 4.2, andthe datasets presented in Chapter 5, the different networks ability to predict thekey point locations of a hand given different types of data, i.e depth, color or both,will be examined in this chapter.

7.1 Training & Testing DataIn order to train a regressor network capable of computing the key point locationsin the (u, v, d) space, a dataset consisting of images dominated by fully visible handsmust be created. The images in the new dataset should also be of a fixed heightand width in order to map directly to the input neurons of the network withoutre-sampling. If the image where to be re-sampled before fed into the network, there-sampling would impose a spatial transformation unique for every image, andeach key point label would have to be subject to the same transformation, which iscurrently difficult to implement in the Caffe framework.

With the above in mind, we have created a new dataset suitable for training aregressor network, derived from the NYU dataset by using its labels and automatedscripting. The algorithm can be described as follows:

1. Load the RGB image and depth image, and the (u, v, d) label correspondingto a certain camera and frame.

2. Compute the minimum and maximum (u, v) values in the label and roundthem to integers.

3. Compute the average of the minimum and maximum (u, v) values and, ifnecessary, round the result. This is the approximate center of the hand.

4. Try to slice the RGB image and depth image around this (u, v) center pointso that two new image arrays with a quadratic size of 256 is generated. If theany of the slicing bounds exceeds the image dimensions, skip that image pair.

67

CHAPTER 7. HAND POSE ESTIMATION


Figure 7.1: RGB image data samples created from the NYU dataset by using itsannotations.


Figure 7.2: Depth image data samples created from the NYU dataset by using itsannotations.

5. Recompute the label so that it corresponds to the new (u, v, d) space definedby the sliced arrays.

6. Store the new sliced arrays and corresponding label.

Applying this scheme to the NYU dataset, more specifically to the first 70 k imagesper camera in the training set and to the first 8 k images per camera in the testingset, yields 385796 images for training and 44212 images for testing. Half of thesenew image sets constitutes RGB images and the other half the corresponding depthimages. The RGB images generated from the first frame of the three differentcameras is shown in Figure 7.1 and the depth images is shown in Figure 7.2

68

7.2. DATA INTERFACING

7.2 Data Interfacing

As mentioned previously, the depth data is encoded as color images in the orig-inal NYU, and the derived, dataset. Therefore, a scheme for decoding this threechannel color representation into depth data, directly implementable in the Caffeframework, has been developed and described in Section 7.2.1. The scheme is thenextended to concatenate the color and depth data, which is described in Section7.2.3.

7.2.1 Depth Data

Normally, depth are represented by using an 16 bit unsigned integer. Packing thisinto a three channel color space, where each channel only is capable of holding an8 bit unsigned integer, means splitting the 16 bit integer into two parts: the top8 bits and the bottom 8 bits. As mentioned in Section 5.1 and showed in Figure5.1, the NYU dataset packs the top 8 bits (the 8 bits representing the highest termsin the binary number) into the green channel and the bottom 8 bits (the 8 bitsrepresenting the lowest terms in the binary number) into the blue channel.

The Caffe framework supports reading a batch of images directly from disk,yielding a four-dimensional floating point array representing the images with di-mensions (B,C,H,W ) where B is the batch size, C is the number of channels, His the height of the image and W is the width of the image.

Understandably, depth data should be represented by a single channel, i.e. (B, 1,H,W ), and the channel should contain the depth value corresponding to the 16 bitunsigned integer before three channel color encoding. Therefore, the depth data(single channel value) must be decoded from the data in the three color channels.The process of decoding the depth data can be thought of as a set of operations onthe original (B, 3, H,W ) array generated when reading B depth images from disk:

1. Split the array along the second dimension, yielding three (B, 1, H,W ) arrays,each containing the information from one channel.

2. Perform element wise multiplication of all the elements in the second arrayby 28 (corresponding to an 8 bit binary shift). The second array represents tothe green color channel for alla images in the batch, i.e. the top 8 bit channel,yielding a modified (B, 1, H,W ) array modulated by 28.

3. Perform element wise addition of the three arrays along the second dimension,yielding a final (B, 1, H,W ) array representing depth data.

A schematic overview of the process is shown in Figure 7.3. Using this approach, thenetworks described in Section 4.2 can be trained to estimate the key point locationusing the depth images in the derived dataset.

69


Add (Eltwise)

Data (B, 1, H, W)

G (B, 1, H, W)

Modulate (Power)

Split (Slice)

B (B, 1, H, W) R (B, 1, H, W)

Depth Image (B, 3, H, W)

Figure 7.3: Data preparation pipeline for converting three channel color encodingof depth information to single channel depth data. The Modulate (Power) unit isperforming multiplication by 28.

7.2.2 RGB Data

Color data can be an important source of information when inferring the key pointlocations of a hand. The information can also be used to derive other importantmeasures, e.g. color and intensity gradients, which has proven to be very potent inthe task of hand pose estimation [38].

The data interfacing step when using the color images of the derived dataset,shown in Figure 7.4, is considerably simpler than the one described in 7.2.2, sincethe Caffe framework is designed to utilize color or gray scale images as a primarydata source [26].

7.2.3 RGB & Depth Data

Finally, both the depth and color data can be used to estimate the key point loca-tions of the hand. To do this, the data is interfaced by concatenating the data fromby the approaches described in Section 7.2.1 and Section 7.2.2 along the seconddimension, i.e. along the channel axis. This means that a floating point array withdimensions (B, 4,W,H) will be fed into the network. This combined interface isshown in figure 7.2.3.

70

7.2. DATA INTERFACING

Direct Feed (No Change)

Data (B, 3, H, W)

RGB Image (B, 3, H, W)

Figure 7.4: Data preparation pipeline for converting three channel RGB informationto data.

Add (Eltwise)

Depth Data (B, 1, H, W)

Concatenate (Concatenate)

Data (B, 4, H, W)

G (B, 1, H, W)

Modulate (Power)

Split (Slice)

B (B, 1, H, W) R (B, 1, H, W)

Depth Image (B, 3, H, W)

RGB Image (B, 3, H, W)

Figure 7.5: Data preparation pipeline for converting three channel encoding ofdepth information and three channel RGB information to a four channel data array,containing both color and depth data. The Modulate (Power) unit is performingmultiplication by 28.

71


Euclidean Loss (EuclideanLoss)

Loss (1)

Label (B, Y) Output (B, Y)

Figure 7.6: Top layers when training all neural networks regressors. Note that theEuclidean loss is computed as a batch average metric.

7.3 Training Details

The different structural properties of the Shallow, Deep and Multiscale networkmade some meta-parameter tuning necessary in order to produce valid results. Thelearning parameter settings and utilized weight initialization method are presentedin detail in Table 7.1. In short, the Deep and Multiscale network required largerbase learning rate and a faster learning rate reduction.

Images are processed in batches of 50 and training is carried out for a total of15 k iterations. In other words, a total of 750 k images are processed, meaning thatall the data will be processed for approximately 4 epochs. However, since the inputis order sensitive when concatenating depth and RGB data, the data is not shuffledbetween epochs. Testing is carried out every 1 k iteration.

The networks, and interfacing pipelines, are implemented using the Caffe frame-work and computation is carried out on a NVIDIA Tesla K40 GPU. The SGD algo-rithm is used to approximate gradients.

7.3.1 Learning Mechanism

When training a regressor, the network will predict the key-point locations of thehand directly. Therefore, the Euclidean distance measure between the predictedkey-point location vector and the label vector can be used to drive the backwardspropagation of error and as a measure of relative performance between the differenttypes of networks and data. The layer shown in figure 7.6 are connected to the topof each regressor network during training and testing.

72

7.4. EUCLIDEAN LOSS ON TRAINING & TESTING SET

Meta-parameters (Shallow Network. Depth & Depth and RGB Data.)α0 µ Policy γ s Decay

10−13 0.9 STEP 0.01 5000 0

Meta-parameters (Deep & Multiscale Network. Depth & Depth and RGB Data.)α0 µ Policy γ s Decay

10−11 0.9 STEP 0.0001 5000 0

Meta-parameters (Shallow Network. RGB Data.)α0 µ Policy γ s Decay

10−12 0.9 STEP 0.01 5000 0

Meta-parameters (Deep & Multiscale Network. RGB Data.)α0 µ Policy γ s Decay

10−10 0.9 STEP 0.0001 5000 0

Initialization (All Networks)Type Setting Parameter

Weights GAUSSIAN STD: 0.01Biases CONSTANT VALUE: 0

Multipliers (All Networks)Type Learning Rate Weight Decay

Weights 1 1Biases 2 0

Table 7.1: Training meta-parameters, initialization schemes and multiplier settingsused when training the different network regressors on data of any kind, i.e. depth,RGB or both.

7.4 Euclidean Loss on Training & Testing Set

The training and testing euclidean loss are reported by the Caffe framework astraining progresses. The Euclidean loss for the different regressor networks, trainedon different types of data, are reported in the following graphs:

• The testing and training Euclidean loss during training of the different net-works using depth data is shown in Figure 7.7.

• The testing and training Euclidean loss during training of the different net-works using RGB data is shown in Figure 7.8.

73


• The testing and training Euclidean loss during training of the different net-works using depth and RGB data is shown in Figure 7.9.

7.5 Key-point Prediction on Test Set DataThe Euclidean loss is not intuitively relatable to regressor performance. A Euclideanloss of zero is understandably ideal, but how high is the Euclidean loss when theprediction is off by a small amount? Or, more interestingly, how close to the labelare the keypoints predicted by networks with the losses presented in Section 7.4?

Using the examples shown in Figure 7.10, the following predictions are made:

• Using the depth data in Figure 7.10a and 7.10c forward predictions are madeby using the deep regressor network trained on depth data. The predictedkeypoint locations, and corresponding label keypoints, is shown in Figure7.11 and 7.15, as well as Figure 7.11 and Figure 7.16.

• Using the depth data in Figure 7.10b and 7.10d forward predictions are madeby using the deep regressor network trained on depth data. The predictedkeypoint locations, and corresponding label keypoints, is shown in Figure7.13 and 7.17, as well as Figure 7.13 and Figure 7.18.

74

7.5. KEY-POINT PREDICTION ON TEST SET DATA

0 0.2 0.4 0.6 0.8 1 1.2 1.4·104

104

105

106

107

108

Iteration

EuclideanLo

ss

ShallowDeepMultiscale

(a) Training loss.

0 0.2 0.4 0.6 0.8 1 1.2 1.4·104

104

105

106

107

108

Iteration

EuclideanLo

ss


(b) Testing loss.

Figure 7.7: Euclidean loss on the training and testing set derived from the NYUdataset for the Shallow, Deep and Multiscale network when using depth data.

75


0 0.2 0.4 0.6 0.8 1 1.2 1.4·104

104

105

106

107

108

Iteration

EuclideanLo

ss


(a) Training loss.

0 0.2 0.4 0.6 0.8 1 1.2 1.4·104

104

105

106

107

108

Iteration

EuclideanLo

ss


(b) Testing loss.

Figure 7.8: Euclidean loss on the training and testing set derived from the NYUdataset for the Shallow, Deep and Multiscale network when using RGB data.

76


0 0.2 0.4 0.6 0.8 1 1.2 1.4·104

104

105

106

107

108

Iteration

EuclideanLo

ss


(a) Training loss.

0 0.2 0.4 0.6 0.8 1 1.2 1.4·104

104

105

106

107

108

Iteration

EuclideanLo

ss


(b) Testing loss.

Figure 7.9: Euclidean loss on the training and testing set derived from the NYUdataset for the Shallow, Deep and Multiscale network when using RGB and depthdata.

77


(a) Depth data. Camera #1. Frame #1. (b) RGB data. Camera #1. Frame #1.

(c) Depth data. Camera #3. Frame #1000. (d) RGB data. Camera #3. Frame #1000.

Figure 7.10: Data used to make a forward prediction. More specifically, framesfrom the test set derived from the NYU dataset

78


−100 −50 0 50 100 150 200 250 300 3500

50

100

150

200

250

Image Matrix Index (v)

Imag

eMatrix

Inde

x(2

56−u) Label

Prediction

(a) Without pair-wise interconnection.

−100 −50 0 50 100 150 200 250 300 3500

50

100

150

200

250


Imag

eMatrix

Inde

x(2

56−u) Label

PredictionDisplacement

(b) With pair-wise interconnection.

Figure 7.11: Keypoint label and prediction in the uv-space computed by the deepregressor network on the depth data in Figure 7.10a.

79


0 5 10 15 20 25 30 35 40550

600

650

700

750

800

Keypoint Index

Dist

ance

Measure

LabelPrediction

Figure 7.12: Keypoint label and prediction in the d-space computed by the deepregressor network on the depth data in Figure 7.10a.

80


−100 −50 0 50 100 150 200 250 300 3500

50

100

150

200

250


Imag

eMatrix

Inde

x(2

56−u) Label

Prediction


−100 −50 0 50 100 150 200 250 300 3500

50

100

150

200

250


Imag

eMatrix

Inde

x(2

56−u) Label



Figure 7.13: Keypoint label and prediction in the uv-space computed by the deepregressor network on the RGB data in Figure 7.10b.

81


0 5 10 15 20 25 30 35 40550

600

650

700

750

800

Keypoint Index

Dist

ance

Measure

LabelPrediction

Figure 7.14: Keypoint label and prediction in the d-space computed by the deepregressor network on the RGB data in Figure 7.10b.

82


−100 −50 0 50 100 150 200 250 300 3500

50

100

150

200

250


Imag

eMatrix

Inde

x(2

56−u) Label

Prediction


−100 −50 0 50 100 150 200 250 300 3500

50

100

150

200

250


Imag

eMatrix

Inde

x(2

56−u) Label



Figure 7.15: Keypoint label and prediction in the uv-space computed by the deepregressor network on the depth data in Figure 7.10c.

83


0 5 10 15 20 25 30 35 40

600

700

800

900

1,000

1,100

1,200

Keypoint Index

Dist

ance

Measure

LabelPrediction

Figure 7.16: Keypoint label and prediction in the d-space computed by the deepregressor network on the depth data in Figure 7.10c.

84


−100 −50 0 50 100 150 200 250 300 3500

50

100

150

200

250


Imag

eMatrix

Inde

x(2

56−u) Label

Prediction


−100 −50 0 50 100 150 200 250 300 3500

50

100

150

200

250


Imag

eMatrix

Inde

x(2

56−u) Label



Figure 7.17: Keypoint label and prediction in the uv-space computed by the deepregressor network on the RGB data in Figure 7.10d.

85


0 5 10 15 20 25 30 35 40

600

700

800

900

1,000

1,100

1,200

Keypoint Index

Dist

ance

Measure

LabelPrediction

Figure 7.18: Keypoint label and prediction in the d-space computed by the deepregressor network on the RGB data in Figure 7.10d.

86

Part III

Analysis

87

Chapter 8

Discussion

In this chapter, an analysis of the result presented in Chapter 6 and Chapter 7will be presented. Since all the experiments conducted can be divided into twodifferent tasks, or research problems, detection and pose estimation, this chapterwill examine the results from the two research problems separately.

8.1 Hand Detection

8.1.1 High Test Set AccuracyAs shown in Figure 6.6 throughout Figure 6.11, many classifier networks reachesa very high classification accuracy on the training and testing set. Only networksfeaturing one convolutional layer and two fully connected layers seems to yieldsignificant performance difference w.r.t. the number of neurons in the fully connectedlayers.

In machine learning, a training set classification accuracy of 1, i.e. all training setsamples correctly classified, indicates some degree of overfitting. However, anothersign of overfitting to the training set is a decreasing classification performance onthe testing set, which cannot be observed in the conducted experiments. This, nondecreasing, high testing set accuracy is believed to be caused by a high similaritybetween the training and testing set. A suspicion which is strengthen by inspectingthe data in the training and testing set:

• Data collection setup is identical, meaning that the camera angles, cameratypes and lighting condition are approximately similar despite of human sub-ject.

• There is a low variability between the human subjects used to record the data.All subjects are adults with light toned skin, which causes a relatively densecluster in an equivalent feature space.

In other words, the generated hand classification dataset can be considered toeasy, making it hard to test the limits of our method. To thoroughly examine the

89

CHAPTER 8. DISCUSSION

performance of convolutional neural networks usable in hand detection, a harderdataset would be required.

8.1.2 High Performance of Relatively Shallow Networks

Another important observation, is that relatively shallow networks, e.g. the networkfeaturing one convolutional layer and two fully connected layers, yield a relativelyhigh classification accuracy on the training and testing set. However, the perfor-mance is also somewhat dependent on the number of neurons in the fully connectedlayer.

This high classification performance is believed to be caused by the difficultiesin constructing a representative negative class. Since the classification task is highlyasymmetric, i.e. having one positive class representing hands and another negativeclass representing everything else, which is a semantic construct difficult to capture.

From manual inspection of the negative classification samples, generated byusing the scheme described in Section 6.1, it is observed that the samples typicallycontain a lot of black background. This leads to the hypothesis, that the shallowclassifier networks are probably triggered by non-dark colors, which is the simplestfeature to classify by. This idea is also strengthen by observing the detection testsin Figure 6.15 and 6.16, where the shallow classifier network is classifying a majorityof the white regions as hands.

8.1.3 Maximum Test Set Accuracy

As shown in Figure 6.6 throughout Figure 6.11, many networks reaches a maximumtesting set accuracy of approximately 92 %, meaning that there are about 8 % of thetesting set samples that are misclassified. The nature of these samples is believedto vary depending on the classifier network, but from manual inspection the set ofmisclassified samples are dominated by false positives, i.e. samples that are classifiedas hands even though they are not hands, which further implies the difficulties inrepresenting all objects that are not hands in a single class.

8.1.4 The Asymmetric Classification Problem

As mentioned above, there are major difficulties in representing every object that arenot a hand as a class. In an imaginary feature space, the positive class, i.e. hands,will be a relatively small cluster compared to the negative class, i.e. everything else,whose samples would fill the rest of the feature space.

Clearly, the schemes used to generate negative samples from the original datasetin the classification experiments has not been able to capture the diversity of thetrue negative samples in the data. However, in order to generate a more diverseand representative set of negative samples, usable in future work related to handdetection, additional data is required since the scene diversity in the original datasetis very limited.

90

8.2. HAND POSE ESTIMATION

8.1.5 Number of Convolutions and Validation Set Performance

An interesting observation from the experiments, is that even though the shallownetworks yields a relatively high classification accuracy on the training and testingset, they yield a poor, if not catastrophic, performance on the validation set, asshown in Figure 6.14. A classification accuracy of 50 % means that the the classifiernetwork is assigning the same class to all samples, since the data is made up of anequal number of positive and negative samples.

However, as shown in Figure 6.14, adding convolutional layers systematicallyimproves the classification accuracy on the validation set, i.e. three convolutionsincenses performance for all classifier networks compared to one convolution andfive convolutions increases performance compared to three convolutions, yielding amaximum validation set classification accuracy of approximately 70 %.

This is believed to be related the ability of relatively deep networks to model ahigher order of non-linearity compared to shallow networks. As mentioned previ-ously, the shallow classifier networks are believed to be very color sensitive, meaningthat shallow networks fails to correctly classify hands whose features are residingwithin other regions of the color space, where a majority of the validation set sam-ples are located. Color can be considered a first order feature, meaning that it canbe inferred relatively easy from the data, i.e. using a few non-linear transformations.Higher order features, e.g. shapes, patterns and gradients, can be inferred from dataas well, but requires multiple non-linear transformations, which calls for a deeperconvolutional neural network.

It is therefore believed, that deeper networks are learning more higher ordersemantic features, e.g. shapes of fingers and palm, than lower level semantic features,e.g. color, learn by the shallow networks. This higher level of semantics is believed tobe the reason for the systematically higher validation set classification performance.Further traction is added hypothesis observing the detection test in Figure 6.17 andFigure 6.18, where the deep network is yielding fewer false positives compared tothe shallow networks.

8.2 Hand Pose Estimation

8.2.1 Relative Performance of Regressor Networks

As shown in Figure 7.7 throughout Figure 7.9, the multiscale regressor networkseems to yield the lowest Euclidean loss on almost all types of data in both trainingand testing. This result is inline with the experiments conducted by [35].

However, there are numerous ways to tune the different networks that havenot been explored in this thesis. The meta-parameters presented in table 7.1 arenot optimal to the regression problem at hand. Also the current version of Caffeframework are, even tho it supports regression, mainly aimed on classification tasks.In future work, the Caffe framework may have a broader set of tools that simplifiesworking with regression tasks. Other tools currently being developed, such as the

91

CHAPTER 8. DISCUSSION

NVIDIA Deep Learning GPU Training System (DIGITS) [39], may also serve asviable software candidate for further work concerning regression tasks.

8.2.2 Information Content in Different Types of DataIn Figure 7.7 it is observed, that the shallow and deep regressor network haveapproximately equivalent performance on depth data. However, when looking atfigure 7.8, the shallow network is performing considerably worse than the deep andmultiscale network. This is believed to be because the shallow networks poor abilityto model high non-linearity. The mapping from color data to approximate key-pointlocations can be considered a more non-linear than the mapping than the mappingfrom depth data to key-point locations. This is in line with the previous result onhand detection and the generic idea behind deep neural networks, i.e. deep networkhave the ability to model high order of non-linearity better than shallow networks.

Furthermore, when observing the forward predictions made in Figure 7.11 through-out Figure 7.18, the deep regressor network trained on color data yields a smallerprediction error compared to the deep regressor network trained on depth data.This effect is especially prominent when the hand has a lot of visible edges betweenits components, e.g. fingers, as shown in Figure 7.13. This is also consistent withthe notion, that more information, e.g. edges and gradients, can be derived fromcolor data than depth data. However, this also requires a more potent network,i.e. a relatively deep network.

Finally, as shown in Figure 7.14 and Figure 7.18, the deep regressor networktrained on color data seems to estimate the key-point depth with relatively goodprecision even though no depth data has been supplied during training, apart fromthe labels. This may indicate, that the depth labels for all key-points in all samplesacross the dataset are located at approximately the same depth and that the net-work reaches a minimum w.r.t. the loss function by predicting depth around thismean depth. However, by studying the relative shape of the label and predictionlines depicted in Figure 7.14 and Figure 7.18, one can observe that there are somesimilarities between the shape of the label and prediction that indicate that thismay not be the case.

92

Chapter 9

Conclusions & Future Work

In this chapter, given the results in Chapter 6 and Chapter 7, as well as the discus-sion in Chapter 8, the final conclusions of this thesis will be summarized.

Additionally, some future work, both in a narrow perspective, i.e. alterationsand probable improvements to the experiments conducted in this thesis, and awider perspective, i.e. how to build additional research and experiments upon thisthesis, will be presented.

9.1 Conclusions

• Convolutional neural networks can be considered as a suitable family of clas-sifiers usable in hand detection systems. However, training a two-class handclassifier imposes high diversity requirements on the data representing thenegative class.

• In our experiments, we created a relatively small dataset unable to capture thediversity of the true negative set. We therefore believe, that a larger and morediverse dataset will reduce the number of false positives in a hand detectionsystem and also improve the accuracy on validation data.

• Shallow convolutional classifier networks yields poor classification accuracy onvalidation data regardless of number of neurons in the fully connected layers.Further experiments indicated, that convolutional neural networks featuringrelatively few convolutions mainly uses color as feature descriptors.

• Deep convolutional classifier networks yields the highest classification accuracyon validation data. It is therefore believed, that convolutional neural networksfeaturing a relatively high number of convolutions are able to learn moreabstract features, e.g. shapes and gradients, which are color independent anda higher moment of information.

93

CHAPTER 9. CONCLUSIONS & FUTURE WORK

• Convolutional neural regressor networks that accepts data at different scalesas input typically yields the lowest error when set to directly compute thekey-point locations of the hand in the (u, v, d)-space.

• Relatively shallow convolutional regressor network yields a performance com-parable to the deep and multiscale convolutional regressor network. However,on color data alone, shallow regressor networks performs considerably worse.This can be explained by considering the mapping from color data to key-pointlocations more non-linear than the mapping from depth data to key-point lo-cations.

• Color data are offering the smallest key-point displacement in our experiments.It also appears, that color data can be used to infer depth with relatively highaccuracy, but this is something we cannot confirm using our experiments alone.

• Using convolutional regressor network to directly compute the key-point loca-tions in the (u, v, d)-space unfortunately yield relatively high errors. However,convolutional neural networks can be used as an intermediate step in handpose estimation system.

9.2 Future Work

9.2.1 The Asymmetric Classification Problem and Data

In the classification and detection experiments, a prominent cause of low validationset classification accuracy and detection performance is due to the asymmetric clas-sification problem. It is difficult to capture the nature of the negative class, i.e. allobjects that are not a hand. The negative samples generated in these experimentsare uniformly sampled from the data in the original dataset, leading to a negativeset which poorly represents the true set of negative sample. We therefore believe,that a more diverse and larder set of negative samples could improve performance,especially hand detection.

9.2.2 Non-naive Convolutional Neural Networks

In these experiments, the convolutional neural networks used in the different taskshave been relatively naive. The goals was to investigate the performance of the dif-ferent networks with respect to structural properties, i.e. number of convolutionallayers, number of fully connected layers and number of neurons. It is there fore be-lieved, that schemes presented to reduce overfitting and balance neural connections,e.g. Local Response Normalization (LRN) [11] and Dropout [19], would increase theperformance of both the classifier and regressor networks.

94

9.2. FUTURE WORK

9.2.3 Low Dimensional Embedding and Model ConstraintsThe regressor networks used in our experiments, as well as in [35], predicts the(u, v, d)-key-point locations directly from image data. Other authors have usedconvolutional neural networks as an intermediate step in a pose estimation pipeline,where the output of the network have been fitted to a hand model in order to imposeconstrains and thereby improving the pose estimate. Such a model could possiblybe imposed on the output of our networks. Also, constraints can be introduceddirectly in the network by having a very narrow, i.e. a fully connected layer with afew neurons, before the last fully connected output layer. Conceptually this narrow,or bottleneck layer, creates a low dimensional embedding of the hand pose, and thefull hand pose are reconstructed from this low dimensional embedding by the lastfully connected layer.

95

Bibliography

[1] A. Thippur, C. H. Ek, and H. Kjellström, “Inferring hand pose: A comparativestudy of visual shape features,” in Automatic Face and Gesture Recognition(FG), 2013 10th IEEE International Conference and Workshops on. IEEE,2013, pp. 1–8.

[2] G. Fanelli, J. Gall, and L. Van Gool, “Real time head pose estimation with ran-dom regression forests,” in Computer Vision and Pattern Recognition (CVPR),2011 IEEE Conference on. IEEE, 2011, pp. 617–624.

[3] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color glove,”ACM Transactions on Graphics (TOG), vol. 28, no. 3, p. 63, 2009.

[4] M. Kölsch and M. Turk, “Robust hand detection.” in FGR, 2004, pp. 614–619.

[5] H. Hamer, K. Schindler, E. Koller-Meier, and L. Van Gool, “Tracking a handmanipulating an object,” in Computer Vision, 2009 IEEE 12th InternationalConference On. IEEE, 2009, pp. 1475–1482.

[6] B. Stenger, A. Thayananthan, P. H. Torr, and R. Cipolla, “Model-based handtracking using a hierarchical bayesian filter,” Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. 28, no. 9, pp. 1372–1384, 2006.

[7] Convolutional Neural Networks (LeNet). [Online]. Available: http://deeplearning.net/tutorial/lenet.html

[8] Large Scale Visual Recognition Challenge. [Online]. Available: http://image-net.org/challenges/LSVRC/

[9] Y. LeCun. Convolutional Neural Networks: Machine Learning for ComputerPerception. GPU Technology Conference On-Demand Webinar. [Online].Available: http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=yann&searchItems=&sessionTopic=&sessionEvent=&sessionYear=2014&sessionFormat=&submit=&select=+

[10] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic objectrecognition with invariance to pose and lighting,” in Computer Vision and Pat-tern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE ComputerSociety Conference on, vol. 2. IEEE, 2004, pp. II–97.

97

http://deeplearning.net/tutorial/lenet.html

http://deeplearning.net/tutorial/lenet.html

http://image-net.org/challenges/LSVRC/

http://image-net.org/challenges/LSVRC/

http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=yann&searchItems=&sessionTopic=&sessionEvent=&sessionYear=2014&sessionFormat=&submit=&select=+



BIBLIOGRAPHY

[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in Advances in neural information pro-cessing systems, 2012, pp. 1097–1105.

[12] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,“Overfeat: Integrated recognition, localization and detection using convolu-tional networks,” arXiv preprint arXiv:1312.6229, 2013.

[13] C. Garcia and M. Delakis, “Convolutional face finder: A neural architecturefor fast and robust face detection,” Pattern Analysis and Machine Intelligence,IEEE Transactions on, vol. 26, no. 11, pp. 1408–1423, 2004.

[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchiesfor accurate object detection and semantic segmentation,” in Computer Visionand Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014,pp. 580–587.

[15] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” arXiv preprint arXiv:1403.6382,2014.

[16] X. Zhang and Y. LeCun, “Text understanding from scratch,” arXiv preprintarXiv:1502.01710, 2015.

[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXivpreprint arXiv:1409.4842, 2014.

[18] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” inProceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186.

[19] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,“Dropout: A simple way to prevent neural networks from overfitting,” TheJournal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[20] CUDA C Programming Guide. [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/

[21] Artificial Neural Networks and Other Learning Systems (DD2432). [Online].Available: https://www.kth.se/student/kurser/kurs/DD2432?l=en

[22] S. Marsland, Machine Learning: An Algorithmic Perspective. CRC Press,2011. [Online]. Available: http://books.google.se/books?id=n66O8a4SWGEC

[23] Intro to Parallel Programming. [Online]. Available: https://www.udacity.com/course/cs344

[24] About CUDA. [Online]. Available: https://developer.nvidia.com/about-cuda/

98

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

https://www.kth.se/student/kurser/kurs/DD2432?l=en

http://books.google.se/books?id=n66O8a4SWGEC

https://www.udacity.com/course/cs344

https://www.udacity.com/course/cs344

https://developer.nvidia.com/about-cuda/

BIBLIOGRAPHY

[25] NVIDIA cuDNN – GPU Accelerated Machine Learning. [Online]. Available:https://developer.nvidia.com/cuDNN/

[26] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature em-bedding,” arXiv preprint arXiv:1408.5093, 2014.

[27] Basic Linear Algebra Subprograms (BLAS). [Online]. Available: http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

[28] Boost C++ Libraries. [Online]. Available: http://www.boost.org/

[29] J. S. Supancic III, G. Rogez, Y. Yang, J. Shotton, and D. Ramanan, “Depth-based hand pose estimation: methods, data, and challenges,” arXiv preprintarXiv:1504.06378, 2015.

[30] J. Tompson, M. Stein, Y. Lecun, and K. Perlin, “Real-time continuous poserecovery of human hands using convolutional networks,” ACM Transactions onGraphics, vol. 33, August 2014.

[31] A. Mittal, A. Zisserman, and P. H. S. Torr, “Hand detection using multipleproposals,” in British Machine Vision Conference, 2011.

[32] M. Šarić, “Libhand: A library for hand articulation,” 2011, version 0.9.[Online]. Available: http://www.libhand.org/

[33] D. Purves, G. J. Augustine, D. Fitzpatrick, W. C. Hall, A.-S. LaMantia, andL. E. White, Neuroscience. Sinauer Associates, 2012. [Online]. Available:http://books.google.se/books?id=B5YXRAAACAAJ

[34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal repre-sentations by error propagation,” DTIC Document, Tech. Rep., 1985.

[35] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands deep in deep learning forhand pose estimation,” arXiv preprint arXiv:1502.06807, 2015.

[36] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M.Smeulders, “Selective search for object recognition,” International Journalof Computer Vision, vol. 104, no. 2, pp. 154–171, 2013. [Online]. Available:https://ivi.fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013

[37] The PASCAL Visual Object Classes Homepage. [Online]. Available:http://host.robots.ox.ac.uk/pascal/VOC/

[38] J. Romero, H. Kjellstrom, and D. Kragic, “Monocular real-time 3d articulatedhand pose estimation,” in Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International Conference on, Dec 2009, pp. 87–92.

99

https://developer.nvidia.com/cuDNN/

http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

http://www.boost.org/

http://www.libhand.org/

http://books.google.se/books?id=B5YXRAAACAAJ

https://ivi.fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013

http://host.robots.ox.ac.uk/pascal/VOC/

BIBLIOGRAPHY

[39] NVIDIA DIGITS - Interactive Deep Learning GPU Training System. [Online].Available: https://developer.nvidia.com/digits

[40] Y. Nesterov, “A method of solving a convex programming problem with con-vergence rate o (1/k2),” in Soviet Mathematics Doklady, vol. 27, no. 2, 1983,pp. 372–376.

[41] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance ofinitialization and momentum in deep learning,” in Proceedings of the 30th In-ternational Conference on Machine Learning (ICML-13), 2013, pp. 1139–1147.

[42] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for on-line learning and stochastic optimization,” The Journal of Machine LearningResearch, vol. 12, pp. 2121–2159, 2011.

100

https://developer.nvidia.com/digits

Appendix A

Additional Training Strategies

In this chapter, some additional training options implementable in the Caffe frame-work is presented.

A.1 Gradient ApproximationsInstead of using the Stochastic Gradient Descent (SGD) algorithm when approximat-ing gradients during backwards propagation of errors, one could use the Nesterov’sAccelerated Gradient (NESTEROV) of Adaptive Gradient (ADAGRAD) algorithm.

A.1.1 Nesterov’s Accelerated Gradient (NESTEROV)

Nesterov’s Accelerated Gradient was originally proposed as optimal method forconvex optimization capable of achieving a convergence rate of O(1/t2) in somecases, but generally not in deep neural networks [40, 26]. However, the methods hasproven to be very effective when optimizing certain types of deep neural networks,e.g. deep MNIST autoencoders [41]. The weight update formula is similar to thatof Stochastic Gradient Descent by with one key difference:

vt+1 = µvt − α∇Q(zt, wt + µvt) (A.1)

wt+1wt + vt+1,

i.e. that the method computes the gradient on weights with added momentum.

A.1.2 Adaptive Gradient (ADAGRAD)

The Adaptive gradient method utilizes the update information from previous tupdate steps and computes each component in the next weight vector as

(wt+1)i = (wt)i − α(∇Q(zt, wt))√∑tt′=1 (∇Q(zt′ , wt′))2

i

(A.2)

101

APPENDIX A. ADDITIONAL TRAINING STRATEGIES

where i is the component in the weight vector and t′ is a variable describing a set ofprevious states [42]. In practical implementations, due to memory limitations andcomputational demand, not all previous weight values are stored nor utilized in thegradient approximation in [26].

A.2 Learning Rate PolicyInstead of using the discrete stepped learning rate policy (STEP), continuous learningrate polices with a faster reduction of the learning rate can be utilized.

A.2.1 Exponential (EXP)Using the exponential learning rate policy, the learning rate, α, at iteration n, i.e. αnis described as

αn = α0γn (A.3)

where α0 is the base learning rate and γn is the base learning rate modulationfactor. The learning rate modulation factor when using the exponential learningrate policy is shown in Figure A.1.

A.2.2 Inverse (INV)The inverse learning rate policy dictates that the learning rate α, at iteration n,i.e. αn is described as

αn = α0 (1 + γn)−c (A.4)where α0 is the base learning rate and (1+γn)−c is the base learning rate modulationfactor. Here, c is a positive constant. The learning rate modulation factor whenusing inverse learning rate policy is shown in Figure A.2.

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

Iteration (n)Base

Learning

RateMod

ulation

γ = 0.99γ = 0.98γ = 0.95

Figure A.1: The base learning modulation factor when using the exponential learn-ing rate policy.

102

A.3. ACTIVATION FUNCTIONS

0 10 20 30 40 50 60 70 80 90 1000.2

0.4

0.6

0.8

1

Iteration (n)Base

Learning

RateMod

ulation

γ = 0.75, c = 0.3γ = 0.50, c = 0.3γ = 0.25, c = 0.3

Figure A.2: The base learning modulation factor when using the inverse learningrate policy.

A.3 Activation Functions

The Rectified Linearity Unit (ReLU) is the most commonly used activation functionin successful classification and pose regression tasks [11, 35]. However, there are awide range of additional activation functions that has been used historically.

A.3.1 The Absolute Value Unit (ABSVAL)

The Absolute Value unit (ABSVAL) is a non-continuous, non-saturating activationfunction described by

ϕ(h) = |h|. (A.5)

The ABSVAL is thereby unity linear w.r.t. the input if the input value is larger thanzero and inversely unity linear w.r.t. the input value if the input value is smallerthan zero. See Figure A.3.

A.3.2 The Hyperbolic Tangent Unit (TANH)

The Hyperbolic Tangent unit (TANH) is a continuous, saturating activation functiondescribed by

ϕ(h) = tanh(h). (A.6)

The TANH yields negative activation for negative input values and is therefore anti-symmetric w.r.t. both the x and y-axis. See Figure A.4.

103

APPENDIX A. ADDITIONAL TRAINING STRATEGIES

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1

0

1

2

Weighted sum (h)

Activation(ϕ

)

Figure A.3: Activation function of the Absolute Value unit, ABSVAL.

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

−0.5

0

0.5

Weighted sum (h)

Activation(ϕ

)

Figure A.4: Activation function of the Hyperbolic Tangent unit, TANH.

A.3.3 The Binomial Normal Log Likelihood Unit (BNLL)The Binomial Normal Log Likelihood unit (BNLL) is a continuous, non-saturatingactivation function described by

ϕ(h) = log(1 + eh). (A.7)

The BNLL does not yield negative activation and its response can be considered acontinuous version of the ReLU

104

A.3. ACTIVATION FUNCTIONS

−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6−5

0

5

Weighted sum (h)

Activation(ϕ

)

Figure A.5: Activation function of the Binomial Normal Log Likelihood unit, BNLL.

105

www.kth.se

hand detection and pose estimation using convolutional ...859561/fulltext01.pdf · convolutional...

Documents