professional project - c++ opencl - platform agnostic hardware acceleration for deep neural...

University of Surrey Faculty of engineering and physical sciences

Department of Computing

Final Year Project Report

19/05/2016

Title: Platform agnostic hardware acceleration for deep neural networks

Student: Callum McMahon

URN: 6279333

Supervisor: Lillian Tang

Platform agnostic hardware acceleration for deep neural networks P a g e | 1

Contents Abstract ....................................................................................................................................... 3

Abbreviations .............................................................................................................................. 3

Introduction ................................................................................................................................. 4

Background ............................................................................................................................. 4

Objectives................................................................................................................................ 5

Literature Review ........................................................................................................................ 5

Pre-existing software packages ............................................................................................... 5

Exploring Caffe’s OpenCL branch in more depth ..................................................................... 5

Theoretical groundwork ........................................................................................................... 7

Multi Layer feed forward perception ..................................................................................... 7

Modern Activation Functions and the Back Propagation algorithm ....................................... 8

Weight regularization ........................................................................................................... 9

OpenCL learning resources and reference material ............................................................... 11

System Design .......................................................................................................................... 11

Development environment ..................................................................................................... 11

Essential Requirements ......................................................................................................... 12

Implementation Deliverables.................................................................................................. 12

Technical Challenges ............................................................................................................ 12

Feeding the OpenCL device ............................................................................................... 12

OpenCL kernel efficiency considerations ........................................................................... 14

Using clFFT ....................................................................................................................... 15

Implementation Schedule ...................................................................................................... 15

Design specification ............................................................................................................... 15

Designing a flexible network architecture ........................................................................... 15

Validation tests ................................................................................................................... 16

Class hierarchy .................................................................................................................. 17

Results ...................................................................................................................................... 18

Requirement satisfaction ....................................................................................................... 18

Refer to system design, essential and optional requirements, page 11. ................................. 18

Test validation Results........................................................................................................... 19

MNIST classification examples .............................................................................................. 20

Result Discussion .................................................................................................................. 20

Evaluation ................................................................................................................................. 21

Further Work ......................................................................................................................... 21

Conclusion............................................................................................................................. 21

Deployment guide ..................................................................................................................... 22


Bibliography .............................................................................................................................. 23

Appendices ............................................................................................................................... 25

A - Network validation architectures....................................................................................... 25

A.1. MNIST ........................................................................................................................... 25

A.2. sin(a) .............................................................................................................................. 25

A.3. sort(a, b, c, d, e) ............................................................................................................ 26

A.4. polynomial ...................................................................................................................... 26

A.5. MNIST ........................................................................................................................... 27

B – clFFT library expeiment ................................................................................................... 27

B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL ........................ 27

B.2. Program outputs from B.1. Showing only the first column for succinctness. ................... 31

C – Gantt time plans .............................................................................................................. 32


Abstract This report provides an overview of resources available for deep neural network machine

learning. Current state of the art software libraries employ massively vectorised training

pipelines, enabling highly parallel computation and hence faster training convergence. Graphics

processing units, provide access to a greater threading capability than a typical central

processing unit. As such, a number of libraries have been developed with alternative fast native

GPU code paths. Current implementations are tightly integrated with the CUDA platform, a

proprietary programming model restricted to Nvidia GPUs.

In response a basic cross platform neural network library has been developed in C++,

demonstrating the feasibility of a single high performance platform agnostic code path. The

library has been built on top of the OpenCL programming framework. OpenCL is maintained by

a non-profit consortium group, Khronos, with implementations available on a number of devices

from different vendors.

Validation tests were performed on multilayer neural networks to assess training performance

and final network accuracy. Training consisted of multiple passes using back propagation and an

adaptive global learning rate.

A network consisting of two hidden linear rectifier layers was trained on the MNIST dataset; a

well known set of labelled greyscale digit images. The best observed error was achieved with a

total of 1099770 trainable parameters over 200 epochs, attaining a classification error of 4.5%.

Each epoch consisted of 5000 stochastic samples and back propagation passes. Total training

time was 53 minutes. Good fast convergence was observed using fewer training epochs. Using

10 epochs, a classification error rate of 9.6% was observed; taking 164.6 seconds of training on

an AMD Fury X.

Training on the Fury X was found to be approximately x5 faster than the i7-6700k. The Fury X

boasts approximately x72 the single floating point performance of the i7-6700k, suggesting

further optimisations can be made.

For demonstration purposes, windows x64 has been explicitly targeted by this release; porting to

another operating system would be trivial. The library has been written against OpenCL version

2.0 in order to take advantage of fine control over job queues. All recent CPUS and GPUs from

AMD and Intel are OpenCL 2.0 capable. Currently Nvidia devices only support OpenCL 1.2, but

2.0 support is likely to come in the near future.

Abbreviations CPU Central Processing Unit

GPU Graphics Processing Unit

CUDA Compute Unified Device Architecture

OpenCL Open Computing Language

clBLAS OpenCL Basic Linear Algebra Subprograms

clFFT OpenCL Fast Fourier Transform

Linear Unit Linear Unit

Rectified Linear Unit ReLU

LU Linear Unit

SiU Sigmoid Unit


Introduction

Background The field of machine learning is currently experiencing renewed interest. Developments in deep

neural network architectures and training methods have resulted in greatly improved model

learning accuracy for difficult tasks. Refinements to techniques are being continually developed,

with error rates as low as 15.2% being reported in difficult tasks such as speech recognition [1].

Companies are making investing large sums into neural network research. See Facebook open

sourcing deep learning modules for Torch [2]. There have been a number of high profile public

successes, as Alphabet’s AlphaGo, the first program to ever beat a professional Go player

without a handicap [3].

Figure 1.1 Google trend data showing the popularity of search terms. Note the rapid rise of "deep learning" searches.

Deep neural networks are an evolution of single hidden layer neural networks. Whilst the idea of

a distributed computational network was conceived in the late fifties, inspired by biological

models, it was not until the invention of back propagation in 1970[4] that an effective network

training method was available. 1985 saw the first proposal of introducing convolution layers [5].

Since then a large number of new methods have been introduced: weight decay [6], fast

convolution layers using Fourier transforms [7], dropout [8], long short term memory networks

[9].

Demand for increased computational performance has risen with the increasing complexity of

neural networks. In 1995 it was demonstrated that GPUs could be used to effectively train neural

networks [10]. Neural network optimisation is a massively parallel problem, and as such is well

suited to GPU architectures, which give access to a much larger number of threads than a

typical CPU.

GPUs APIs were originally designed with fixed pipeline designed to produce visual effects.

Traditionally it has been very difficult to run exploit GPU parallelism for algorithm computation.

However, graphics API pipelines have become increasingly generic to handle more intricate

computer graphics methods [11][12]. Hardware vendors have subsequently released more

generic compute platforms [13][14][15][16] that can run on code against GPU hardware,

designed for the needs of the scientific computing community. Nvidia CUDA 1.0 was released in

2007, OpenCL 1.0 in 2009. Both OpenCL and CUDA program kernels are based on the C++14

specification.


CUDA is currently the more mature of the two GPU compute platforms, boasting a wider

selection of libraries [17]. This has directly translated into more widespread CUDA hardware

acceleration for training deep neural networks. In contrast, OpenCL implementations are

generally incomplete or non-existent (table 2.1). However, CUDA is a proprietary platform that

will only run on Nidia’s GPU hardware [18]. OpenCL implementations exist across a range of

hardware from different vendors, including both CPUs and GPUs [19]. OpenCL has the potential

to provide a single unified fast code path for training deep Neural Networks.

Objectives

To develop a basic library deep learning library that utilises OpenCL for all intensive

operations.

Develop an easy to use interface within C++.

Maintain compatibility across as many OpenCL platforms as possible.

Minimise external dependencies to ease setup and increase portability.

Literature Review

Pre-existing software packages

Software Primary language

interface

Other language

interfaces

CUDA GPU

support

OpenCL CPU / GPU

support

Caffe Python C++, Matlab Yes Third party branch from

AMD, but only neared

feature completion as of

late August 2015.

Neon Python Yes No.

Theano Python Yes In development.

Tensorflow Python C++ (graphs only) Yes In development.

Torch Lua C Yes Third party branch in

development.

Figure 2.1.1 An overview of popular deep learning software environments.

None of the popular deep learning libraries provide official OpenCL support. Caffe is the only

library with a feature complete OpenCl branch.

Exploring Caffe’s OpenCL branch in more depth There a large number of dependencies [20] required for installation. Installations are restricted to

Ubuntu 12.04 or later. Only AMD GPUs are currently supported. Building and deploying the full

caffe OpenCL stack was deemed outside the scope of this project. Test performance metrics are

available on the github page [21], see Fig 2.2.1.


Platform Speed (images per second)

AMD W9100 & A10-7850k 255

AMD R9 Fury & A10-7850k 261

AMD R290X @1000MHz & A10-7850k 268

AMD S9150 @900MHz & Xeon E5-2640 227

Figure 2.2.1. Training performance using the well known AlexNet network. [22]

The network inputs used by Alexnet were images of 256x256 resolution. Multiplying out the total

number of pixels by the number images processed per second, we can see that OpenCL’s caffe

branch is capable of training approximately 17,104,896 inputs per second on an AMD Fury X.

Platform Speed (images per second)

AMD W9100 & A10-7850k 590

AMD R9 Fury & A10-7850k 699

AMD R290X @1000MHz & A10-7850k 606

AMD S9150 @900MHz & Xeon E5-2640 452

Figure 2.2.2. Recognition performance using AlexNet. [22]

Similarly, we can see that an approximately 45,809,664 inputs per second can be processed.


Theoretical groundwork

Multi Layer feed forward perception

The perceptron network was first proposed in 1958 by Frank Rosenblatt [24]. Perceptrons are

connected into a directed graph. The

perceptrons at the start of the graph

correspond to the network’s inputs.

Perceptrons at the end of the graph, the

outputs. Input values are passed into the input

perceptrons. Each subsequent perceptron

computes a weighted sum of the outputs from

prior connected perceptrons. The summed

value is then passed through an activation

function, A(x), and passed on through to the

next set of perceptrons. This process is

continued until the network output is reached.

These networks were handcrafted by

tweaking connection weight values. Modern

neural networks employ learning algorithms to

automatically update weight values.

𝐴 𝑥 =𝑑(max{𝑥, 0})

𝑑𝑥

Figure 2.3.2. The Heaviside step function was the activation function originally used by Rosenblatt. It has since been replaced by differentiable functions. Differentiable activation functions allow gradient descent to be used to modify connection weights in such a way that the network can be taught to output a set of desired values for a given input.

Figure 2.3.1. A diagram showing how a single perceptron unit processes inputs within a network. This process is called a forward pass.


Modern Activation Functions and the Back Propagation algorithm

Back propagation[3] is widely used as a training algorithm for neural networks. It is a class of

gradient descent algorithm. It works by first performing a forward pass of the network. See [25]

for an overview of the algorithm.

𝑝𝑗 = 𝐴

𝑝𝑖𝑤𝑖𝑗

𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔𝑤𝑒𝑖𝑔 ℎ𝑡𝑠

𝑖=1

Where 𝐴() is an activation function, 𝑤𝑖𝑗 a weight between units 𝑖 and 𝑗 and 𝑝𝑥 is the output of

unit 𝑥. 𝑖 is the index of the unit closest to the input layer.

The activation function must be differentiable so that an error gradient may be calculated. The

sigmoid function is commonly used. The linear rectifier activation function has been shown to

have better characteristics under some conditions [26]. The linear rectifier prevents the

vanishing gradient problem experienced by the sigmoid activation function, where weights of

large magnitude will have activation gradients of 0, or near to 0, which in turns reduces the

weight update deltas to 0, or near 0.

Sigmoid, derivative: 𝐴 𝑥 = 1

(1+𝑒−𝑥 )

𝑑𝐴 𝑥

𝑑𝑥=

𝑒𝑥

1+𝑒−𝑥 2 = 𝐴 𝑥 (1 − 𝐴 𝑋 )

Linear rectifier, derivative: 𝐴 𝑥 = 𝑙𝑛(1 + 𝑒𝑥) 𝑑𝐴 𝑥

𝑑𝑥=

1

1+𝑒−𝑥

An error delta is calculated at each output unit by finding the difference between its output and a

desired output value. The error deltas are propagated back through the network to the input

layer, storing deltas at each unit. This is referred to as a backwards pass.

Delta error for output units: 𝛿𝑝𝑗= (𝑝𝑗 − 𝑡𝑗 )

𝑑𝐴 𝑝𝑗

𝑑𝑥

Where 𝑡𝑗 denotes the 𝑡𝑡ℎ output unit’s target value.

Delta error for inner units: 𝛿𝑝𝑗=

𝑑𝐴 𝑝𝑗

𝑑𝑥 𝛿𝑝𝑖

𝑤𝑖𝑗

𝑂𝑢𝑡𝑔𝑜𝑖𝑛𝑔𝑤𝑒𝑖𝑔 ℎ𝑡𝑠

𝑖=1

𝑤𝑖𝑗 is the weight from unit 𝑖 in the previously visited layer, to 𝑗 in the current layer. i.e. 𝑖 is

the index of the unit closest to the output layer.

Finally, weights are moved by value proportional to the error delta at the unit they provide inputs

for. The direction of change is opposite to the sign of the delta. The deltas are proportional to the

rate of change of the network’s error with respect to the incoming weights.

∆𝑤𝑖𝑗 = −𝑎𝛿𝑗𝑝𝑖 =𝑑𝐸𝑟𝑟𝑜𝑟

𝑑𝑤𝑖𝑗

Where 𝑎 is the learning rate. 𝑖 is now the index of the unit closest to the input layer.


The learning rate, 𝑎, must be small enough to allow the network to converge, yet large

enough to give a reasonable training time. Small 𝑎 values may also cause the network to

get stuck in local error minima.

Weight regularization

Weight regularization is commonly applied in one of two forms: weight decay [6], or dropout [8].

Weight regularization is intended to prevent overfitting, whereby the network learns to exactly

produce the training outputs, rather than learning a generalized pattern. Over fitted networks

perform poorly on validation test sets.

Weight decay modification to the weight update rule: ∆𝑤𝑖𝑗 = −𝑎𝛿𝑗𝑝𝑖 − 𝑑 ∗ 𝑠𝑖𝑔𝑛(𝛿𝑗𝑝𝑖)

Where 𝑑 is a small decay factor, such that 𝑑 ≪ 𝑎.

Weight decay may however reduce final network performance, as it will create moving global

optima. It is preferable to use dropout where possible. The dropout modification is applied to the

forward pass during training, giving each unit a small probability of outputting a value of 0.

𝑝𝑗 =

𝐴

𝑝𝑖𝑤𝑖𝑗

𝐼𝑛𝑐𝑜𝑚𝑖𝑛𝑔𝑤𝑒𝑖𝑔 ℎ𝑡𝑠

𝑖=1

𝑟𝑛𝑑 0.0, 1.0 < 𝑑

0 𝑟𝑛𝑑 0.0, 1.0 ≥ 𝑑

Where 𝑑 is a small dropout probability such that 0.0 ≤ 𝑑 < 1.0.

Dropout attempts to spread learned patterns across the network, rather than isolated groups of

units.

Convolution Layers and Fast Convolutions

Convolution layers provide method of introducing translation resistant weights into the network

[27]. Units within a convolution layer share weights in a spatial pattern, allowing the network to

quickly generalize for inputs containing translated patterns. Stacked convolution layers can

identify identify extremely complex patterns much more rapidly than a typical multi layer network;

convolution networks have seen great success in many applications.

Figure 2.3.3 A diagram showing how the weights are shared across convolutional layer units.


Convolution operations can however be expensive for large kernels, being 𝑂(𝑛𝑘2), where 𝑛 is

the number of units in the convolutional layer, and 𝑘 is the kernel width. It has been recognised

that that convolution theorem can be applied to give greatly reduced computation time of

𝑂(𝑛𝑙𝑜𝑔 𝑛 ) for the forward pass [28].

𝐹 𝑐 . 𝑘 = 𝐹 𝑐 ∗ 𝐹(𝑘)

∴ 𝑐. 𝑘 = 𝐹−1(𝐹 𝑐 ∗ 𝐹 𝑘 )

The convolution theorem shows that the elementwise product of two matrices is equal to the

product of their fourier transforms. Using the fast fourier transform algorithm, 𝐹(𝑐) and 𝐹(𝑘) can

be computed in 𝑂(𝑛𝑙𝑜𝑔 𝑛 ), where 𝑛 is the number of elements in 𝑐 or 𝑘 (they must have the

same number of elements). Similarly, the backpropagation algorithm may be also be modified to

take advantage of this identity [28].

Delta errors for convolutional output layer: 𝜹𝒋 = 𝑝𝑑𝐴 𝑝

𝑑𝑝 − 𝒕

Note 𝑝𝑑𝐴 𝑝

𝑑𝑝 is a matrix of the output layer multiplied element wise with the derivatives of

the activation function.

Delta errors for convolutional inner layer 𝜹𝒋 = 𝑑𝐴 𝒍𝒋

𝑑𝑝 𝒍𝑖 ∗ 𝒘𝑇

𝑖𝑗

Where 𝑖 and 𝑗 are now indexes between network layers, rather than units. For the

backwards pass 𝑖 is the index of the layer closest to the output layer.

Where 𝒍𝑖 = 𝑝 𝑖 , denoting the matrix of outputs for layer 𝑖.

Weight updates for a convolutional kernel: ∆𝒘𝑖𝑗 = −𝑎(𝜹𝑗 ∗ 𝒍𝑖) =𝑑𝑬

𝑑𝒘

For the weight updates, 𝑖 is the index of the layer closest to the input layer.


OpenCL learning resources and reference material

Having never worked with OpenCL before, I ended up working through a number of

tutorials and example programs. Listed below are all the resources I used.

Resource Type

Name Location

PDF, specification

OpenCL 2.0 specification https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf

Website, reference

clBLAS manual and reference http://clmathlibraries.github.io/clBLAS/

Website, reference

clFFT manual and reference http://clmathlibraries.github.io/clFFT/

Book Heterogeneous Computing with OpenCL 2.0, By David Kaeli,

Perhaad Mistry, Dana Schaa and Dong Ping Zhang

http://developer.amd.com/partners/university-programs/heterogeneous-computing-

with-opencl/

Website, tutorial

Oak Ridge laboratory, OpenCL vector addition tutorial

https://www.olcf.ornl.gov/tutorials/opencl-vector-addition/

Website, tutorial

AMD, Intro to OpenCL tutorial http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-

resources/introductory-tutorial-to-opencl/ Figure 2.4.1. Learning resources

System Design

Development environment The OpenCL specification is written against C++, and is subsequently the language of choice for

this project.

Windows was chosen as the development environment due to personal familiarity with the visual

studio software package. Visual Studio 2015 is used to provide an up to date implementation of

the C++11 specification. In keeping with the project objectives, windows specific code shall be

restricted to the main.cpp file. All other code will be written with the standard template in mind,

and as such should compile under g++ and run on Linux.

Familiarisation with OpenCL showed that developing optimised kernels is difficult. Consequently,

I decided to employ AMD’s clBLAS library where possible. clBLAS provides a set of common

basic linear algebra kernels. AMD also provides clFFT for computing fast fourier transforms.

clFFT was added as an additional dependency so as to assist in implementing fast convolution

layers (Fig. 3.2.1).


Essential Requirements 1. A network class capable of:

a. Constructing multi layer feed forward neural networks. The programmer

should be able to easily specify the number of units within each layer.

b. Training neural networks. Training performance must be reported through

cross validation against test data.

c. Testing neural networks. A method must be implemented that returns

information on the network’s the mean standard error across a batch of test

data.

d. Processing inputs. A method must be implemented that allows the network to

accept a single set of inputs from the main program thread, returning the

corresponding output from the network.

2. A layer class that provides a logical ordering of network computational units.

3. An implementation of the back propagation training algorithm.

4. An implementation of the sigmoid activation function and its corresponding

differential.

5. A sample program capable of demonstrating network training and testing functionality

on different OpenCL devices.

6. Unit testing, testing trained Network accuracy by validating against a dataset

generated from a mathematical function.

Optional Requirements

1. Unit testing, testing trained Network accuracy by validating against a well known pre-

constructed dataset.

2. Implementation of a convolutional layer and convolutional kernel classes. These must

provide:

a. Weight sharing across spatially separated neuron units.

b. Modification to the back propagation algorithm to handle shared weights.

3. An implementation of the linear rectifier activation function and its corresponding

differential.

4. Network regularization. Either through weight decay or dropout.

Implementation Deliverables

1. A Visual Studio 2015 C++ solution containing a working example of the developed deep

neural network library.

2. Headers and associated .cpp definitions with comments describing how the library works.

3. OpenCL kernel code.

4. clBLAS and clFFT included as dynamic link libraries.

Technical Challenges

Feeding the OpenCL device

OpenCL provides a high latency, high throughput bridge between the host device and the

compute device. The host device and compute device share one or more queues. The host

produces jobs and inserts them into a queue. The compute device consumes job items from the

queue. By default, OpenCL creates a serial queue, forcing the compute device to compute jobs

in order. This is not ideal, as some jobs may take only a fraction of the compute device’s


resources. Setting CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when creating the cl_queue will

enable the device to consume jobs out of order.

Each job is associated with an event, which may be in one of four states, queued, submitted,

running, and complete. Jobs are also associated with event completion wait lists, allowing for

synchronization and dependency blocks. Ideally the work queue will be saturated so that

compute device can be continually working on jobs.

Figure 3.1.1. A visualization of job how the queue controls job consumption. The queue is saturated, there are more jobs available for the compute device to consume, as shown by the line in red. The host device is shown in green. Independent jobs are undertaken either in parallel, or in an undetermined serial order.

Behaviour is undefined if the compute device attempts to write or read from cl_mem buffers

being modified by the host device. The reverse is also true. Consequently, the queue must be

utilised to stall both host and compute device until read / write operations are finished. The

number of read and write operations between the host and compute device should be minimised

in order to prevent stalls. As such, as much data as possible should be kept device side.


Figure 3.1.2. A forced synchronization point. The host is attempting to read the cl_mem holding the network’s output. A backward gather compute job is available, but cannot be consumed until the host has finished its read.

OpenCL kernel efficiency considerations

OpenCL kernels are small programs that run on the openCL compute device. OpenCL kernels

are compiled using an OpenCL device context by the host at program start up. The host can

then queue the kernel binary to the compute device as part of a compute task. Similarly, the

openCL host can queue read or write operations to modify or view the contents of cl_mem

buffers held in the compute device’s global cache.

Figure 3.1.3. A depiction of the hardware differences exposed by OpenCL. OpenCL devices typically have access to much larger number of threads. An AMD Fury X GPU has access to 4096 threads.


The specification is designed with massive parallelism in mind. An instance of a submitted kernel

program is launched for each thread in the global work group. The global work group is

subdivided into equally sized local work groups. Each thread has access to a small but very fast

local memory cache, and a slower, but larger work group memory cache. All threads have

access to the global memory cache. Threads may only communicate within their work group.

Task division is primarily achieved using the thread’s unique id, which lies in the range

0 >= x< global work group size. Kernels jobs are only marked as complete once all their threads

have finished, as such the kernel is only as fast as its slowest thread.

It is also worth noting that GPUs often implement reduced instruction sets. Consequently some

function calls can have large overheads. For example, the modulo operator is expensive on

AMD GPU hardware.

Using clFFT

The clFFT library is relatively complex, yet I could only find three example programs. I

subsequently created a small program to see if I could successfully transform real valued 2D

matrix into the complex frequency domain, then back again to the spatial domain. The test was

successful. See Appendix B.1. for the code and B.2. for results.

Implementation Schedule For the original implementation schedule, refer to Appendix C.1. A modified schedule was

created in at the end of December 2015 after the initial project proposal was recognized to be

too complex for the given time frame. See Appendix C.2. Originally I had hoped to demonstrate

basic speech recognition capabilities; however this would require that convolution features be

fully implemented. Other commitments meant that I was unsure whether or not convolution layer

functionality could be implemented in time. Instead I decided that the implementation would

benefit from greater focus on testing core multi layer network functionality and performance.

Design specification

Designing a flexible network architecture

Rather than adding computational units directly into the Layer class, it was decided to wrap them

within a pool class. This gives the programmer

more flexibility when defining network

architecture, as shown by Fig 3.2.1. This was

an early design decision, a result of designing

a way in which convolution layers and

standard unit layers could be integrated in a

complimentary fashion, rather than forcing the

programmer to choose between one or the

other. Layers enforce the sequence in which

the forward and backward passes visit units.

Pass are performed in parallel for pools in the

same layer. MatrixPools are pools of standard

units with biases. ConvPools are pools of

convolutional units arraned into a 2D matrix.

ConvPool units share a single bias between

them for each incoming convolutional kernel.

Figure 3.2.1. A network architecture example that might be used.


Validation tests

All training outputs are normalised into the range 0.0 to 1.0 such that they are compatible with

logistic units typically used by output layers. Linear rectifiers are not suitable for use in the

network output layer.

1. MNIST handwritten character recognition, 60,000 labelled training images, 10,000

labelled testing images. [29]. Network input of 28x28 = 784 LU. Output of 10 SiLU, with

the index of the unit of largest responsive corresponding to the digit’s classification.

Generating random values a, b, c, d, e in the range 0.0 to 1.0.

2. Sin(a), 1000 testing values, 200 training values. Network input of 1 LU. Output of 1 SiLU.

3. sort(a, b, c, d, e) sorting 5 parameters, 1000 testing values, 200 training values. Network

input of 5 LU. Output of 5 SiLU.

4. polynomial, 3.0f*a*a + a + 7.0f*b + 1.0f, 1000 testing values, 200 training values. Network

input of 2 LU. Output of 1 SiLU.


Class hierarchy

Figure 3.3.1. A UML diagram showing the basic relationship between network classes. Important field members are shown. The Network class is intended to provide the primary interface used by the programmer.


Results

Requirement satisfaction

Refer to system design, essential and optional requirements, page 11.

1. a. Full compliance

1. b. Full compliance.

1. c. Full compliance.

1. d. Full compliance.

2. Full compliance.

3. Full compliance.

4. Full compliance.

5. Full compliance.

6. Full compliance.

Optional Requirements

1. Full compliance, MNIST [29] handwritten digit dataset validation provided.

2. Partial compliance. clFFT tests completed. Interface and class structure for

convolution units and kernels added. No implementations currently present.

3. Full compliance. Linear rectifiers are used as the default activation function for hidden

layers.

4. No compliance. A test was conducted with weight decay, but was not found to

increase network test validation accuracy. Consequently it was decided not to include the

weight modification change. Further resting required.


Test validation Results Table 4.1.1. Results from validation runs with varying epoch numbers. The initial learn rate for all tests was 0.001.

OpenCL Device

Validation test type

Training time

(seconds) Epochs Network structure

Training sample

selection

Training passes

per epoch

Mean standard

error

Classification

error

AMD Fury X (8192 GFlops) MNIST 10.941 5

Appendix A.1 random 2000 0.1198 0.1819

Intel i7-6700k (114 Gflops) MNIST 65.4442 5










AMD Fury X (8192 GFlops) Sin(x) 9.3542 20

Appendix A.2 all 800 0.0124 N/A

Intel i7-6700k (114 Gflops) Sin(x) 19.0043 20










AMD Fury X (8192 GFlops)

Sort(a, b, c, d, e) 25.9588 20


Intel i7-6700k (114 Gflops)

Sort(a, b, c, d, e) 169.0974 20



Sort(a, b, c, d, e) 25.6606 20



Sort(a, b, c, d, e) 173.9321 20



Sort(a, b, c, d, e) 25.6168 20



Sort(a, b, c, d, e) 169.9004 20


AMD Fury X (8192 GFlops) Polynomial 17.789 20


Intel i7-6700k (114 Gflops) Polynomial 90.2876 20

Appendix A.1.4 all 800 0.0315 N/A














MNIST classification examples

Figure 4.2.1. A randomly sampled 2

misclassified by the neural network a 0.

Figure 4.2.1. A randomly sampled 2

that is correctly classified.

Figure 4.2.1. A randomly sampled

5 that is correctly classified.

Result Discussion Taking the mean average ratio of i7-6700k run times over Fury X run times from table 4.1.1,

gives a mean ratio of 4.97. This is low considering the Fury X has 8192 GFlops of compared to

the i7-6700k’s 114, which would suggest a ratio closer to 72. It is possible that the task queue is

not saturated, and that the OpenCL device is idling for a number of cycles, which would suggest

the main threa is causing throttling. Alternatively, it is possible that an OpenCL kernel is causing

a bottleneck due to poor optimisation. Further investigation is required.

Overall performance is acceptable on the Fury X, but has a some way to go before have

comparable performance of popular public libraries. The 10 epoch fury X test with a 5000

sample rate completed training in 165 seconds, and had 1,099,770 trainable parameters. A


similar network was setup within python using theaon, via python and lasagne, to provide a

reference. The theano network had 945,768 parameters, and achieved a training time of 44

seconds on an i7-7600k over 10 epochs. Final accuracy was relatively similar. My OpenCL

implementation achieved a misclassification rate of 10%. Theano achieved an error of 8%.

Recognition rate was good, taking 15.5 seconds to recognise all 10,000 MNIST test images,

giving an image per second rate of 645. Multiplying out by the size of the input 28x28 = 784, this

gives a total rate of 505,680 inputs processed per second. Caffe’s OpenCL branch is

approximately 90x faster as processing inputs, and significantly faster at training. Though it is

worth noting that batching is used for the caffe test results published on github.

The i7-6700k’s training could be quite long on my OpenCL implementation. For example, the 20

epoch MNIST test with a 2000 sample rate took 136 seconds, despite having only 218,842

training parameters.

A longer training session was undertaken using the network described in Appendix A.1.5.,

achieving a good final error rate of 4.5%, the same as what was achieved by a two layer neural

network by a popular publication on document recognition [29][30]. The network also proved

accurate over the modelled mathematical functions: sin(x), sort(a, b, c, d, e) and the polynomial

function, achieving best respective errors of 8.4%, 15%, 19%.

Evaluation

Further Work 1. Debugging performance issues.

2. Finishing integration of optional requirements.

3. Possibly worth investigating the removal of the majority of queue jobs by calling kernels

from the device. OpenCL 2.0 allows compute devices to make kernel calls. This feature

was not explored, as it adds significant design complexity. clBLAS would have to be

modified to handle custom kernel post / pre callback. clFFT supports this feature.

Conclusion Considering the complexity of the project, I believe the outcome to be reasonable. A cross

platform deep learning library was developed in C++, and demonstrated to work successfully on

a range of tasks. Though performance was not ideal, I am confident the bottlenecks could be

identified by isolating the execution times for the called OpenCL kenerls.


Deployment guide

Hardware requirements:

OpenCL 2.0 compatible device

x64 Windows environment (tested on windows, 7, 9, 10)

Software requirements:

AMD App SDK 3.00 or greater

Building from source requires visual studio 2015 or newer

1. Proceed to http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-

parallel-processing-app-sdk/.

2. Download and install AMD APP SDK 3.0 for windows 64 bit.

3. Unzip Code_Base.zip

Running the binary:

4. Proceed to the “./Backpropagation/Bin” folder

5. Run Backpropagation.exe

Compiling from source:

4. Proceed to the “./Backpropagation/Backpropagation” folder

5. Open visual studio 2015

6. Click file -> open project/solution

7. Open Backpropagation.sln

8. Press ctl + f5 to compile and run

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/


Bibliography

[1] Sainath, Tara N., Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. "Deep

convolutional neural networks for LVCSR." InAcoustics, Speech and Signal Processing (ICASSP), 2013

IEEE International Conference on, pp. 8614-8618. IEEE, 2013.

[2] https://research.facebook.com/blog/fair-open-sources-deep-learning-modules-for-torch/

[3] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,

Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,

Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray

Kavukcuoglu, Thore Graepel, and Demis Hassabis. "Mastering the Game of Go with Deep Neural

Networks and Tree Search." Nature 529, no. 7587 (2016): 484.

[4] Linnainmaa, Seppo. "The representation of the cumulative rounding error of an algorithm as a Taylor

expansion of the local rounding errors." Master's Thesis (in Finnish), Univ. Helsinki (1970): 6-7.

[5] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by

error propagation. No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE

SCIENCE, 1985.

[6] Rumelhart, D.E., Hinton, G.E. and Williams, R.J., 1988. Learning representations by back-propagating

errors. Cognitive modeling, 5(3), p.714.

[7] Mathieu, Michael, Mikael Henaff, and Yann LeCun. "Fast training of convolutional networks through

FFTs." arXiv preprint arXiv:1312.5851 (2013).

[8] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: A simple

way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1),

pp.1929-1958.

[9] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory."Neural computation 9, no. 8

(1997): 1735-1780.

[10] Martínez-Zarzuela, Mario, Francisco Javier Díaz Pernas, José Fernando Díez Higuera, and Míriam Antón Rodríguez. "Fuzzy ART neural network parallel computing on the GPU." In Computational and Ambient Intelligence, pp. 463-470. Springer Berlin Heidelberg, 2007. [11] (Shader model 5 for DirectX), accessed 21/ 05/ 2016, https://www.google.co.uk/search?q=shader+model+5&oq=shader+model+5&aqs=chrome..69i57.3354j0j7&sourceid=chrome&ie=UTF-8 [12] John Kessenich, Dave Baldwin, Randi Rost, “The OpenGL Shader language”, https://www.opengl.org/registry/doc/GLSLangSpec.4.50.pdf

[13] http://www.nvidia.co.uk/object/cuda-parallel-computing-uk.html, accessed 22/05/2016 [14] https://www.khronos.org/opencl/, accessed 22/05/2016 [15] http://developer.amd.com/tools-and-sdks/opencl-zone/, accessed 22/05/2016 [16] https://software.intel.com/en-us/intel-opencl?cid=sem43700008896000156&intel_term=intel+openCL&gclid=CjwKEAjwsYW6BRCTzvu5y8DPhi0SJABnGLlHWfkJo5tNdbBubNlnsqdz_nyHUSfm6SPPlECfXbtAgxoCSvXw_wcB&gclsrc=aw.ds, accessed 22/05/2016


[17] https://developer.nvidia.com/gpu-accelerated-libraries, accessed 22/05/2016

[18] https://developer.nvidia.com/cuda-gpus, accessed 22/05/2016

[19] https://www.khronos.org/conformance/adopters/conformant-products#opencl, accessed 22/05/2016

[20] https://github.com/amd/OpenCL-caffe/wiki/How-to-set-up-clBLAS-and-OpenCL, accessed 22/05/2016

[21] https://github.com/amd/OpenCL-caffe, accessed 22/05/2016

[22] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional

neural networks. In Advances in neural information processing systems (pp. 1097-1105).

[23] Kulkarni, Sanjeev, and Harman, Gilbert. "Multilayer Networks." In Wiley Series in Probability and

Statistics, 99-115. Hoboken, NJ, USA: John Wiley & Sons, 2011.

[24] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and organization in

the brain." Psychological review 65, no. 6 (1958): 386.

[25] Narsky, Ilya, and Porter, Frank C. "Neural Networks." In Statistical Analysis Techniques in Particle

Physics, 251-63. Weinheim, Germany: Wiley‐VCH Verlag GmbH & KGaA, 2013. Chapter 12.

[26] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep sparse rectifier neural networks."

In International Conference on Artificial Intelligence and Statistics, pp. 315-323. 2011.

[27] Simard, P.Y., Steinkraus, D. and Platt, J.C., 2003, August. Best practices for convolutional neural

networks applied to visual document analysis. In null(p. 958). IEEE.

[28] Mathieu, M., Henaff, M. and LeCun, Y., 2013. Fast training of convolutional networks through

FFTs. arXiv preprint arXiv:1312.5851.

[29] http://yann.lecun.com/exdb/mnist/, accessed 22/05/2016

[30] LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document

recognition." Proceedings of the IEEE86, no. 11 (1998): 2278-2324.


Appendices

A - Network validation architectures

A.1. MNIST

Trainable parameters 21,8842

A.2. sin(a)

Trainable parameters 387


A.3. sort(a, b, c, d, e)


A.4. polynomial



A.5. MNIST

Trainable parameters 1,099,770

B – clFFT library expeiment

B.1. Fourier transform and inverse fourier transform via clFFT and OpenCL

/* ************************************************************************ * Copyright 2013 Advanced Micro Devices, Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * ************************************************************************/ /* ************************************************************************ * Copyright Callum McMahon * * Added inverse hermitian transform, showing how data can * be to transformed back to spatial domain. * Terminal outputs after inverse should match the original dataset. * ************************************************************************/ /* No need to explicitely include the OpenCL headers */


#include <clFFT.h> #include <stdio.h> #include <stdlib.h> #include <math.h> int main(void) { system("MODE CON COLS=80 LINES=1024"); cl_int err; cl_platform_id platform = 0; cl_device_id device = 0; cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 }; cl_context ctx = 0; cl_command_queue queue = 0; cl_mem bufX, bufY; float *X, *Y; cl_event event = NULL; int ret = 0; const size_t N0 = 8, N1 = 8; char platform_name[128]; char device_name[128]; /* FFT library realted declarations */ clfftPlanHandle planHandle; clfftDim dim = CLFFT_2D; size_t clLengths[2] = { N0, N1 }; int fac = ((N1 / 2) + 1);//=N1; //size_t l = N0; size_t clOutStrides[2] = { 1, fac }; size_t clInStrides[2] = { 1, N0 }; /* Setup OpenCL environment. */ err = clGetPlatformIDs(1, &platform, NULL); size_t ret_param_size = 0; err = clGetPlatformInfo(platform, CL_PLATFORM_NAME, sizeof(platform_name), platform_name, &ret_param_size); printf("Platform found: %s\n", platform_name); err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, NULL); err = clGetDeviceInfo(device, CL_DEVICE_NAME, sizeof(device_name), device_name, &ret_param_size); printf("Device found on the above platform: %s\n", device_name); props[1] = (cl_context_properties)platform; ctx = clCreateContext(props, 1, &device, NULL, NULL, &err); queue = clCreateCommandQueueWithProperties(ctx, device, 0, &err); /* Setup clFFT. */ clfftSetupData fftSetup; err = clfftInitSetupData(&fftSetup); err = clfftSetup(&fftSetup); /* Allocate host & initialize data. */ /* Only allocation shown for simplicity. */ size_t buffer_size_x = N0 * N1 * sizeof(*X);


size_t buffer_size_y = ((N0+2) * N1) * sizeof(*Y); X = (float *)malloc(buffer_size_x); Y = (float *)malloc(buffer_size_y); /* print input array just using the * indices to fill the array with data */ printf("\nPerforming fft on an two dimensional array of size N0 x N1 : %ld x %ld\n", N0, N1); int i, j; i = j = 0; for (i = 0; i<N0; ++i) { for (j = 0; j<N1; ++j) { float x = 0.5f; float y = 0.5f; unsigned idx = (j + i*N0); X[idx] = sin(1.0f*(float)i) + cos(0.4f*(float)j); printf("\n(%f) ", X[idx]); } printf("\n"); } /* Prepare OpenCL memory objects and place data inside them. */ bufX = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_x, NULL, &err); //CL_MEM_READ_ONLY bufY = clCreateBuffer(ctx, CL_MEM_READ_WRITE, buffer_size_y, NULL, &err); err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL, NULL); /* Create a default plan for a complex FFT. */ err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths); /* Set plan parameters. */ err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE); err = clfftSetLayout(planHandle, CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED); err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE); err = clfftSetPlanOutStride(planHandle, dim, clOutStrides); err = clfftSetPlanInStride(planHandle, dim, clInStrides); /* Bake the plan. */ err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL); /* Execute the plan. */ err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL, &bufX, &bufY, NULL); /* Wait for calculations to be finished. */ err = clFinish(queue); /* Fetch results of calculations. */ err = clEnqueueReadBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL, NULL); /* print output array */ printf("\n\nfft result: \n"); i = j = 0; for (i = 0; i<N0; ++i) { for (j = 0; j<fac; ++j) { unsigned idx = 2 * (j + i*fac); printf("\n(%f) ", sqrt(Y[idx] * Y[idx] + Y[idx+1] * Y[idx+1])); //fiddle with restults to test //Y[idx] += 0.01f*(float)idx;


} printf("\n"); } printf("\n"); //***************** //revserse! //***************** printf("\n\n *** reverse ***\n\n"); //clOutStrides[0] = { 1, fac }; //clInStrides[0] = { 1, N0 }; err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0, buffer_size_y, Y, 0, NULL, NULL); /* Create a default plan for a complex FFT. */ err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths); /* Set plan parameters. */ err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE); err = clfftSetLayout(planHandle, CLFFT_HERMITIAN_INTERLEAVED, CLFFT_REAL); err = clfftSetResultLocation(planHandle, CLFFT_OUTOFPLACE); err = clfftSetPlanOutStride(planHandle, dim, clInStrides); err = clfftSetPlanInStride(planHandle, dim, clOutStrides); /* Bake the plan. */ err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL); /* Execute the plan. */ err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL, &bufY, &bufX, NULL); /* Wait for calculations to be finished. */ err = clFinish(queue); /* Fetch results of calculations. */ err = clEnqueueReadBuffer(queue, bufX, CL_TRUE, 0, buffer_size_x, X, 0, NULL, NULL); i = j = 0; for (i = 0; i<N0; ++i) { for (j = 0; j<N1; ++j) { float x = 0.5f; float y = 0.5f; unsigned idx = (j + i*N0); printf("\n(%f) ", X[idx]); } printf("\n"); } //***************** //revserse END //***************** /* Release OpenCL memory objects. */ clReleaseMemObject(bufX); free(X); clReleaseMemObject(bufY); free(Y); /* Release the plan. */


err = clfftDestroyPlan(&planHandle); /* Release clFFT library. */ clfftTeardown(); /* Release OpenCL working objects. */ clReleaseCommandQueue(queue); clReleaseContext(ctx); getchar(); return ret; }

B.2. Program outputs from B.1. Showing only the first column for succinctness.

Platform found: Intel(R) OpenCL

Device found on the above platform: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

Performing fft on an two dimensional array of size N0 x N1 : 8 x 8

(1.000000)

(0.921061)

(0.696707)

(0.362358)

(-0.029200)

(-0.416147)

(-0.737394)

(-0.942222)

fft result:

(11.271166)

(27.725875)

(11.865518)

(8.765699)

(8.040510)

*** reverse ***

(1.000000)

(0.921061)

(0.696707)

(0.362358)

(-0.029200)

(-0.416147)

(-0.737394)

(-0.942222)


C – Gantt time plans

C.1. Original Gantt time plane


C.2. Modified Gantt time plane

professional project - c++ opencl - platform agnostic hardware acceleration for deep neural...

Documents