spinnaker demonstration - university of...

SpiNNaker Demonstration

Student: Ion Diaconu, Supervisor Steve Furber

May 3, 2016

1

Abstract

This report details my efforts in attempting to implement a Face andFacial Expression recognizer in SpiNNaker. I started by researching spik-ing neural networks and current vision models and familiarizing myselfwith Nengo (the neural description language I used). Afterwards I put to-gether a set of functional and non-functional requirements for my project.Using this, designed a 3 layer feed-forward neural network which could the-oretically accomplish the task. I took a modular programming approachto implementing the solution by first making a list of modules/milestoneswith their perceived difficulty and estimated time of completion. I alsoimplemented the model in simple Python both as a benchmark and as aproof of concept for the model itself. I managed to make a convolutionnetwork on SpiNNaker as well as a simple classifier, but my full modelturned out to be too big (far too many neurons and populations) to beable to run on SpiNNaker. Attempts at reducing the resolution and neu-ral approximation led to an unsatisfying accuracy. Still, the model couldstill remain viable is implemented in a different manner.

2

Contents

1 Introduction 51.1 Project Overview and Aim . . . . . . . . . . . . . . . . . . . . . 51.2 Spiking Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 51.3 SpiNNaker and Nengo . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Vision in Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5 Motivation and Existing Solutions . . . . . . . . . . . . . . . . . 111.6 Functional and non-Functional Requirements . . . . . . . . . . . 12

2 Design 122.1 Overall Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Edge Enhancing . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Gabor Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.6 Implementation Approach . . . . . . . . . . . . . . . . . . . . . . 15

3 Implementation 173.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Edge Enhancing . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Gabor Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Testing and Validation 19

5 Conclusion 19

3

List of Figures

1 Different ways of approximating firing rates. Image from Theo-retical Neuroscience Computational and Mathematical Modelingof Neural Systems - Peter Dayan, L. F. Abbott pg 10 . . . . . . . 6

2 Average firing rate of a cat’s primary visual cortex V1 neuronsplotted as a function orientation of the light bar stimulus. Imagefrom Theoretical Neuroscience Computational and MathematicalModeling of Neural Systems - Peter Dayan, L. F. Abbott pg 10.data points from Henry et al., 1974 . . . . . . . . . . . . . . . . . 7

3 SpiNNaker chip architecture. Image from http://apt.cs.manchester.ac.uk/projects/SpiNNaker/SpiNNchip/ 84 Chip layout on a SpiNNaker board. Image from http://apt.cs.manchester.ac.uk/projects/SpiNNaker/architecture/ 95 A represents a simple neuron. The visual information goes through

the receptive field and using a threshold a response is obtained. Brepresents a complex neuron which sums the responses of multiplesimple neurons to form its own response. Figure from http://www.scholarpedia.org/article/Area V1#Receptive fields 10

6 Represents the overall design of the model. Image modified from: 137 A list of tasks with their estimated finish time as well as a per-

ceived difficulty colour code. Green - no considerable difficulty,Yellow - moderate difficulty, Brown - challenging, Red - very chal-lenging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

8 The convoluter unit. . . . . . . . . . . . . . . . . . . . . . . . . . 179 the Classifying layer (with 3 trained classifiers) . . . . . . . . . . 18

4

1 Introduction

This chapter covers the overall project scope, goal and motivation. It presentsthe functional and non-functional requirements of the project as well as someperceived difficulties in implementing it.

1.1 Project Overview and Aim

The project title is ”SpiNNaker Demonstration” which allows for a freedom ofscope. After research on the possibilities offered by the platform, I decidedto attempt to implement a face and facial expression recognizer using spikingneural network running on SpiNNaker. The program would be given a numberof faces with different facial expressions expressing 7 emotions (Neutral, Angry,Disgusted, Afraid, Happy, Surprised and Sad) which would be used to trainthe network. After training, the network would be able to receive as input animage or video of a face and successfully identify the person and his/hers facialexpression.

INSERT IMAGE HERE

1.2 Spiking Neural Networks

What are artificial neural networks? Artificial neural networks are a class ofcomputational models inspired from their biological equivalent in the brain.They work by approximating a function with a large number of inputs whichis generally unknown. The basic building block of a neural network is theneuron. Like its biological equivalent it has a number of inputs X1, X2, . . . , Xn

(dendrites) which come from either arbitrary inputs (sensory organs) or otherneurons. These inputs accumulate in the neuron and after a certain threshold Tis met, the neuron fires what is called an action potential(spike) through a singleoutput(axon), which is followed by a rest period called post-synaptic break. Theinputs also have a synaptic weight W1,W2, . . . ,Wn assigned to them dependingon the strength of the connection, which can also change over time (plasticity).

T <=

n∑i

XiWi => SPIKE

Artificial neural networks(ANN) have been around in the world of computerscience for over sixty years. The McCulloch-Pitts threshold neurons was oneof the earliest models introduced and it is known to represent what is calledthe first generation of ANN’s. The model itself was very simple, if the sum theweighted inputs a neuron received was over a threshold it would send a ’high’signal, otherwise it would send a ’low’ signal. However when arranged in amulti-layered structure they could approximate a large number of functions andare universal for digital computations.

The second generation built upon the first generation by removing the thresh-old and instead using a continuous activation function. This allowed for ana-logue output which meant that the networks could approximate analogue func-tions arbitrarily well. As such the second generation ANN are universal foranalogue computations.

The first two generations do not count the individual spikes. Instead theiroutput (which usually lies between 0 and 1) works as an normalized firing rate at

5

Figure 1: Different ways of approximating firing rates. Image from TheoreticalNeuroscience Computational and Mathematical Modeling of Neural Systems -Peter Dayan, L. F. Abbott pg 10

a certain point in time. As such, the second generation is better at approximat-ing the firing rate than the first generation, which is limited in its approximationsince it can only output 0 or 1, losing some of the actual spike information.

In biology, determining the exact firing rate of a neuron (or a cluster ofneurons) is fairly difficult since, as a probability density function, it cannotbe determined from a limited range of data. There are however methods forapproximating it such as dividing the time into discrete time periods of ∆t,counting the spikes and then dividing them by ∆t. Figure 1 shows the result ofa number of different methods.

Third generation neural networks (or Spiking neural networks) get evencloser to the actual biological neurons by firing individual spikes. This incorpo-rates a time component into the information sent (just like real neurons) whichallows for multiplexing of information. Neurons which can send and receiveindividual spikes can use pulse-coding schemes (the spikes or pulses representthe actual information) which can be much faster than rate-coding schemes (theinformation is in the firing rate). This is especially important in time-sensitiveinformation such as identifying whether a person you are looking at is a friendor a stranger.

If we observe the neural response of a cluster of neurons to a stimulus s

6

Figure 2: Average firing rate of a cat’s primary visual cortex V1 neurons plottedas a function orientation of the light bar stimulus. Image from TheoreticalNeuroscience Computational and Mathematical Modeling of Neural Systems -Peter Dayan, L. F. Abbott pg 10. data points from Henry et al., 1974

over a period of time, we can plot the average firing rate as a function f(s) =AverageF iringRatetos. This is called a tuning curve 2 . By changing theneuron parameters (weights, activation function, post-synaptic break etc.) wecan adjust the tuning curve to fit the function we desire. As such, we can ”train”neuron clusters to approximate functions in order to use them.

REF1: Abbot, L. F. & Nelson, S. B. Synaptic Plasticity: taming the beast, Nature Neuroscience Review, vol. 3 p.1178-1183 (2000).

REF2: Maass, W. The Third Generation of Neural Network Models , Tech-nische Universitat Graz (1997).

REF3: Ferster, D. & Spruston, N. Cracking the neuronal code , Science, vol.270 p.756- 757 (1995)

REF4: Thorpe, S., Delorme, A., Van Rullen, R. Spike based strategies forrapid processing , Neural Networks, vol. 14(6-7), p.715-726 (2001)

1.3 SpiNNaker and Nengo

SpiNNaker (Spiking Neural Network Architecture) is a project aimed at buildinga massively parallel computer capable of simulating and capturing very largespiking neural networks in order to emulate and better understand how neuralprocessing works in our brains.( REF1)

SpiNNaker is made out of SpiNNaker multi-core chips. Each chip is a Glob-ally Asynchronous Locally Synchronous (GALS) system with eighteen ARM968processors. The inter-processor infrastructure is based on an efficient multicastsolution inspired from neurobiology. The communication is based on packets,which only carry information about the sender and the receiver, with the in-frastructure being responsible for delivering them to their destination. 3

Each chip also has a 128Mbyte SDRAM (Synchronous Dynamic Random Ac-cess Memory). Each chips has six bidirectional inter-chip links and is arranged

7

Figure 3: SpiNNaker chip architecture. Image fromhttp://apt.cs.manchester.ac.uk/projects/SpiNNaker/SpiNNchip/

on the board in a ”hexagonal” format allowing for fast multicast. 4The specifics of how SpiNNaker works are outside of the scope of this project,

but what is important about SpiNNaker is that it allows for real-time simulationof massive spiking neural networks such as one dealing with facial recognition.

SpiNNaker currently supports two neural networking languages: pyNN andNengo. For this project I have chosen to work with Nengo since Spaun, theworld’s largest functional brain model, was developed in Nengo. Nengo (Neu-ral ENGineering Objects) is spiking neural network building and simulatinglanguage developed in the last several years at the Centre for Theoretical Neu-roscience in the University of Waterloo. With nengoSpiNNaker(a backend inSpiNNaker for Nengo), any solutions built in Nengo can be run on SpiNNaker.

REF1: Furber, S. and Temple, S. Neural Systems Engineering. Journal ofthe Royal Society Interface. 2007, 4.

REF2: Furber, S., et al., et al. Overview of the SpiNNaker System Archi-tecture. Computer, IEEE Transactions on. 2012, Vol. PP, 99.

8

Figure 4: Chip layout on a SpiNNaker board. Image fromhttp://apt.cs.manchester.ac.uk/projects/SpiNNaker/architecture/

1.4 Vision in Biology

The processing of visual information happens in the visual cortex. Accordingto the ventral/dorsal model, information from the visual organs accumulatesinto the lateral geniculate nucleus (LGN) and is then transmitted into the pri-mary visual cortex area V1.(REF1) The information is then transmitted on twodifferent pathways: the ventral stream (specialised in recognition and objectclassification) and the dorsal stream (specialised in kinetic motion and distanceperception) REF2. The process is roughly feed-forward with higher layers re-ceiving information from the lower levels (from the retina to the LGN, to V1,down the dorsal/ventral stream). Remarkably, the information is considerablycompressed without any noticeable loss. REF3

In the primary visual cortex area V1, the neurons are usually divided into twotypes: simple and complex, depending on the structure of their receptive fields.Simple neurons have their receptive field divided into ON regions(respondingto light onset/dark offset) and OFF regions (responding to light offset/darkonset) and have the role of enhancing certain features (such as edges). Complexneurons can be thought of as integrating the output of multiple simple neuronsin order to detect orientated edges and grids.(REF4 REF5) Figure 5 shows asimplified representation.

For facial/expression recognition the information from V1 proceeds on theventral stream through visual area V2, then through visual area V4, and tothe inferior temporal cortex (IT cortex). The exact functions of each visualarea are not known for certain and are outside of the scope of this project. Inessence the V2, V4 and the IT cortex do increasingly complex recognition(fromedges, orientations, size, colour to complex shapes) and are connected to thememory (V2 was found to play an important role in object recognition memoryand in short to long term memory conversion). REF6 REF8 The entire process

9

Figure 5: A represents a simple neuron. The visual information goesthrough the receptive field and using a threshold a response is ob-tained. B represents a complex neuron which sums the responsesof multiple simple neurons to form its own response. Figure fromhttp://www.scholarpedia.org/article/Area V1#Receptive fields

10

is roughly feed-forward, although feed-back does occur (especially between V1and V2, V2 and V4,IT). REF7

REF1: Cudeiro, Javier; Sillito, Adam M. (2006). ”Looking back: corti-cothalamic feedback and early visual processing”. Trends in Neurosciences 29(6): 298–306. doi:10.1016/j.tins.2006.05.002. PMID 16712965.

REF2: Ungerleider LG, Mishkin M (1982). ”Two Cortical Visual Systems”.In Ingle DJ, Goodale MA and Mansfield RJW. Analysis of Visual Behavior.Boston: MIT Press. pp. 549–586.

REF3: L. Zhaoping, “Theoretical understanding of the early visual processesby data compression and data selection,” Network: Computation in NeuralSystems, vol. 17, no. 4, pp. 301–334, 2006.

REF4: Movshon JA, Thompson ID, Tolhurst DJ (1978b) Receptive fieldorganization of complex cells in the cat’s striate cortex. J Physiol

REF5: S. G.Wysoski, L. Benuskova, and N. Kasabov, “Fast and adaptivenetwork of spiking neurons for multi-view visual pattern recognition,” Neuro-computing, vol. 71, no. 13-15, pp. 2563– 2575, 2008.

REF6: Bussey, T J; Saksida, L. M (2007). ”Memory, perception, and theventral visual-perirhinal-hippocampal stream: thinking outside of the boxes”.Hippocampus 17 (9): 898–908. doi:10.1002/hipo.20320. PMID 17636546.

REF7: Stepniewska, I; Kaas, J. H. (1996). ”Topographic patterns of V2 cor-tical connections in macaque monkeys”. The Journal of Comparative Neurol-ogy 371 (1): 129–152. doi:10.1002/(SICI)1096-9861(19960715)371:1¡129::AID-CNE8¿3.0.CO;2-5. PMID 8835723.

REF8: Lopez-Aranda et alt. (2009). ”Role of Layer 6 of V2 Visual Cortex inObject Recognition Memory”. Science 25 (5936): 87–89. doi:10.1126/science.1170869.PMID 19574389.

1.5 Motivation and Existing Solutions

For as long as we have had cameras and computers there has been a demandfor face and facial expression recognition. The human facial expression conveysa large amount of non-verbal information about the emotional and physicalstate of the sender. REF1 Small facial movements, eye gaze shifting or nostrilflaring all give us crucial context in which to interpret the information we re-ceive. Understanding difference between a thanks given with a large smile andthinned eyes or with a frown and perched lips is a crucial skill for successfulcommunication.

If human-computer interactions are ever to evolve to a natural, intuitiveand easy to use interface, software which can achieve such recognition on a levelsimilar to a human will be necessary. REF3 There are also other applicationsfor such a technology. It can be used in security or law-enforcement (instead ofbiometrics or for identifying suspects) or even as a new marketing tool(trackingresponses to certain products). REF2

The first attempt at automatic facial expression recognition was made in1978 by Suwa et al. Suwa presented a method for analysing facial expressionsfrom multiple frames using twenty tracking points. Although the proposal wasmade in 1978, it took until the 90’s for facial expression recognition to becomeactively pursued. This was mostly due to expensive computing power as wellas a lack of interest in human computer interaction.REF4

11

Since the 90’s, face and facial expression recognition has come far. Withincreased computing power, better cameras and techniques such as analysingthe relative position and shape of the nose, mouth, jaw and eyes or building3d models of the face, facial recognition algorithms have achieved human-levelaccuracy in ideal conditions (good lighting and a frontal image). Even in lessthan ideal conditions, the systems can still maintain over 80% accuracy.REF5There have been a few attempts at building a visual face and facial expressionsystem based on the human visual cortex using spiking neural networks. Twoof them are Delorme’s one spike per neuron face recognition model REF6 andSi-Yao Fu cortex-like mechanism for facial expression recognition. REF7 Themodel this project uses is inspired from both of them.

REF1: G. Donato, M.S. Bartlett J.C. Hager, P. Ekman, T.J. Sejnowski,”Classifying Facial Actions” IEEE Trans. Pattern Analysis and Machine Intel-ligence, vol. 21, no. 10, 1999

REF2: ”Facial Recognition: Who’s Tracking You in Public?”. ConsumerReports. Retrieved 04-04-2016.

REF3: A. van Dam, ”Beyond WIMP” IEEE Computer Graphics and Ap-plications, vol. 20, no. 1

REF4: A. Samal and P.A. Iyengar, “Automatic Recognition and Analysisof Human Faces and Facial Expressions: A Survey,” Pattern Recognition, vol.25, no. 1, pp. 65-77, 1992

REF5: https://arxiv.org/ftp/arxiv/papers/1203/1203.6722.pdfREF6: ”Face identification using one spike per neuron: resistance to image

degradations”, Arnaud Delorme and Simon J. Thorpe, Neural Networks, 14(6-7), 795-804, 2001

REF7: ”A Spiking Neural Network Based Cortex-Like Mechanism and Ap-plication to Facial Expression Recognition” Si-Yao Fu, Guo-Sheng Yang, andXin-Kai Kuai Hindawi Publishing Corporation Computational Intelligence andNeuroscience Volume 2012, Article ID 946589, 13 pages doi:10.1155/2012/946589

1.6 Functional and non-Functional Requirements

Func: Be able to detect faces Be able to classify the 7 facial expressions accu-rately Be able to recognize faces accurately Non-FUNC Be resistance to badimage quality Recognize/classify in real-time Be resistant to make-up, slightangles, hats or glasses

2 Design

This chapter covers the project’s overall design. It also covers the implementa-tion philosophy and plan.

2.1 Overall Design

The software has a layered design, with each layer passing along informationin a feed-forward fashion. There is no back-propagation in the interest of fastrecognition. There are a total of four layers: Preprocessing, Edge Enhancing,Gabor Filtering and Classifying. Each layer will be explained in detail in itsown chapter but the overall structure is as showed in figure 6.

12

Figure 6: Represents the overall design of the model. Image modified from:

2.2 Preprocessing

In the interest of meeting the functional requirements, the project assumes idealinput conditions (input face will be centred in the visual field, with minimumvariance in the distance from the lens and good illumination). However, inorder to make the classification more robust, I have added this layer whosesole purpose is improving accuracy and speed by optimizing the condition ofthe input. It also serves as an abstraction of some of the brain’s not so wellunderstood ”features” such as near lossless compression.

Preprocessing has four steps: conversion from RGB to grayscale, illumina-tion normalization, downscaling and face/eye/mouth detection.

The conversion from RGB to grayscale is done in order to remove the unnec-essary colour. While research has proven that colour does indeed have a role inrecognition and it can boost accuracy in certain models REF1, in this particularmodel it just obscures the important information. The conversion itself can beeasily accomplished by doing the following operation on each pixel:

GrayscaleP ixel = 0.2989PixelR + 0.5870PixelG + 0.1140PixelB

Illumination normalization is a well-known technique in the field of facialrecognition which helps reduce the impact of poor illumination to the accuracyof the model. Poor illumination has a big detrimental effect to the accuracy ofmost models and is one of the challenges of constructing a robust system. REF2Illumination normalization refers to reducing the overall standard deviation inorder to avoid having the difference poorly and well illuminated areas affect theaccuracy of the model. A simple but effective form of illumination normalizationcan be accomplished in grayscale by applying the following formula to each pixel:

NewPixel = (OldPixel−min(AllOldP ixels))255

max(AllOldP ixels)−min(AllOldP ixels)

Downscaling the image might be a necessary action in order to reduce theoverall number of pixels and improve performance. SpiNNaker can currentlysupport up to 24 48 node boards which limits the number of neuron clustersthat can be simulated at a time. There are also other concerns such as long set-up time and potential errors (from either fewer neurons per cluster or hardware

13

failures). For downscaling, this project uses a simple mean of the grayscalevalues of the downscaled pixels.

Face/eye/mouth detection is needed in order to focus the classification andremove unnecessary information such as hats, clothes, eye-glasses or backgrounds.Classifying the eyes, mouth and overall facial expression can also improve accu-racy and even be used to detect more subtle cues such as fake smiles (when thedifferent elements contradict each-other, smiling mouth and flared nostrils).

REF1: ”Contribution of color to face recognition”, Andrew W Yip, PawanSinha, Perception, 2002, volume 31, pages 995 - 1003, DOI:10.1068/p3376

REF2: ”Face recognition: The problem of compensating for changes in illu-mination direction.”, Y. Adini, Y. Moses, and S. Ullman. IEEE Transactionson Pattern Analysis and Machine Intelligence, 19(7):721–732, 1997.

2.3 Edge Enhancing

This layer focuses on eliminating unnecessary details and bringing into focusthe desired features, which in this case are the edges. An edge can be defined asa boundary between two different groups. In order to recognize a face, we firstneed to identify its shape and the shape of its features (eyes, eyebrows etc.).Finding these feature’s boundaries or edges is a non-trivial problem as it is hardto tell where exactly is the point or line which separates two features (wheredoes the nose begin FIGURE?).

There are however techniques which can enhance potential edges. One suchtechnique would be applying a kernel (or a square convolution matrix) to animage in a process called image convolution. If we view the image as a matrixthen we can move the kernel across each pixel and apply the following operationwith the overlapped sub-matrix:

P2,2 =

K1,1 K1,2 K1,3

K2,1 K2,2 K2,3

K3,1 K3,2 K3,3

∗P1,1 P1,2 P1,3

P2,1 P2,2 P2,3

P3,1 P3,2 P3,3

=

3,3∑i,j

Ki,jPi,j

If the kernel is not symmetric (not equal to its transpose) then it needs to beflipped on its horizontal and vertical axis before applying the operation.

There are different types of kernels which can enhance different features,but for this project I have chosen four kernels which are focused on enhancingdifferent kinds of edges. (FIGURE FOR KERNELS)

A more computationally complex method (which can have better results)is applying a difference of Gaussians(DoG). By taking a grayscale image I andtwo Gaussian blurred versions(with different variances δ,Kδ where K is a sizeratio ) of it and subtracting them, one can increase the visibility of edges. Themathematical formula for this is(∗ represents a convolution) :

Resultδ,Kδ(x, y) = I ∗ 1

δ√

2πe−(x2+y2)/(2δ2) − I ∗ 1

δK√

2πe−(x2+y2)/(2δ2K2)

2.4 Gabor Filtering

This layer simulates the simple and complex cells in the primary visual cortex V1by using Gabor filters. A Gabor filter is a type of linear filter which can detect

14

edges in images at certain angles. In order to maximise the accuracy of the edgedetection this layer uses 8 Gabor filters adapted to detect edges at 8 different ori-entations (0 deg, 45 deg, 90 deg, 135 deg, 180 deg, 225 deg, 290 deg) REF1 REF2.

A Gabor filter can thought of as a kernel specialized for detecting edges withdifferent orientations. The formula to creating a Gabor filter for an angle θ is:

gb = exp(−θ2x + θ2yγ

2

2σ2) ∗ cos(

2π

λ ∗ θx + ψ)

Whereθx = x cos(θ) + y sin(θ)

θy = −x sin(θ) + y cos(θ)

with σ = standard deviation, λ = wavelength, θ = angle, φ = phase offset andγ = spatial aspect ratio.

Due to the high dimensionality of the Gabor filter result, some kind of fea-ture reduction might be needed. PCA (principle component analysis) or ICA(independent component analysis) are both viable options. Unfortunately, im-plementing either with spiking neural networks is a non-trivial problem. REF3

REF1: S. G.Wysoski, L. Benuskova, and N. Kasabov, “Fast and adaptivenetwork of spiking neurons for multi-view visual pattern recognition,” Neuro-computing, vol. 71, no. 13-15, pp. 2563– 2575, 2008.

REF2: ”A Spiking Neural Network Based Cortex-Like Mechanism and Ap-plication to Facial Expression Recognition” Si-Yao Fu, Guo-Sheng Yang, andXin-Kai Kuai Hindawi Publishing Corporation Computational Intelligence andNeuroscience Volume 2012, Article ID 946589, 13 pages doi:10.1155/2012/946589

REF3: Savin C, Joshi P, Triesch J (2010) Independent Component Analysisin Spiking Neurons. PLoS Comput Biol 6(4): e1000757. doi:10.1371/journal.pcbi.1000757

2.5 Classifier

This layer consists of populations of neurons which are trained to react to certainfeatures in the centre of their receptive field. For all training images, the firstthree layers are first applied and then a population is specifically trained torecognize the image. The training itself is done by modifying the tuning curve ofthe neurons in the population so that the neurons act as a function of likelihoodof the input being the desired one.

The classifier takes the input from the third layer and runs it through eachtrained neuron population. As such, it gives a set of probabilities for the in-put image to correspond to any of the people and expressions in the trainingdata. Afterwards, the classification is done by taking the most probable of allpossibilities.

2.6 Implementation Approach

This chapter focuses on the approach used when implementing the project.After an initial period of research into spiking neural networks, SpiNNaker andhow to achieve the desired requirements I devised the model described in theprevious chapters. I designed the model with modular programming principlesin mind. Each layer has its own role and has a fixed input/output size and type

15

Figure 7: A list of tasks with their estimated finish time as well as a perceiveddifficulty colour code. Green - no considerable difficulty, Yellow - moderatedifficulty, Brown - challenging, Red - very challenging

which allows for independent testing. The layers themselves are also composedout of smaller modular components which can be checked separately. I have alsocompiled a list of modules (or milestones) which describe the estimated finishtime and the perceived difficulty of each task 7.

In order to be able to efficiently test and train the model, I needed a largesample of high-quality pictures of faces with the desired facial expressions (Neu-tral, Angry, Disgusted, Afraid, Happy, Surprised). As such, I used the JAFFEREF1 (Japanese female facial expression) database which has face photographsof various Japanese female models with the desired expressions. This way, I canalso test the robustness of the system by “sabotaging” the images in deliberateways to see what the model is strong and weak against.

The project also required something to test against, so I decided to developin parallel an equivalent solution in python (without using neural networks),which I could use to benchmark the solution.

REF1: M. J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, and J. Budynek,“The Japanese female facial expression (JAFFE) database,” 1998

16

Figure 8: The convoluter unit.

3 Implementation

This chapter focuses on how the previously described model was implemented.

3.1 Preprocessing

The preprocessing ”optimizes” the input image before feeding it into the net-work. This part is not implemented using a neural network as it serves as anabstractisation the initial input filtering done by the brain. REF1

As described in the design the RGB to grayscale conversion and the illumi-nation normalization can be achieved by using simple mathematical formulas.

REF1: Harter, M. Russell, and Cheryl J. Aine. ”Brain mechanisms of visualselective attention.” Varieties of attention (1984): 293-321.

3.2 Edge Enhancing

The basic building block of the edge enhancing layer is the convolution unit8. It takes a kernel and the overlapped submatrix as an input and outputs asingle value as the result of the convolution between them. The convolutionis achieved by using nengo’s vector multiplication network. This network ispart of the nengo API and it computes the element-wise product of two equallysized vectors REF1. After the elements are multiplied, the resulting outputsare combined into one final output via synaptic connections (effectively addingthem).

17

Figure 9: the Classifying layer (with 3 trained classifiers)

Each pixel gets a convoluter unit assigned to it. The overlapping submatrixand the kernel are then connected to the corresponding convoluter and theoutputed result is put into the corresponding output pixel.

REF1: Gosmann, 2015, ”Precise multiplications with the NEF”, Retrievedon 30 April 2016, url: http://nbviewer.jupyter.org/github/ctn-archive/technical-reports/blob/master/Precise-multiplications-with-the-NEF.ipynb#An-alternative-network

3.3 Gabor Filtering

The Gabor filters were created using the mathematical formula described in thedesign of this layer. The convolution was handled with the same system usedfor the edge enhancing.

3.4 Classifier

The basic classifier unit utilises the nengo targetfunction feature which is avail-able when builing a connection between 2 neuron clusters. The function at-tempts to ”fit” a given set of evaluation points on a given set of target points.This takes advantage of the neuron’s inherent ability to recognize patterns orsimilarities REF1.

By using a classifier unit for each model and each expression, we get a set ofprobabilities for the likelihood of the input being a certain model with an expres-sion. We can then feed these inputs into a Basal Ganglia-Thalamus network.The Basal Ganglia-Thalamus network is a nengo API network which works asa winner-take-all system, inhibiting all but the largest input it receives.9

18

REF1: C. M. Bishop, ”Neural Networks for Pattern Recognition”, Oxford,England: Oxford Press, 1996.

4 Testing and Validation

As I worked through the project, I developed an equivalent model in python(using the same model but without spiking neural networks) in order to beable to benchmark and test the solution. Unfortunately the resulting modelturned out to be too large. The edge enhancing and Gabor filter layers bothassumed one convoluter unit per pixel. This meant a very large number ofneuron populations were necessary in order to compute the convolution. Dueto the limitation of 24 boards on SpiNNaker this meant a maximum of 8x8resolution with a very bad approximation rate (few neurons per population)which meant it failed to accomplish its task. The python equivalent was able tocorrectly complete the task, giving the model some credibility but the currentconvolution implementation is too costly in terms of neurons to work. Theclassifier also suffered due to the high dimensionality of the Gabor filter output.Due to time constraints I was unable to develop a feature reduction function,but without one the classifier cannot correctly distinguish between its inputs.There are solutions to solving both of these problems. Convolution can be donein an efficient manner by using some manually calculated weights through asingle neuron population (for an entire image) and ICA has been implementedin a spiking neural network.

5 Conclusion

The project is a proof of concept for a working face and facial expression recog-nition model on SpiNNaker. The model serves as a good base for further devel-opment and the implementation shows some of the pitfalls of certain ideas. Ihave learned a tremendous amount from this project, both from a technical andindividual point of view. I learned a lot about what it means to do research andreading through academic papers, the importance (and difficulty!) of sustainedwork ethic and the importance of planning the approach and keeping track ofthe milestones.

19

spinnaker demonstration - university of...

Documents