hardware implementation of back-propagation neural ... · hardware implementation of...

Hardware Implementation of Back-PropagationNeural Networks for Real-Time Video Image

Learning and ProcessingH. Madokoro and K. Sato

Department of Machine Intelligence and Systems Engineering,Faculty of Systems Science and Technology,

Akita Prefectural University, Yurihonjo, JapanEmail: {madokoro, ksato}@akita-pu.ac.jp

Abstract— This paper presents a digital hardware Back-Propagation (BP) model for real-time learning in the fieldof video image processing. The model is a layer parallelarchitecture with a 16-bit fixed point specialized for videoimage processing. We have compared our model with astandard BP model that used a double-precision floatingpoint. Simulation results show that our model has equalcapabilities to those of the standard BP model. We haveimplemented the model on an FPGA board that we orig-inally designed and developed for experimental use as aplatform for real-time video image processing. Experimentalresults show that our model performed 100,000 epochs/framelearning that corresponds to 90 MCUPS and was able to testall pixels on interlace video images.

Index Terms— Field programmable gate arrays, back-propagation neural networks, real-time video image pro-cessing, million connection updates per second.

I. I NTRODUCTION

Neural-network based embedded computing is oneof the significant technologies for a ubiquitous soci-ety. Application of neural networks, especially in Back-Propagation (BP) neural networks by Rumelhart [1] hasbeen used widely for numerous systems such as imageprocessing, computer vision, biometrics, and others. Al-though BP provides excellent mapping ability, its learningrequires high costs of calculation. An original handlingcapability of a neural network is its parallel and dis-tributed architecture. However, numerous neural networksare performed on von Neumann type general-purposecomputers as a sequential algorithm. Processors used inembedded systems or gadgets are insufficient to performneural networks in real time. Hardware implementationis the best solution for neural-network applications inubiquitous computing.

According to the progress of large-scale and high-density Field Programmable Gate Arrays (FPGAs), nu-merous models have been proposed to implement neuralnetworks on FPGAs using digital architecture [2]–[9].Neural networks and FPGAs offer a good combinationbecause neural networks must change their structure ac-cording to problems and FPGAs are able to change theirstructure only through programming. Our goal is FPGAimplementation of neural networks for real-time learning

in video image processing. For video image processing,scenes or targets are changed instantly and continually.Therefore, it is important to repeat learning according tothose environmental changes. We define learning withinone frame per second for a Video Graphics Array (VGA)image as real-time learning.

This paper presents a digital hardware BP model forreal-time learning in the field of video image processing.We simulate our model to evaluate its convergence andresult compared with a standard BP model that used adouble-precision floating point. Moreover, we implementthe model on an FPGA board that we originally designedand developed for experimentally uses as a platform forreal-time learning and video image processing. Experi-mental results show that our model performed 100,000epochs/frame learning that corresponds to 90 MillionConnection Updates Per Second (MCUPS) and was ableto test all pixels on interlaced video images.

II. RELATED STUDIES

When neural networks are used for complex andactual applications such as image processing, speechrecognition, robot vision, etc., different neural-networkalgorithms and conventional methods are combined inone system in order to be able to use the most appro-priate method for each subtask [10]. Therefore neuralnetworks of various types are proposed and implementedto configurable devices as a digital architecture.Moreover,evolutional algorithms have been implemented in parallel[11], [12].

Maeda et al. [2] proposed a recursive learning schemefor Recurrent Neural Networks (RNNs) based on the si-multaneous perturbation method. They implemented Hop-field Neural Networks (HNNs), which is a typical modelof RNNs, on a Altera APEX-series FPGA. They evaluatedtheir model using oscillatory patterns. Their experimentalresults indicated the model performed a perfect agreementwith the wanted patterns based on grand truth. However, itis a challenging task for RNNs to set up the weight valuesin networks for a specific purpose. Moreover, HNNscontains a local minimum problem to be converged whenit apples to an actual problem with high-dimensionaldatasets.

JOURNAL OF COMPUTERS, VOL. 8, NO. 3, MARCH 2013 559

© 2013 ACADEMY PUBLISHERdoi:10.4304/jcp.8.3.559-566

Martincigh et al. [3] presented an architecture forpulse-mode neurons based on a voting circuit. Theyused stochastic pulse modulation that were corded interms of bit probabilities for the values of the neuroninputs. They implemented their model to a Xilinx Virtex-series FPGA and compared with the Hikawa’s model [5].The experimental results indicated that they implementedtheir model with fewer resources: I/O blocks, functiongenerators, configurable logic blocks, D-FFs, and criticalpaths. However, the evaluation of changing the rate of theshift-register reset in a change of the activation steepnessremained for future work. Moreover, no results wereshown used for applications of this method.

Anguita et al. [4] proposed a digital architecture forSupport Vector Machines (SVMs). They analyzed thequantization effects on the performance of SVMs thatare implemented to a Xilinx Vertex-II FPGA. The mainadvantage of SVMs consist in the structure of the learningalgorithms characterized by the resolution of a constrainedquadratic programming problems where the drawbackof local minima is completely avoided. They optimizedSVMs for targeting digital solutions to be easily mappedto a FPGA. They used a well-know benchmarking dataset,Sonar, for evaluating the performance of this model.They considered that future work remained to apply theirmethod to an actual problem, although they show real-time performances for this experiment.

Porrmann et al. [10] proposed a hardware accelerator ofSelf-Organizing feature Maps (SOMs) with a massivelyparallel architecture for embedded applications requiringonly small areas of silicon. They developed a prototypesystem of a heterogeneous multiprocessor system calledMultiprocessor system for Neural Applications (MoNA).This system is consisted of a PC and a board with a XilinxXC4052XL FPGA. They implemented a SOM with thesize of the mapping layer 16× 16 unit. The performancereached 4.609 MCPS for the recall and 0.383 MCUPSfor learning. This is an unsupervised model, althoughthey mentioned U-Matrix for visualizing boundaries ofcategories. Applications used for unsupervised neuralnetworks are limited because of its low performance oflearning.

Hikawa [5] proposed a pulse-mode multilayer neuralnetworks trained using BP. He realized a piecewise-linearfunction using a smile multiplexer for the calculation ofthe weighted sum of a neuron. Moreover, accompanyingsynapse multiplier has no restriction. The FPGA imple-mentation of learning and texting of BP demonstrated0.226 MCUPS and 2.996 MCUPS for 29 weights and 290weights, respectively. However, this performance evalu-ation was conducted using sing two-dimensional binaryclassifying problem. He did not mention how to use thishardware BP model for an actual problem.

Sackinger et al. [9]developed an Analog Neural Net-work ALU (ANNA) chip using a Xilinx 4005 FPGAas a platform for a wide variety of algorithms used inneural-network applications as well as image analysis.Their emulating cellular neural-network model demon-

IN1[7..0]

IN2[7..0]

IN3[7..0]

DW1[15..0]

DW2[15..0]

DW3[15..0]

CLK_FP

CLK_BP

RST

IN1[7..0]

IN2[7..0]

IN3[7..0]

DW1[15..0]

DW2[15..0]

DW3[15..0]

CLK_FP

CLK_BP

RST

IN1[7..0]

IN2[7..0]

IN3[7..0]

CLK_FP

CLK_BP

RST

IN1[7..0]

IN2[7..0]

IN3[7..0]

CLK_FP

CLK_BP

RST

IN1[7..0]

IN2[7..0]

IN3[7..0]

CLK_FP

CLK_BP

RST

TCH

OUT[7..0]

TCH

OUT[7..0]

DW1[15..0]

OUT[7..0]

DW2[15..0]

DW1[15..0]

OUT[7..0]

DW2[15..0]

DW1[15..0]

OUT[7..0]

DW2[15..0]

MODE

CLK

RST

CLK_FP

CLK_BPCLK

MODE

RST

G[7..0]

R[7..0]

B[7..0]

O1[7..0]

TCH1

O2[7..0]

TCH[7..0]

HIDDEN LAYER OUTPUT LAYER

CLOCK GENERATOR

IN1[7..0]

IN2[7..0]

IN3[7..0]

DW1[15..0]

DW2[15..0]

DW3[15..0]

CLK_FP

CLK_BP

RST

IN1[7..0]

IN2[7..0]

IN3[7..0]

DW1[15..0]

DW2[15..0]

DW3[15..0]

CLK_FP

CLK_BP

RST

IN1[7..0]

IN2[7..0]

IN3[7..0]

CLK_FP

CLK_BP

RST

IN1[7..0]

IN2[7..0]

IN3[7..0]

CLK_FP

CLK_BP

RST

IN1[7..0]

IN2[7..0]

IN3[7..0]

CLK_FP

CLK_BP

RST

TCH

OUT[7..0]

TCH

OUT[7..0]

DW1[15..0]

OUT[7..0]

DW2[15..0]

DW1[15..0]

OUT[7..0]

DW2[15..0]

DW1[15..0]

OUT[7..0]

DW2[15..0]

MODE

CLK

RST

CLK_FP

CLK_BPCLK

MODE

RST

G[7..0]

R[7..0]

B[7..0]

O1[7..0]

TCH1

O2[7..0]

TCH[7..0]

HIDDEN LAYER OUTPUT LAYER

CLOCK GENERATOR

Figure 1. Whole architecture: the input layer is three units, the hiddenlayer is three units, and the output layer is two units.××××

××××＋＋＋＋×××× ＋＋＋＋

1)

1)Sigmoid

Sigmoid diff

weight1

＋＋＋＋threshold

IN1

OUT

DW1

DW2

Forward Propagation

Back Propagation×××××××× ＋＋＋＋1) weight2

IN2 ×××××××× ＋＋＋＋1) weight3

IN3

00.20.40.60.81-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 600.050.10.150.20.250.3-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 61)

××××××××＋＋＋＋

×××× ＋＋＋＋1)

1)Sigmoid

Sigmoid diff

weight1

＋＋＋＋threshold

IN1

OUT

DW1

DW2

Forward Propagation

Back Propagation×××××××× ＋＋＋＋1) weight2

IN2 ×××××××× ＋＋＋＋1) weight3

IN3

00.20.40.60.81-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 600.050.10.150.20.250.3-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 61)

Figure 2. Hidden module of three input ports, one output port, and twoports used for back-propagation signals.

strated two billion connection updates per one second. Foran character recognition problem, this model recognized1,000 characters per second. However, the target of thisapplication remained binary images.

III. H ARDWARE MODEL

A. Overview

Figure. 1 shows the network architecture of our model.The network comprises three layers: an input layer, a hid-den layer, and an output layer. As described in this study,our final application is human skin-color extraction fromvideo images. The input layer comprises three units thatcorrespond to RGB signals. The output layer comprisestwo units that correspond to the objective region and thenon-objective one, e.g. skin-color pixels and non-skin-color pixels. We assigned three units to the hidden layer.

Our model consists of two modules: a hidden moduleand an output module. The hidden module consists ofone hidden unit and a bus. The bus is connected betweenthe hidden units and all input units to maintain weights.All hidden modules run parallel in a forward-propagation

560 JOURNAL OF COMPUTERS, VOL. 8, NO. 3, MARCH 2013

© 2013 ACADEMY PUBLISHER

××××××××

×××× ＋＋＋＋A

Sigmoid

Sigmoid diff

weight1

＋＋＋＋IN1

OUT

TCH

Forward Propagation

Back Propagation

DW1 ××××A ×××××××× ＋＋＋＋weight3

IN2

DW2 ××××A ×××××××× ＋＋＋＋weight3

IN3

DW3 ××××A

00.20.40.60.81-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 600.050.10.150.20.250.3-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6＋＋＋＋ thresholdA

××××××××

×××× ＋＋＋＋A

Sigmoid

Sigmoid diff

weight1

＋＋＋＋IN1

OUT

TCH

Forward Propagation

Back Propagation

DW1 ××××A ×××××××× ＋＋＋＋weight3

IN2

DW2 ××××A ×××××××× ＋＋＋＋weight3

IN3

DW3 ××××A

00.20.40.60.81-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 600.050.10.150.20.250.3-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6＋＋＋＋ thresholdA

Figure 3. Output module of three input ports, one output port, one portfor teaching signals, and two ports for propagating signals to hiddenmodules.

0

0.2

0.4

0.6

0.8

1

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

Figure 4. The dashed line is the sigmoid function and the solid line isour approximated function.

process and a back-propagation process. The output mod-ule consists of one output unit and a bus. The bus isconnected between the output units and all hidden unitsto maintain weights. All output modules also performforward-propagation and back-propagation in parallel.

The bus connections are shifted to reduce weight-initialization steps. The upper left module is a clockgenerator that produces clocks for forward-propagationand back-propagation.

B. Hidden Modules

Figure. 2 shows a block diagram of the hidden module.The hidden module maintains weightswij(t) between

0

0.05

0.1

0.15

0.2

0.25

0.3

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

Figure 5. The dashed line is the sigmoid differential function and thesolid line is our approximated function.

OUT_FP OUT_FP OUT_FP OUT_FP OUT_FP OUT_FP

HID_BP HID_BP HID_BP

Base CLK

(24.5454Mz)

CLK_FP

CLK_BP

0 1 2 3 0 1 2 3 0 1 2 3

HID_FP

OUT_FP

OUT_BP

HID_FP

OUT_FP

OUT_BP

HID_FP

OUT_FP

OUT_BP

CLK_FP

CLK_BP

HID_FP HID_FP HID_FP HID_FP HID_FP HID_FP

Learn

Test

OUT_FP OUT_FP OUT_FP OUT_FP OUT_FP OUT_FP

HID_BP HID_BP HID_BP

Base CLK

(24.5454Mz)

CLK_FP

CLK_BP

0 1 2 3 0 1 2 3 0 1 2 3

HID_FP

OUT_FP

OUT_BP

HID_FP

OUT_FP

OUT_BP

HID_FP

OUT_FP

OUT_BP

CLK_FP

CLK_BP

HID_FP HID_FP HID_FP HID_FP HID_FP HID_FP

Learn

Test

Figure 6. CLK is used for the base clock. CLKFP and CLKBP areused for the forward-propagation clock and the back-propagation clock,respectively.

(a) Learning image

2

1

34

52

1

34

5

2

1

34

52

1

34

5

0 0.2 0.4 0.6 0.8 100.20.40.60.8100.10.20.30.40.50.60.70.80.91 RedBlueGreen(b) Distribution of training data

Figure 7. Image used for training. The marks depicted ’×’ are trainingpoint for the skin-color region. The marks depicted ’+’ are trainig pointsfor non-skin-color region. Each point was selected by an experimenterthough a mouse click operation.

j-th hidden unit and all input units(1 ≤ i ≤ I).In the forward-propagation phase, the hidden module ispresented with input dataxi(t) from the input layer. Itsoutput valueuj(t) is calculated as

uj(t) = f(I∑

i=1

xi(t)wij(t)), (1)

wheref is a sigmoid function defined as

f(x) =1

1 + e−x. (2)

For updating weightswij(t) in the back-propagation



(b) Our model (16-bit fixed point)

(a) Double-precision floating point model

Figure 8. (a) The skin-color extraction result from a double-precisionfloating-point model. (b) The skin-color extraction result achieved usingour model.

phase,∆wji(t) is given as

∆wji(t) = uj(t)(1 − uj(t))

(K∑

k=1

wkj(t)δk) + ∆wkj(t − 1), (3)

whereδk is the back-propagation value from the outputmodules.

C. Output Modules

Figure. 3 shows a block diagram of the output module.The output module maintains weightswjk(t) betweenk-th output unit and all hidden units(1 ≤ j ≤ J).In the forward-propagation phase, the output module ispresented input datauj(t) from the hidden layer; itsoutput valueok(t) is calculated as

ok(t) = f(J∑

j=1

uj(t)wjk(t)), (4)

wheref is the sigmoid function that is previously definedin the furmula (2).

For updating weightwkj(t) in the back-propagationphase,∆wkj(t) is given as

∆wkj(t) = δkuj(t) + ∆wkj(t − 1), (5)

wheretk(t) is the teaching signal, andδk is defined as

δk = (tk(t) − ok(t))ok(t)(1 − ok(t)), (6)

used in the output modules, and propagates to hiddenmodules via the buses.

(a) double-precision floating point model

01020304050

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000010203040

500 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000

iteration

MS

EM

SE

iteration

(b) 16-bit fixed point model

Figure 9. Mean squared error rate on software simulation of a double-precision floating-point model and a 16-bit fixed point of our model.

D. Sigmoid Function

We use 16-bit fixed point signals include all weightsand parameters. The decimal part is 8 bit, the integer partis 7 bit; the remaining bit is a sign part. This assignmentcan be expressed from -127.99609375 to 127.99609375in steps of 0.00390625.

Figure. 4 shows a sigmoid approximated function usingthe 16-bit fixed point. For hardware implementation, weuse a matrix to calculate the sigmoid function. We setto the horizontal resolution 0.5 for reducing the matrixsize. Figure. 5 shows a sigmoid differential approximatedfunction. We also use a matrix that set the horizontalresolution to 0.5. We have evaluated the effect of theseapproximations through the simulation.

E. Clocks

We used the non-interlaced VGA pixel clock that isgenerated by the digital video decoder on our originallydeveloped FPGA board. The clock rate is 24.5454 MHzshown in Fig. 6. The clocks used in learning and testingare also shown in Fig. 6. Forward-propagation and back-propagation are performed using the clocks. Of them,CLK FP is a forward-propagation clock and CLKBP isa back-propagation clock.

In the learning phase, one epoch of learning is per-formed with four clocks. The CLKFP falling trig-gers forward-propagation on hidden modules; that rising



triggers forward-propagation on output modules. Subse-quently, the CLKBP falling triggers back-propagation onoutput modules and rising triggers back-propagation onhidden modules.

In the testing phase, testing on one pixel is performedwith two clocks. The CLKFP falling triggers forward-propagation on hidden modules, and rising triggersforward-propagation on output modules. The CLKBP isnot used in the testing phase.

IV. SIMULATION

For determining the network parameters, we have sim-ulated our model on a personal computer. The applicationin this paper is skin-color extraction from video images.Figure. 7 (a) shows a image for learning. The man inthe image is the first author of this paper. For extractinglearning data, we selected five points for the skin-colorregion, depicted using the mark ’×’, and five pointsfor non-skin-color region, depicted using the mark ’+’,through a mouse click operation. We extracted learningdata from the second neighborhood pixels, 5× 5 pixelsin each point. Figure. 7 (a) protrays the distribution oftraining data. Dicision boundaries are created thoughlearning.

Figure 8 (a) shows an extraction result from a double-precision floating-point model, which is a widely usedmodel on von Neumann type general-purpose computers.Figure. 8 (b) shows an extraction result achieved usingour model. Original colored pixels show skin-color pixels;black pixels show non-skin-color pixels. Although false-positive regions in the lower right image and false-negative parts under his right eye or the top of his noseare apparent, a good result was obtained using our model.The result resembled results achieved using the standardBP model.

Figure 9 shows Mean Squared Error (MSE) rates of adouble-precision floating-point model and a 16-bit fixedpoint of our model. Actually, MSE is calculated with theoutput values from the model and ground-truth values thatare presented to the network as teaching signals. The MSEof the double-precision floating-point model converged to1.00 after 48,000 epochs. On the other hand, the MSE ofour model was 1.746094 after 100,000 epochs. Judgingby the final value of convergence, our model is inferior tothe double-precision floating-point model. However, theMSE converged according to the epochs. The unsteadyamplitude is caused by the resolution of the 16-bit fixedpoint.

V. HARDWARE IMPLEMENTATION

We designed and developed the original FPGA boardshown in Fig. 10 (a) as a platform for real-time learningand video image processing [15]. On this board, anFPGA (Stratix EP1S80B956C6; Altera Corp.) is mountedthrough a socket. In addition, a video D/A converter(DAC-350; Datel Inc.) and a digital video decoder(MSM7664BTB; Oki Electric Industry Inc.) are installedrespectively for the video input and output. Moreover, a

(a) Originally developed FPGA boardCCD Camera Display(b) Experimental environment

FPGA BoardFigure 10. Originally developed FPGA board. Designed by AkitaPrefectual Industrial Technology Center (AIT) and assembled by AkitaElectronics Systems Co.,Ltd. We conducted the experiment using thisboard and a CCD camera for real-time image processing.

TABLE I.RESULTS OF SYNTHESIS.

Item Device specs. Result Rate (%)Logic elements 79,040 4847 6

Pins 692 102 14Memory bits 7,427,520 31,232 ≪1

DSP block (9-bit) 176 86 48

PCMCIA interface ports is equipped for the extension ofa high-capacity storage device or a Wi-Fi wireless com-munication device. We used a CCD camera for obtaindimages 60 frame per second directly to the boad shownin Fig. 10 (a).

We have described our model by VHDL. Table I showsthe synthesis result using a FPGA design tool (QuartusII; Altera Corp.). The DSP block is 86 elements, 48% ofthe total resources. The BP algorithms require numerousmultipliers for multiplication of weights. On the otherhand, the logic element is only 4,874, which is 6% ofthe total resources.

Collection of learning data is an important factor butdifficult problem for supervised neural networks. Forhardware implementation, we used a fixed window in-terface shown in Fig. 11 for collecting learning data. The



(a) Extraction of red color (b) Result of red color

(c) Extraction of blue color (d) Result of blue color

(e) Extraction of skin color (f) Result of skin color 1

(g) Result of skin color 2 (h) Result of skin color 3

Figure 12. Experimental results of color plates and the skin region.

centered region is the objective target; the four cornerregions are the non-objective targets. Figure. 12 shows theextraction results. The upper two rows depict red and bluemarked extraction results. Both are good results becausethe marks are originally colored. The lower two rows

depict skin-color extraction results. Skin-color regions areextracted in the face and the hand similar to the resultsof the simulation.



128 32 128 64 128 32 1283

23

21

28

12

83

23

26

4

640

48

0

[mm]

Figure 11. Regions for acquisition of training data. The center regionis an objective target; the regions of four directions are non-objectivetargets.

TABLE II.COMPARISON RESULT WITH EXISTING METHODS.

Methods Size (In–Hid–Out) MCUPSOur model 3–2–2 90

Hikawa et al. [5] 2–2–1 85.7Gadea et al. [13] 2–6–3–2 81Botros et al. [14] 5–4–2 4

VI. D ISCUSSION

We defined learning within one frame as real-timelearning for video images. The mapping capabilities ofneural networks can be changed according to environmen-tal changes if it can learn within one frame The resolutionof the video image that we used is VGA, 640× 480pixels. The flame rate is 60 frames per second, and thepixel clock is 24 MHz. We use the pixel clock for thebase clock. In the learning phase, our model has marked100,000 epochs/frame (24 MHz÷ 4 clocks÷ 60 frames =100,000 epochs). There are 15 connections in our model;nine connections between three input units and threehidden units and six connections between three hiddenunits and two output units. Therefore, the performancecorresponds to 90 MCUPS (100,000 epochs/frame×15 connections× 60 frames). Table II shows a resultcompared with three existing methods. The performanceof our method is superior to the exiting methods. In thetesting phase, our model can test one result for every twopixels for non-interlaced images. Our model can test allpixels if output images are interlaced.

Learning is terminated when its iterations reach themaximum epoch or an MSE is less than a thresholdvalue. As described in this study, we set 1,000,000 epochfor the maximum and 0.01 for the threshold. However,calculation of the MSE depends on the number of learningdatasets. Moreover, it is very difficult to design for aparallel architecture. Therefore, we only use the maximumepoch. On the other hand, BP learning might not alwayssucceed. If the maximum epoch is used for terminationof learning, then local minima or overfitting sometimesoccurs. We have confirmed through a simulation in pre-

vious section that BP learning converges with 100,000epochs. We have realized real-time learning of 100,000epochs/frame on the FPGA. Our model can relearn in asubsequent frame. although BP learning fails because itencounters a local minimum or overfitting, Moreover, ourmodel can learn in a subsequent frame, although targetsare changed.

VII. C ONCLUSION

This paper presented a digital hardware BP model forreal-time learning in the field of video image processing.We have simulated a skin-color extraction experimentfrom video images for determining network parameters.We have implemented this model on an FPGA board thatwe developed for exclusive use. Our model performedreal-time learning, 100,000 epochs/frame that correspondsto 90 MCUPS. Results for skin-color extraction were asgood as the simulation results.

Future studies will examine implementation of weightsfor saving and loading functions, adding soft-core pro-cessors to control network parameters for a software andhardware co-design environment, and for using dynamicreconfiguration devices.

ACKNOWLEDGMENT

This work was supported by a Grant-in-Aid for YoungScientists (B) No. 21700257 from the Ministry of Educa-tion, Culture, Sports, Science and Technology of Japan.

REFERENCES

[1] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learn-ing representations by back-propagating errors,”Nature,vol. 323, pp. 533–536, Oct. 1986.

[2] Y. Maeda and M. Wakamura, “Simultaneous perturbationlearning rule for recurrent neural networks and its fpgaimplementation,”IEEE Trans. Neural Networks, vol. 16,no. 6, pp. 1664–1672, Nov. 2005.

[3] M. Martincigh and A. Abramo, “A new architecture fordigital stochastic pulse-mode neurons based on the votingcircuit,” IEEE Trans. Neural Networks, vol. 16, no. 6, pp.1685–1693, Nov. 2005.

[4] D. Anguita, A. Boni, and S. Ridella, “A digital architecturefor support vector machines: theory, algorithm, and fpgaimplementation,”IEEE Trans. Neural Networks, vol. 14,no. 5, pp. 993–1009, Sep. 2003.

[5] H. Hikawa, “A digital hardware pulse-mode neuron withpiecewise linear activation function,”IEEE Trans. NeuralNetworks, vol. 14, no. 5, pp. 1028–1037, Sep. 2003.

[6] F. Yang and M. Paindavoine, “Implementation of anrbf neural network on embedded systems: real-time facetracking and identity verification,”IEEE Trans. NeuralNetworks, vol. 14, no. 5, pp. 1162–1175, Sep. 2003.

[7] H. S. Ng and K. P. Lam, “Analog and digital fpga imple-mentation of brin for optimization problems,”IEEE Trans.Neural Networks, vol. 14, no. 5, pp. 1413–1425, Sep. 2003.

[8] Y. Maeda and T. Tada, “Fpga implementation of a pulsedensity neural network with learning ability using simulta-neous perturbation,”IEEE Trans. Neural Networks, vol. 14,no. 3, pp. 688–695, May 2003.

[9] E. Sackinger and H. P. Graf, “A board system for high-speed image analysis and neural networks,”IEEE Trans.Neural Networks, vol. 7, no. 1, pp. 214–221, Jan. 1996.



[10] M. Porrmann, U. Witkowski, and U. Ruckert, “A massivelyparallel architecture for self-organizing feature maps,”IEEE Trans. Neural Networks, vol. 14, no. 5, pp. 1110–1121, Sep. 2003.

[11] H. Zheng, M. Hou, and Y. Wang, “An efficient hybridclustering-pso algorithm for anomaly intrusion detection,”Journal of Software, vol. 6, no. 12, pp. 2350–2360, Dec.2012.

[12] Y. Liu, X. Ling, Z. Shi, M. Lv, J. Fang, and L. Zhang,“A survey on particle swarm optimization algorithms formultimodal function optimization,”Journal of Software,vol. 6, no. 12, pp. 2449–2455, Dec. 2012.

[13] R. Gadea, J. Cerda, F. Ballester, and A. Mocholi, “Artificialneural network implementation on a single fpga of apipelined on-line backpropagation,”Proc. 13th Interna-tional Symposium on System Synthesis, pp. 225–230, Aug.2000.

[14] N. M. Botros and M. Abduo-Aziz, “Hardware imple-mentation of an artificial neural network using field pro-grammable gate arrays,”IEEE Trans. on Industrial Elec-tronics, vol. 41, pp. 665–667, Dec. 1994.

[15] H. Madokoro, K. Sato, and M. Ishii, “Fpga implementationof back-propagation neural networks for real-time im-age processing,”Proc. International Symposium on Field-Programmable Gate Arrays, Feb 2007.

Hirokazu Madokoro received the ME degree in informationengineering from Akita University in 2000 and joined Mat-sushita Systems Engineering Corporation. He moved to AkitaPrefectural Industrial Technology Center and Akita ResearchInstitute of Advanced Technology in 2002 and 2006, respec-tively. He received the PhD degree from Nara Institute ofScience and Technology in 2010. He is currently an assistantprofessor at the Department of Machine Intelligence and Sys-tems Engineering, Akita Prefectural University. His researchinterests include in machine learning and robot vision. He isa member of the Robotics Society of Japan, the Japan Societyfor Welfare Engineering, the Institute of Electronics, Informationand Communication Engineers, and the IEEE.

Kazuhito Sato received the ME degree in electrical engineeringfrom Akita University in 1975 and joined Hitachi Engineer-ing Corporation. He moved to Akita Prefectural IndustrialTechnology Center and Akita Research Institute of AdvancedTechnology in 1979 and 2005, respectively. He received thePhD degree from Akita University in 1997. He is currently anassociate professor at the Department of Machine Intelligenceand Systems Engineering, Akita Prefectural University. He isengaged in the development of equipment for noninvasive in-spection of electronic pats, various kinds of expert systems, andMRI brain image diagnostic algorithms. His current researchinterests include in biometrics, medical image processing, facialexpression analysis, computer vision. He is a member of theMedical Information Society, the Medical Imaging TechnologySociety, the Japan Society for Welfare Engineering, the Instituteof Electronics, Information and Communication Engineers, andthe IEEE.



hardware implementation of back-propagation neural ... · hardware implementation of...

Documents