http://dx.doi.org/10.5573/JSTS.2012.12.2.150 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.2, JUNE, 2012
An FPGA-based Parallel Hardware Architecture for Real-time Eye Detection
Dongkyun Kim*, Junhee Jung*, Thuy Tuong Nguyen*, Daijin Kim**, Munsang Kim***, Key Ho Kwon*, and Jae Wook Jeon*
Abstract—Eye detection is widely used in applications,
such as face recognition, driver behavior analysis, and
human-computer interaction. However, it is difficult
to achieve real-time performance with software-based
eye detection in an embedded environment. In this
paper, we propose a parallel hardware architecture
for real-time eye detection. We use the AdaBoost
algorithm with modified census transform(MCT) to
detect eyes on a face image. We parallelize part of the
algorithm to speed up processing. Several downscaled
pyramid images of the eye candidate region are
generated in parallel using the input face image. We
can detect the left and the right eye simultaneously
using these downscaled images. The sequential data
processing bottleneck caused by repetitive operation
is removed by employing a pipelined parallel
architecture. The proposed architecture is designed
using Verilog HDL and implemented on a Virtex-5
FPGA for prototyping and evaluation. The proposed
system can detect eyes within 0.15 ms in a VGA image.
Index Terms—Eye detection, hardware architecture,
FPGA, image processing, HDL
I. INTRODUCTION
Eye detection is an image processing technique to
locate eye position from an input facial image. Eye
detection is widely used in applications, including face
recognition, driver behavior analysis, and human-
computer interaction [1-4]. The extraction of local facial
features, such as nose, eye, and mouth, is important in
face recognition to analyze the face image [1-2]. The eye
is the most important and most commonly adopted
feature for human recognition [2]. Eye detection is used
to track the eye and gaze to monitor driver vigilance in
driver behavior analysis [3]. Eye-gaze tracking system
can be used for pointing and selection in human-
computer interactions [4].
Z. Zhu and Q. Ji classified traditional eye detection
methods into three categories: template based methods,
appearance based methods, and feature based methods
[5]. In the template-based methods, a generic eye model
is first designed and the template matching is used to
detect the eyes on the image. The deformable template
method is a commonly used template based method [6-8].
In this method, an eye model is first designed and then
the eye position is obtained via a recursive process. This
method can detect the eye accurately, but it is
computationally expensive and good image contrast is
required for the method to converge correctly.
The appearance-based methods detect eyes based on
their photometric appearance [9-11]. These methods
usually need a large amount of training data representing
the eyes of different subjects, under different face
orientations, and under different illumination conditions.
These data are used to train a classifier; detection is then
achieved via classification. The feature-based methods
use characteristics, such as the edge, iris intensity, color
distribution, and the flesh of the eyes to identify features
Manuscript received Aug. 23, 2011; revised Dec. 19, 2011. * Dep. ECE., Sungkyunkwan University, Suwon, Korea ** Dep. SCE., POSTECH, Pohang, Korea *** Adv. Robotics Research Center, KIST, Seoul, Korea E-mail : [email protected]
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.2, JUNE, 2012 151
around the eyes [12].
There are several approaches to reduce eye detection
processing time. These approaches can be broadly
classified into two groups: software-based and hardware-
based. The focus in the software-based approach is
designing and optimizing an eye detection algorithm for
real-time performance. Conversely, in the hardware-
based approach, the eye detection algorithm is
parallelized and implemented on DSP or FPGA to
achieve real-time performance.
T. D'Orazio et al. propose a real time eye detection
algorithm [13]. They use iris geometrical information to
determine the eye region candidate and symmetry to
select the pair of eyes. Their algorithm consists of two
steps. First, the region candidate that contains one eye is
detected from the whole image by matching the edge
directions with an edge template of the iris. Then, in the
opposite region, a search is carried out for the second eye,
whose distance and orientation are compatible with the
range of possible eye positions. The software has been
implemented using Visual C++ on a Pentium III 1GHz.
They use a camera with a frame rate of 7.5 fps. The
search for the eyes in an image of 640×480 takes about
0.3 sec.
K. Lin et al. proposed a fast eye detection scheme for
use in video streams rather than still images [14]. The
temporal coherence of sequential frames was used to
improve the detection speed. First, the eye detector
trained by the AdaBoost algorithm is used to roughly
obtain the eye positions. Then, these candidate positions
are filtered according to geometrical patterns of human
eyes. The detected eye regions are then taken as the
initial detecting window. The detecting window is
updated after each frame is detected. The experimental
system was developed with Visual C++ 6.0 on a Pentium
2.8 GHz and 1 GB memory. The mean detection rate was
92.73% for 320×240 resolution test videos in their
experiments, with a speed of 24.98 ms per frame.
However, CPU-based implementation has two major
bottlenecks that limit performance: data transfer
bandwidth and sequential data processing rate [15]. First,
video frames are captured from the camera and
transferred to the CPU main memory via a transfer
channel, such as USB or IEEE1394. These channels have
limited bandwidth; hence they impose a data flow
bottleneck. After a frame is grabbed and moved to
memory, the CPU can sequentially process the pixels.
The CPU adds a large overhead to the actual computation.
Its time is split between moving data between memory
and registers, applying arithmetic and logic operators on
the data, and maintaining the algorithm flow, handling
branches, and fetching code. Additionally, a relatively
long latency is imposed by the need to wait for an entire
frame to be captured before it is processed.
The Deployment and real-time processing of the eye
detection algorithm in an embedded environment are
difficult due to these reasons. Therefore, A. Amir et al.
implemented an embedded system for eye detection
using a CMOS sensor and FPGA to overcome the
shortcomings of CPU-based eye detection [15]. Their
hardware-based embedded system for eye detection was
implemented using simple logic gates, with no CPU and
no addressable frame buffers. The image-processing
algorithm was redesigned to enable highly parallel,
single-pass image-processing implementation. Their
system processes 640×480 progressive scan frames at a
rate of 60 fps, and outputs a compact list of eye
coordinates via USB communication. It is difficult to
achieve real-time performance with CPU-based eye
detection in an embedded environment. A. Amir et al.
reduce processing time by using FPGA [15]. However,
the eye detection algorithm they used is relatively
simple; hence, the eye miss rate is relatively high.
B. Jun and D. Kim proposed a face detection algorithm
using MCT and AdaBoost [16]. I. Choi and D. Kim
presented an eye detection method using AdaBoost
training with MCT-based eye features [17]. The MCT-
based AdaBoost algorithm is very robust, even when the
illumination varies, and is very fast. They choose the
MCT-based AdaBoost training method to detect the face
and the eye due to its simplicity of learning and high
speed of face detection.
In this paper, we propose a dedicated hardware
architecture for eye detection. We use the MCT-based
AdaBoost eye detection algorithm proposed by I. Choi
and D. Kim [17]. We revise the algorithm slightly for
parallelization. We parallelize the part of the algorithm
that is iterated to speed up processing. Several
downscaled images of eye candidate regions are
generated in parallel using the input face image in the
pyramid downscaling module. These images are passed
to the MCT and cascade classifier module via a pipeline
152 DONGKYUN KIM et al : AN FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME EYE DETECTION
and processed in parallel. Thus, we can detect the left
and right eye simultaneously. The sequential data
processing bottleneck caused by repetitive feature
evaluation is removed by employing pipelined parallel
architecture.
The proposed architecture is described using Verilog
HDL (Hardware Description Language) and implemented
on an FPGA. Our system uses CameraLink to interface
with the camera. The CameraLink interface transfers the
control signal and data signal synchronously. The entire
system is synchronized with a pixel clock signal from the
camera. Our system does not need an interface such as
USB; hence we have removed the bottleneck caused by
the interface bandwidth. The frame rate of the proposed
system can be flexibly changed in proportion to the
frequency of the camera clock. The proposed system can
detect left and right eyes within 0.15 ms using a VGA
(640×480) image and a 25 MHz clock frequency.
This paper consists of six sections and appendices. The
remainder of the paper is organized as follows. In
Section 2, we describe the details of the MCT and
AdaBoost algorithm and the overall flow of the eye
detection algorithm that we use. Section 3 describes the
proposed hardware architecture and the simulation
results. Section 4 presents the configuration of the system
and the implementation result. Section 5 presents the
experimental results. Finally, our conclusions are
presented in Section 6.
II. EYE DETECTION ALGORITHM
1. Modified Census Transform
The modified census transform is a non-parametric
local transform that is a modification of the census
transform by B. Fröba, and A. Ernst [18]. It is an ordered
set of comparisons of pixel intensities in a local
neighborhood to find out which pixels have lower
intensity than the mean pixel intensity. In general, the
size of the local neighborhood is not restricted, but in this
work, we always assume a 3×3 surrounding.
The modified census transform is described below. We
define N(x) as the local spatial neighborhood of the pixel
x, so that x N(x), and N’(x) = N(x) x. Let I(x) and
( )I x be the intensity of the pixel x and the intensity
mean in the neighborhood N’(x), respectively. The
modified census transform can be presented as
( )
( ) ( ), ( )x N x
x I x I x
where ( ), ( )I x I x is a comparison function which
yields 1 if ( ) ( )I x I x , and yields 0 if ( ) ( )I x I x ,
and is a concatenation operation to form a bit vector
(see Fig. 1). If we consider a 33 neighborhood, we are
able to determine 511 distinct structure kernels. Fig. 1
shows an example of the modified census transform.
In this example, I(x) is 95.11. Each pixel on N’ (x) is
transformed to 0 or 1 by applying ( ), ( )I x I x . Bits
are concatenated in order, as shown as Fig. 1.
2. AdaBoost Algorithm
AdaBoost, short for Adaptive Boosting, is a machine
leaning algorithm, formulated by Y. Freund and R.
Schapire [19]. They introduced the AdaBoost algorithm
briefly in another work and showed that it had good
generalization capability [20]. It is a meta-algorithm, and
can be used in conjunction with many other learning
algorithms to improve the performance. AdaBoost is
adaptive, in the sense that subsequent classifiers that are
built are tweaked in favor of those instances
misclassified by previous classifiers. AdaBoost is
sensitive to noisy data and outliers. However, in some
situation it can be less susceptible to the over fitting
problem than most learning algorithms [21].
AdaBoost calls a weak classifier repeatedly in a series
of rounds t = 1, 2, ... , T from a total of T classifiers. For
each call, a distribution of weights Dt is updated that
indicates the importance of examples in the data set for
the classification. The weights of each incorrectly
classified example are increased (or alternatively, the
Fig. 1. Example modified census transform.
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.2, JUNE, 2012 153
weights of each correctly classified example are
decreased), in each round, so that the new classifier
focuses more on those examples [21].
3. MCT-based AdaBoost Algorithm
In this paper, we use the AdaBoost training method
with MCT-based eye features proposed by I. Choi and D.
Kim [17]. In the AdaBoost training, they constructed a
weak classifier that classifies the eye and non-eye
patterns and then constructed a strong classifier that is a
linear combination of weak classifiers [17].
In the detection, they scanned the eye candidate region
by moving a 12×8 sized part of the scanning window and
obtained a confidence value corresponding to the current
window location using the strong classifier. Then, they
determined the window location whose confidence value
was maximized as the location of the detected eye. Their
MCT-based AdaBoost training was performed using only
the left eye and non-eye training images. Therefore,
when they tried to detect the right eye, they needed to
flip the right subregion of the face image [17].
Fig. 2 shows the flowchart of the eye detection
algorithm. The eye detection algorithm requires a face
image and the coordinates of the face region as inputs. If
the detection completes successfully, the algorithm
outputs the coordinates of the left and right eyes as
results.
Eye candidate regions are cropped for pyramid
downscaling in the first step. The face region is split in
half horizontally and vertically. The top-left subregion is
set as the left eye candidate region and top-right
subregion is set as the right eye candidate region. Then,
the right eye candidate region is flipped, because the
training data for the right eye is the same as for the left.
These candidate images are downscaled to a specific size
to get pyramid images. There are eight pyramid images
(four for the left and four for the right) and the images
sizes are 33×36, 31×34, 29×32, and 27×29. The four
down-scaled images are enough for an indoor
environment where the support distance is 0.5 ~ 8 meters.
These sizes of the down-scaled images were determined
by empirical experiments.
MCT is applied to the downscaled pyramid images in
the second step. The window size for the transform is
3×3. We can get MCT images by sliding the window
across the pyramid image. Then, the MCT-transformed
pyramid images are passed to the cascade classifier to
detect eyes.
The MCT images are scanned by the classifier and
candidate eye locations are extracted in the third step.
There are three stages of classifiers. The window size for
the classifier is 15×15. The classifier slides the window
across the MCT image and extracts the confidence value
of the current window position. If the confidence value is
lower than the specific threshold value, then the
confidence value of the next-stage classifier is calculated.
If the confidence value of the last stage is lower than the
threshold, then an eye candidate is detected in the current
window position. The eye coordinate can be calculated
from the current window position. If there are no
detected eye candidates after sliding, then detection has
failed.
Eye detection for one pyramid image is completed in
the third step. The algorithm repeats these steps for all
the pyramid images. After the left eye detection finishes,
the same process is applied to the right eye pyramid
images. After this repetitive operation, the left eye
candidates list and right eye candidates list are updated.
In the post-processing step, if there are several eye
candidates in the list, the eye candidate that has the
maximum confidence value is selected as the result. Fig. 2. Eye detection algorithm flowchart.
154 DONGKYUN KIM et al : AN FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME EYE DETECTION
III. PROPOSED HARDWARE ARCHITECTURE
1. Overall Architecture
Fig. 3 shows the overall hardware architecture of the
proposed system. The proposed system consists of four
sub-modules: a pyramid downscaling module, an MCT
module, a parallel classifier module, and a post-
processing module. The face detector and SRAM are not
part of the proposed system.
The face detector receives the image from the camera
and detects faces on the image. The face detector outputs
the top-left and bottom-right coordinates of the face
region. The face detector passes the face region
coordinates to the eye detection module after detecting
the faces. The SRAM Interface module updates every
frame of the image from the camera to SRAM. This
image is a non-uniform eye region image. A stored non-
uniform eye image is then used to generate the fixed size
pyramid image by using the pyramid downscaling
module.
The pyramid downscaling module receives as input the
left and the right eye region images from SRAM. SRAM
cannot be accessed in parallel, so we have to crop the
image in serial. First, the downscaled images of the left
and right eye regions are generated using SRAM data
and stored in registers. After generating the first
downscaled images, we can generate other downscaled
images in parallel, because the registers can be accessed
in parallel. This is different to a software algorithm, in
which pyramid images are generated one-by-one.
All generated pyramid images are passed to the MCT
module in parallel through pipelines. There are eight
MCT modules for eight pyramid images. In the MCT
modules, the 3×3 window slides across the pyramid
images for the census transform. MCT images are passed
to the classifier module in parallel through pipelines.
The classifier module extracts features from the MCT
image. In this module, a window with a pre-defined size
is used to scan through the MCT image from top-left to
bottom-right. At every stage of the window scanning, if
the confidence value of this stage is greater than a
defined threshold, then the current window region is
considered as an eye candidate. The threshold values are
built in advance (by learning), corresponding to the
classifier’s stages. The coordinates of the eye candidate
can be calculated based on the current position of the
scanning window. In contrast to the software
implementation, in this paper we apply the 3-stage
classifier in parallel for eight MCT images. Hence, we
need 24 classifiers (3 stages × 8 images) for detection.
Moreover, our hardware implementation can detect the
left eye and the right one simultaneously. As a result,
several eye candidates can be detected within one left or
right eye region. If there are more than one eye candidate,
the candidate that has the maximum confidence value is
selected as the final detection result.
2. Flip / Pyramid Downscaling Module
Fig. 4 shows the pyramid downscaling images of eye
candidate regions. The pyramid downscaling module
needs a face image and face region coordinates to
generate pyramid images. The face image is stored in
SRAM and face region coordinates can be received from
the face detector.
First, the pyramid module splits the eye candidate
regions. As shown in Fig. 4, the top-left region is set as
the left eye candidate region and the top-right region is
set as the right eye candidate region. Image pixels in
SRAM can be accessed one-by-one every pixel clock
period. Thus, we cannot generate all eight pyramid
images simultaneously. Therefore, we should generate
the first pyramid images of the left and right eye regions
Camera Input
Rig
ht R
egio
n F
lip /
Do
wn
scalin
g
Flip
MC
T T
rans
form
Post Processing
FaceDetector
8-Image x 3-StateParallel Classifier
Left
Right
Stage 1 Stage 2 Stage 3
H1<T1 H2<T2 H3<T3
H’1<T1 H’2<T2 H’3<T3
Detection Result
SRAM
Eye Regionof Faces
UninformedEye Image
Fig. 3. Overall hardware architecture of the proposed system.
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.2, JUNE, 2012 155
in serial. In Fig. 4, image (1) and image (2) are generated
in serial, and these image data are stored in registers. We
use nearest-neighbor interpolation to generate
downscaling images. We can generate other downscaled
images in parallel using these images stored in registers.
In the software algorithm, the right eye is flipped after
cropping the image. As shown in Fig. 4, the right eye
candidate region is scanned from right to left. Thus, flip
is unnecessary in our architecture. The size of pyramid
images is the same as in the software algorithm (33×36,
31×34, 29×32, 27×29).
3. MCT Module
Fig. 5 shows the realization of MCT bit computation.
The average intensity of pixels in the 3×3 window is
needed to calculate the MCT bit vector. We need to use
the divider module to calculate the average intensity.
However, the computation latency of the divider module
is very long compared to the adder module. Thus, we use
the adder and shifter instead of the divider.
We do not calculate average intensity. Instead, we
calculate SUM_I and I(x,y)×9 as shown in Fig. 5. SUM_I
is the sum of pixels in the window. We can calculate this
value using the adder tree. The Ixy×9 value can be
calculated by multiplying the intensity of every pixel by
nine. We implement the computation using a
combination of the shifter and adder, as shown in Figure
8. By comparing SUM_I and Ixy×9, we can get the
number of MCT bits for each pixel in the window.
4. Parallel Classifier Module
In the classifier module, a window of a pre-defined
sized (15×15) slides from top-left to bottom-right on the
MCT images, as shown in Fig. 6. In contrast to the
software algorithm, we apply the 3-stage classifier in
parallel for eight MCT images. Thus, we can calculate
confidence values for each stage classifier
simultaneously. We can detect both eyes simultaneously.
We need 24 classifiers (8 images × 3 stages).
We use a revised version of the parallel classifier
cascade architecture proposed by Jin et al. [22] for eye
detection, as shown in Fig. 7. They designed the
①
②
Left Eye Region
Right Eye Region
Fig. 4. Pyramid downscaling images.
X
1,1 1,2 1,3 2,1 2,2 2,3 3,1 3,2 3,3
+ + +
+
SUM_I
1,1 1,1 3,3 3,3
<<3 <<3
+ +
I(1,1)x9 I(3,3)x9
SUM_I I(1,1)x9 SUM_I I(3,3)x9
> >
MCT(1,1)
1,1 1,2 1,3
2,1 2,2 2,3
3,1 3,2 3,3
Intensity MCT
1,1 1,2 1,3
2,1 2,2 2,3
3,1 3,2 3,3
MCT(3,3)
Fig. 5. Realization of MCT bit computation. Fig. 6. Parallel classifier cascade for left eye detection.
156 DONGKYUN KIM et al : AN FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME EYE DETECTION
classifiers using look-up tables (LUTs), since each
classifier consists of the set of feature locations and the
confidence value corresponding to the LBP (local binary
pattern). LUTs contain training data generated by
AdaBoost training. We revise this LUT for MCT rather
than LBP. The three stages of the cascades have an
optimized number of allowed positions: 22, 36, and 100,
respectively. In the software algorithm, the classifiers are
cascaded in serial. In the proposed hardware, however,
all the cascade stages are pipelined and operate in
parallel. The feature extraction result is generated at each
pixel clock time after the fixed pipeline latency.
IV. SYSTEM IMPLEMENTATION
1. Synthesis Result
All hardware modules are implemented using Verilog
HDL and simulated by the ModelSim 6.5 Simulator for
functional verification. We use a camera with a 24.5454
MHz clock frequency for the implementation, so for
consistency we use a 25 MHz clock frequency for the
simulation.
The hardware module is synthesized by the Synplify
logic synthesis tool from Synplicity. The target device
for the implementation is a Virtex-5 LX330 FPGA from
Xilinx. We use the Xilinx ISE 10.1.3 integration tool for
front-end and back-end implementation. The place and
route phase is performed by Xilinx PAR (Place and
Route Program).
Table 1 shows the device utilization summary reported
by the Xilinx ISE. The number of occupied slices and the
total used memory of the proposed system are 39,914
(about 76% of the device) and 288 KB, as shown in
Table 1. Its equivalent gate count is about 200 K gates.
Table 2 shows the timing summary reported by the
Synplify synthesis tool. The requested frequency is 25.0
Mhz and the maximum estimated frequency of the
implemented system is 106.3 MHz. We use a progressive
camera that has a 24.5454 MHz pixel clock. This camera
captures the VGA image at 60 fps. We can expect a rate
of around 240 fps for the processing of standard VGA
images at the reported maximum estimated frequency of
the system, because the frame rate increases linearly with
the pixel clock increment.
2. System Configuration
Fig. 8 shows the system configuration for testing and
verifying the eye detection system. The overall system
consists of four PCBs (Printed Circuit Board): a
CameraLink interface board, two FPGA boards for face
and eye detection, and an SRAM/DVI interface board.
The CameraLink interface board receives sync signals
and image data signals from the camera using the
CameraLink interface. These signals include pixel clock,
VSYNC (vertical sync - frame valid), HSYNC (horizontal
sync - line valid), DVAL (data valid) and streaming
image data. The FPGA boards grab these signals for use
Su
m::
co
nfi
de
nc
e v
alu
e a
t S
tag
e 1
Su
m::
co
nfi
de
nc
e v
alu
e a
t S
tag
e 2
Su
m::
co
nfi
de
nc
e v
alu
e a
t S
tag
e 3
Fig. 7. Parallel classifier cascade and feature evaluation.
Table 1. Device utilization and equivalent gate count
Logic utilization Used Avail. Util.
Num. of slice registers 45,716 207,360 22%
Num. of slice LUTs 133,916 207,360 64%
Num. of occupied slices 39,914 51,840 76%
Memory (KB) 288 10,368 2%
Equivalent gate-count : about 200 K gates
Table 2. Timing summary
Requested Frequency 25.0 MHz
Estimated Frequency 106.3 MHz
Requested Period 40.000 ns
Estimated Period 9.409 ns
Slack 30.591 ns
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.2, JUNE, 2012 157
in face and eye detection. We use two FPGA boards. One
is used for face detection and the other is used for eye
detection. We use the face detector proposed by Jin et al.
[22] to detect the coordinates of face regions. The single
image frame is stored and updated in SRAM to generate
pyramid downscaling images. The eye detection result is
displayed on the monitor using a DVI interface.
V. EXPERIMENTAL RESULT
The robustness of our system was evaluated using both
the provided image database and real people’s faces
captured by our camera system. Images from the facial
database were downloaded and referred from [23]. We
selected the first directory named 2002-07 in the
database for the experiment. In this directory, there are
934 images containing 1394 faces or left-right eye pairs.
The evaluation was made using four criteria. First, the
precision and recall values were analyzed with respect to
all left-right eye pairs. These values were calculated
based on the number of correct detections (true positives),
the number of missed detection (false negatives), and the
number of wrong detections (false positives). Second, the
result was analyzed when out-of-focus images were not
considered. Similarly, in the two remaining criteria,
rotation-in-plane (RIP) and with-glasses were not
considered, respectively. Fig. 9 shows the experimental
results in four different cases. It is noted that our
implementation does not support the case of rotation-out-
of-plane (ROP); thus, all faces rotated out of the plane
(i.e. a face that contains only one eye) were not
considered in the result evaluation. The experimental
results yielded 93% detection accuracy. Fig. 10 shows
sample image results where faces and eyes were
successfully detected under different conditions.
Fig. 11 and Table 3 show the detection results of real
people’s faces captured by our camera system, with and
without glasses, respectively. The detection rate is 87%
from 1200 frames. As shown in Table 3, the detection
rate of people wearing glasses drops significantly. There
can be light spots or blooming near eyes due to light
reflection caused by glasses. These effects can influence
the detection rate. Generally, the detection rate of people
with glasses is lower than without glasses. Therefore, we
need some additional processing to reduce the effects
caused by glasses.
By considering the detection results for the provided
image database (software implementation) and the real
Fig. 8. System configuration for eye detection.
Fig. 9. Evaluation of 4 different cases: all left-right eye pairs,all except out-of-focus eyes, all except RIP of faces or eyes, allexcept eyes with-glasses.
Fig. 10. Sample results yielded from the face image database.
158 DONGKYUN KIM et al : AN FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME EYE DETECTION
people’s faces captured by our camera system (hardware
implementation), we can see that the detection rate of the
hardware implementation is lower than that of the
software implementation in the case of with-glasses.
However, the processing time of the hardware system is
much better than that of the software: 0.15 ms/frame
versus 20 ms/frame. We used a personal computer for the
evaluation of the software implementation, which had an
AMD Athlon 64 X2 Dual Core Processor 3800+ 2.00
GHz and 2-GB RAM.
Table 4 shows a comparison chart with other eye
detection algorithms. Our method has the largest image
size and fastest processing speed, even though small
decrease of detection accuracy. Our method is suitable
for hardware implementation.
VI. CONCLUSIONS
In this paper, we proposed a dedicated hardware
architecture for eye detection. We implemented it on an
FPGA. The implemented system processes each pyramid
image of the left and the right eye candidate regions in
parallel. The processing time to detect the left and right
eyes of one face is less than 0.15 ms with a VGA image.
The maximum estimated frequency of the implemented
system is 106.3 MHz. We can expect a rate of around
240 fps for the processing of standard VGA images at the
reported maximum estimated frequency of the system,
because the frame rate increases linearly with the pixel
clock increment.
The detection rate of our system is 93% with the
FDDB face database and 87% for real people's faces.
However, the detection rate for people with glasses drops
significantly. This is a weakness of our proposed system.
In the future, we will add a pre-processing step to reduce
the effects caused by glasses.
ACKNOWLEDGMENTS
This work was supported by the Ministry of
Knowledge Economy and the Korea Institute for
Advancement in Technology through the Workforce
Development Program in Strategic Technology.
REFERENCES
[1] R. Chellappa, C. L. Wilson, and S. Sirohey, “Human
and Machine Recognition of Faces: A Survey”,
Proceedings of the IEEE, Vol.83, No.5, May, 1995.
[2] G. C. Feng, and P. C. Yuen, “Variance projection
(a) Correctly detected results (b) Falsely detected results
Fig. 11. Online detection results of a person with glasses.
Table 3. Detection results for persons with and without glasses
Glasses FramesFalsely
Detected Detection
Rate
Person 1 Yes 300 43 85.67 % No 300 10 96.67 %
Person 2 Yes 300 85 71.67 % No 300 17 94.33 %
Total - 1200 155 87.08 %
Table 4. Performance summary of reported software-based eye detection systems
Eye Detection Software Implementations
Image Size
Detection Rate
FPS Features
K. Peng 2005 [24] 92 112 95.2 % 1 fps Feature Localization, Template Matching
P. Wang 2005 [25] 320 240 94.5 % 3 fps Fisher Discriminant Analysis, AdaBoost, Recursive Nonparametric Discriminant Analysis
Q. Wang 2006 [26] 384 286 95.6 % - Binary Template Matching, Support Vector Machines
T. Akashi 2007 [27] 320 240 97.9 % 35 fps Template Matching, Genetic Algorithm
H.-J. Kim 2008 [28] 352 240 94.6 % 5 fps Zernike moments, Support Vector Machines
M. Shafi 2009 [29] 480 540 87 % 1 fps Geometry, shape, density features
H. Liu 2010 [30] 640 480 94.16 % 30 fps Zernike moments, Support Vector Machines, Template Matching
Ours 640 480 93 % 50 fps AdaBoost, Modified Census Transform, Multi-scale Processing
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.2, JUNE, 2012 159
function and its application to eye detection for
human face recognition”, Pattern Recognition
Letters, Vol.19, No.9, pp.899-906, Jul., 1998.
[3] Q. Ji, and X. Yang, “Real-Time Eye, Gaze, and
Face Pose Tracking for Monitoring Driver
Vigilance”, Real-Time Imaging, Vol.8, No.5,
pp.357-377, Oct., 2002.
[4] Robert J. K. Jacob, “The Use of Eye Movements in
Human-Computer Interaction Techniques: What
You Look At is What You Get,” ACM
Transactions on Information Systems (TOIS), Vol.9,
No.2, Apr., 1991.
[5] Z. Zhu, Q. Ji, “Robust real-time eye detection and
tracking under variable lighting conditions and
various face orientations,” Computer Vision and
Image Understanding, Vol.98, No.1, pp.124-154,
Apr., 2005.
[6] A. L. Yuille, P W. Hallinan, and D. S. Cohen,
“Feature Extraction from Faces Using Deformable
Templates,” International Journal of Computer
Vision, Vol.8, No.2, pp.99-111, 1992.
[7] X. Xie, R. Sudhakar, and H. Zhuang, “On
improving eye feature extraction using deformable
templates,” Pattern Recognition, Vol.27, No.6,
pp.791-799, Jun., 1994.
[8] K. Lam, and H. Yan, “Locating and extracting the
eye in human face images,” Pattern Recognition,
Vol.29, No.5, pp.771-779, May, 1996.
[9] A. Pentland, B. Moghaddam, and T. Starner,
“View-Based and Modular Eigenspaces for Face
Recognition,” in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition
(CVPR'94), pp.84-91, 1994.
[10] W. Huang, and R. Mariani, “Face Detection and
Precise Eyes Location,” in Proceedings of
International Conference on Pattern Recognition
(ICPR'00), Vol.4, pp.722-727, 2000.
[11] J. Huang, H. Wechsler, “Eye Detection Using
optimal Wavelet Packets and Radial Basis
Functions (RBFs),” International Journal of
Pattern Recognition and Artificial Intelligence,
Vol.13, No.7, pp.1009-1025, 1999.
[12] S. A. Sirohey, A. Rosenfeld, “Eye detection in a
face image using linear and nonlinear filters,”
Pattern Recognition, Vol.34, No.7, pp.1367-1391,
2001.
[13] T. D'Orazio, M. Leo, G.Cicirelli, and A. Distante,
“An Algorithm for real time eye detection in face
images”, in Proceedings of the 17th International
Conference on Pattern Recognition (ICPR'04),
Vol.3, 2004.
[14] K. Lin, J. Huang, J. Chen, and C. Zhou, “Real-time
Eye Detection in Video Streams,” Fourth
International Conference on Natural Computation
(ICNC'08), pp.193-197, Oct., 2008.
[15] A. Amir, L. Zimet, A. Sangiovanni-Vincentelli, and
S. Kao, “An embedded system for an eye-detection
sensor”, Computer Vision and Image Understanding,
Vol.98, No.1, pp.104-123. Apr., 2005.
[16] B. Jun, and D. Kim, “Robust Real-Time Face
Detection Using Face Certainty Map”, Lecture
Notes in Computer Science, Vol.4642, pp.29-38,
2007.
[17] I. Choi, and D. Kim, “Eye Correction Using
Correlation Information,” Lecture Notes in
Computer Science, Vol.4843, pp.698-707, 2007.
[18] B. Fröba, and A. Ernst, “Face Detection with the
Modified Census Transform,” in Proceedings of
the Sixth IEEE International Conference on
Automatic Face and Gesture Recognition (FGR'04),
pp.91-96, May 2004.
[19] Y. Freund, and R. E. Schapire, “A Decision-
Theoretic Generalization of On-Line Learning and
an Application to Boosting,” Journal of Computer
and Systems Sciences, Vol.55, pp.119-139, Aug.,
1997.
[20] Y. Freund, and R. E. Schapire, “A Short
Introduction to Boosting,” Journal of Japanese
Society for Artificial Intelligence, Vol.14, No.5,
pp.771-780, 1999.
[21] P. Viola and M. Jones, “Fast and Robust
Classification using Asymmetric AdaBoost and a
Detector Cascade,” Advances in Neural
Information Processing Systems, Vol.2, pp.1311-
1318, 2002.
[22] S. Jin, D. Kim, T. T. Nguyen, D. Kim, M. Kim and
J. W. Jeon, “Design and Implementation of a
Pipelined Datapath for High-Speed Face Detection
Using FPGA,” IEEE Transaction on Industrial
Informatics, Vol.8, No.1, pp.158-167, Feb., 2012.
[23] V. Jain and E. L. Miller, “FDDB: A Benchmark for
Face Detection in Unconstrained Setting,” Technical
Report UM-CS-2010-009, Dept. of Computer
Science, University of Massachusetts, Amherst,
160 DONGKYUN KIM et al : AN FPGA-BASED PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME EYE DETECTION
2010. http://vis-www.cs.umass.edu/fddb
[24] K. Peng, L. Chen, S. Ruan, and G. Kukharev, “A
Robust Algorithm for Eye Detection on Gray
Intensity Face without Spectacles,” Journal of
Computer Science and Technology, Vol.5, No.3,
pp.127-132, 2005.
[25] P. Wang and Qiang Ji, “Learning Discriminant
Features for Multi-View Face and Eye Detection,”
in Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, 2005.
[26] Q. Wang and J. Yang, “Eye Detection in Facial
Images with Unconstrained Background,” Journal
of Pattern Recognition Research, Vol.1, pp.55-62,
2006.
[27] T. Akashi, Y. Wakasa, K. Tanaka, S. Karungaru,
and M. Fukumi, “Using Genetic Algorithm for Eye
Detection and Tracking in Video Sequence,”
Journal of Systemics, Cybernetics, and Informatics,
Vol.5, No.2, pp.72-78, 2007.
[28] H.-J. Kim and W.-Y. Kim, “Eye Detection in
Facial Images using Zernike Moments with SVM,”
ETRI Journal, Vol.30, No.2, 2008.
[29] M. Shafi and P. W. H. Chung, “A Hybrid Method
for Eyes Detection in Facial Images,” International
Journal of Electrical, Computer, and Systems
Engineering, Vol.42, pp.231–236, 2009.
[30] H. Liu and Q. Liu, “Robust Real-Time Eye
Detection and Tracking for Rotated Facial Images
under Complex Conditions,” in Proceedings of
International Conference on Natural Computation,
2010.
Dongkyun Kim received B.S. and
M.S. degrees in electrical and computer
engineering from Sungkyunkwan
University, Suwon, Korea, in 2007
and 2009, respectively. He is currently
working toward a Ph.D. degree with
the School of Information and
Communication Engineering, Sungkyunkwan University.
His research interests include image/speech signal
processing, embedded systems, and real-time applications.
Junhee Jung received a B.S. degree
in the School of Information and
Communication Engineering from
Sungkyunkwan University, Korea, in
2009 and an M.S. degree in the
Department of Electrical and Computer
Engineering from Sungkyunkwan
University, Korea, in 2011, respectively.
His interests include image processing, computer vision
and SoC.
Thuy Tuong Nguyen received a B.S.
(magna cum laude) degree in computer
science from Nong Lam University,
Ho Chi Minh City, Vietnam, in 2006,
M.S. and Ph.D. degrees in electrical
and computer engineering from
Sungkyunkwan University, Suwon,
Korea, in 2009 and 2012, respectively.
His research interests include computer vision, image
processing, and graphics processing unit computing.
Daijin Kim received a B.S. degree in
electronic engineering from Yonsei
University, Seoul, Korea, in 1981, an
M.S. degree in electrical engineering
from the Korea Advanced Institute of
Science and Technology (KAIST),
Taejon, 1984, and a Ph.D. degree in
electrical and computer engineering from Syracuse
University, Syracuse, NY, in 1991.
He is currently Professor in the Department of Computer
Science Engineering at POSTECH. His research interests
include intelligent fuzzy system, soft computing, genetic
algorithms, and hybridization of evolution and learning.
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.2, JUNE, 2012 161
Munsang Kim received his B.S. and
M.S. degrees in mechanical
engineering from the Seoul National
University in 1980 and 1982,
respectively, and the Dr.-Ing. degree
in robotics from the Technical
University of Berlin, Germany, in
1987. Since 1987, he has been working as a research
scientist at the Korea Institute of Science in Korea.
He leads the Advanced Robotics Research Center since
2000 and became the director of the Intelligent Robot-
The Frontier 21 Program since 2003. His current research
interests are the design and control of novel mobile
manipulation systems, haptic device design and control,
and sensor application in intelligent robots.
Key Ho Kwon received a B.S. degree
from the Department of Electronics
Engineering, Seoul National University,
Korea, in 1975, an M.S. degree from
the Department of Electronics
Engineering, Seoul National University,
Korea, in 1978, and a Ph.D. degree
from the Department of Electronics Engineering, Seoul
National University, Korea, in 1989. He is currently a
Professor at the School of Information and Communication
Engineering, Sungkyunkwan University.
His research interests include artificial intelligence, fuzzy
theory, neural networks, and genetic evolutionary
algorithms.
Jae Wook Jeon received B.S. and
M.S. degrees in electronics engineering
from Seoul National University,
Seoul, Korea, in 1984 and 1986,
respectively, and a Ph.D. degree in
electrical engineering from Purdue
University, West Lafayette, IN, in
1990.
From 1990 to 1994, he was a Senior Researcher with
Samsung Electronics, Suwon, Korea. Since 1994, he has
been with the School of Electrical and Computer
Engineering, Sungkyunkwan University, Suwon, first as
an Assistant Professor, and currently as a Professor. His
research interests include robotics, embedded systems,
and factory automation.