a 99.4 fps optical flow estimation processor with image ... · as shown in fig. 1(b), for real-time...

8
JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.19, NO.1, FEBRUARY, 2019 ISSN(Print) 1598-1657 https://doi.org/10.5573/JSTS.2019.19.1.116 ISSN(Online) 2233-4866 Manuscript received Jan 30, 2019; accepted Jan. 31, 2019 School of EE, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, Korea E-mail : [email protected] A 99.4 fps Optical Flow Estimation Processor with Image Tiling for Action Recognition in Mobile Devices Juhyoung Lee, Sungpill Choi, Jinmook Lee, Sanghoon Kang, and Hoi-Jun Yoo Abstract—A 99.4 fps optical flow estimation (OFE) processor with image tiling is proposed for action recognition in mobile devices. The OFE is essential for the high action recognition accuracy. However, it is unsuitable for real-time constraint in a mobile computing environment because it requires a huge amount of external memory accesses (EMAs) and matrix computations. For mitigating the external memory bandwidth requirement, this paper proposes the tile-based hierarchical OFE. It divides input images into several tiles and enables intermediate data reusing with 326.4 KB on-chip memory and 175.8 MB/s external memory bandwidth. Moreover, a background decision unit with early termination is proposed to reduce computation workload. It gets rid of unnecessary matrix computation by terminates the computation early for zero optical flow region. As a result, the proposed features reduce external memory bandwidth by 99.3 % and increase throughput by 50.7 %, respectively. The proposed 12.8 mm2 OFE processor is implemented in 65 nm CMOS technology, and it achieves the real-time OFE with 99.4 frames- per-second (fps) throughput for an image resolution of QVGA (320 × 240). Index Terms—Optical flow estimation, action recognition, tile-based processing, early termination, real-time mobile user interface, high-throughput ASIC I. INTRODUCTION Recently, action recognition has been highlighted as a next-generation user interface in mobile devices such as pet robots for human-computer interaction (HCI) [1, 2]. The most important thing in action recognition on mobile devices is the high recognition accuracy, thus several previous works [3, 4] used deep neural networks with RGB images as their input. However, in order to achieve more robust action recognition performance on illumination changes and clutters in the background, we need motion information as well as RGB images [5]. For Scale 0 Image t Image t-1 Flow Generation Scale n's Optical Flow Sc. 2's Optical Flow Sc. 1's Optical Flow Final Optical Flow Scale Space Generation Blurring & Subsampling Wraping Scale 1 Scale 2 Blur & Subsample Pyramid t Pyramid t-1 Conjugate Gradient Descent (Matrix Computations) Scale n Scale n n=2 n=1 n=0 (a) Computation Workload Ext. Memory Bandwidth 1.98GFLOPS 0.13GB/s : Conjugate Gradient Descent : Parameter Generation : Scale Space Generation 18GB/s 31.1GFLOPS 2.78GFLOPS (b) Fig. 1. (a) The algorithm flow of the optical flow estimation (OFE), (b) External memory bandwidth, computation workload breakdown of the OFE.

Upload: others

Post on 21-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A 99.4 fps Optical Flow Estimation Processor with Image ... · As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.19, NO.1, FEBRUARY, 2019 ISSN(Print) 1598-1657 https://doi.org/10.5573/JSTS.2019.19.1.116 ISSN(Online) 2233-4866

Manuscript received Jan 30, 2019; accepted Jan. 31, 2019 School of EE, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, Korea E-mail : [email protected]

A 99.4 fps Optical Flow Estimation Processor with Image Tiling for Action Recognition in Mobile Devices

Juhyoung Lee, Sungpill Choi, Jinmook Lee, Sanghoon Kang, and Hoi-Jun Yoo

Abstract—A 99.4 fps optical flow estimation (OFE) processor with image tiling is proposed for action recognition in mobile devices. The OFE is essential for the high action recognition accuracy. However, it is unsuitable for real-time constraint in a mobile computing environment because it requires a huge amount of external memory accesses (EMAs) and matrix computations. For mitigating the external memory bandwidth requirement, this paper proposes the tile-based hierarchical OFE. It divides input images into several tiles and enables intermediate data reusing with 326.4 KB on-chip memory and 175.8 MB/s external memory bandwidth. Moreover, a background decision unit with early termination is proposed to reduce computation workload. It gets rid of unnecessary matrix computation by terminates the computation early for zero optical flow region. As a result, the proposed features reduce external memory bandwidth by 99.3 % and increase throughput by 50.7 %, respectively. The proposed 12.8 mm2 OFE processor is implemented in 65 nm CMOS technology, and it achieves the real-time OFE with 99.4 frames-per-second (fps) throughput for an image resolution of QVGA (320 × 240). Index Terms—Optical flow estimation, action recognition, tile-based processing, early termination, real-time mobile user interface, high-throughput ASIC

I. INTRODUCTION

Recently, action recognition has been highlighted as a next-generation user interface in mobile devices such as pet robots for human-computer interaction (HCI) [1, 2]. The most important thing in action recognition on mobile devices is the high recognition accuracy, thus several previous works [3, 4] used deep neural networks with RGB images as their input. However, in order to achieve more robust action recognition performance on illumination changes and clutters in the background, we need motion information as well as RGB images [5]. For

Scale 0Image t

Image t-1

Flow Generation

Scale n's Optical Flow

Sc. 2's Optical Flow Sc. 1's Optical Flow Final Optical Flow

Scale Space GenerationBlurring & Subsampling

Wraping

Scale 1

Scale 2

Blur & Subsample

Pyramid t

Pyramid t-1

Conjugate Gradient Descent

(Matrix Computations)

Scale n Scale nn=2

n=1n=0

(a)

Computation Workload

Ext. Memory Bandwidth

1.98GFLOPS

0.13GB/s

: Conjugate Gradient Descent

: Parameter Generation

: Scale Space Generation

(b)18GB/s

31.1GFLOPS

2.78GFLOPS

(b)

Fig. 1. (a) The algorithm flow of the optical flow estimation (OFE), (b) External memory bandwidth, computation workload breakdown of the OFE.

Page 2: A 99.4 fps Optical Flow Estimation Processor with Image ... · As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.19, NO.1, FEBRUARY, 2019 117

example, action recognition accuracy is enhanced by about 14% when the motion information is used along with RGB images on a UCF-101 dataset [6].

In order to obtain the motion information, optical flow estimation (OFE) is broadly used. It is an algorithm which calculates a set of vectors which contain an object or pattern’s movement in the image sequence. Fig. 1(a) shows the overall algorithm of the Brox Flow [7, 8], which is an OFE algorithm that increases the action recognition accuracy. The OFE consists of two stages: 1) Scale space generation and 2) Flow generation. First, in the scale space generation stage, it generates multiple images in different scales (Gaussian Pyramid) to achieve scale invariance. Secondly, in the flow generation stage, conjugate gradient descent (CGD) algorithm estimates the final optical flow result by minimizing a cost function that is calculated by comparing the previous frame with the current frame for every scale in the pyramid.

However, it is difficult to implement real-time OFE in mobile devices due to massive amount of external memory access and OFE’s huge computational cost.

As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and 35.7 GFLOPS of computational throughput are required. Over 99.3% of the external memory bandwidth and 85% of the computation are occupied by the CGD iterations in flow generation stage. It is due to the CGD, which performs a lot of large (over 320 × 240 for QVGA image) matrix multiplications for every iteration.

In this paper, we propose a real-time OFE processor with two key features. 1) Tile-based hierarchical optical

flow estimation algorithm (THOFE) is proposed to mitigate external memory bandwidth requirement. The THOFE divides input images into several tiles and reuses the intermediate data on-chip during the CGD as much as possible. Thus, it minimizes the required external memory bandwidth to 175.8 MB/s and required on-chip memory capacity from 4 MB to 326.4 KB.

2) Background detection unit with early termination algorithm is proposed to reduce the required computational cost by omitting redundant image patches with zero optical flow, thus increasing the computational throughput by 50.7%.

From the two features, the proposed hardware achieves 99.4 fps OFE with only 175.8 MB/s external memory bandwidth. Compared to the previous work [9], the selection of optimal tile size and increased parallelism helped improve the OFE throughput by 116%.

II. PROPOSED TILE-BASED HIERARCHICAL OFE

1. Basic Concept & Description of the Proposed OFE The basic concept of the proposed tile-based

hierarchical OFE algorithm is to divide the input image to multiple smaller image tiles to reduce external memory access (EMA).

The CGD iterations, which is performed in the flow generation stage, occupy the largest portion of EMA. It is because, the CGD includes many iterations of matrix multiplication (over 320 × 240 × 26 dimension for QVGA image), whose dimension is too large to store the

Conventional OFE

Tile-based OFE

Tile-based Optical Flow Estimation with 2 Cores @S3

Tile-based Hierarchical Optical Flow Estimation

S4

S0S1

@S0~S2 (No-Tile) @S3 (4-tile w/o Overlapping) @S3 (4-tile w/ Overlapping)Small O.FDetection

Large Optical Flow Detection

Small Optical Flow Detection available

No Optical Flow Detection at Edge

Small O.FDetection

T=0(Core 0)

CGD Iteration x 200

CGD Iteration x 200

A B(Core 1)

T=1

DC(Core 0)

CGD Iteration x 200

CGD Iteration x 200

(Core 1)B

C DA

Fig. 2. The overall flow of proposed tile-based hierarchical optical flow estimation.

Page 3: A 99.4 fps Optical Flow Estimation Processor with Image ... · As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and

118 JUHYOUNG LEE et al : A 99.4 fps OPTICAL FLOW ESTIMATION PROCESSOR WITH IMAGE TILING FOR ACTION …

entire elements on-chip. In addition, the EMA requirement of the CGD is proportional to the size of the input image. The conventional OFE [8] performs the CGD for the entire image at once. However, it is not efficient to accelerate the CGD for the entire image in mobile devices which have limited memory capacity.

Even though integration of larger on-chip memory might help reduce the EMA, over 4 MB of on-chip memory capacity (for QVGA image) is required, which is not feasible for low power consumption. Therefore, if we can reduce the input image size of the CGD, we will be able to reduce the EMA without the need for large on-chip memory size.

Fig. 2 shows the proposed tile-based OFE algorithm. It divides an input image into several tiles, and performs the CGD in a separate manner. In other words, the only data which must be located in the on-chip memory during the CGD iterations are from the current processing tile. Therefore, the required memory size of the proposed tile-based hierarchical OFE is proportional to the tile size, which is much smaller than the whole input image size, to which the conventional method’s memory size was proportional. This tile division method is performed in both spatially (different tiles processed in the different cores) and temporally (different tiles processed at the different time). After CGD iterations are performed on each tile, the generated optical flow images from different tiles are collected in a final optical flow image. In addition, the number of tile division is determined by the size of each scale space image. In other words, the tile division is utilized only for lower layers of the scale space which has a larger image size.

However, due to the loss of information near tile boundaries, accuracy degradation must occur in tile-based hierarchical OFE. For example, the target motion vector which lies across different tiles cannot be estimated with a single tile. Therefore, tile overlapping is applied to prevent the degradation in the proposed OFE. By setting the overlapping region length to include motion vectors near the tile boundaries, we can avoid accuracy degradation. An analysis of proper overlapping length will be shown in Section II-B.

The proposed method requires a smaller on-chip memory size but a larger computation workload due to duplicated computations in the overlapping regions. This trade-off relationship is directly affected by the tile size.

The smaller the tile size, the smaller the required memory the larger computation overhead. The analysis of this trade-off and proper tile size will be also shown in Section II-B.

There are three advantages of the proposed method. First, the required memory size is reduced by a ratio of (tile size)/(image size) compared to the conventional algorithm [8]. Second, the proposed method shows scalability because the tile size is not affected by the image size. Even though the input images become larger, the proposed method can accelerate the OFE using the same on-chip memory whereas the conventional method requires larger on-chip memory. Finally, a parallel processing inside the input image is available because there is no data dependency among different tiles.

0 2 3 5 6Overlapping Length

1 4 7

0.029-0.021

-0.044-0.052 -0.056 -0.060 -0.060 -0.060

0

0.05

Origianl EPE Tile-based EPE (PX)-

-0.05

(a)

Req

uire

d M

emor

y (K

B)

40/40

Wor

kloa

d (G

OPS

)

35/35 30/30 25/25 20/20

65.6 KB

16.4 KB

54.9 GOPS

280 GOPS

0

300

0

200

100

15/15

35

70

9.2 KB

124 GOPS

Tile Size

(b)

40/40 35/35 30/30 25/25 20/200

15/15Tile Size

36013037

2561 2207 20342576

2000

4000

Required Memory Workload (KB GOPS)g g

(c)

Fig. 3. (a) The endpoint error difference between original OFE algorithm and proposed tile-based OFE algorithm, (b) The required memory and workload analysis by sweeping tile size L, (c) The product of required memory and workload of several tile sizes.

Page 4: A 99.4 fps Optical Flow Estimation Processor with Image ... · As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.19, NO.1, FEBRUARY, 2019 119

2. Analysis on Tile-based Hierarchical OFE Fig. 3(a) shows the endpoint error difference of the

conventional method and the proposed tile-based method over different overlapping lengths. The endpoint error (EPE) is measured around tile boundaries with the Middlebury dataset [10], and is defined by (1),

2 2EPE ( ) ( )GT GTu u v v= - + - (1)

where ( , )u v is the estimated optical flow vector and

( , )GT GTu v is the ground truth optical flow vector. For the overlapping length larger than 5 pixels, the

EPE around tile boundaries of the tiling method converges to the lowest point. Therefore, the overlapping length is set to 5 pixels.

Fig. 3(b) shows the computation workload and the required memory over different tile sizes with the overlapping length set as 5 pixels. Fig. 3(c) shows the (required memory × workload) over different tile sizes. In these analyses, the square-shaped tile is used. The overlapping region of length 5 is concatenated to the 4 sides of each tile. The results show that setting tile size as 20 generates the lowest (required memory × workload).

Fig. 4(a)-(f) show the proposed tile-based OFE results, with the overlapping length as 5 and the tile size as 20, in comparison with the conventional OFE method. The results show that there is only a 0.05 EPE difference between the conventional method and the proposed tile-based approach.

III. OVERALL ARCHITECTURE

Fig. 5 illustrates the overall architecture of the proposed processor. It consists of the 16 tile CGD cores, 1-D SIMD core, scale space core, and top RISC controller. The tile CGD cores are integrated to accelerate the CGD while exploiting the parallelism of the proposed tile-based hierarchical OFE. The 1-D SIMD core is integrated to accelerate the matrix computations which are utilized as generating the parameters for the CGD. The scale space core accelerates the image filtering which generates the scale spaces. The top RISC controller is used for the communication among every component in this processor through network-on-chip (NoC).

Each tile CGD core consists of a tile CGD unit, a background detection unit (BDU), a 20.4 KB tile CGD memory, and a core controller. The tile CGD unit, which includes 14 5-way parallel multipliers, 16 5-way parallel adders, and an adder tree. The BDU performs background detection for early termination.

The 1-D SIMD core consists of an element-wise multiplication engine, an element-wise addition engine, and a 32 KB DMEM. The scale space core has a convolution engine for filtering process of scale space generation and a 32 KB DMEM.

The overall flow of the tile-based hierarchical OFE on the proposed processor is as follows: 1) The scale space core generates multiple scales of the input image. 2) The 1-D SIMD core generates parameters for the CGD operation using the generated images. 3) Each tile CGD core operates in a parallel manner to generate an optical flow for each tile with the generated parameters. 4) After

Tile CGD Core 15BDU

Distance Map Register (20B)

Skipping Detector

x16Internal NoC

Parameter Memory (11.7

KB)CGD Memory

(26.3 KB)

CGD Unit

Add

erTr

ee

x16

5-way Add.

x14

5-way Mult.

Parameter Memory (11.7

KB)Distance Map Register (20B)

Skipping Detector

CGD Memory (26.3 KB)

CGD Unit

Add

erTr

ee

x16

5-way Add.

x14

5-way Mult.

Tile-based OFE CoreParameter

Memory (11.7 KB)

Distance Map Register (20B)

Skipping Detector

CGD Memory (26.3 KB)

CGD Unit

Add

erTr

ee

x16

5-way Add.

x14

5-way Mult.

Tile CGD Core 0

Core Controller

Tile CGD Memory (20.4 KB)

Tile CGD Unit

Add

erTr

ee

x16

5-way Add.

x14

5-way Mult.

Ext. IF Ext. IF

Scale Space Core

Conv. Engine

DMEM (32 KB)

Top

Con

trol

ler1-D SIMD Core

EltMult.EltAdd.

DMEM (32 KB)

BDU

Distance Map Register (20B)

Background Detector

Fig. 5. The overall architecture of the proposed OFE processor.

EPE : 0.457EPE : 0.508

(a) (b) (c)

(d) (e) (f)

Color Code of End Point Error Map

1 2 3 4 5 6 7 80

Fig. 4. (a) Ground truth optical flow, (b) Conventional algorithm’s optical flow, (c) Tile-based algorithm’s optical flow (d) RGB image, (e) Endpoint error map of conventional algorithm, (f) Endpoint error map of tile-based.

Page 5: A 99.4 fps Optical Flow Estimation Processor with Image ... · As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and

120 JUHYOUNG LEE et al : A 99.4 fps OPTICAL FLOW ESTIMATION PROCESSOR WITH IMAGE TILING FOR ACTION …

the optical flow tiles are collected, it finally forms an entire optical flow image.

The tile size and parallelism are the two major differences between the proposed processor and the previous work [9]. In [9], the processor was optimized for 90 × 70 tile size and took shared memory architecture. In [9], data of one tile were stored in the shared memory and were accessed by two CGD cores. As a result, 328 KB of CGD memory was required, with 9.12 GOPS of computation overhead [9]. On the other hand, the proposed processor is optimized for 20 × 20 tile size, which shows the lowest (required memory × workload). Since the memory requirement for processing a single tile has been greatly reduced, several tiles can be processed simultaneously while using the similar sized memory to [9]. Therefore, the memory of each tile is allocated locally to each core, and 16 cores are integrated to operate in parallel. As a result, external memory access of 1.74 MB / frame is required for real-time OFE

using 326 KB of CGD memory. Even though the computation overhead of 93.3 GOPS occurs, the throughput increases 2.16 times due to the parallelism of 16 CGD cores.

IV. PROPOSED BACKGROUND DECISION UNIT

WITH EARLY TERMINATION

In the optical flow results, the pixels of zeros do not

have any information and they may correspond to a background of the scene. Especially in the case of action recognition, there are numerous pixels with zero motion vector. The OFE result with UCF-101 dataset [11] reveals that average 78.5 % pixels of a scene have zero optical flow. Therefore, CGD throughput can be increased by terminating the computations early for zero optical flow pixels.

Fig. 6 shows the flow of the proposed early termination scheme. The CGD processing with the proposed scheme is divided into 2 phases. During phase 1, CGD iterations are performed through the whole image. Then, the background detection may be performed following phase 1 by thresholding the magnitude of the optical flow vector. Thus, the pixels with small optical flow vector are abandoned. After the background detection, CGD iterations are performed without the abandoned pixels during phase 2. Early termination of the CGD process accomplishes 50.7 % throughput enhancement than the conventional method with 0.7% misprediction rate in the UCF-101 test dataset.

After the background detection, the addresses of non-terminated pixels must be saved for subsequent iterations. Non-terminated pixel’s addresses are stored efficiently utilizing only the distance between non-terminated pixels. In the tile CGD cores, distance map registers of total 320

Phase 1 Phase 2 Result

CGD Iterations

StopZeroO.F.

Early Terminated

Region

Freeze Background

Zero O.F.

CGD Iterations

Fig. 6. The overall flow of the early termination scheme.

Thresholding

2 0 01

du du dv

Adder

Pixel's O.F. Magnitude

Mult. Mult.

Counting Distance

Update Distance Map Register's Row

dv

Background Detector Tile CGD Core

Row of du, dv

Tile CGD Unit

du Distance Map Register

row1 row2 row3 row4

dv

1212

4 0

0000

2 1 0 0

T=1~2 T=3~5 T=6~8 T=9~10

Update Dis. map

Small Large

dudv

: Large value: Small value

01 21 1

1

1 1 2 0 2 1 1 0 1 4 0 0

1Tile CGD Unit

Tile CGD Unit

Tile CGD Unit

(a) (b)

Fig. 7. (a) The background detector, (b) the operation on tile CGD core.

Process 65 nm CMOS

Frequency 200 MHz

Area 4.0 X 3.2 mm2

Power 278 mW

Operating Voltage 1.2 V

Frame Rate 99.4 fps

Tile CGD Core 11~15

Tile CGD Core 7~10

Tile CGD Core 3~6

Scale Space Core Top Ctrl.

Tile

CG

D

Cor

e 0

Mem

ory

Tile

CG

D

Cor

e 1

Mem

ory

Tile

CG

D

Cor

e 3

Mem

ory

Tile

CG

D

Cor

e 2

Mem

ory

Tile

Par

am M

emor

y 0

Tile

Par

am M

emor

y 1

SIMD Core

Fig. 8. The layout photograph and performance summary.

Page 6: A 99.4 fps Optical Flow Estimation Processor with Image ... · As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.19, NO.1, FEBRUARY, 2019 121

bytes are integrated to store. Fig. 7(a) explains the block diagram of the background

detector. It updates the distance map register through thresholding the magnitude of each region’s optical flow vectors (du, dv). If the optical flow of the selected region is smaller than the threshold, the background detector increments the stored distance value. If not, the background detector stores the distance counter value into the distance map and resets the distance counter to 0 for the next region. Fig. 7(b) shows the operation of the tile CGD core for early termination using the stored distance map. The tile CGD core looks-up the distance map register and fetches the non-terminated pixels’ optical flow vectors from the tile CGD memory. And it feeds the vector into the tile CGD unit. If the distance map value is 0, the operation of the corresponding row is terminated and proceeds to the next row.

IV. IMPLEMENTATION RESULTS

The proposed OFE processor shown in Fig. 8 is simulated using 65 nm CMOS technology and it occupies 12.8 mm2 die area. The processor can operate at a maximum 200 MHz clock frequency with 1.2 V supply, and it consumes 278.4 mW power. It achieves 99.4 fps throughput for QVGA resolution (320 × 240) at 200 MHz clock frequency. Table 1 shows the comparison table with previous OFE hardware implementations. The proposed tile-based OFE processor shows 116 % higher throughput than [9]. This is because the proposed processor can exploit higher parallelism with the same memory size by selecting a more optimal

tile size than [9]. The proposed tile-based hierarchical OFE processor fulfills the real-time constraint despite the fact that the OFE is performed for all pixels (100 % density). [13] can also estimate a 100% density optical flow map, but the proposed algorithm shows a much lower AAE than [13]. Fig. 9 shows the constant current injection capability of DCG. It can inject constant amplitudes of 100 kHz pseudo-sine currents, up to kΩs of load condition.

Fig. 9(a) indicates the estimated optical flow images from the UCF-101 dataset [11]. Fig. 9(b) indicates the comparison of action recognition accuracy utilizing the LRCN network [6] with the optical flow maps of the proposed OFE and the baseline [7]. Randomly selected 1000 videos from UCF-101 were utilized for the test dataset, and the proposed OFE shows only 0.8% accuracy drop.

V. CONCLUSIONS

In this paper, we implemented and simulated a tile-based OFE processor. It can support the OFE for action recognition in mobile devices in real-time with 2 key features: 1) The tile-based hierarchical OFE to reduce external memory access efficiently with a limited on-chip memory, 2) the background detection unit with early termination to increase the throughput by 50.7 %. With these features, the proposed OFE processor achieves 99.4 fps throughput for QVGA resolution (320 × 240) at 200 MHz core frequency, showing 278.4 mW power consumption, and satisfying the real-time constraint.

Table 1. Comparison Table

a Average angular error measured on the Yosemite sequence without clouds.

b Density of pixels computed across all the pixels of the Yosemite sequence without clouds.

ApplyEyeMakeup BalanceBeam

Early Termination Ratio : 96.42% Early Termination Ratio : 51.1%

(a)

Baseline [7]/w Proposed Optical Flow

80.9 %80.1 %

Action Recognition Accuracy [6]

(b)

Fig. 9. (a) Estimated optical flow images, (b) Action recognition accuracy of proposed optical flow and baseline [7] (@ 1000 randomly sampled UCF-101 test dataset).

Page 7: A 99.4 fps Optical Flow Estimation Processor with Image ... · As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and

122 JUHYOUNG LEE et al : A 99.4 fps OPTICAL FLOW ESTIMATION PROCESSOR WITH IMAGE TILING FOR ACTION …

ACKNOWLEDGMENTS

Manuscript submitted October 5. This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No. 2016-0-00207, Intelligent Processor Architectures and Application Softwares for CNN (Convolutional Neural Network)-RNN (Recurrent Neural Network)).

REFERENCES

[1] Chrungoo A., Manimaran S.S., Ravindran B. “Activity Recognition for Natural Human Robot Interaction”. ICSR 2014. Lecture Notes in Computer Science, vol 8755. Springer, 2014

[2] Fahimeh Rezazadegan, Sareh Shirazi, Ben Upcrofit, and Michael Milford. “Action recognition: From static datasets to moving robots.” In Proceedings of IEEE International Conference on Robotics and Automation, pages 3185–3191, 2017.

[3] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE PAMI, 35(1):221–231, 2013.

[4] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proc. CVPR, 2014.

[5] Sevilla-Lara, L., Liao, Y., Guney, F., Jampani, V., Geiger, A., Black, M.J.: On the integration of optical flow and action recognition. arXiv preprint arXiv:1712.08416 (2017)

[6] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko and Trevor Darrell. “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[7] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. “High accuracy optical flow estimation based on a theory for warping”. In ECCV, 2004.

[8] C. Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. PhD thesis, MIT, 2009.

[9] J. Lee, C. Kim, S. Choi, D. Shin, S. Kang and H.

Yoo, "A 46.1 fps Global Matching Optical Flow Estimation Processor for Action Recognition in Mobile Devices," 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 2018, pp. 1-5.

[10] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” International Journal of Computer Vision, vol. 92, no. 1, pp. 1–31, 2011.

[11] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CRCV-TR-12-01, Tech. Rep., 2012.

[12] H. Seong, C. Rhee, and H. Lee, “A novel hardware architecture of the lucas-kanade optical flow for reduced frame memory access,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. PP, no. 99, pp. 1–1, 2015.

[13] M. Komorkiewicz, T. Kryjak, and M. Gorgoń, “Efficient hardware implementation of the Horn–Schunck algorithm for high-resolution real-time dense optical flow sensor”, Sensors, 1424-8220, 14 (2) (2014), pp. 2860-2891.

Juhyoung Lee is a master student at the Korea Advanced Institute of Science and Technology (KAIST). His current research interests include energy-efficient multicore architec-tures and systems especially focused on artificial intelligence and com-

puter vision fields. Lee received a BS in electrical engineering from the KAIST. He is a student member of IEEE. Contact him at [email protected].

Sungpill Choi (S'13 – M’15) received the B.S. and M.S. degrees in Department of Electrical Engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea in 2012 and 2014, respectively. He is currently working

toward the Ph.D. degree in the same department at KAIST. His research interests include the low-power and memory-efficient architecture design for mobile vision SoC especially object segmentation, stereoscopic and deep learning.

Page 8: A 99.4 fps Optical Flow Estimation Processor with Image ... · As shown in Fig. 1(b), for real-time (30 fps) OFE with QVGA (320×240) images, 18.1 GB/s external memory bandwidth and

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.19, NO.1, FEBRUARY, 2019 123

Jinmook Lee (S’15) received the B.S. degrees in electrical engineering from Hanyang University, Seoul, Korea in 2014, and the M.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, South

Korea, in 2016, where he is currently pursuing the Ph. D. degree. His research interests include energy-efficient deep learning inference/training accelerator ASIC design, embedded deep learning platform design and verification, embedded system development with FPGA programming, and deep learning algorithm for sequence recognition. Affiliations: KAIST (Korea Advanced Institute of Science and Technology) Postal Address: #1233, Dept. of Electrical Engineering, KAIST, 335 Gwahangno, Yuseong-gu, Daejeon 305-701, Republic of Korea Phone: +82-42-350-5468 Fax: +82-42-350-3410

Sanghoon Kang (S’16) received the B.S. and M.S. degrees with the School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 2016. He is currently pursuing the Ph.D. degree with the

School of Electrical Engineering. His current research interests include low-power vision system-on-chip design and deep learning processor design.