isscc 2012 / session 28 / adaptive & low-power circuits / 28c][2012][isscc][chung... · figure...

480 • 2012 IEEE International Solid-State Circuits Conference

ISSCC 2012 / SESSION 28 / ADAPTIVE & LOW-POWER CIRCUITS / 28.2

28.2 A 1.0TOPS/W 36-Core Neocortical Computing Processor with 2.3Tb/s Kautz NoC for Universal Visual Recognition

Chuan-Yung Tsai, Yu-Ju Lee, Chun-Ting Chen, Liang-Gee Chen

National Taiwan University, Taipei, Taiwan

Unlike human brains, where various kinds of visual recognition tasks are carriedout with homogeneous neocortical circuits and a unified working mechanism,existing visual recognition processors [1-4] rely on multiple algorithms and het-erogeneous multicore architectures to accomplish their narrowly predefinedrecognition tasks. In contrast, new brain-mimicking recognition algorithms havebeen proposed recently [5,6]; we label such algorithms as belonging to theNeocortical Computing (NC) model. Based on neuroscience findings, NC algo-rithms model both the human brain’s static and dynamic visual recognitionstreams (as shown in Fig. 28.2.1) through a series of unified matching/poolingoperations. The algorithms exhibit high visual recognition capability and performaccurately on wide range of image/video recognition tasks. However, using theNC model for real-time recognition involves several hundred GOPS of bothdense and sparse matrix calculations and requires over 1.5Tb/s inter-stage databandwidth – requirements that cannot be met efficiently on existing parallelarchitectures.

In this paper, an NC processor for power-efficient real-time universal visualrecognition is proposed with following features: 1) A grey matter-like homoge-neous many-core architecture with event-driven hybrid MIMD execution pro-vides 1.0TOPS/W efficient acceleration for NC operations; 2) A white matter-likeKautz NoC provides 2.3Tb/s throughput, fault/congestion avoidance and redun-dancy-free multicast with 151Tb/s/W NoC power efficiency, which is 2.7-3.9×higher than previous NoC-based visual recognition processors [1,2].

Figure 28.2.2 shows the system architecture of the Kautz NoC-based 36-core NCprocessor, where the Kautz graph is implemented as an NoC. A Kautz graph canbe defined by its degree d and diameter k (both d and k are 3 in this work) andhas N = (d+1)dk-1 nodes (N = 36 cores in this work). The Kautz NoC provideswhite matter-like routing capability – the maximum hop count between any 2cores is less than logdN. Fault/congestion tolerance is also possible, as there ared disjoint routing paths between any 2 cores. The grey matter-like homogeneouscores support unified NC operations and provide programmability for variousrecognition workloads. As the examples show in Fig. 28.2.2, such a combinationprovides better scalability in terms of both application and architecture com-pared to heterogeneous designs [1-4]. In this processor, each core is addressedby a string of 3 base-4 numbers (i.e. 0/1/2/3; core addresses are 6b in total) andadjacent numbers must be different (e.g. 122 is not a valid address). Each coreunidirectionally links to 3 other cores with left-shifted addresses. For example,core 121 links to 210, 212 and 213, and receives links from 012, 212 and 312,accordingly. Through this string-shifting procedure, a core can reach any othercore within 3 hops. All cores share a 32b address space, where each core isaddressed by the most significant 6b and has a 26b private address space. Amemory map, which defines the NC model, is loaded through two 64b systembuses as memory pages.

Figure 28.2.3 shows a block diagram of the NC core and its processing element(PE). All computations are event-driven. Specifically, a core is only turned onwhen a packet is received from another core through the NoC, or from the sys-tem bus. Inactive cores and their components are clock gated. Arriving packetsare decoded into instructions and queued in the instruction FIFO. The instructiondispatcher can issue up to 2 instructions to the PEs and paging memory units,which cover the instruction’s target addresses. Page misses are handled by thememory management unit (MMU) and DMA. In the proposed hybrid MIMDmode, each of the 2 issued instructions can be either SIMD or SISD. Thus, thecore can flexibly switch between maximal acceleration for dense NC operations(by SIMD) and minimal power consumption for sparse NC operations (by SISD).With this scheme, a 10.3× recognition speed increase and a 4.43× power reduc-tion can be achieved. Inside each PE, the 16b arithmetic unit can execute up to4 operations per cycle, i.e. 1GOPS @ 250MHz. Successful instruction execution

will cause the resultant datum to be sent to the packet encoder, and fired to itssubsequent target addresses through NoC.

Figure 28.2.4 shows a block diagram of the Kautz NoC router and defines itsrouting rules. The router performs distributed low-radix routing and all inputports share one routing FIFO, which minimizes area and improves power effi-ciency. All packets can be routed/multicast with the distributed routing string-based procedure, as illustrated in Fig. 28.2.4. Information regarding faulty/con-gested cores or NoC links can be used to correct routing strings and redirectpackets for fault/congestion avoidance with minimal hop count overhead. Basedon properties of a Kautz graph, the NC processor can sustain at least 2 faultycores/links. The proposed redundancy-free multicast scheme exploits the man-ner in which cores are named by using disallowed name strings as group multi-cast addresses without header redundancies (e.g. 122 (a disallowed name) canbe used to represent the cores 120, 121 and 123 as a subgroup, and 11X repre-sents the entire core group 1). With this scheme, a distributed minimum span-ning tree-based multicast is realized and provides a 1.75× recognition speedincrease. Compared to a non-fault-tolerant star-/tree-based NoC [1-3], the KautzNoC provides better efficiency (since centralized high-radix routers arearea/power hungry). Compared to a fault-tolerant mesh-based NoC, the Kautzhas a smaller diameter and lower routing delay (3 hops vs. 10 hops and 16%less average packet delay when running NC applications vs. a 6×6 mesh).

Figure 28.2.5 shows the single-step communication/execution scheme for thischip, which is called push-based processing. Unlike conventional parallelprocessors, where data are first placed on, then pulled from, caches multipletimes by multiple cores/threads, push-based processing passes data andinstructions as multicast packets through the energy-efficient Kautz NoC, whichmimics a neuron’s firing process, as shown in Fig. 28.2.1. According to eachdatum’s associated NC operation, a SIMD or SISD instruction is executed. Zero-valued datum and datum associated with zero-valued coefficients will not resultin any computation being performed, providing further power reductions.Recognition speed is improved by 2.09×, and power is reduced by 1.38×, due topush-based processing. The benefits of the features incorporated are summa-rized in Fig. 28.2.5. Overall, 37.8× acceleration, and a 6.25× total power reduc-tion are achieved.

Figure 28.2.6 offers a summary of the chip’s features. The NC processor isimplemented on a 4.5×4.5mm2 die using the TSMC 65nm CMOS process. A widerange of visual recognition applications, including image (object/face/scene) andvideo (action/sport), is supported and the chip works in real-time with high accu-racy. Compared with other programmable visual recognition processors [1-4],high power efficiencies are achieved. The die photo is shown in Fig. 28.2.7.

Acknowledgments:We thank the TSMC University Shuttle Program for chip fabrication and theNational Chip Implementation Center for chip testing. This work is funded by theNational Science Council. We also thank Tung-Chien Chen, Yu-Han Chen, Chih-Chi Cheng, Shao-Yi Chien, Tzu-Der Chuang, David Cox and Pai-Heng Hsiao fortheir valuable suggestions.

References:[1] K. Kim, et al., “A 125GOPS 583mW Network-on-Chip Based ParallelProcessor with Bio-Inspired Visual Attention Engine,” ISSCC Dig. Tech. Papers,pp. 308-309, 2008.[2] J.-Y. Kim, et al., “A 201.4GOPS 496mW Real-Time Multi-Object RecognitionProcessor with Bio-Inspired Neural Perception Engine,” ISSCC Dig. Tech.Papers, pp. 150-151, 2009.[3] S. Lee, et al., “A 345mW Heterogeneous Many-Core Processor with anIntelligent Inference Engine for Robust Object Recognition,” ISSCC Dig. Tech.Papers, pp. 332-333, 2010.[4] T. Kurafuji, et al., “A Scalable Massively Parallel Processor for Real-TimeImage Processing,” ISSCC Dig. Tech. Papers, pp. 334-335, 2010.[5] H. Jhuang, et al., “A biologically inspired system for action recognition,” IEEEInternational Conf. on Computer Vision, pp. 1-8, 2007.[6] J. Mutch, et al., “Object class recognition and localization using sparse fea-tures with limited receptive fields,” International Journal of Computer Vision, vol.80, no. 1, pp. 45-57, 2008.

978-1-4673-0377-4/12/$31.00 ©2012 IEEE

481DIGEST OF TECHNICAL PAPERS •

ISSCC 2012 / February 22, 2012 / 2:00 PM

Figure 28.2.1: NC model and supported applications of NC processor. Figure 28.2.2: NC processor architecture and scalability.

Figure 28.2.3: Event-driven NC core/PE architectures and execution modes.

Figure 28.2.5: Push-based processing mechanism and speed/power summaryof proposed features. Figure 28.2.6: Chip features and comparison.

Figure 28.2.4: Kautz NoC router architecture and routing/multicast flows.

t=1t=2

t=3

PoolingOperation

2D / PyramidalAverage orMaximum

MatchingOperationMulti-scale2D / 3D

Convolution

PoolingOperation

Dense / Sparse2D Average orMaximum

MatchingOperation

Dense / Sparse2D Linear orLaplacian or

Gaussian Kernel

PoolingOperation

Dense / Sparse1D Maximum(Video or KNN)

MatchingOperation

Dense / Sparse1D Linear Kernel(SVM) or L1 / L2Distance (KNN)

...

FeaturePyramid

FeatureVector

RecognitionAnswerImage / Video Input

Ventral Stream:Static Visual Recog.

Dorsal Stream:Dynamic Visual Recog.

Neocortical Computing Processor

ImageRecognition:Indoor/

OutdoorObject,Face,Scene

Identification,etc.

VideoRecognition:Human

ActionSurveillance,Sport

VideoClassification,etc.

Ref. [1-3]

Range of Supported Applications

[4](Partially)

Feature Extraction Stages Feature Classification Stage

Unified Neocortical Computing Operations

NeocorticalComputingModel [5,6]

White Matter(Connection)

Grey Matter(Computing)

Kautz NoC

CG 0

CG 1

CG 2

CG 3

Core030

NoCRouter

Core031

Core032

Core023

Core020

Core021

Core012

Core010

Core013

64-BitSystem

Bus I/F 1

Core101

Core102

Core103

Core130

Core131

Core132

Core123

Core121

Core120

Core212

Core213

Core210

Core201

Core202

Core203

Core230

Core232

Core231

64-BitSystem

Bus I/F 2

Core

323

Core

320

Core

321

Core

312

Core

313

Core

310

Core

301

Core

303

Core

302

DDRSDRAM

RISC

Still/VideoCamera

NeocorticalComputingProcessor NoCRouter NoCRouter NoCRouter NoCRouter NoCRouter

NoCRouterNoCRouterNoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouterNoCRouterNoCRouterNoCRouterNoCRouter

NoCRouter NoCRouter NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

NoCRouter

Application A

Application B

Feature ClassificationCores

Feature Extraction Cores(Overload)

Feature ClassificationCores (Overload)

Feature Extraction Cores

Heterogeneous Proc. [1-4]

Homogeneous NC Proc.

Heterogeneous Proc. [1-4]

Homogeneous NC Proc.

BalancedCores &NoC withUnified

Workloads

BalancedCores &NoC withUnified

Workloads

NC Processor Scalability

Congested

Congested

Internal Bus

16b Arithmetic Unit

Inst.Dispatch

LocalMemoryUnit

81BReg. File(72b x 9)

Kautz NoC Router

Inst.FIFO(4 deep)

PacketEncode

DMA / MMU / TLB

PagingMemoryUnit 1

2KBSP SRAM(16b x 1K)

PagingMemoryUnit 9

2KBSP SRAM(16b x 1K)

...

...

PacketDecode

Control

Adder-Subtractor

Adder-Subtractor

Multiplier

Comparator

Data Parser

Data Packer

RegistersP

EControl

Inst. Imm. Field Local Mem. Broadcast

To/FromNeighborPE

To/FromNeighborPE

PE 0 PE 1 PE 9

To Packet EncodeTo Memory (Write)

From Memory (Read)To Memory (Addr.)

SISD Mode SIMD Mode

From Other CoresTo Other Cores

IssuedInstruction(s)

HMIMD Mode

...

...Mem

PE

Dense 1D &Sparse 2D/3D Inst.(e.g. Maximum)

...

...

Dense 2D/3D Inst.(e.g. 9x9 Conv.)

...

...

2 SISD/SIMD Inst.(e.g. 5x5 & 3x3 Conv.)

In Buf

MU

RU 1

RU 2

RU 3RoutingControl

Fault/CongestionInformation

Strings(FCIS1/FCIS2)

In Buf

In Buf

From Other CoresTo Other Cores

Arriving / Loopback Packets

Port-shared Routing FIFO(4-in-3-out / 84b-wide / 8-deep)

In Ctrl.ToPacketDecode From

PacketE

ncode

...

Routing Unit (RU) FlowS1 S2 S3

D1 D2 D3

D1 D2 D3D2 D3S1 S2 S3

D3S1 S2 S3

if (S2==D1) & (S3==D2)

else if (S3==D1)

else

S1 S2 S3 D1 D2 D3

S1 S2 S3 I1 I2

S1 S2 S3 D1 D2 D3I1

If RS contains substring FCIS1/FCIS2

try inserting 1 number

else try inserting 2 numbers

1. Routing String (RS)Generation

2. RS Check (Flt./Cong. Avoiding)

3. Routing Port Decision

S1 S2 S3 ?

Multicast Unit (MU) Flow1. Routing String (RS) Generation2. Multicast Detection, Packet Cloning,and Target Address Modification

if (length(RS)==4) & (D2==D3), send to

else if (length(RS)==5) & (D1==D2), send to

D1 D2 P1 D1 D2 P2 D1 D2 P3

D1 P1 P1 D1 P2 P2 D1 P3 P3

Local Core

Next-hop CoresS2 S3 P1

P1 P2 P3

(Target Addr. Modifications)

e.g.1 2 1

3 2 3

2 1 0S2 S3 P22 1 2S2 S3 P32 1 3 1 2 1 3 2 3 3/0

Route: 213 132 323 (if no fault);210 103 032 323 (if 132 faulty)

e.g.132

1 2 1 3 2 30

Target Core

2 1 e.g. multicast to 211 detected at 121 triggerspacket to be cloned and send to 210, 212 and 213(the subgroup defined as 211) via all output ports

directly, excluding zerosor not-to-be-used values

...

...

...

...

...

y3,3 = i=1.. ..

...

...

...

...

...

...Multicast

Target Array y on Core B

Target Array y on Core ASource Array x on Core ZSource Array x on $ Target Array y on Core A

Target Array y on Core B

SIMD Inst.

Addr. (Req. Mem.)

y1,1 = y1,1 +

y2,1 = y2,1 +

y3,1 = y3,1 +

y4,1 = y4,1 +

y5,1 = y5,1 +

Pull-based Processing(1) Data placed on cache,including zeros

(2) Data pulled by targets

Push-based Processing(1) Data pushed to targetsdirectly, excluding zeros

y1,1 = y1,1 +

205mW 209mW 927mW 1284mW63fps 36fps 3.5fps 1.7fps

AveragePower*(mW)

All FeaturesOn

MulticastOff

HybridMIMD Off

Push-basedProc. Off

4.43x

1.38x

1.02x

6.25x

All FeaturesOn

MulticastOff

HybridMIMD Off

Push-basedProc. Off

RecognitionSpeed*(fps)

10.3x

2.09x

1.75x

37.8x

*Benchm

ark:128x1281000-class

IndoorObjectIm

ageRecognition

Technology

Die Size

Power Supply

Operating Frequency

Gate / SRAM

TotalPower Consumption

Peak Performance

Power Efficiency

TSMC 65nm 1P9M CMOS

20.25mm2 (4.5mm x 4.5mm)

Core 1.0V, I/O 2.5V

250MHz

2.2M / 648KB

205mW / 351mW(Average / Peak)

360GOPS

1026GOPS/W

NoC Throughput

NoCPower Consumption

NoCPower Efficiency

Fault / CongestionTolerance

128x128 Image Input(Enhanced NC Model)

128x128 Image Input

256x256 Image Input

128x128 Video Input2.3Tb/s

(Aggregated Bandwidth)

15mW

151Tb/s/W

2 Cores / 2 NoC Links /1 Core + 1 NoC Link

33-68fps

63-130fps

15-32fps

14-26fps

Online Instance Learning Yes (KNN Mode)

Recognition

Speed/Feature

ISSCC08[1]

ISSCC09[2]

ISSCC10[3]**

ISSCC10[4]

ThisWork

616

835933

310

1026

PowerEfficiency*

(GOPS/W)

ISSCC08[1]**

ISSCC09[2]**

ThisWork

38.556.8

151

NoC

PowerEfficiency*

(Tb/s/W)

Measured

Under

Applications

Show

nin

Fig.28.2.1(11

Image/V

ideoDatabases)

*TechnologyScaling

ofPower(130nm

to65nm

):P65 =

P130 x

(C65 /C

130 )x(V

65 /V130 ) 2=

P130 /2.88

**Numberfrom

Corresponding

JSSCPaper

28

• 2012 IEEE International Solid-State Circuits Conference 978-1-4673-0377-4/12/$31.00 ©2012 IEEE

ISSCC 2012 PAPER CONTINUATIONS

Figure 28.2.7: Chip micrograph.

Core Group 0Paging Memory Array

Core Group 2Paging Memory Array

KautzNoC

CG0PMA

CG2PMA

CoreGroup 3PagingMemoryArray

CG0PMA

CG2PMA

CoreGroup 1PagingMemoryArray

Core Group 0(9 NC Cores)

Core Group 2(9 NC Cores)

CoreGroup 1(9 NC Cores)

CoreGroup 3(9 NC Cores)

isscc 2012 / session 28 / adaptive & low-power circuits / 28c][2012][isscc][chung... · figure...

Documents