nmax: fast, modular, low latency, low cost/power neural
TRANSCRIPT
No NDA Required – Public Information
NMAX:
Fast, Modular,
Low Latency,
Low Cost/Power
Neural Inferencing
1
No NDA Required – Public Information
Leader in eFPGA
• TSMC IP Alliance Member
• eFPGA working Silicon TSMC 40/28/16/12
• eFPGA in design for GF14 and TSMC 7/7+
• >>10 customer deals/design/fab/silicon: 16-180nm; more in negotiation
• First 6 announced agreements
No NDA Required – Public Information
NMAX Value Proposition
▪ Inferencing: 8x8 or 16x8 natively; 16x16 at ½ throughput
▪ High MAC utilization: more throughput out of less silicon
▪ Low DRAM bandwidth: more throughput at less system cost and power
▪ Low latency: high performance, low latency at batch = 1
▪ Modular: can respond quickly to customer needs
▪ Scalable doubling MACs means doubling throughput
▪ Flexible: run any NN or multiple NN
▪ Tensorflow/Caffe: easy to program
3
No NDA Required – Public Information
NMAX Inferencing Applications
▪ Edge
• Automotive
• Aerospace
• Cameras
• Cell Phones
• Drones
• PCs
• Retail
• Robotics
• Speech
• Surveillance
▪ Data Centers
4
No NDA Required – Public Information
Comparing Neural Engine Alternatives
5
No NDA Required – Public Information
Neural Inferencing Challenges & Terminology
• An inferencing chip with 1000 MACs @ 1GHz has 1 TMACs/sec = 2 TOPS
▪ This is a peak number: no one uses all of the MACs
• Challenge: move data at low power where needed to keep MACs utilized
• Challenge: maintain high performance at batch=1 for lowest latency
6
No NDA Required – Public Information
How do you get the NN throughput you need?
1. Determine how many OPS/MACs are needed for each image
▪ YOLOv3 2MP = 400 Billion MACs/image = 800 Billion Operations/image
2. Determine how many images/second you need to process
▪ YOLOv3 2MP autonomous driving: 30 image/sec = 24 TOPS throughput
3. How many MACs you need is determined by this formula:
▪ Y TOPS Peak = X TOPS Throughput ➗MAC utilization
▪ MAC utilization will vary based on NN, image size, batch size
▪ Batch=1 is what you need at the edge
▪ Number of MACs required = Y TOPS Peak ➗Frequency of MAC Completion
▪ NOTE: no short cuts in the above for pruning model, Winograd, compression
7
No NDA Required – Public Information
MAC Utilization / MAC Efficiency
• A MAC can only do a useful calculation if both the activation and the weight are
available on the inputs; if not it stalls
• MAC Utilization = (# of useful MAC calculations) ➗ (# of MACs Available)
• Example:
▪ Nvidia Tesla T4 claims 3920 images/second on ResNet-50 @ Batch=28
▪ Each ResNet-50 image takes 7 Billion Operations (3.5 Billion MACs)
▪ So T4 is doing 3920 * 7 Billion Ops = 27.44 Trillion Ops/sec = 27.44 TOPS
▪ T4 data sheet claims 130 TOPS (int8)
▪ So T4 MAC utilization, for Resnet-50 @ Batch=28, is 21%
8
No NDA Required – Public Information
Microsoft BrainWave Slide from HotChips 2018:
Batch Size
Latency at 99th
MaximumAllowedLatency
Batch Size
HardwareUtilization
(%)
9
Batching improves HW utilization but increases latency
IDEAL
Existing Solutions
No NDA Required – Public Information
ResNet-50 Int8
▪ Image classification
▪ 224 x 224 pixels
▪ 50 stage neural network
▪ 22.7 Million weights
▪ 3.5 Billion MACs per image = 7 Billion Operations per image
10
No NDA Required – Public Information
ResNet-50 Images/Second vs Batch Size: NMAX utilization high at batch=1
0
2000
4000
6000
8000
10000
12000
14000
16000
Batch=1 Batch=5 Batch=10 Batch=28
Nvidia Tesla T4 Habana Goya NMAX 6x6 NMAX6x12 NMAX12x12
11
NMAX 12x12
NMAX 12x6
NMAX 6x6
?
ImagesPerSecond
EDGE
HabanaGoya
No NDA Required – Public Information
Real Time Object Recognition: YOLOv3
12
No NDA Required – Public Information
YOLOv3 Int8
▪ Real time object recognition
▪ 2 Megapixel images
▪ >100 stage neural network
▪ 62 Million weights
▪ 400 Billion MACs per image = 800 Billion Operations per image
13
No NDA Required – Public Information
NMAX: YOLOv3, 2048x1024, Batch=1 using 2 x 4Gbit LPDDR4 DRAM
10x reduction in DRAM BW requirements vs competing solutions (<25 vs >300GB/s)
14
NMAX array size 12x12 12x6 6x6
SRAM Size 64MB 64MB 32MB
TOPS Peak 147 73 37
Throughput (@1GHz) 124 fps 72 fps 27 fps
Latency 8 ms 14 ms 37 ms
Avg. DRAM BW 12 GB/s 14 GB/s 10 GB/s
Avg. SRAM BW 177 GB/s 103 GB/s 34 GB/s
XFLX & ArrayLINX BW 18 TB/s 10 TB/s 4 TB/s
MAC Efficiency(max useable DRAM BW: 25 GB/s)
67%98 TOPS
Throughput
78%58 TOPS
Throughput
58%22 TOPS
Throughput T4-class performance
No NDA Required – Public Information
Why is NMAX the right solution for performance inferencing
• The most efficient implementation of any neural network is a hardwired ASIC
• But customers want reconfigurability
• NMAX is the closest reconfigurable architecture to hardwired ASIC
▪ Each stage when configured executes just like an ASIC
• NMAX can run any neural network running Tensorflow/Caffe
15
No NDA Required – Public Information
• 8 NMAX clusters achieves 50-90% MAC efficiency
• Local eFPGA logic (EFLX) for: ▪ Control logic & management
▪ Reconfigurable data flow
▪ Additional signal processing (e.g. ReLU, Sigmoid, Tanh)
• Local L1 SRAM for weights & activations
• L2 SRAM (via RAMLINX)
• L3 storage through DDR/PCIe
• High speed XFLX interconnects all blocks within the tile
• High speed ArrayLINX connects to adjacent NMAX tiles
to create larger NMAX arrays by abutment
NMAX512 Tile Microarchitecture: 1 TOPS @ <2mm2 in TSMC16FFC/12FFCNMAX512 Tile*
Features
16*architectural diagram, not to scale
XFLXInterconnect
NMAX Cluster
L1 SRAM
EFLX Logic
EFLX IO
ArrayLINXTM to adjacent tiles
NMAX Cluster
NMAX Cluster
NMAX Cluster
L1 SRAM
EFLX Logic
EFLX IO
NMAX Cluster
L1 SRAM
EFLX Logic
EFLX IO
NMAX Cluster
NMAX Cluster
NMAX Cluster
L1 SRAM
EFLX Logic
EFLX IO
ArrayLINXTM to adjacent tiles
DDR, PCIe & SoC connections
L2 SRAM via RAMLINXTM
L2 SRAM via RAMLINXTM
L2 SRAM via RAMLINXTM
No NDA Required – Public Information
Every Tile is Reconfigured (quickly & differently) every stage
NMAX512 Tile*
17*architectural diagram, not to scale
XFLXInterconnect
NMAX Cluster
L1 SRAM
EFLX Logic
EFLX IO
ArrayLINXTM to adjacent tiles
NMAX Cluster
NMAX Cluster
NMAX Cluster
L1 SRAM
EFLX Logic
EFLX IO
NMAX Cluster
L1 SRAM
EFLX Logic
EFLX IO
NMAX Cluster
NMAX Cluster
NMAX Cluster
L1 SRAM
EFLX Logic
EFLX IO
ArrayLINXTM to adjacent tiles
DDR, PCIe & SoC connections
L2 SRAM via RAMLINXTM
L2 SRAM via RAMLINXTM
L2 SRAM via RAMLINXTM
ACTIVATION
This example does a matrix multiply of a 512 activation vector from the prior stage times a weight matrix which is then activated to produce the activation vector for the next stage
Input Activation from L2 SRAM
Output Activation to L2 SRAM
No NDA Required – Public Information
NMAX Clusters Systolically Multiply the Activation Vector by the Weights
• Example of a 4 input vector multiplying by a 4x4 weight matrix
18
Source: Hardware for Neural networks, page 466, https://page.mi.fu-berlin.de/rojas/neural/chapter/K18.pdf
No NDA Required – Public Information
NMAXTILE
NMAXTILE
NMAXTILE
NMAXTILE
DDR IF SoC / PCIe connection
L2 SRAM L2 SRAM
L2 SRAM L2 SRAM
Modular NMAX arrays: easily scales from 1 to >100 TOPS
Features
19
2x2 NMAX512 Array*
• NMAX tiles form arrays by abutment
• ArrayLINX interconnect on all 4 sides of
NMAX tile automatically connect to provide
high bandwidth array-wide interconnect
• Shared L2 SRAM:▪ Local, high-capacity SRAMs placed in between
NMAX tiles
▪ Holds weights for each layer, as well as activations
from one layer to the next
▪ EFLX place-and-route algorithms minimizes
interconnect distances between SRAM and NMAX
*architectural diagram, not to scale
No NDA Required – Public Information
NMAX is dataflow: NMAX Compiler maps from Caffe/TensorFlow
• Mapped NN automatically “unrolls” onto the NMAX hardware
• Control logic & data operators maps to EFLX reconfigurable logic
i0
i1
i2
i3
n0
n1
n2
n3
w00
w11
w22
w33
i0
i1
i2
i3
n0
n1
n2
n3
w00
w11
w22
w33
iin
i'0
i'1
i'2
i'3ld
ld
ld
ld n°0
n°1
n°2
n°3ld
ld
ld
ld
dout
20
NMAXTILE