making machines smarter.™nfic-2016.ieeesiliconvalley.org/wp-content/uploads/sites/... ·...
TRANSCRIPT
Proprietary and confidential. Do not distribute.
Rethinking computation: A processor architecture for
machine intelligence
17 May 2016 Amir Khosrowshahi
Co-founder and CTO, Nervana
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.ne r vana
About nervana
2
• A platform for machine intelligence
• enable deep learning at scale
• optimized from algorithms to silicon
X
Proprietary and confidential. Do not distribute.ne r vana
Model and substrate for computation
3
Functional model
? ?
Machine learning modelMammalian cortex
Hard!
Proprietary and confidential. Do not distribute.ne r vana
Model and substrate for computation
4
Custom ASIC Deep learning model• Model description language • Hardware abstraction layer • Distributed primitives • Compilers, drivers
Feasible, but still hard.Do this instead:
Proprietary and confidential. Do not distribute.ne r vana
Application areas
5
Healthcare Agriculture Finance
Online Services Automotive Energy
Proprietary and confidential. Do not distribute.ne r vana
nervana cloud
6
Images
Text
Tabular
Speech
Time series
Video
Data
import trainbuild deploy
Cloud
Proprietary and confidential. Do not distribute.ne r vana
Deep learning as a core technology
7
DL
PhotosMaps
Voice Search
Self-driving car Ad
Targeting
Machine Translation
‘Google Brain’ model
DL
Image Classification
Object Localization
Video Indexing
Speech Recognition
Nervana Platform
Natural Language
Proprietary and confidential. Do not distribute.ne r vana
nervana neon
8
• Fastest library
Proprietary and confidential. Do not distribute.ne r vana
nervana neon
8
• Fastest library
Proprietary and confidential. Do not distribute.ne r vana
nervana neon
8
• Fastest library
• Model support Models • Convnet • RNN, LSTM • MLP • DQN • NTM
Domains • Images • Video • Speech • Text • Time series
Proprietary and confidential. Do not distribute.ne r vana
Running locally:
% python rnn.py # or neon rnn.yaml
Running in nervana cloud:
% ncloud submit —py rnn.py # or —yaml rnn.yaml
% ncloud show <model_id>
% ncloud list
% ncloud deploy <model_id>
% ncloud predict <model_id> <data> # or use REST api
nervana neon
8
• Fastest library
• Model support
• Cloud integration
Proprietary and confidential. Do not distribute.ne r vana
Backends
• CPU • GPU • Multiple GPUs • Parameter server • (Xeon Phi) • nervana TPU
nervana neon
8
• Fastest library
• Model support
• Cloud integration
• Multiple backends
Proprietary and confidential. Do not distribute.ne r vana
nervana neon
8
• Fastest library
• Model support
• Cloud integration
• Multiple backends
• Optimized at assembler level
Proprietary and confidential. Do not distribute.ne r vana
=1
nervana engine @200 watts
10 GPUs @2000 watts
200 CPUs @20,000 watts
nervana tensor processing unit (TPU)
9
• Unprecedented compute density
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
9
• Unprecedented compute density
• Scalable distributed architecture
nn
n n
nn
nn
Proprietary and confidential. Do not distribute.ne r vana
Instruction and data memory
Ctrl
ALU
CPU
Data Memory
Ctrl
Nervana
nervana tensor processing unit (TPU)
9
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
Proprietary and confidential. Do not distribute.ne r vana
nervana tensor processing unit (TPU)
9
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Incorporate latest advances
• Power efficiency
Proprietary and confidential. Do not distribute.ne r vana
• 10-100x gain
• Architecture optimized for
algorithm
nervana tensor processing unit (TPU)
9
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Incorporate latest advances
• Power efficiency
Proprietary and confidential. Do not distribute.ne r vana
General purpose computation
10
2000s: SoC
Motivation: reduce power and cost, fungible computing.
Enabled inexpensive mobile devices.
Proprietary and confidential. Do not distribute.ne r vana
Dennard scaling has ended
11
What’s next?
Transistors Clock speed Power Perf / clock
Proprietary and confidential. Do not distribute.ne r vana
Many-core tiled architectures
12
Tile Processor Architecture Overview for the TILEPro Series 5
CHAPTER 2 TILE PROCESSOR ARCHITECTURE OVERVIEW
The Tile Processor™ implements Tilera’s multicore architecture, incorporating a two-dimensional array of processing elements (each referred to as a tile), connected via multiple two-dimensional mesh networks. Based on Tilera’s iMesh™ Interconnect technology, the architecture is scalable and provides high bandwidth and extremely low latency communication among tiles. The Tile Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-ble multicore processor. External memory and I/O interfaces are connected to the tiles via the iMesh interconnect.
Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s structure.
Figure 2-1. Tile Processor Hardware Architecture
Each tile is a powerful, full-featured computing system that can independently run an entire oper-ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC), cache, and DMA subsystem. An individual tile is capable of executing up to three operations per cycle.
CDNTDNIDNMDNSTNUDN
1,1 6,1
3,2 4,2 5,2 6,2 7,2
XAUI(10GbE)
TDNIDNMDNSTNUDN
LEGEND:
Tile Detail
port2msh0
port0
port2 port1 port0
DDR2
DDR2
port0msh1
port2
port0 port1 port2
DDR2
DDR2
RGMII(GbE)
XAUI(10GbE)
FlexI/O
PCIe(x4 lane)
I2C, JTAG,HPI, UART,
SPI ROM
FlexI/O
PCIe(x4 lane)
port1 port1
msh3 msh2
port2msh0
port0
port2 port1 port0
port0msh1
port2
port0 port1 port2
port1 port1
msh3 msh2
gpio1
port0
port1
port1
port0
port1
xgbe0
gbe0
xgbe1
port0
gpio1
port1
port0
port1
gbe1
port0
port1
xgbe0
xgbe1
port0
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3
0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6
0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7
7,00,0 1,0 2,0 3,0 4,0 5,0 6,0
0,1 1,1 6,12,1 3,1 4,1 5,1 7,1
3,2 4,2 5,2 6,2 7,20,2 1,2 2,2
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4
port07,0 port0
pcie0
port0
port1
rshim0
gpio0
pcie1port0
port1
pcie0
port0
port1
rshim0
gpio0
pcie1port0
port1
SwitchEngine
CacheEngine
ProcessorEngine
UDN
STN
MDN
IDN
TDN
CDN
UDN
STN
MDN
IDN
TDN
CDN
STNSTN
TDNTDNIDNIDNMDNMDNUDNUDN
CDNCDN
2010s: multi-core, GPGPU
Motivation: increased performance without clock rate increase or smaller devices.
Requires changes in programming paradigm.
NVIDIA GM204Tilera
Intel Xeon Phi Knight’s landing
Proprietary and confidential. Do not distribute.ne r vana
Special purpose computation: Anton
13
III. ANTON 2 SYSTEM OVERVIEW Anton 2 comprises a number of identical ASICs (“nodes”)
that are directly connected to form a three-dimensional torus topology (Fig. 3). Inter-node connections consist of two bidi-rectional 112 Gb/s “channels”, providing a total raw bandwidth (data plus encoding overhead) of 224 Gb/s to each of an ASIC’s six neighbors within the torus. The ASIC contains an array of two types of computational tiles. There are 16 flexible subsystem (“flex”) tiles, each with 4 embedded processors called geometry cores (GCs), 256 KB of SRAM, a network interface, and a dispatch unit [20] that provides hardware sup-port for fine-grained event-driven computation (Fig. 4a). In addition, there are 2 high-throughput interaction subsystem (HTIS) tiles, each with an array of 38 pairwise point interac-tion modules (PPIMs) for computing pairwise interactions (Fig. 4b). These tiles are clocked at 1.65 GHz and are connected by an on-chip mesh network consisting of 317 Gb/s bidirectional links [52]. A host interface communicates with an external host processor over a PCI Express link, and a “logic analyzer” cap-tures and stores on-die activity for debugging purposes.
The HTIS tile computes pairwise interactions between two (possibly overlapping) sets of atoms or grid points using the same strategy as the Anton 1 HTIS [29]. It has a maximum throughput of two interactions per cycle per PPIM. The
computation is directed by an interaction control block (ICB), which receives atoms and grid points from the network, sorts them into 128 queues within a 640-KB local memory, then loads data from the queues into the PPIM array. The ICB is in turn controlled by a sequence of commands that it receives from the GC within a mini-flex tile, which is a scaled-down version of the flex tile containing a single GC, 64 KB of SRAM, a dispatch unit, and a network interface. Controlling the ICB with a mini-flex tile simplifies the task of programming Anton 2 by allowing the same toolchain to be used for all embedded software and enabling code sharing between the flex tile and HTIS tile software.
While the GCs and PPIMs have been given the same names as their Anton 1 counterparts, the blocks themselves have been redesigned. The GCs have an entirely new instruction-set architecture, and the PPIM datapaths have fundamentally changed to provide improved performance and flexibility.
IV. ARCHITECTURAL IMPROVEMENTS OVER ANTON 1 Table 1 gives a block-level comparison of the Anton 1 and
Anton 2 ASICs. The improvement in inter-node bandwidth is due to a combination of more aggressive signaling technology (14 Gb/s vs. 4.6 Gb/s), more total pins devoted to inter-node signaling (384 vs. 264), and a more efficient physical-layer
flex
(b)(a)
flex flex flex
flex flex flex flex
flex flex flex flex
HTIS HTIS
flex flex flex flexX+
X+
Y-
Y+
Z-
Z+ HOST LA
X-
X-
Y-
Y+
Z-
Z+
(c)
Fig. 3. (a) The Anton 2 ASICs are directly connected by high-speed channels to form a three-dimensional torus topology. (b) Schematic view of an Anton 2 ASIC, which contains 2 connections to each neighbor in the torus topology, 16 flexible subsystem (“flex”) tiles, 2 high-throughput interaction subsystem (HTIS) tiles, a host interface (HOST), and an on-die logic analyzer (LA). (c) Physical layout of a 20.4 mm × 20 mm Anton 2 ASIC implemented in 40-nm technology. One of the 12 channel blocks is highlighted as well as one of the 16 flex tiles (with all 4 of the tile’s geometry cores shown) and one of the 2 HTIS tiles (with one of the tile’s 38 pairwise point interaction modules shown).
Mini-flex tile
GeometryCore
DispatchUnit
64 KBSRAM
NetworkInterface
ICB
cont
rol/c
omm
ands
PPIM array
PPIM36
PPIM37
PPIM19
PPIM20
Geometry Core
16 KB D$56 KB I$
Geometry Core
Geometry Core
GeometryCore
DispatchUnit
256 KBSRAM
NetworkInterface
16 KB D$56 KB I$
16 KB D$56 KB I$
16 KB D$56 KB I$
(a) (b)
128queues PPIM
1PPIM
17PPIM
18
outputPPIM
0
640 KBlocal memory
Fig. 4. (a) A flex tile contains 4 geometry cores, 256 KB of shared SRAM, a dispatch unit, a network interface, and an internal network. (b) An HTIS tile containsa mini-flex tile, an interaction control block (ICB), and an array of 38 pairwise point interaction modules (PPIMs). The mini-flex tile contains a geometry core, adispatch unit, 64 KB of SRAM, and a network interface. The ICB sorts atoms and grid points into 128 queues stored in a 640-KB local memory, loads data fromthe local memory into the PPIM array, and converts the output of the PPIM array into network packets. The ICB is controlled by a sequence of commands that itreceives from the mini-flex GC.
43
(Shaw et al., 2014)
Proprietary and confidential. Do not distribute.ne r vana
Computational motifs
14
Motif Examples1 Dense linear algebra Matrix multiply (GEMM)2 Sparse linear algebra SpMV3 Spectral methods FFT4 N-Body methods Molecular dynamics5 Structured grids Lattice Boltzmann6 Unstructured grids CFD7 Map-Reduce Expectation
maximization8 Combinational logic Encryption, hashing9 Graph traversal Decision trees, quicksort10 Dynamic programming Forward-backward11 Bactrack, branch and bound Constraint satisfaction12 Graphical models HMM, Bayesian
networks13 Finite state machines Compilers(Asanovic et al., 2006)
• Silicon
• Software
• Neural network architectures!
Can be implemented using:
Proprietary and confidential. Do not distribute.ne r vana
Summary
15
• Computers are tools for solving problems of their time
• Was: Coding, calculation, graphics, web
• Today: Learning and Inference on data
• Deep learning as a computational paradigm
• Custom architecture can do vastly better
• We are hiring! Summer interns and full time.