making machines smarter.™nfic-2016.ieeesiliconvalley.org/wp-content/uploads/sites/... ·...

Proprietary and confidential. Do not distribute.

Rethinking computation: A processor architecture for

machine intelligence

17 May 2016 Amir Khosrowshahi

Co-founder and CTO, Nervana

MAKING MACHINES SMARTER.™

Proprietary and confidential. Do not distribute.ne r vana

About nervana

2

• A platform for machine intelligence

• enable deep learning at scale

• optimized from algorithms to silicon

X


Model and substrate for computation

3

Functional model

? ?

Machine learning modelMammalian cortex

Hard!


Model and substrate for computation

4

Custom ASIC Deep learning model• Model description language • Hardware abstraction layer • Distributed primitives • Compilers, drivers

Feasible, but still hard.Do this instead:


Application areas

5

Healthcare Agriculture Finance

Online Services Automotive Energy


nervana cloud

6

Images

Text

Tabular

Speech

Time series

Video

Data

import trainbuild deploy

Cloud


Deep learning as a core technology

7

DL

PhotosMaps

Voice Search

Self-driving car Ad

Targeting

Machine Translation

‘Google Brain’ model

DL

Image Classification

Object Localization

Video Indexing

Speech Recognition

Nervana Platform

Natural Language


nervana neon

8

• Fastest library


nervana neon

8

• Fastest library

• Model support Models • Convnet • RNN, LSTM • MLP • DQN • NTM

Domains • Images • Video • Speech • Text • Time series


Running locally:

% python rnn.py # or neon rnn.yaml

Running in nervana cloud:

% ncloud submit —py rnn.py # or —yaml rnn.yaml

% ncloud show <model_id>

% ncloud list

% ncloud deploy <model_id>

% ncloud predict <model_id> <data> # or use REST api

nervana neon

8

• Fastest library

• Model support

• Cloud integration


Backends

• CPU • GPU • Multiple GPUs • Parameter server • (Xeon Phi) • nervana TPU

nervana neon

8

• Fastest library

• Model support


• Multiple backends


nervana neon

8

• Fastest library

• Model support


• Multiple backends

• Optimized at assembler level


=1

nervana engine @200 watts

10 GPUs @2000 watts

200 CPUs @20,000 watts

nervana tensor processing unit (TPU)

9

• Unprecedented compute density



9


• Scalable distributed architecture

nn

n n

nn

nn


Instruction and data memory

Ctrl

ALU

CPU

Data Memory

Ctrl

Nervana


9



• Memory near computation



9




• Learning and inference

• Exploit limited precision

• Incorporate latest advances

• Power efficiency


• 10-100x gain

• Architecture optimized for

algorithm


9




• Learning and inference

• Exploit limited precision

• Incorporate latest advances

• Power efficiency


General purpose computation

10

2000s: SoC

Motivation: reduce power and cost, fungible computing.

Enabled inexpensive mobile devices.


Dennard scaling has ended

11

What’s next?

Transistors Clock speed Power Perf / clock


Many-core tiled architectures

12

Tile Processor Architecture Overview for the TILEPro Series 5

CHAPTER 2 TILE PROCESSOR ARCHITECTURE OVERVIEW

The Tile Processor™ implements Tilera’s multicore architecture, incorporating a two-dimensional array of processing elements (each referred to as a tile), connected via multiple two-dimensional mesh networks. Based on Tilera’s iMesh™ Interconnect technology, the architecture is scalable and provides high bandwidth and extremely low latency communication among tiles. The Tile Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-ble multicore processor. External memory and I/O interfaces are connected to the tiles via the iMesh interconnect.

Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s structure.

Figure 2-1. Tile Processor Hardware Architecture

Each tile is a powerful, full-featured computing system that can independently run an entire oper-ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC), cache, and DMA subsystem. An individual tile is capable of executing up to three operations per cycle.

CDNTDNIDNMDNSTNUDN

1,1 6,1

3,2 4,2 5,2 6,2 7,2

XAUI(10GbE)

TDNIDNMDNSTNUDN

LEGEND:

Tile Detail

port2msh0

port0

port2 port1 port0

DDR2

DDR2

port0msh1

port2

port0 port1 port2

DDR2

DDR2

RGMII(GbE)

XAUI(10GbE)

FlexI/O

PCIe(x4 lane)

I2C, JTAG,HPI, UART,

SPI ROM

FlexI/O

PCIe(x4 lane)

port1 port1

msh3 msh2

port2msh0

port0

port2 port1 port0

port0msh1

port2

port0 port1 port2

port1 port1

msh3 msh2

gpio1

port0

port1

port1

port0

port1

xgbe0

gbe0

xgbe1

port0

gpio1

port1

port0

port1

gbe1

port0

port1

xgbe0

xgbe1

port0

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3

0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5

0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6

0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7

7,00,0 1,0 2,0 3,0 4,0 5,0 6,0

0,1 1,1 6,12,1 3,1 4,1 5,1 7,1

3,2 4,2 5,2 6,2 7,20,2 1,2 2,2

0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4

port07,0 port0

pcie0

port0

port1

rshim0

gpio0

pcie1port0

port1

pcie0

port0

port1

rshim0

gpio0

pcie1port0

port1

SwitchEngine

CacheEngine

ProcessorEngine

UDN

STN

MDN

IDN

TDN

CDN

UDN

STN

MDN

IDN

TDN

CDN

STNSTN

TDNTDNIDNIDNMDNMDNUDNUDN

CDNCDN

2010s: multi-core, GPGPU

Motivation: increased performance without clock rate increase or smaller devices.

Requires changes in programming paradigm.

NVIDIA GM204Tilera

Intel Xeon Phi Knight’s landing


Special purpose computation: Anton

13

III. ANTON 2 SYSTEM OVERVIEW Anton 2 comprises a number of identical ASICs (“nodes”)

that are directly connected to form a three-dimensional torus topology (Fig. 3). Inter-node connections consist of two bidi-rectional 112 Gb/s “channels”, providing a total raw bandwidth (data plus encoding overhead) of 224 Gb/s to each of an ASIC’s six neighbors within the torus. The ASIC contains an array of two types of computational tiles. There are 16 flexible subsystem (“flex”) tiles, each with 4 embedded processors called geometry cores (GCs), 256 KB of SRAM, a network interface, and a dispatch unit [20] that provides hardware sup-port for fine-grained event-driven computation (Fig. 4a). In addition, there are 2 high-throughput interaction subsystem (HTIS) tiles, each with an array of 38 pairwise point interac-tion modules (PPIMs) for computing pairwise interactions (Fig. 4b). These tiles are clocked at 1.65 GHz and are connected by an on-chip mesh network consisting of 317 Gb/s bidirectional links [52]. A host interface communicates with an external host processor over a PCI Express link, and a “logic analyzer” cap-tures and stores on-die activity for debugging purposes.

The HTIS tile computes pairwise interactions between two (possibly overlapping) sets of atoms or grid points using the same strategy as the Anton 1 HTIS [29]. It has a maximum throughput of two interactions per cycle per PPIM. The

computation is directed by an interaction control block (ICB), which receives atoms and grid points from the network, sorts them into 128 queues within a 640-KB local memory, then loads data from the queues into the PPIM array. The ICB is in turn controlled by a sequence of commands that it receives from the GC within a mini-flex tile, which is a scaled-down version of the flex tile containing a single GC, 64 KB of SRAM, a dispatch unit, and a network interface. Controlling the ICB with a mini-flex tile simplifies the task of programming Anton 2 by allowing the same toolchain to be used for all embedded software and enabling code sharing between the flex tile and HTIS tile software.

While the GCs and PPIMs have been given the same names as their Anton 1 counterparts, the blocks themselves have been redesigned. The GCs have an entirely new instruction-set architecture, and the PPIM datapaths have fundamentally changed to provide improved performance and flexibility.

IV. ARCHITECTURAL IMPROVEMENTS OVER ANTON 1 Table 1 gives a block-level comparison of the Anton 1 and

Anton 2 ASICs. The improvement in inter-node bandwidth is due to a combination of more aggressive signaling technology (14 Gb/s vs. 4.6 Gb/s), more total pins devoted to inter-node signaling (384 vs. 264), and a more efficient physical-layer

flex

(b)(a)

flex flex flex

flex flex flex flex

flex flex flex flex

HTIS HTIS

flex flex flex flexX+

X+

Y-

Y+

Z-

Z+ HOST LA

X-

X-

Y-

Y+

Z-

Z+

(c)

Fig. 3. (a) The Anton 2 ASICs are directly connected by high-speed channels to form a three-dimensional torus topology. (b) Schematic view of an Anton 2 ASIC, which contains 2 connections to each neighbor in the torus topology, 16 flexible subsystem (“flex”) tiles, 2 high-throughput interaction subsystem (HTIS) tiles, a host interface (HOST), and an on-die logic analyzer (LA). (c) Physical layout of a 20.4 mm × 20 mm Anton 2 ASIC implemented in 40-nm technology. One of the 12 channel blocks is highlighted as well as one of the 16 flex tiles (with all 4 of the tile’s geometry cores shown) and one of the 2 HTIS tiles (with one of the tile’s 38 pairwise point interaction modules shown).

Mini-flex tile

GeometryCore

DispatchUnit

64 KBSRAM

NetworkInterface

ICB

cont

rol/c

omm

ands

PPIM array

PPIM36

PPIM37

PPIM19

PPIM20

Geometry Core

16 KB D$56 KB I$

Geometry Core

Geometry Core

GeometryCore

DispatchUnit

256 KBSRAM

NetworkInterface

16 KB D$56 KB I$

16 KB D$56 KB I$

16 KB D$56 KB I$

(a) (b)

128queues PPIM

1PPIM

17PPIM

18

outputPPIM

0

640 KBlocal memory

Fig. 4. (a) A flex tile contains 4 geometry cores, 256 KB of shared SRAM, a dispatch unit, a network interface, and an internal network. (b) An HTIS tile containsa mini-flex tile, an interaction control block (ICB), and an array of 38 pairwise point interaction modules (PPIMs). The mini-flex tile contains a geometry core, adispatch unit, 64 KB of SRAM, and a network interface. The ICB sorts atoms and grid points into 128 queues stored in a 640-KB local memory, loads data fromthe local memory into the PPIM array, and converts the output of the PPIM array into network packets. The ICB is controlled by a sequence of commands that itreceives from the mini-flex GC.

43

(Shaw et al., 2014)


Computational motifs

14

Motif Examples1 Dense linear algebra Matrix multiply (GEMM)2 Sparse linear algebra SpMV3 Spectral methods FFT4 N-Body methods Molecular dynamics5 Structured grids Lattice Boltzmann6 Unstructured grids CFD7 Map-Reduce Expectation

maximization8 Combinational logic Encryption, hashing9 Graph traversal Decision trees, quicksort10 Dynamic programming Forward-backward11 Bactrack, branch and bound Constraint satisfaction12 Graphical models HMM, Bayesian

networks13 Finite state machines Compilers(Asanovic et al., 2006)

• Silicon

• Software

• Neural network architectures!

Can be implemented using:


Summary

15

• Computers are tools for solving problems of their time

• Was: Coding, calculation, graphics, web

• Today: Learning and Inference on data

• Deep learning as a computational paradigm

• Custom architecture can do vastly better

• We are hiring! Summer interns and full time.

making machines smarter.™nfic-2016.ieeesiliconvalley.org/wp-content/uploads/sites/... ·...

Documents