mind: scalable embedded computing through advanced ... · advanced processor in memory thomas...

From PIM to Petaflops Computing

MIND: Scalable Embedded Computing through Advanced Processor in Memory

Thomas SterlingCalifornia Institute of Technology

andNASA Jet Propulsion Laboratory

September 24, 2002

Presentation to the

High Performance Embedded Computing Conference 2002:

September 24, 2002 Thomas Sterling - Caltech & NASA JPL 2


Summary of Mission Driver Factors

w Speed of light preclude real time manual controlw Mission duration and spacecraft lifetime up to 100 yearsw Adaptivity to system and environmental uncertainty through reasoningw Cost of ground based deep space tracking and high bandwidth downlinkw Weight and cost of space craft high bandwidth downlink

n Antennas, Transmitter, Power supplyn Raw power sourcen Maneuver rockets and/or inertial storage, Mid course main engine thrustersn Launch vehicle fuel and type

w On-board science computationw On-board mission planning (long term and real time)w On-board mission fault detection, diagnostic, and reconfigurationw Obstructed mission profiles


Goals for a New Generation ofSpaceborne Supercomputer

w Performance gain of 100 to 10,000w Low power, high power efficiency.w Wide range for active power management.w Fault tolerance and graceful degradation.w High scalability to meet widely varying mission profiles.w Common ISA for software reuse and technology migration.w Multitasking, real time response.w Numeric, data oriented, and symbolic computation.


Processor in Memory (PIM)

w PIM merges logic with memoryn Wide ALUs next to the row buffern Optimized for memory throughput, not ALU utilization

w PIM has the potential of riding Moore's law while n greatly increasing effective memory bandwidth,n providing many more concurrent execution threads,n reducing latency, n reducing power, and n increasing overall system efficiency

w It may also simplify programming and system design

MemoryStack

Sense Amps

Node Logic

Sense Amps

Memory Stack

Sense Amps

Sense Amps

Dec

od

e

Memory Stack

Sense Amps

Sense Amps

Memory Stack

Sense Amps

Sense Amps


Why is PIM Inevitable?

w Separation between memory and logic artificialn von Neumann bottleneckn Imposed by technology limitationsn Not a desirable property of computer architecture

w Technology now brings down barriern We didn’t do it because we couldn’t do itn We can do it so we will do it

w What to do with a billion transistorsn Complexity can not be extended indefinitelyn Synthesis of simple elements through replicationn Means to fault tolerance, lower power

w Normalize memory touch time through scaled bandwidth with capacityn Without it, takes ever longer to look at each memory block

w Will be mass market commodity commercial marketn Drivers outside of HPC thrustn Cousin to embedded computing


Current PIM Projects

w IBM Blue Genen Pflops computer for protein folding

w UC Berkeley IRAMn Attached to conventional servers for multi-media

w USC ISI DIVAn Irregular data structure manipulation

w U of Notre Dame PIM-liten Multithreaded

w Caltech MINDn Virtual everything for scalable fault tolerant general purpose


Limitations of Current PIM Architectures

w No global address spacew No virtual to physical address translation

n DIVA recognizes pointers for irregular data handling

w Do not exploit full potential memory bandwidthn Most use full row buffern Blue Gene/Cyclops has 32 nodes

w No memory to memory process invocationn PIM-lite & DIVA use parcels for method driven computation

w No low overhead context switchingn BG/C and PIM-lite have some support for multithreading


MIND Architecture

w Memory-Intelligence-and-Networking Devicesw Target systems

n Homogenous MIND arraysn Heterogeneous MIND layer with external high-speed processorsn Scalable embedded

w Addresses challenges of:n global shared memory and virtual paged managementn irregular data structure handlingn dynamic adaptive on-chip resource managementn inter-chip transactionsn global system locality and latency managementn power management and system configurabilityn fault tolerance


Attributes of MIND Architecture

w Parcel active message driven computingn Decoupled split-transaction executionn System wide latency hidingn Move work to data instead of data to work

w Multithreaded controln Unified dynamic mechanism for resource managementn Latency hidingn Real time response

w Virtual to physical address translation in memoryn Global distributed shared memory thru distributed directory tablen Dynamic page migrationn Wide registers serve as context sensitive TLB

w Graceful degradation for Fault tolerance


MIND Mesh Array

PIM-MTPIM-MT

PIM-MT

PIM-MT PIM-MT

PIM-MT PIM-MT PIM-MT PIM-MT

PIM-MT PIM-MTPIM-MTPIM-MT

PIM-MT sensor

actuator


Diagram - MIND Chip Architecture

On-chip communications

Shared Computing Resources

NodesNodesNodesNodes

Explicit Signals

System memory bus interface

Stream and backing store I/O interface

Parcel Interfaces


MIND Node

Wide Register Bank

Multithreading Execution Control

mem

ory

addr

ess

buffe

r

Parcel Handler Parcel Interface

On-chipInterface

Permutation Network

Wide Multi Word ALU

Memory Stack

sense amps & row buffer

Memory Controller


Unified Register Set Supports a Diversity of Runtime Mechanisms

w Node status wordw Thread statew Parcel decodingw Parcel constructionw Vector registerw Translation Lookaside Bufferw Instruction cachew Data cachew Irregular Data Structure Node (data, pointers, usw.)


MIND Node Instruction Set

w Basic set of word operationsw Row wide field permutations for reordering and alignmentw Data parallel ops across row-wide register and delimited subfieldsw Parallel dual ops with key field and data field for rapid associative

searchesw Thread management and controlw Parcel explicit create, send, receivew Virtual and physical word access; local, on-chip, remotew Floating pointw Reconfigurationw Protected supervisor


Multithreading in PIMS

w MIND must respond asynchronously to service requests from multiple sources

w Parcel-driven computing requires rapid response to incident packets

w Hardware supports multitasking for multiple concurrent method instantiations

w High memory bandwidth utilization by overlapping computation with access ops

w Manages shared on-chip resourcesw Provides fine-grain context switchingw Latency hiding


Single HWT; Multiple Memory Banks; MultiThreadprobability of reg-to-reg instr fixed at 0.7probability of data cache hit fixed at 0.9

Memory access fixed at 70 cycles

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10

Number of Threads

Nor

mal

ized

inst

r/cy

cle

1 bank

2 banks3 banks4 banks

NOTE: For 1 & 2 memory banks memory becomes bottleneck a #threads increaseswhile for 3 & 4 banks the single HWT becomes the system bottleneck


PIM Parcel Model

w Parcel: logically complete grouping of info sent to a node on a PIM chip n by SPELLs, other PIM nodes

w At arrival, triggers local computation:n Read from local memoryn Perform some operation(s)n Write back locally (optional)n Return value to sender (optional)n Initiate additional parcel(s) (optional)


PIM Node Architecture

Row

RAM Array

“VLIW” Instruction Store

Active Parcels

Parcel Queue

Address

Command

Data Operand

Iteratoror MultiCycleThread

CPU ASAP

Logic


Virtual Page Handling

w Pages preferentially distributed in local groups with associatedpage entry tables

w Directory table entries located by physical addressw Pages may be randomly distributed within MIND chip or groupw Pages may be randomly distributed requiring second hop from

page table locationw Supervisor address space supports local node overhead and

service tasks.w Copying to physical pages, not to virtualw Demand paging to/from backing store or other MIND chipsw Nodes directly address memory of others on same MIND chip


Fault Tolerance

w Near fine-grain redundancy provides multiple alike resources to perform workload tasks.

w Even single-chip Gilgamesh (for rovers, sensor webs) will incorporate 4-way to 16-way redundancy and graceful degradation.

w Hardware architecture includes fault detection mechanisms.w Software tags for bit-checking at hardware speeds; includes

constant memory scrubbing. w Monitor threads for background fault detection and diagnosisw Virtual data and tasks permits rapid reconfiguration without

software regeneration or explicit remapping.


System Availability as Function Of Number of Faults Before Node FailureMTBF = 1 unit Exponential Arrival Rate of Faults

64 Modules 4 Nodes/Module

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 120 240 360 480 600 720 840 960 1080 1200 1320 1440 1560 1680 1800 1920 2040 2160 2280 2400

Elapsed Time

%T

ota

l Sys

tem

Cap

acit

y A

vailo

able

1 Fault

2 Faults

3 Faults


Real Time Response

w Multiple nodes permits dedication of a single node to a single real time task

w Threads and pages can be nailed down for real time tasksw Multithreading uses real time priority for guaranteed reaction timew Preemptive memory accessw Virtual address translation can be buffered in registers as TLBw Hardwired signal lines from sensors and to actuators


Power Reduction Strategy

w Objective: achieve 10 to 100 reduction in power over conventional systems of comparable performance.

w On-chip data operations avoids external I/O drivers.w Number of memory block row accesses reduced because all row bits

available for processing.w Simple processor with reduced logic. No branch prediction

prediction, speculative execution, complex scoreboarding.w No caches.w Power management of separate processor/memory nodes.


Earth Simulator


Architectures

Single Processor

SMP

MPP

SIMD

Constellation

Cluster - NOW

0

100

200

300

400

500

Jun-9

3

Nov-93

Jun-9

4

Nov-94

Jun-9

5

Nov-95

Jun-9

6

Nov-96

Jun-9

7

Nov-97

Jun-9

8

Nov-98

Jun-9

9

Nov-99

Jun-0

0

Nov-00

Jun-0

1

Y-MP C90

Sun HPC

Paragon

CM5T3D

T3E

SP2

Cluster ofSun HPC

ASCI Red

CM2

VP500

SX3


Cascade Node

DRAM

ALU

DRAM

ALU

DRAM

ALU

HD-RAM

HD-RAM

HD-RAM

100 Gflops MTV Processor

Compiler Managed Cache

Non Blocking Router

To Interconnect

Network

To External I/O

Main Memory

PIM Array3/2 Memory

Compute Unit

Cour

tesy

of T

homa

s Ste

rling


Roles for PIM/MIND in Cascade

w Perform in-place operations on zero-reuse dataw Exploit high degree data parallelismw Rapid updates on contiguous data blocksw Rapid associative searches through contiguous data blocksw Gather-scattersw Tree/graph walkingw Enables efficient and concurrent array transposew Permits fine grain manipulation of sparse and irregular data

structuresw Parallel prefix operationsw In-memory data movementw Memory management overhead workw Engage in prestaging of data for MTV/HWT processorsw Fault monitoring, detection, and cleanupw Manage 3/2 memory layer


Speedup Smart Memory Over Dumb Memory for Various LWT Clock Rates64 Smart Memory Nodes

0.00

5.00

10.00

15.00

20.00

25.00

0 250 500 750 1000 1250 1500 1750 2000

LWT Clock Rate (MHZ)

Sp

eed

Up


FPGA-based Breadboard

w FPGA technology has reached million gate countw Rapid prototyping enabledw MIND breadboard

n Dual node MIND modulen Each node

l 2 FPGAsl 8 Mbytes of SRAMl External serial interconnect for parcelsl Interface to other on-board node

w Test Facilityn Rack of four cagesn Each cage with eight MIND modules

w Alpha boards near completion (4)w Beta board design waiting next generation parts


256-bit wide SRAM

FPGA A FPGA B

D0-

D12

7

D12

8-D

255

A0-

A17

CTR

L

1394PHY

1394PHY

MCU

1394LLC+PHY Configuration

host

Node 1

Node 2

Remotenodes


MIND Prototype

mind: scalable embedded computing through advanced ... · advanced processor in memory thomas...

Documents