mind: scalable embedded computing through advanced ... · advanced processor in memory thomas...
TRANSCRIPT
From PIM to Petaflops Computing
MIND: Scalable Embedded Computing through Advanced Processor in Memory
Thomas SterlingCalifornia Institute of Technology
andNASA Jet Propulsion Laboratory
September 24, 2002
Presentation to the
High Performance Embedded Computing Conference 2002:
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 2
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 3
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 4
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 5
Summary of Mission Driver Factors
w Speed of light preclude real time manual controlw Mission duration and spacecraft lifetime up to 100 yearsw Adaptivity to system and environmental uncertainty through reasoningw Cost of ground based deep space tracking and high bandwidth downlinkw Weight and cost of space craft high bandwidth downlink
n Antennas, Transmitter, Power supplyn Raw power sourcen Maneuver rockets and/or inertial storage, Mid course main engine thrustersn Launch vehicle fuel and type
w On-board science computationw On-board mission planning (long term and real time)w On-board mission fault detection, diagnostic, and reconfigurationw Obstructed mission profiles
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 6
Goals for a New Generation ofSpaceborne Supercomputer
w Performance gain of 100 to 10,000w Low power, high power efficiency.w Wide range for active power management.w Fault tolerance and graceful degradation.w High scalability to meet widely varying mission profiles.w Common ISA for software reuse and technology migration.w Multitasking, real time response.w Numeric, data oriented, and symbolic computation.
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 7
Processor in Memory (PIM)
w PIM merges logic with memoryn Wide ALUs next to the row buffern Optimized for memory throughput, not ALU utilization
w PIM has the potential of riding Moore's law while n greatly increasing effective memory bandwidth,n providing many more concurrent execution threads,n reducing latency, n reducing power, and n increasing overall system efficiency
w It may also simplify programming and system design
MemoryStack
Sense Amps
Node Logic
Sense Amps
Memory Stack
Sense Amps
Sense Amps
Dec
od
e
Memory Stack
Sense Amps
Sense Amps
Memory Stack
Sense Amps
Sense Amps
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 8
Why is PIM Inevitable?
w Separation between memory and logic artificialn von Neumann bottleneckn Imposed by technology limitationsn Not a desirable property of computer architecture
w Technology now brings down barriern We didn’t do it because we couldn’t do itn We can do it so we will do it
w What to do with a billion transistorsn Complexity can not be extended indefinitelyn Synthesis of simple elements through replicationn Means to fault tolerance, lower power
w Normalize memory touch time through scaled bandwidth with capacityn Without it, takes ever longer to look at each memory block
w Will be mass market commodity commercial marketn Drivers outside of HPC thrustn Cousin to embedded computing
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 9
Current PIM Projects
w IBM Blue Genen Pflops computer for protein folding
w UC Berkeley IRAMn Attached to conventional servers for multi-media
w USC ISI DIVAn Irregular data structure manipulation
w U of Notre Dame PIM-liten Multithreaded
w Caltech MINDn Virtual everything for scalable fault tolerant general purpose
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 10
Limitations of Current PIM Architectures
w No global address spacew No virtual to physical address translation
n DIVA recognizes pointers for irregular data handling
w Do not exploit full potential memory bandwidthn Most use full row buffern Blue Gene/Cyclops has 32 nodes
w No memory to memory process invocationn PIM-lite & DIVA use parcels for method driven computation
w No low overhead context switchingn BG/C and PIM-lite have some support for multithreading
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 11
MIND Architecture
w Memory-Intelligence-and-Networking Devicesw Target systems
n Homogenous MIND arraysn Heterogeneous MIND layer with external high-speed processorsn Scalable embedded
w Addresses challenges of:n global shared memory and virtual paged managementn irregular data structure handlingn dynamic adaptive on-chip resource managementn inter-chip transactionsn global system locality and latency managementn power management and system configurabilityn fault tolerance
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 12
Attributes of MIND Architecture
w Parcel active message driven computingn Decoupled split-transaction executionn System wide latency hidingn Move work to data instead of data to work
w Multithreaded controln Unified dynamic mechanism for resource managementn Latency hidingn Real time response
w Virtual to physical address translation in memoryn Global distributed shared memory thru distributed directory tablen Dynamic page migrationn Wide registers serve as context sensitive TLB
w Graceful degradation for Fault tolerance
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 13
MIND Mesh Array
PIM-MTPIM-MT
PIM-MT
PIM-MT PIM-MT
PIM-MT PIM-MT PIM-MT PIM-MT
PIM-MT PIM-MTPIM-MTPIM-MT
PIM-MT sensor
actuator
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 14
Diagram - MIND Chip Architecture
On-chip communications
Shared Computing Resources
NodesNodesNodesNodes
Explicit Signals
System memory bus interface
Stream and backing store I/O interface
Parcel Interfaces
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 15
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 16
MIND Node
Wide Register Bank
Multithreading Execution Control
mem
ory
addr
ess
buffe
r
Parcel Handler Parcel Interface
On-chipInterface
Permutation Network
Wide Multi Word ALU
Memory Stack
sense amps & row buffer
Memory Controller
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 17
Unified Register Set Supports a Diversity of Runtime Mechanisms
w Node status wordw Thread statew Parcel decodingw Parcel constructionw Vector registerw Translation Lookaside Bufferw Instruction cachew Data cachew Irregular Data Structure Node (data, pointers, usw.)
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 18
MIND Node Instruction Set
w Basic set of word operationsw Row wide field permutations for reordering and alignmentw Data parallel ops across row-wide register and delimited subfieldsw Parallel dual ops with key field and data field for rapid associative
searchesw Thread management and controlw Parcel explicit create, send, receivew Virtual and physical word access; local, on-chip, remotew Floating pointw Reconfigurationw Protected supervisor
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 19
Multithreading in PIMS
w MIND must respond asynchronously to service requests from multiple sources
w Parcel-driven computing requires rapid response to incident packets
w Hardware supports multitasking for multiple concurrent method instantiations
w High memory bandwidth utilization by overlapping computation with access ops
w Manages shared on-chip resourcesw Provides fine-grain context switchingw Latency hiding
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 20
Single HWT; Multiple Memory Banks; MultiThreadprobability of reg-to-reg instr fixed at 0.7probability of data cache hit fixed at 0.9
Memory access fixed at 70 cycles
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10
Number of Threads
Nor
mal
ized
inst
r/cy
cle
1 bank
2 banks3 banks4 banks
NOTE: For 1 & 2 memory banks memory becomes bottleneck a #threads increaseswhile for 3 & 4 banks the single HWT becomes the system bottleneck
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 21
PIM Parcel Model
w Parcel: logically complete grouping of info sent to a node on a PIM chip n by SPELLs, other PIM nodes
w At arrival, triggers local computation:n Read from local memoryn Perform some operation(s)n Write back locally (optional)n Return value to sender (optional)n Initiate additional parcel(s) (optional)
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 22
PIM Node Architecture
Row
RAM Array
“VLIW” Instruction Store
Active Parcels
Parcel Queue
Address
Command
Data Operand
Iteratoror MultiCycleThread
CPU ASAP
Logic
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 23
Virtual Page Handling
w Pages preferentially distributed in local groups with associatedpage entry tables
w Directory table entries located by physical addressw Pages may be randomly distributed within MIND chip or groupw Pages may be randomly distributed requiring second hop from
page table locationw Supervisor address space supports local node overhead and
service tasks.w Copying to physical pages, not to virtualw Demand paging to/from backing store or other MIND chipsw Nodes directly address memory of others on same MIND chip
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 24
Fault Tolerance
w Near fine-grain redundancy provides multiple alike resources to perform workload tasks.
w Even single-chip Gilgamesh (for rovers, sensor webs) will incorporate 4-way to 16-way redundancy and graceful degradation.
w Hardware architecture includes fault detection mechanisms.w Software tags for bit-checking at hardware speeds; includes
constant memory scrubbing. w Monitor threads for background fault detection and diagnosisw Virtual data and tasks permits rapid reconfiguration without
software regeneration or explicit remapping.
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 25
System Availability as Function Of Number of Faults Before Node FailureMTBF = 1 unit Exponential Arrival Rate of Faults
64 Modules 4 Nodes/Module
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 120 240 360 480 600 720 840 960 1080 1200 1320 1440 1560 1680 1800 1920 2040 2160 2280 2400
Elapsed Time
%T
ota
l Sys
tem
Cap
acit
y A
vailo
able
1 Fault
2 Faults
3 Faults
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 26
Real Time Response
w Multiple nodes permits dedication of a single node to a single real time task
w Threads and pages can be nailed down for real time tasksw Multithreading uses real time priority for guaranteed reaction timew Preemptive memory accessw Virtual address translation can be buffered in registers as TLBw Hardwired signal lines from sensors and to actuators
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 27
Power Reduction Strategy
w Objective: achieve 10 to 100 reduction in power over conventional systems of comparable performance.
w On-chip data operations avoids external I/O drivers.w Number of memory block row accesses reduced because all row bits
available for processing.w Simple processor with reduced logic. No branch prediction
prediction, speculative execution, complex scoreboarding.w No caches.w Power management of separate processor/memory nodes.
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 28
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 29
Earth Simulator
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 30
Architectures
Single Processor
SMP
MPP
SIMD
Constellation
Cluster - NOW
0
100
200
300
400
500
Jun-9
3
Nov-93
Jun-9
4
Nov-94
Jun-9
5
Nov-95
Jun-9
6
Nov-96
Jun-9
7
Nov-97
Jun-9
8
Nov-98
Jun-9
9
Nov-99
Jun-0
0
Nov-00
Jun-0
1
Y-MP C90
Sun HPC
Paragon
CM5T3D
T3E
SP2
Cluster ofSun HPC
ASCI Red
CM2
VP500
SX3
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 31
Cascade Node
DRAM
ALU
DRAM
ALU
DRAM
ALU
HD-RAM
HD-RAM
HD-RAM
100 Gflops MTV Processor
Compiler Managed Cache
Non Blocking Router
To Interconnect
Network
To External I/O
Main Memory
PIM Array3/2 Memory
Compute Unit
Cour
tesy
of T
homa
s Ste
rling
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 32
Roles for PIM/MIND in Cascade
w Perform in-place operations on zero-reuse dataw Exploit high degree data parallelismw Rapid updates on contiguous data blocksw Rapid associative searches through contiguous data blocksw Gather-scattersw Tree/graph walkingw Enables efficient and concurrent array transposew Permits fine grain manipulation of sparse and irregular data
structuresw Parallel prefix operationsw In-memory data movementw Memory management overhead workw Engage in prestaging of data for MTV/HWT processorsw Fault monitoring, detection, and cleanupw Manage 3/2 memory layer
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 33
Speedup Smart Memory Over Dumb Memory for Various LWT Clock Rates64 Smart Memory Nodes
0.00
5.00
10.00
15.00
20.00
25.00
0 250 500 750 1000 1250 1500 1750 2000
LWT Clock Rate (MHZ)
Sp
eed
Up
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 34
FPGA-based Breadboard
w FPGA technology has reached million gate countw Rapid prototyping enabledw MIND breadboard
n Dual node MIND modulen Each node
l 2 FPGAsl 8 Mbytes of SRAMl External serial interconnect for parcelsl Interface to other on-board node
w Test Facilityn Rack of four cagesn Each cage with eight MIND modules
w Alpha boards near completion (4)w Beta board design waiting next generation parts
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 35
256-bit wide SRAM
FPGA A FPGA B
D0-
D12
7
D12
8-D
255
A0-
A17
CTR
L
1394PHY
1394PHY
MCU
1394LLC+PHY Configuration
host
Node 1
Node 2
Remotenodes
September 24, 2002 Thomas Sterling - Caltech & NASA JPL 36
MIND Prototype