m-machine and grids parallel computer architectures
DESCRIPTION
M-Machine and Grids Parallel Computer Architectures. Navendu Jain. Readings. The M-machine multicomputer Marco et al., MICRO 1995 Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor Keckler et al., MICRO 1998 - PowerPoint PPT PresentationTRANSCRIPT
M-Machine and M-Machine and GridsGrids
Parallel Computer ArchitecturesParallel Computer Architectures
Navendu JainNavendu Jain
ReadingsReadings
The M-machine multicomputerThe M-machine multicomputer Marco et al., MICRO 1995 Marco et al., MICRO 1995
Exploiting fine-grain thread level parExploiting fine-grain thread level parallelism on the MIT multi-ALU proceallelism on the MIT multi-ALU processorssor
Keckler et al., MICRO 1998Keckler et al., MICRO 1998 A design space evaluation of grid proA design space evaluation of grid pro
cessor architecturescessor architectures
Nagarajan et al., MICRO 2001Nagarajan et al., MICRO 2001
OutlineOutline
The M-Machine MulticomputerThe M-Machine Multicomputer
Thread Level Parallelism on M-Thread Level Parallelism on M-MachineMachine
Grid Processor ArchitecturesGrid Processor Architectures
Review and Discussion Review and Discussion
The M-Machine The M-Machine MulticomputerMulticomputer
Design MotivationDesign Motivation
Achieve higher throughput of memory Achieve higher throughput of memory resourcesresources Increase chip area devoted to processorsIncrease chip area devoted to processors Arithmetic to bandwidth ratio of 12 Arithmetic to bandwidth ratio of 12
operations/wordoperations/word Minimize global communication (local sync.)Minimize global communication (local sync.) Faster execution of fixed size problemsFaster execution of fixed size problems Easier programmability of parallel Easier programmability of parallel
computerscomputers Incremental approachIncremental approach
ArchitectureArchitecture
A bi-directional 3-D network mesh of A bi-directional 3-D network mesh of multi-threaded processing nodesmulti-threaded processing nodes
A chip comprises of a multi-ALU processor A chip comprises of a multi-ALU processor (MAP) and 128KB on-chip sync. DRAM (MAP) and 128KB on-chip sync. DRAM
A user-accessible message passing system A user-accessible message passing system (SEND)(SEND)
Single global virtual address space Single global virtual address space Target CLK 100 MHz (control logic Target CLK 100 MHz (control logic
40MHz)40MHz)
Multi-ALU processor Multi-ALU processor (MAP)(MAP)
A MAP chip A MAP chip comprises : comprises : Three 64-bit 3-issue Three 64-bit 3-issue
clustersclusters 2-way interleaved on-2-way interleaved on-
chip cachechip cache A Memory SwitchA Memory Switch A Cluster switchA Cluster switch External memory External memory
interfaceinterface On-chip network On-chip network
interfaces and routersinterfaces and routers
A MAP ClusterA MAP Cluster
64-bit three issue 64-bit three issue pipelined processorpipelined processor
2 Integer ALUs 2 Integer ALUs 1 Floating point ALU1 Floating point ALU Register Files Register Files 4KB Instruction 4KB Instruction
cachecache A MAP instruction A MAP instruction
has 1, 2 or 3 has 1, 2 or 3 operationsoperations
Map Chip Die Map Chip Die (18 mm side, 5M transistors)(18 mm side, 5M transistors)
Exploiting Parallelism Exploiting Parallelism on on
M-MachineM-Machine
ThreadsThreads
Exploit ILP both with-in and across Exploit ILP both with-in and across the clustersthe clusters
Horizontal Threads (H-Threads)Horizontal Threads (H-Threads) Instruction level parallelismInstruction level parallelism Executes on a single MAP clusterExecutes on a single MAP cluster 3-wide instruction stream3-wide instruction stream Communication/synchronization Communication/synchronization
through messages/registers/memorythrough messages/registers/memory Max. 6 H-Threads can be interleaved Max. 6 H-Threads can be interleaved
dynamically on a cycle-by-cycle basis dynamically on a cycle-by-cycle basis
Threads (contd.)Threads (contd.)
Vertical Threads (V-Threads)Vertical Threads (V-Threads) Thread level parallelism (a standard Thread level parallelism (a standard
process) process) contains up-to 4 H-Threads (one per contains up-to 4 H-Threads (one per
cluster)cluster) Flexibility of scheduling (compiler/run-time)Flexibility of scheduling (compiler/run-time) Communication/synchronization through Communication/synchronization through
registersregisters At-most 6 resident V-ThreadsAt-most 6 resident V-Threads
4 user slots, 1 event slot, 1 exception 4 user slots, 1 event slot, 1 exception
Concurrency Model Concurrency Model Three Levels of ParallelismThree Levels of Parallelism
Instruction Level Parallelism ( 1 instruction)Instruction Level Parallelism ( 1 instruction) VLIW, Superscalar processorsVLIW, Superscalar processors Issues: Control Flow, Data dependency, Issues: Control Flow, Data dependency,
ScalabilityScalability Thread Level Parallelism (~ 1000 Thread Level Parallelism (~ 1000
instructions)instructions) Chip MultiprocessorsChip Multiprocessors Issues: Limited coarse TLP, Inner cores non-Issues: Limited coarse TLP, Inner cores non-
optimaloptimal Fine grain Parallelism (~ 50 – 1000 Fine grain Parallelism (~ 50 – 1000
instructions)instructions)
MappingMapping
ILPILP Sub-exprSub-expr ClusterCluster
Fine-GrainFine-GrainInner loopsInner loops
sub-sub-routinesroutines
MAP chipMAP chip
TLPTLPOuter Outer loopsloops
Co-Co-routinesroutines
NodesNodes
ProgramProgram ArchitectureArchitectureGranularityGranularity
Fine-grain overheadsFine-grain overheads
Thread creation (11 cycles – Thread creation (11 cycles – hforkhfork)) CommunicationCommunication
Register-Register read/writesRegister-Register read/writes Message passing/on-chip cacheMessage passing/on-chip cache
SynchronizationSynchronization Blocking on a register (full/empty bit)Blocking on a register (full/empty bit) Barrier Instruction (Barrier Instruction (cbarcbar instruction) instruction) Memory (sync bit)Memory (sync bit)
Grid Processor Grid Processor ArchitectureArchitecture
Design MotivationDesign Motivation
Continued scaling of the clock rateContinued scaling of the clock rate Scalability of the processing coreScalability of the processing core Higher ILP - Instruction throughput Higher ILP - Instruction throughput
(IPC)(IPC) Mitigate global wire and delay Mitigate global wire and delay
overheadsoverheads Closer coupling of Architecture and Closer coupling of Architecture and
compilercompiler
ArchitectureArchitecture
An inter-connected 2-D network of ALU arraysAn inter-connected 2-D network of ALU arrays Each node has a IB and a execution unitEach node has a IB and a execution unit A single control thread maps instructions to A single control thread maps instructions to
nodes nodes Block-Atomic Execution ModelBlock-Atomic Execution Model
Mapping blocks of statically scheduled instructionsMapping blocks of statically scheduled instructions Dynamic execution in data-flow orderDynamic execution in data-flow order Forwarding temp. values to the consumer ALUsForwarding temp. values to the consumer ALUs Critical path scheduled along shortest physical pathCritical path scheduled along shortest physical path
GPA ArchitectureGPA Architecture
Example: Block-Atomic Example: Block-Atomic MappingMapping
ImplementationImplementation
Instruction fetch and map Instruction fetch and map predicated hyper-block, predicated hyper-block, move move instructionsinstructions
Execution - control logicExecution - control logic Operand routing – max 3 dest., Operand routing – max 3 dest., split split
instructionsinstructions Hyper-block controlHyper-block control
Predication (execute-all approach), Predication (execute-all approach), cmove cmove instructionsinstructions
Block-commitBlock-commit Block-stitchingBlock-stitching
Review and DiscussionReview and Discussion
Key Ideas: ConvergenceKey Ideas: Convergence
Microprocessor – no. of superscalar processors Microprocessor – no. of superscalar processors comm./sync. via registers – low overheadscomm./sync. via registers – low overheads
Exploiting ILP – TLP granularitiesExploiting ILP – TLP granularities Dependency mapped to a grid of ALUsDependency mapped to a grid of ALUs
Replication reduces design/verification effortReplication reduces design/verification effort Point-to-point communicationPoint-to-point communication Exposing architecture partitioning and flow of Exposing architecture partitioning and flow of
operations to the compileroperations to the compiler Avoid wire, routing delays, memory wall Avoid wire, routing delays, memory wall
problemsproblems
Ideas: DivergenceIdeas: Divergence
M-MachineM-Machine On-chip cache On-chip cache Register based mech. Register based mech.
[Delays][Delays] Broadcasting and Point-to-point communicationBroadcasting and Point-to-point communication GPAGPA Register Set Register Set Grid: Chaining [Scalability] Grid: Chaining [Scalability] Point-to-point communicationPoint-to-point communicationTERATERA Fine-grain threads – Memory comm/sync Fine-grain threads – Memory comm/sync
(full/empty)(full/empty) No support for single threaded codeNo support for single threaded code
Drawbacks (Unresolved Drawbacks (Unresolved Issues)Issues)
M-MachineM-Machine ScalabilityScalability Clock speedsClock speeds Memory Memory
SynchronizationSynchronization
(use (use hforkhfork))
Grid Processor Arch.Grid Processor Arch. Data Caches far from Data Caches far from
ALUsALUs Incur delays between Incur delays between
dependent operations due dependent operations due to network router and to network router and wireswires
Complex Frame-Complex Frame-management and Block-management and Block-stitchingstitching
Explicit compiler Explicit compiler dependencedependence
Challenges/Future Challenges/Future DirectionsDirections
Architectural support to extract TLPArchitectural support to extract TLP
Parallelizing compiler technologyParallelizing compiler technology
How many cores/threadsHow many cores/threads No. of threads – memory latency, wire delays No. of threads – memory latency, wire delays
[Flynn][Flynn] Inter-thread communicationInter-thread communication Height of Grid == 8 (IPC 5-6) [GPA, Peter]Height of Grid == 8 (IPC 5-6) [GPA, Peter] Optimization - f(comm., delays, memory costs)Optimization - f(comm., delays, memory costs)
Challenges (contd.)Challenges (contd.)
On-fly data-dependence detection On-fly data-dependence detection (RAW/WAR)(RAW/WAR)
TLP/ILP Balance – M Multi-TLP/ILP Balance – M Multi-ComputerComputer
ThanksThanks