multithreaded systems for emerging high-performance applications 2008 doe summer school in...
TRANSCRIPT
Multithreaded Systems for Emerging High-Performance Applications
2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing
Daniel G. Chavarría
Outline
☞High-performance computing systemsProcessor architecture
Memory subsystems
Multithreaded processorsCray XMT multithreaded systemProgramming models & environments (Cray XMT)Examples
High-Performance Computing Systems
Nowadays, HPC systems are parallel computing systemsConsisting of hundreds of processors (or more)
Connected by high bandwidth, low-latency networks
Collections of PCs connected by Ethernet are not HPC systems
Basic building block is a node: server-like computer (a few processor sockets, memory and network interconnect cards, possibly I/O devices).
Nodes are parallel computers on their own: contain usually >= 2 processor sockets with multiple cores per processorHPC systems have a multiplicity of applications in scientific and engineering areas: physics, chemistry, material design, mechanical design.
HPC Systems (cont.)
Two basic kinds of HPC systems:Distributed memory systems
Shared memory systems
Distributed memory HPC systems:Typical HPC system, processors only have direct access to local memory on the node.
Remote memory on other nodes must be accessed indirectly via a library call.
Can scale to tens and hundreds of thousands of processors (Blue Gene/P@LLNL and others)
Shared memory HPC systems:Processors have direct access to local memory on the node and to remote memory on other nodes.
Speed of access may vary
More difficult to scale beyond a few thousand processors (Columbia SGI Altix@NASA)
HPC Systems (cont.)
CPUCPU CPUCPU
MemoryMemory
Node
CPUCPU CPUCPU
MemoryMemory
Node
NICNIC NICNIC
Indirect Access
Distributed Memory HPC System(high level view)
Most HPC systems fit in this categoryProgramming HPC systems is difficult, access to remote data usually requires message exchange
or coordination between nodes
HPC Systems (cont.)
CPUCPU CPUCPU
MemoryMemory
Node
CPUCPU CPUCPU
Node
NICNIC NICNIC
Direct Access
Shared Memory HPC System(high level view)
SGI Altix, HP Superdome, SunFire servers, Cray XMTShared memory helps a lot in easing parallel programming, however
synchronizing accesses to shared data becomes challenging
Processor Architecture
Execution UnitsExecution Units Execution UnitsExecution Units
L1 CacheL1 Cache L1 CacheL1 Cache
Shared L2 CacheShared L2 Cache
Memory InterfaceMemory Interface
Modern Dual-core CPU
1x
1/10x
1/500x
>= 2GHz
Shared L3 CacheShared L3 Cache1/50x
Processor Architecture (cont.)
0
500
1000
1500
2000
2500
3000
1990199219941996199820002002
MHz .
W. Wulf S. McKee, “Hitting the memory wall: Implications of the obvious”, ACM Computer Architecture News, 1995
L1
L2
L3
memory
1-2c
10c
50c
500
c
Processor speed
Memory speed
Memory Wall Problem
Outline
High-performance computing systemsProcessor architecture
Memory subsystems
☞Multithreaded processorsCray XMT multithreaded systemProgramming models & environments (Cray XMT)Examples
Multithreaded Processors
Commodity memory is slow, custom memory is very expensive:
What can be done about it?
Idea: cover latency of memory loads with other (useful) computation
OK, how do we do this?
Use multiple execution contexts on the same processor, switch between them when issuing load operations
Execution contexts correspond to threads
Examples: Cray ThreadStorm processors, Sun Niagara 1 & 2 processors, Intel Hyperthreading
Multithreaded Processors (cont.)
Execution UnitsExecution Units
T0T0
T1T1
T2T2
T3T3 T4T4
T5T5
Each thread has its own independent instruction stream (program counter)
Each thread has its own independent register set
Cray XMT multithreaded system
ThreadStorm processors run at 500 MHz128 hardware thread contexts, each with its own set of 32 registersNo data cache128KB, 4-way associative data buffer on the memory sideExtra bits in each 64-bit memory word: full/empty for synchronizationHashed memory at a 64-byte level, i.e. contiguous logical addresses at a 64-byte boundary are mapped to uncontiguous physical locationsGlobal shared memory
Cray XMT multithreaded system (cont.)
3rd generation multithreaded system from CrayInfrastructure is based on XT3/4, scalable up to 8192 processors
SeaStar network, torus topology, service and I/O nodes
Compute nodes contain 4 ThreadStorm multithreaded processors instead of 4 AMD Opteron processors
Hybrid execution capabilities: code can run on ThreadStorm processors in collaboration with code running on Opteron processors
Example Topology 6x12x8 11x12x8 11x12x16 22x12x16 14x24x24
Processors 576 1056 2112 4224 8064
Memory capacity 9 TB 16.5 TB 33 TB 66 TB 126 TB
Sustainable remote memory reference rate (per processor)
60 MW/s 60 MW/s 45 MW/s 33 MW/s 30 MW/s
Sustainable remote memory reference rate (aggregate)
34.6 GW/s 63.4 GW/s 95.0 GW/s 139.4 GW/s 241.9 GW/s
Relative size 1.0 1.8 3.7 7.3 14.0
Relative performance 1.0 1.8 2.8 4.0 7.0
Cray XMT multithreaded system (cont.)
MTX Linux
Compute Service & IO
Service Partition
• Linux OS• Specialized Linux nodes
Login PEs
IO Server PEs
Network Server PEs
FS Metadata Server PEs
Database Server PEs
Compute Partition
MTX (BSD)
RAID Controllers
Network
PCI-X
10 GigE
Fiber ChannelPCI-X
Cray XMT multithreaded system (cont.)
15
4 DIMM Slots4 DIMM Slots
CRAYSeastar2™
CRAYSeastar2™
CRAYSeastar2™
CRAYSeastar2™
L0 RAS ComputerL0 RAS Computer
Redundant VRMsRedundant VRMs
Outline
High-performance computing systemsProcessor architecture
Memory subsystems
Multithreaded processorsCray XMT multithreaded system
☞Programming models & environments (Cray XMT)Examples
Programming Models & Environments
Compute nodes run MTX, Multithreaded UNIXService nodes run LinuxI/O calls are offloaded to the service nodesService nodes are used for compilation
17
0 Operating Systems0 LINUX on service and I/O nodes
0 MTX on compute nodes
0 Syscall offload for I/O
0 Run-Time System0 Job launch
0 Node allocator
0 Logarithmic loader
0 Programming Environment– Cray MTA compiler -C, C++– Debugger - mdb– Performance tools: canal, traceview
0 High Performance File Systems0 Lustre
0 System Mgmt and Admin0 Accounting0 Red Storm Management System0 RSMS Graphical User Interface
Programming Models & Environments (cont.)
Cray MTA C/C++ CompilerParallelizing compiler technology enabled by XMT hardware features
Quasi-uniform memory latency
No data caches
Hardware operations for fine-grained synchronization and atomic updates
Support for data parallel compilation
Focus on loops operating on equivalent data sets
User-provided pragmas (directives) to help out the compiler
Support for parallelizing reductions: sum, prod., min, max
Support for parallelizing linear recurrences in loopsdo i = 1, n
X(i) = x(i – 1) + y
end do
18
Examples
Vector Dot Product:
19
double s, a[n], b[n];
…s = 0.0for (i = 0; i < n; i++) s += a[i] * b[i];
Compiler automatically recognizes and parallelizes the reduction:using a combination of per-thread accumulation and then atomic updates to the global accumulator
Examples (cont.)
Sparse Matrix-Vector Product
C n x 1 = A n x m * B m x 1
Store A in packed row formA[nz], where nz is the number of non-zeros
cols[nz] stores the column index of the non-zeros
rows[n] stores the start index of each row in A
20
#pragma mta use 100 streams#pragma mta assert no dependencefor (i = 0; i < n; i++) { int j; double sum = 0.0; for (j = rows[i]; j < rows[i+1]; j++) sum += A[j] * B[cols[j]]; C[i] = sum;}
Examples (cont.)
Parallel Prefix SumInput vector A of length N
Result vector R of length N:
Sequential code:
21
sum = 0;
for (i = 0; i < n; i++) { sum += A[i]; R[i] = sum;}
€
Ri = A jj=1
i
∑
Difficult to parallelize since the value at position i depends on the sum of
the previous values…
Examples (cont.)
• Parallel Prefix Sum– Parallelization idea
22
Add pairs in parallel
Add pairs in parallel
Add pairs in parallel
Update intermediate positions with pairwise sums
int sum[n];
for (i = 1; i < n; i++) { sum[i] = A[i – 1]; R[i] = sum[i];}
for (step = 1; step < n; step *= 2) { for (i = step; i < n; i++) R[i] = sum[i - step] + sum[i]; for (i = 0; i < n; i++) sum[i] = R[i];}
Examples (cont.)
• Parallel Prefix Sum– Actual code:
23
Logarithmic number of sequential loops
Parallel loops
Examples (cont.)
• Parallel Tree Population• PDTree: “Partial Dimensions” Tree• Count combinations of values for different variables
24 24
B C
A C D
B C
A C D
N
B C
( b0
, N ) ( b1
, N ) ( c0
, N ) ( c1
, N )
AA C C D D
( c1
, N )
( a0
, N ) ( c0
, N )
( a1
, N ) ( c1
, N )
( a0
, N ) ( c0
, N )
( a1
, N ) ( d1
, N )
( d0
, N )
( d1
, N )
( d0
, N )
N
B C
( b0
, N ) ( b1
, N ) ( c0
, N ) ( c1
, N )
AA C C D D
( c1
, N )
( a0
, N ) ( c0
, N )
( a1
, N ) ( c1
, N )
( a0
, N ) ( c0
, N )
( a1
, N ) ( d1
, N )
( d0
, N )
( d1
, N )
( d0
, N )
PDTree
• PDTree implemented using a multiple type, recursive tree structure– Root node is an array of ValueNodes (counts for different value instances
of the root variables)– Interior and leaf nodes are linked lists of ValueNodes– Inserting a record at the top level involves just incrementing the counter
of the corresponding ValueNode– XMT’s int_fetch_add() atomic operation is used to increment counters– Inserting a record at other levels requires the traversal of a linked list to
find the right ValueNode– If the ValueNode does not exist, it must be appended to the end of the list
• Inserting at other levels when the node does not exist is tricky– To ensure safety the end pointer of the list must be locked– Use readfe() and writeef()XMT operations to create critical sections
• Take advantage of full/empty bits on each memory word
• As data analysis progresses the probability of conflicts between threads is lower
25 25
PDTree (cont.)
26 26 26
vi = j(count)
vi = k(count)
T2 trying to grab theend pointer
T1 trying to grab theend pointer
vi = j(count)
vi = k(count)
T2 now has a lock to a non-end pointer
T1 succeeded and inserted a new node
vi = m(count)