multithreaded systems for emerging high-performance applications 2008 doe summer school in...

27
Multithreaded Systems for Emerging High-Performance Applications 2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing Daniel G. Chavarría

Upload: allan-wells

Post on 24-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Multithreaded Systems for Emerging High-Performance Applications

2008 DOE Summer School in Multiscale Mathematics and High-Performance Computing

Daniel G. Chavarría

Outline

☞High-performance computing systemsProcessor architecture

Memory subsystems

Multithreaded processorsCray XMT multithreaded systemProgramming models & environments (Cray XMT)Examples

High-Performance Computing Systems

Nowadays, HPC systems are parallel computing systemsConsisting of hundreds of processors (or more)

Connected by high bandwidth, low-latency networks

Collections of PCs connected by Ethernet are not HPC systems

Basic building block is a node: server-like computer (a few processor sockets, memory and network interconnect cards, possibly I/O devices).

Nodes are parallel computers on their own: contain usually >= 2 processor sockets with multiple cores per processorHPC systems have a multiplicity of applications in scientific and engineering areas: physics, chemistry, material design, mechanical design.

HPC Systems (cont.)

Two basic kinds of HPC systems:Distributed memory systems

Shared memory systems

Distributed memory HPC systems:Typical HPC system, processors only have direct access to local memory on the node.

Remote memory on other nodes must be accessed indirectly via a library call.

Can scale to tens and hundreds of thousands of processors (Blue Gene/P@LLNL and others)

Shared memory HPC systems:Processors have direct access to local memory on the node and to remote memory on other nodes.

Speed of access may vary

More difficult to scale beyond a few thousand processors (Columbia SGI Altix@NASA)

HPC Systems (cont.)

CPUCPU CPUCPU

MemoryMemory

Node

CPUCPU CPUCPU

MemoryMemory

Node

NICNIC NICNIC

Indirect Access

Distributed Memory HPC System(high level view)

Most HPC systems fit in this categoryProgramming HPC systems is difficult, access to remote data usually requires message exchange

or coordination between nodes

HPC Systems (cont.)

CPUCPU CPUCPU

MemoryMemory

Node

CPUCPU CPUCPU

Node

NICNIC NICNIC

Direct Access

Shared Memory HPC System(high level view)

SGI Altix, HP Superdome, SunFire servers, Cray XMTShared memory helps a lot in easing parallel programming, however

synchronizing accesses to shared data becomes challenging

Processor Architecture

Execution UnitsExecution Units Execution UnitsExecution Units

L1 CacheL1 Cache L1 CacheL1 Cache

Shared L2 CacheShared L2 Cache

Memory InterfaceMemory Interface

Modern Dual-core CPU

1x

1/10x

1/500x

>= 2GHz

Shared L3 CacheShared L3 Cache1/50x

Processor Architecture (cont.)

0

500

1000

1500

2000

2500

3000

1990199219941996199820002002

MHz .

W. Wulf S. McKee, “Hitting the memory wall: Implications of the obvious”, ACM Computer Architecture News, 1995

L1

L2

L3

memory

1-2c

10c

50c

500

c

Processor speed

Memory speed

Memory Wall Problem

Outline

High-performance computing systemsProcessor architecture

Memory subsystems

☞Multithreaded processorsCray XMT multithreaded systemProgramming models & environments (Cray XMT)Examples

Multithreaded Processors

Commodity memory is slow, custom memory is very expensive:

What can be done about it?

Idea: cover latency of memory loads with other (useful) computation

OK, how do we do this?

Use multiple execution contexts on the same processor, switch between them when issuing load operations

Execution contexts correspond to threads

Examples: Cray ThreadStorm processors, Sun Niagara 1 & 2 processors, Intel Hyperthreading

Multithreaded Processors (cont.)

Execution UnitsExecution Units

T0T0

T1T1

T2T2

T3T3 T4T4

T5T5

Each thread has its own independent instruction stream (program counter)

Each thread has its own independent register set

Cray XMT multithreaded system

ThreadStorm processors run at 500 MHz128 hardware thread contexts, each with its own set of 32 registersNo data cache128KB, 4-way associative data buffer on the memory sideExtra bits in each 64-bit memory word: full/empty for synchronizationHashed memory at a 64-byte level, i.e. contiguous logical addresses at a 64-byte boundary are mapped to uncontiguous physical locationsGlobal shared memory

Cray XMT multithreaded system (cont.)

3rd generation multithreaded system from CrayInfrastructure is based on XT3/4, scalable up to 8192 processors

SeaStar network, torus topology, service and I/O nodes

Compute nodes contain 4 ThreadStorm multithreaded processors instead of 4 AMD Opteron processors

Hybrid execution capabilities: code can run on ThreadStorm processors in collaboration with code running on Opteron processors

Example Topology 6x12x8 11x12x8 11x12x16 22x12x16 14x24x24

Processors 576 1056 2112 4224 8064

Memory capacity 9 TB 16.5 TB 33 TB 66 TB 126 TB

Sustainable remote memory reference rate (per processor)

60 MW/s 60 MW/s 45 MW/s 33 MW/s 30 MW/s

Sustainable remote memory reference rate (aggregate)

34.6 GW/s 63.4 GW/s 95.0 GW/s 139.4 GW/s 241.9 GW/s

Relative size 1.0 1.8 3.7 7.3 14.0

Relative performance 1.0 1.8 2.8 4.0 7.0

Cray XMT multithreaded system (cont.)

MTX Linux

Compute Service & IO

Service Partition

• Linux OS• Specialized Linux nodes

Login PEs

IO Server PEs

Network Server PEs

FS Metadata Server PEs

Database Server PEs

Compute Partition

MTX (BSD)

RAID Controllers

Network

PCI-X

10 GigE

Fiber ChannelPCI-X

Cray XMT multithreaded system (cont.)

15

4 DIMM Slots4 DIMM Slots

CRAYSeastar2™

CRAYSeastar2™

CRAYSeastar2™

CRAYSeastar2™

L0 RAS ComputerL0 RAS Computer

Redundant VRMsRedundant VRMs

Outline

High-performance computing systemsProcessor architecture

Memory subsystems

Multithreaded processorsCray XMT multithreaded system

☞Programming models & environments (Cray XMT)Examples

Programming Models & Environments

Compute nodes run MTX, Multithreaded UNIXService nodes run LinuxI/O calls are offloaded to the service nodesService nodes are used for compilation

17

0 Operating Systems0 LINUX on service and I/O nodes

0 MTX on compute nodes

0 Syscall offload for I/O

0 Run-Time System0 Job launch

0 Node allocator

0 Logarithmic loader

0 Programming Environment– Cray MTA compiler -C, C++– Debugger - mdb– Performance tools: canal, traceview

0 High Performance File Systems0 Lustre

0 System Mgmt and Admin0 Accounting0 Red Storm Management System0 RSMS Graphical User Interface

Programming Models & Environments (cont.)

Cray MTA C/C++ CompilerParallelizing compiler technology enabled by XMT hardware features

Quasi-uniform memory latency

No data caches

Hardware operations for fine-grained synchronization and atomic updates

Support for data parallel compilation

Focus on loops operating on equivalent data sets

User-provided pragmas (directives) to help out the compiler

Support for parallelizing reductions: sum, prod., min, max

Support for parallelizing linear recurrences in loopsdo i = 1, n

X(i) = x(i – 1) + y

end do

18

Examples

Vector Dot Product:

19

double s, a[n], b[n];

…s = 0.0for (i = 0; i < n; i++) s += a[i] * b[i];

Compiler automatically recognizes and parallelizes the reduction:using a combination of per-thread accumulation and then atomic updates to the global accumulator

Examples (cont.)

Sparse Matrix-Vector Product

C n x 1 = A n x m * B m x 1

Store A in packed row formA[nz], where nz is the number of non-zeros

cols[nz] stores the column index of the non-zeros

rows[n] stores the start index of each row in A

20

#pragma mta use 100 streams#pragma mta assert no dependencefor (i = 0; i < n; i++) { int j; double sum = 0.0; for (j = rows[i]; j < rows[i+1]; j++) sum += A[j] * B[cols[j]]; C[i] = sum;}

Examples (cont.)

Parallel Prefix SumInput vector A of length N

Result vector R of length N:

Sequential code:

21

sum = 0;

for (i = 0; i < n; i++) { sum += A[i]; R[i] = sum;}

Ri = A jj=1

i

Difficult to parallelize since the value at position i depends on the sum of

the previous values…

Examples (cont.)

• Parallel Prefix Sum– Parallelization idea

22

Add pairs in parallel

Add pairs in parallel

Add pairs in parallel

Update intermediate positions with pairwise sums

int sum[n];

for (i = 1; i < n; i++) { sum[i] = A[i – 1]; R[i] = sum[i];}

for (step = 1; step < n; step *= 2) { for (i = step; i < n; i++) R[i] = sum[i - step] + sum[i]; for (i = 0; i < n; i++) sum[i] = R[i];}

Examples (cont.)

• Parallel Prefix Sum– Actual code:

23

Logarithmic number of sequential loops

Parallel loops

Examples (cont.)

• Parallel Tree Population• PDTree: “Partial Dimensions” Tree• Count combinations of values for different variables

24 24

B C

A C D

B C

A C D

N

B C

( b0

, N ) ( b1

, N ) ( c0

, N ) ( c1

, N )

AA C C D D

( c1

, N )

( a0

, N ) ( c0

, N )

( a1

, N ) ( c1

, N )

( a0

, N ) ( c0

, N )

( a1

, N ) ( d1

, N )

( d0

, N )

( d1

, N )

( d0

, N )

N

B C

( b0

, N ) ( b1

, N ) ( c0

, N ) ( c1

, N )

AA C C D D

( c1

, N )

( a0

, N ) ( c0

, N )

( a1

, N ) ( c1

, N )

( a0

, N ) ( c0

, N )

( a1

, N ) ( d1

, N )

( d0

, N )

( d1

, N )

( d0

, N )

PDTree

• PDTree implemented using a multiple type, recursive tree structure– Root node is an array of ValueNodes (counts for different value instances

of the root variables)– Interior and leaf nodes are linked lists of ValueNodes– Inserting a record at the top level involves just incrementing the counter

of the corresponding ValueNode– XMT’s int_fetch_add() atomic operation is used to increment counters– Inserting a record at other levels requires the traversal of a linked list to

find the right ValueNode– If the ValueNode does not exist, it must be appended to the end of the list

• Inserting at other levels when the node does not exist is tricky– To ensure safety the end pointer of the list must be locked– Use readfe() and writeef()XMT operations to create critical sections

• Take advantage of full/empty bits on each memory word

• As data analysis progresses the probability of conflicts between threads is lower

25 25

PDTree (cont.)

26 26 26

vi = j(count)

vi = k(count)

T2 trying to grab theend pointer

T1 trying to grab theend pointer

vi = j(count)

vi = k(count)

T2 now has a lock to a non-end pointer

T1 succeeded and inserted a new node

vi = m(count)

Questions

• Thanks for your attention.

27