overview: classical parallel hardware · the processor performs: l ﬂoating point operations...

Overview: Classical Parallel Hardware

Review of Single Processor Design

l so we talk the same language

l many things happen in parallel even on a single processor

l identify potential issues for parallel hardware

l why use 2 CPUs if you can double the speed on one!

Multiple Processor Design

l architectural classifications

l shared/distributed memory

l hierarchical/flat memory

l dynamic/static processor connectivity

l evaluating static networks

l case study; the NCI Raijin supercomputer

COMP4300/8300 L2-3: Classical Parallel Hardware 2020 JJ J • I II × 1

The Processor

Performs:

l floating point operations (FLOPS) - add, mult, division (sqrt maybe!)

l integer operations (MIPS) - adds etc, also logical ops and instruction processing

n MIPS: Machine Instructions Per Second (on a very old VAX 100/780)n anyway, what is a machine instruction on different CPUs!

l our primary focus will be in floating point operations

The clock:

l all ops take a fixed number of clock ticks to complete

l clock speed is measured in GHz (109 cycles/second) or nsec (10−9 seconds)

n Apple iPhone 6 ARM A8 1.4GHz (0.71ns), NCI Gadi Intel Xeon Cascade Lake3.2GHz (0.31ns), IBM zEC12 processor 5.5Ghz (0.18ns)

l clock speed is limited by etching+speed of light, hence motivates parallelism

n (to my knowledge) IBM zEC12 is fastest commodity processor at 5.5GHzn light travels about 1cm in 3.2ns, a chip is a few cm!


Processor Performance

FLOPS/sec Prefix Occurrence103 kilo very badly written code106 mega badly written code109 giga single-core1012 tera multiple chip (NCI)1015 peta 23 machines in Top500 (Nov 2012, measured)1018 exa around 2020!

l PC 2.5GHz Core2 Quad, 4(core)*4(ops)*2.5GHz ≡ 40GF

l Bunyip Pentium III, 96(nodes)*2(sockets)*1(op)*550MHz ≡ 105GF

l NCI Raijin 3592(nodes)*2(sockets)*8(core)*8(ops)*2.6GHz ≡ 1.19PF


Adding Numbers

Consider adding two double precision (8 byte) numbers

0 1 11 12 63± Exponent Significand

Possible steps:

l determine largest exponent

l normalize significand of the smaller exponent to the larger

l add significand

l re normalize the significand and exponent of the result

Multiple steps each taking 1 tick implies 4 ticks per addition (FLOP)


Operation PipeliningStep in Pipeline

Waiting 1 2 3 4 DoneX(6)

X(5)→X(4)→

X(3)→X(2)→

X(1)

l X(1) takes 4 ticks to appear (startup latency); X(2) appears 1 tick after X(1)

l asymptotically achieves 1 result per clock tick

l the operation (X) is said to be pipelined: steps in the pipeline are running in parallel

l requires same op consecutively on different (independent) data items

n good for “vector operations” (note limitations on chaining output data to input)

l not all operations are pipelined, e.g. integer mult., division, sqrt. E.g. Alpha EV6

operation latency repeat+,-,* 4 1

/ 15 12sqrt 33 30


Instruction Pipelining

l break instruction execution into k stages;⇒ can get ≤ k-way ||ism

(generally, the circuitry for each stage is independent)

l eg. (k = 5): stages FI = Fetch Instrn., DI = Decode Instrn., FO = Fetch Operand, EX

= Execute Instrn., WB = Write Back

(branch): FI DI FO EX WB(guess) FI DI FO EX WB

(guess) FI DI FO EX WB(guess) FI DI FO EX WB

(sure) FI DI FO EX WB

n note: FO & WB stages may involve memory accesses (and may possibly stall the pipeline)

n conditional branch instructions are problematic: the wrong guess may require

flushing succeeding instructions from the pipeline

l tendency to increase the number of stages, but this increases the startup latency

e.g. number of stages: UltraSPARC II (9), UltraSPARC III (14), Intel Prescott (31)


Pipelining: Dependent Instructions

principle: CPU must ensure result is the same as if no pipelining (or ||ism)

l instructions requiring only 1 cycle in the EX stage:add %i1, −1, %i1 ! i1 = i1 − 1

cmp %i1, 0 ! is i1 = 0?

can be solved by pipeline feedback from EX stage to next cycle

l (important) instructions requiring c cycles for the EX stage (ie. f.p. * +, load, store) are

normally implemented by having c EX stages. This requires any dependent

instruction to be delayed by c cycles. e.g. c = 3fmuld %f0, %f2, %f4 ! I0: f4 = f0 ∗ f2 (f.p.)

.... ! I1:

.... ! I2:

faddd %f4, %f6, %f6 ! I3: f6 = f4 + f6 (f.p.)

I0: FI DI FO EX1 EX2 EX3 WBI1: FI DI FO EX1 EX2 EX3 WBI2: FI DI FO EX1 EX2 EX3 WBI3: FI DI FO EX1 EX2 EX3 WB


Superscalar (Multiple Instruction Issue)

A small number (w) of instructions are scheduled by the H/W to execute together.

l groups must have an appropriate ‘instruction mix’

e.g. UltraSPARC (w = 4):

≤ 2 different floating point≤ 1 load / store ; ≤ 1 branch≤ 2 integer / logical

instr’ns per group

l have ≤ w-way ||ism over different types of instr’ns

l generally requires:

n multiple (≥ w) instruction fetches

n an extra grouping (G) stage in the pipeline

l problem: amplifies dependencies and other problems of pipelining by a factor of w!

l issues: the instruction mix must be balanced for maximum performance!

n i.e. floating point ∗, + must be balanced


Review - Instruction Level Parallelism

l instruction level parallelism (ILP): pipelining and superscalar, offer ≤ kw-way ||ism

l branch prediction techniques alleviates issue of conditional branches

n uses a table recording the result of recently-taken branches

l out-of-order execution: alleviates the issue of dependencies

n pulls fetched instructions into a buffer of size W , W ≥ wn execute them in any order provided dependencies are not violated

n must keep track of all ‘in-flight’ instructions and associated registers (O(W 2)

area and power!)

l in most situations, the compiler can do as good a job as a human at exposing this

parallelism (ILP was part of the ‘Free Lunch’)

An exception: how to get a UltraSPARC III to execute 1 f.p. multiply (and add!) per

cycle


http://cs.anu.edu.au/courses/comp4300/lectures/progs/optimizedMatrixMult.c

http://cs.anu.edu.au/courses/comp4300/lectures/progs/optimizedMatrixMult.c

Memory Structure

Consider the DAXPY computation:

y(i) = 1.234∗ x(i) + y(i)

If theoretically the CPU can perform the 2 FLOPs in 1 cycle

l the memory system must load two doubles (x(i) and y(i) – 16 bytes) and storeone (y(i) – 8 bytes) each clock cycle

n on a 1 GHz system this implies 16 GB/s load traffic and 8 GB/s store traffic

l typically:

n a processor core can only issue one load OR store instruction in a clock cyclen DDR3-SDRAM memory is available clocked at ≈1066 MHz with access times

accordingly

Memory latency and bandwidth are critical performance issues.

l caches: reduce latency and provide improved cache to CPU bandwidth

l multiple memory banks: improve bandwidth (by parallel access)


Cache Memory

Main Memory – large cheap memory↓ – large latency/small bandwidth

Cache – small fast expensive memory↓↓↓↓↓ – lower latency/higher bandwidth

CPU Registers

l part of the memory hierarchy

l cache hit: data is in cache and received in a few cycles

l cache miss: data must be fetched from main memory (or higher level cache)

l try to ensure data is in cache (or as close to the CPU as possible)

l can we ‘block’ algorithm to minimize memory traffic?

Cache memory is effective because algorithms often use data that:

l was recently accessed from memory (temporal locality)

l was close to other recently accessed data (spatial locality)

Note: duplication of data in the cache will have implications for parallel systems!


Cache Mapping

Blocks of main memory are mapped to a cache line

l cache lines are typically 16-128 bytes wide

l mapping may be direct, or n-way associative.

E.g. 2-way associative mapping

Cache Main Memory MappingLine 0 0 or 1 2 or 3 0 or 1 2 or 3Line 1 0 or 1 2 or 3 0 or 1 2 or 3Line 2 0 or 1 2 or 3 0 or 1 2 or 3Line 3 0 or 1 2 or 3 0 or 1 2 or 3

0 or 1 2 or 3 0 or 1 2 or 3

l entire cache line is fetched from memory, not just one element (why?)

n try to structure code to use an entire cache line of data

n best to have unit stride

n pointer chasing is very bad! (large address ‘strides’ guaranteed!)


Memory Banks

Memory bandwidth can be improved by having multiple parallel paths to/from memory,

organized in b banks.

Bank 1 Bank 2 Bank 3 Bank 4

CPU

l traditional solution used by vector processors

l good performance for unit stride

l very bad performance if bank conflicts occur

n worst case: power-of-2 strides, where b is a power 2


Parallelism in Memory Access

Dual Inline Memory Module

(DIMM)

l a processor may have 2

or 4 DIMMs (i.e.

b = 16,32), depending

on its lowest-level

cache line size

l each DRAM chip

access is pipelined

(select row address,

select column address,

access byte)


Going Parallel

l inevitably the performance of a single processor is limited by the clock speed

l improved manufacturing increases clock but is ultimately limited by the speed oflight

l ILP allows multiple ops at once, but is limited

It’s time to go parallel

Hardware Issues

l Flynn’s Taxonomy of parallel processors

n SIMD/MIMD

l shared/distributed memory

l hierarchical/flat memory

l dynamic/static processor connectivity

l characteristics of static networks

l case study: NCI’s Raijin (2012–2019)


Architecture Classification: Flynn’s Taxonomy

l why classify:

n what kind of parallelism is employed? Which architecture has the best

prospects for the future? What has already been achieved by current

architecture types? Reveal configurations that have not yet considered by the

system architect. Enable the building of performance models.

Flynn’s taxonomy is based on the degree of parallelism, with 4 categories determined

according to the number of instruction and data streams

Data StreamSingle Multiple

Single SISD SIMDInstruction 1CPU Array/Vector ProcessorStream Multiple MISD MIMD

(Pipelined?) Multiple Processor


SIMD and MIMD

SIMD: Single Instruction Multiple Data

l also known as data parallel processors or array processors

l vector processors (to some extent)

l current examples include SSE instructions, SPEs on CellBE, GPUs

l NVIDIA’s SIMT (T = Threads) is a slight variation

MIMD: Multiple Instruction Multiple Data

l examples include quad-core PC, 24-core Xeons on Gadi

CPU CPU CPU CPU

I N T E R C O N N E C T

CPU and

Control

CPU and

Control

CPU and

Control

CPU and

Control

Global

Control Unit


S I M D M I M D


MIMD

Most successful parallel model

l more general purpose than SIMD (e.g. CM5 could emulate CM2)

l harder to program, as processors are not synchronized at the instruction level

Design issues for MIMD machines

l scheduling: efficient allocation of processors to tasks in a dynamic fashion

l synchronization: prevent processors accessing the same data simultaneously

l interconnect design: processor to memory and processor to processorinterconnects. Also I/O network - often processors dedicated to I/O devices

l overhead: inevitably there is some overhead associated with coordinating activitiesbetween processors, e.g. resolve contention for resources

l partitioning: identifying parallelism in algorithms that is capable of exploitingconcurrent processing streams is non-trivial

(Aside: SPMD Single Program Multiple Data: more restrictive than MIMD, implying thatall processors run the same executable.)


Address Space Organization: Message Passing

l each processor has local or private memory

l interact solely by message passing

l commonly known as distributed memory parallel computers

l memory bandwidth scales with number of processors

l example: between “nodes” on the NCI Gadi system


PROCESSOR

MEMORY

PROCESSOR

MEMORY MEMORY

PROCESSOR PROCESSOR

MEMORY


Address Space Organization: Shared Address Space

l processors interact by modifying data objects stored in a shared address space

l simplest solution is a flat or uniform memory access (UMA)

l scalability of memory bandwidth and processor-processor communications (arising

from cache line transfers) are problems

l so is synchronizing access to shared data objects

l example: dual/quad core laptop (ignoring caches)

MEMORY MEMORY MEMORY MEMORY


PROCESSOR PROCESSOR PROCESSOR PROCESSOR


Non-Uniform Memory Access (NUMA)

Machine includes some hierarchy in its memory structure

l all memory is visible to the programmer (single address space), but some memory

takes longer to access than others

l in this sense, cache introduces one level of NUMA

l between sockets on the NCI Gadi system or in a multi-socket Opteron system

(some ANU research on an Opteron (2011))




Cache Cache Cache Cache






http://www.qdpma.com/SystemArchitecture/numa.html

http://ieeexplore.ieee.org/document/6012912/?arnumber=6012912&tag=1

Shared Address Space Access

Parallel Random Access Machine (PRAM): any shared memory machine

l what happens when multiple processors try to read/write to the same memorylocation at the same time?

l PRAM models

n Exclusive-read, exclusive-write (EREW PRAM)n Concurrent-read, exclusive-write (CREW PRAM)n Exclusive-read, concurrent-write (ERCW PRAM)n Concurrent-read, concurrent-write (CRCW PRAM)

Concurrent read OK, but concurrent-write requires arbitration:

l Common: allowed if all values being written are identical

l Arbitrary: an arbitrary processor is allowed to proceed and the rest fail

l Priority: processors are organized into a predefined prioritized list; the process withhighest priority succeeds, the rest fail

l Sum: the sum of all quantities is written


Dynamic Processor Connectivity: Crossbar

l is a non-blocking network, in that the connection of two processors does not block

connection between other processors

l complexity grows as O(p2)

l may be used to connect processors each with their own local memory

Memory

Processor

and

Memory

Processor

and

Memory

Processor

and

Memory

Processor

and

Memory

Processor

and

Memory

Processor

and

Memory

Processor

and

Memory

Processor

and

l may also be used to connect processors with various memory banks

e.g. the 8 cores on an UltraSPARC T2 access the 8 banks of its level 2 cache


Dynamic Processor Connectivity: Multi-staged Networks

000

010

100

110

001

011

101

111

ProcessorsS W I T C H I N G N E T W O R K

O M E G A N E T W O R K

Memory

010

100

001

000

011

101

110

111

l consist of lg p stages, where p is the number of processors (lg p = log2(p))

l s and t are binary representations of message source and destination at stage 1

n route through if the most significant bits of s and t are the same

n otherwise, crossover

l process repeats for next stage using the next most significant bit etc


Dynamic Processor Connectivity: Bus

l processor gains exclusive access to the bus for some period

l the performance of the bus limits scalability




B U S

Performance

crossbar > multistage > bus

Cost

crossbar > multistage > bus

Discussion point: which would you use for small, medium and large numbers of

processors (p)? What values of p would (roughly) be the cuttoff points?


Static Processor Connectivity: Complete, Mesh, Tree

Completely connected (becomes very complex!)

Linear array/ring, mesh/2d torus

Tree (static if nodes are processors)

Processors

Switches


Static Processor Connectivity: Hypercube

0100

0101

0001

0000

1100

1000

1101

1001

1110

1010

1011

1111

0110

0010

0111

0011

l a multidimensional mesh with exactly two processors in each dimension

p = 2d

where d is the dimension of the hypercube

l advantages: ≤ d ‘hops’ between any two processors

l disadvantage: the number of connections per processor is d

l can be generalized to k-ary d-cubes, e.g. a 4×4 torus is a 4-ary 2-cube

l examples: Intel iPSC Hypercube, NCube, SGI Origin, Cray T3D, TOFU


Static Processor Connectivity: Hypercube Characteristics

0100

0101

0001

0000

1100

1000

1101

1001

1110

1010

1011

1111

0110

0010

0111

0011

l two processors are connected directly ONLY IF their binary labels differ by one bit

l in a d-dimensional hypercube, each processor directly connects to d others

l a d-dimensional hypercube can be partitioned into two d−1 sub-cubes etc

l the number of links in the shortest path between two processors is the Hamming

distance between their labels

n the Hamming distance between two processors labeled s and t is the number of

bits that are on in the binary representation of s⊕ t where ⊕ is the bitwise

exclusive or operation (e.g. 3 for 101⊕010 and 2 for 011⊕101)


Evaluating Static Interconnection Networks #1

Diameter

l the maximum distance between any two processors in the network

l directly determines communication time (latency)

Connectivity

l the multiplicity of paths between any two processors

l a high connectivity is desirable as it minimizes contention (also enhances

fault-tolerance)

l arc connectivity of the network: the minimum number of arcs that must be removed

for the network to break it into two disconnected networks

n 1 for linear arrays and binary trees

n 2 for rings and 2D meshes

n 4 for a 2D torus

n d for d-dimensional hypercubes


Evaluating Static Interconnection Networks #2

Channel width

l the number of bits that can be communicated simultaneously over a link connecting

two processors

Bisection width and bandwidth

l bisection width is the minimum number of communication links that have to be

removed to partition the network into two equal halves

l bisection bandwidth is the minimum volume of communication allowed between two

halves of the network with equal numbers of processors

Cost

l many criteria can be used; we will use the number of communication links or wires

required by the network


Summary: Static Interconnection Characteristics

Bisection Arc CostNetwork Diameter width connectivity (no. of links)Completely-connected 1 p2/4 p−1 p(p−1)/2Binary Tree 2 lg((p + 1)/2) 1 1 p−1Linear array p−1 1 1 p−1Ring bp/2c 2 2 p2D Mesh 2(

√p−1)

√p 2 2(p−√p)

2D Torus 2b√p/2c 2√

p 4 2pHypercube lg p p/2 lg p (p lg p)/2

Note: the Binary Tree suffers from a bottleneck: all traffic between the left and rightsub-trees must pass through the root. The fat tree interconnect alleviates this:

l usually, only the leaf nodes contain a processor

l it can be easily ‘partitioned’ into sub-networks without degraded performance(useful for supercomputers)

Discussion point: which would you use for small, medium and large numbers ofprocessors (p)? What values of p would (roughly) be the cuttoff points?


https://en.wikipedia.org/wiki/Fat_tree

NCI’s Raijin: A Petascale Supercomputer

l 57,472 cores (dual socket, 8 core Intel Xeon Sandy

Bridge, 2.6 GHz) in 3592 compute nodes

l 160 TBytes (approx.) of main memory

l Mellanox Infiniband FDR interconnect (52km cables)

l interconnects: ring (cores), full (sockets), fat tree

(nodes)

l 10 PBytes (approx.) of usable fast filesystem (for

shortterm scratch space apps, home directories)

l power: 1.5 MW max. load

l cooling systems: 100 tonnes of water

l 24th fastest in the world in debut (November 2012);

first petaflop system in Australia

n fastest (parallel) filesystem in the s. hemisphere

n custom monitoring and deployment

n custom Linux kernel

n highly customised PBS Pro scheduler


http://www.qdpma.com/systemarchitecture/systemarchitecture_sandybridge.html

http://www.qdpma.com/systemarchitecture/systemarchitecture_sandybridge.html

http://nci.org.au/2012/11/26/nci-supercomputer-best-in-australia-24th-in-world/

NCI’s Integrated High Performance Environment


Further Reading: Parallel Hardware

l The Free Lunch Is Over!

l Ch 1, 2.1-2.4 of Introduction to Parallel Computing

l Ch 1, 2 of Principles of Parallel Programming

l lecture notes from Calvin Lin:

A Success Story: ISA

Parallel Architectures


http://www.gotw.ca/publications/concurrency-ddj.htm

http://www.cs.utexas.edu/~lin/cs380p/handout05.pdf

http://www.cs.utexas.edu/~lin/cs380p/handout09.pdf

overview: classical parallel hardware · the processor performs: l ﬂoating point operations...

Documents