cme342 - parallel methods in numerical...

CME342 - Parallel Methods in Numerical Analysis

April 2, 2014 Lecture 2

Parallel Architectures

2

Announcements 1.  Subscribe to the mailing list.

Go to lists.stanford.edu and follow directions in 1st handout Subscribe to list with name: cme342-class

2.  Sign your name up on the list I am passing around if you have not done so.

If you are not on the list, you will not have preferred access to cluster resources / GPUs.

3.  If you missed lecture notes or handouts, they are on the web page now

http://adl.stanford.edu/cme342"

3

Some units

1MFlop= 10^6 flops/sec 1MByte=10^6 bytes

1GFlop= 10^9 flops/sec 1GByte=10^9 bytes 1TFlop= 10^12 flops/sec 1TByte=10^12 bytes

4

Performance goals

5

Microprocessor performance

6

Top 500 List – November 2013

7

Top 500 List – Historical Trends

8

Parallel Computers

•  A parallel computer is a collection of CPUs/Processing Units that cooperate to solve a problem. It can solve a problem faster or just solve a bigger problem. – How large is the collection? – How is the memory organized? – How do they communicate and transfer

information?

This is not a course on parallel architectures: this lecture will be just an overview.You need to have a certain knowledge of the underlying hardware to efficiently use parallel computers.

9

Categorization of Parallel Architectures

•  Control mechanism: instruction stream and data stream

•  Process granularity •  Address space organization •  Interconnection network

–  Static –  Dynamic

10

Control mechanism (Flynn’s taxonomy)

•  SISD: Single Instruction stream Single Data stream

•  SIMD: Single Instruction stream Multiple Data stream

•  MIMD: Multiple Instruction stream Multiple Data stream

•  MISD: Multiple Instruction stream Single Data stream

11

SIMD

•  Multiple processing elements are under the supervision of a control unit

Thinking Machine CM-2, MasPar MP-2, Quadrics

–  SIMD extensions are now present in commercial microprocessors (MMX or SSE in Intel x86, 3DNow in AMD K6 and Athlon, Altivec in Motorola G4)

12

MIMD

•  Each processing element is capable of executing a different program independent of the other processors

•  Most multiprocessors can be classified in this category

13

Process granularity

•  Coarse grain: Cray C90, Fujitsu

•  Medium grain: IBM SP2, CM-5, clusters

•  Fine grain: CM-2, Quadrics, Blue Gene/L?

14

Address space:

•  Single address space

– Uniform Memory Address: SMP (UMA)

– Non Uniform memory Address (NUMA)

•  Local address spaces: – Message passing

15

SMP architecture

Memory I/O

Bus or Crossbar Switch

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

• SMP uses shared system resources (memory, I/O) that can be accessed equally from all the processors • Cache coherence is maintained

16

NUMA architecture

•  Shared address space •  Memory latency varies whether you access local or remote

memory •  Cache coherence is maintained using an hardware or

software protocol

Memory

Bus or Crossbar Switch

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Memory Memory Memory

17

Message-passing / Distributed Memory

•  Local address space •  No Cache coherence

Memory

Communication network

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Memory Memory Memory

18

Hybrids?

•  Notice that, currently, a CPU can have many cores that share memory within one processor (and possibly multiple ones)

•  In addition, accelerator cards can be present with separate memory (heterogeneous) architectures (GPUs and others)

•  No Cache coherence

Communication network

Memory

CPU

Cache

Device memory

GPU

PCIx

Memory

CPU

Cache

Device memory

GPU

PCIx

Memory

CPU

Cache

Device memory

GPU

PCIx

19

Dynamic interconnections

•  Crossbar Switching : most expensive and extensive interconnection.

•  Bus connected : Processors are connected to memory through a common datapath

•  Multistage interconnection: Butterfly,Omega network, perfect shuffle, etc

P1

M1

P2

M2

Butterfly

20

Static interconnection networks

•  Complete interconnection •  Star interconnection •  Linear array •  Mesh: 2D/3D mesh, 2D/3D torus •  Tree and fat tree network •  Hypercube network

21

Characteristics of static networks

•  Diameter: maximum distance between any two processors in the network D=1 complete connection D=N-1 linear array D=N/2 ring D=2(√N -1) 2D mesh D=2 (√(N/2)) 2D torus D=log N hypercube

22

Characteristics of static networks

•  Bisection width: the minimum number of communications links that have to be removed to partition the network in half.

•  Channel rate: peak rate at which a single wire can deliver bits

•  Channel bandwidth: it is the product of channel rate and channel width

•  Bisection bandwidth B: it is the product of bisection width and channel bandwidth.

23

Linear array, ring, mesh, torus

Processors are arranged as a d-dimensional grid or torus

P

P

P

P

P

P

P

P

P

P

P P P

P

P

P

P

P

P

P

P

P

P

P

24

Tree, Fat-tree

•  Tree network: there is only one path between any pair of processors.

•  Fat tree network: increase the number of communication links close to the root

25

Hypercube

1-D 3-D 2-D

26

Binary Reflected GRAY code:

G(i,d) denotes the i-th entry in a sequence of Gray codes of d bits. G(i,d+1) is derived from G(i,d) by reflecting the table and prefixing the reflected entry with 1 and the original entry with 0.

27

Example of BRG code

0 1

1 0

0 0

1 1

0 1

1 0 1 1 0 1 0 0

0 0 0 0

1 1 1 1

0 0 0 1 1 1 1 0

0 1 2 3

4 5 6 7

0 1 3 2

6 7 5 4

1-bit 2-bit 3-bit 8p ring 8p hyper

28

Embedding other networks into hypercubes

•  Mapping a linear array into an hypercube: A linear array (or ring) of 2^d processors can be embedded into a d-dimensional hypercube by mapping processor I onto processor G(I,d) of the hypercube

•  Mapping a 2^r x 2^s mesh on an hypercube:

•  processor(i,j)---> G(i,r)||G(j,s) (|| denotes concatenation)

The hypercube is a rich topology, many other networks can be “easily” mapped onto it.

29

Trade-off among different networks

Netwo r k Minimum latency Maximum Bw per Proc W i r e s S w i t c h e s Examp le

Completely connected Constant Constant O(p*p) - -

Crossbar Constant Constant O(p) O(p*p) Cray

Bus Constant O(1/p) O(p) O(p) SGI Challenge

Mesh O(sqrt p) Constant O(p) - Intel ASCI Red

Hypercube O(log p) Constant O(p log p) - Sgi Origin

Switched O(log p) Constant O(p log p) O(p log p) IBM SP-2

30

Beowulf

•  Cluster built with commodity hardware components –  PC hardware (x86,Alpha,PowerPC) –  Commercial high-speed

interconnection (100Base-T, Gigabit Ethernet, Myrinet,SCI,Infiniband)

–  Linux, Free-BSD operating system

http://www.beowulf.org

31

Appleseed: PowerPC cluster

http://exodus.physics.ucla.edu/appleseed

32

Clusters of SMPs

•  The next generation of supercomputers will have thousand of SMP nodes connected. –  Increase the computational power of the single

node – Keep the number of nodes “low” – New programming approach needed, MPI

+Threads (OpenMP,Pthreads,….) – ASCI White, Compaq SC, IBM SP3….

http://www.llnl.gov/asci

33

Multithreaded architecture •  The MTA system provides scalable shared memory, in

which every processor has equal access to every memory location

•  No concerns about the layout of memory •  Each MTA processor has up to 128 RISC-like virtual

processors

http://www.tera.com (now Cray)

Each virtual processor is a hardware stream with its own instruction counter, register set, stream status word and target and trap registers. A different hardware stream is activated every clock period.

34

Earth Simulator

•  Between 2002-04, the Earth Simulator was the most powerful supercomputer available today •  40 Teraflops of peak •  10 Terabytes of memory •  Use a crossbar switch to connect 640 nodes: 3,000Km of

cables!!!! •  Each node has 8 vector processors •  Sustained performances of up to 20TFlops on climate

simulation, 15TFlops on 4096^3 isotropic turbulence simulation

35

BlueGene/L •  In 2006-07, BGL was the #1 computer

on the TOP500 supercomputer list •  32 x 32 x 64 3D torus •  131,000 processors •  System on a chip •  Low-cost, low-power processor •  360 teraOps peak •  280 teraOps sustained (Linpack) •  32 tebibytes

36

Tianhe-1 and -2 •  National Super Computer Center in

Guangzhou •  3,120,000 cores •  Intel Xeon E5-2692, 2.2GHz •  Intel Xeon Phi 31S1P accelerator cards •  TH Express-2 interconnect •  Linpack performance: 33.8 PFlop/s •  Power: 18.8 MWatt •  Intel CC compiler, Intel MKL-11.0.0 •  MPICH2 for communication

37

Programming models

•  Shared memory: – Automatic parallelization – Pthreads – Compiler directives: OpenMP

•  Message passing: – MPI: message passing interface – PVM: parallel virtual machine – HPF: high performance Fortran

38

Pthreads

•  Posix threads: – Standard definition but non-standard

implementation – Hard to code

More Recent Models

•  Extracting high performance from the use of Graphics Processing Units (GPUs) using – CUDA for NVIDIA cards – OpenCL for CPUs / GPUs / DSPs / FPGAs – OpenACC for heterogeneous systems

•  And the list goes on…what should you do?

39

40

OpenMP

•  New “de-facto” standard available on all major platforms

•  Easy to implement •  Single source for parallel and serial code

41

MPI

•  Standard parallel interface: – MPI 1 – MPI 2: extend MPI 1 with single side

communication •  Need to rewrite the code •  Code portable on all architectures

42

PVM

•  Parallel Virtual Machine •  Another popular message passing interface •  Useful in environments with multiple

vendors and for MPMD approach

43

Simple code to compute Π !program compute_pi!!integer n, i!!double precision w, x, sum, pi, f, a!

c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!print *, 'Enter number of intervals: '!!read *,n!

c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!do i = 1, n!! !x = w * (i - 0.5d0)!! !sum = sum + f(x)!!end do!!pi = w * sum!!print *, 'computed pi = ', pi!!stop!!end!

44

OpenMP code program compute_pi!

!integer n, i!!double precision w, x, sum, pi, f, a!

c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!print *, 'Enter number of intervals: '!!read *,n!

c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!

!$OMP PARALLEL DO PRIVATE(x), SHARED(w)!!$OMP& REDUCTION(+: sum)!

!do i = 1, n!! !x = w * (i - 0.5d0)!! !sum = sum + f(x)!!end do!

!$OMP END PARALLEL DO!!pi = w * sum!!print *, 'computed pi = ', pi!!stop!!end!

45

MPI code program compute_pi!

!include 'mpif.h'!!double precision mypi, pi, w, sum, x, f, a!!integer n, myid, numprocs, i, rc!

c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!call MPI_INIT( ierr )!!call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )!!call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )!!if ( myid .eq. 0 ) then!!print *, 'Enter number of intervals: '!!read *, n!!endif!!call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)!

c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!do i = myid+1, n, numprocs!! !x = w * (i - 0.5d0)!! !sum = sum + f(x)!!enddo!!mypi = w * sum!

!

46

MPI code collect all the partial sums!

!call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,!!$ MPI_COMM_WORLD,ierr)!

c node 0 prints the answer.!!if (myid .eq. 0) then!! !print *, 'computed pi = ', pi!!endif!!call MPI_FINALIZE(rc)!!stop!!end!

!

47

Pthreads code #include <stdio.h>!#include <unistd.h>!#include <sys/times.h>!#include <pthread.h>!#include <stdlib.h>!!float sumall, width;!int i, iend;!__private int ibegin;!__private int cut;!__private float x;!__private float xsum;!void *do_work();!main(argc,argv)!int argc;!char *argv[];!{!/* Pi - Program loops over slices in interval, summing area of each slice!*/! struct tms time_buf;! float ticks, t1, t2;! int intrvls, numthreads, istart;! pthread_t pt[32];!!/* get intervals from command line */ !! intrvls = atoi(argv[1]);! numthreads = atoi(getenv( "PL_NUM_THREADS"));! printf(" intervals = %d PL_NUM_THREADS = %d \n",intrvls, numthreads);!!/* get number of clock ticks per second and initialize timer */ !! ticks = sysconf(_SC_CLK_TCK);! t2 =times(&time_buf);!!/* - - Compute width of cuts */!! width = 1. / intrvls;! sumall = 0.0;!!

48

Pthreads code /* - - Loop over interval, summing areas */! istart = 1;! iend = intrvls/numthreads;! for (i = 0; i < numthreads - 1 ; i++)! {! pthread_create(&pt[i], pthread_attr_default, do_work,(void *) istart);! istart += iend;! }! do_work( istart);! istart += iend;! for (i = 0; i < numthreads - 1 ; i++)! {! pthread_join(pt[i], NULL);! }!/* - - fininish any remaining slices */! iend = intrvls - (intrvls/numthreads)* numthreads;! if( iend) do_work( istart);!/* - - Finish overall timing and write results */! t1 = times(&time_buf);! printf("Time in main = %20.14e sum = %20.14f \n",(t1 -t2)/ticks,sumall);! printf("Error = %20.15e \n",sumall - 3.14159265358979323846); !}!!void *do_work(istart)!int istart;!{! ibegin = istart;! xsum = 0.0;! for (cut = ibegin ; cut < ibegin+iend; cut++)! {! x = (( (float ) cut) - .5) * width ;! xsum += width * 4. /(1. + x * x);! }! sumall += xsum;!}!

49

PVM code 1/3 program compute_pi_master! include '~/pvm3/include/fpvm3.h' ! parameter (NTASKS = 5)! parameter (INTERVALS = 1000)! integer mytid! integer tids(NTASKS)! real sum, area! real width! integer i, numt, msgtype, bufid, bytes, who, info!! sum = 0.0!C Enroll in PVM ! call pvmfmytid(mytid)!C spawn off NTASKS workers ! call pvmfspawn('comppi.worker', PVMDEFAULT, ' ',! + NTASKS, tids, numt)! width = 0.0! i = 0 !C Multi-cast initial dummy message to workers ! msgtype = 0! call pvmfinitsend(0, info)! call pvmfpack(INTEGER4, i, 1, 1, info)! call pvmfpack(REAL4, width, 1, 1, info)! call pvmfmcast(NTASKS, tids, msgtype, info)!C compute interval width ! width = 1.0 / INTERVALS!C for each interval, 1) receive area from worker 2) add area to sum 3) send worker new interval number and width ! call pvmfrecv(-1, -1, bufid)! call pvmfbufinfo(bufid, bytes, msgtype, who, info)! call pvmfunpack(REAL4, area, 1, 1, info)! sum = sum + area! call pvmfinitsend(PvmDataDefault, info)! call pvmfpack(INTEGER4, i, 1, 1, info)! call pvmfpack(REAL4, width, 1, 1, info)! call pvmfsend(who, msgtype, info) ! enddo!!

50

PVM code 2/3 C Signal to workers that tasks are done ! i = -1 ! call pvmfinitsend(0, info)! call pvmfpack(INTEGER4, i, 1, 1, info)! call pvmfpack(REAL4, width, 1, 1, info)! !C Collect the last NTASK areas and send the completion signal ! do i = 1, NTASKS! call pvmfrecv(-1,-1, bufid)! call pvmfbufinfo(bufid, bytes, msgtype, who, info)! call pvmfunpack(REAL4, area, 1, 1, info)! ! sum = sum + area! ! call pvmfsend(who, msgtype, info)! enddo! ! print 10,sum! 10 format(x,'Computed value of Pi is ',F8.6)! ! call pvmfexit(info)! end!

51

PVM code 3/3 integer mytid, master! real area! real width, int_val, height! integer int_num!C Enroll in PVM ! call pvmfmytid(mytid)!C who is sending me work? ! call pvmfparent(master)! !C receive first job from the master ! call pvmfrecv(-1, -1, info)! call pvmfunpack(INTEGER4, int_num, 1, 1, info)! call pvmfunpack(REAL4, width, 1, 1, info)! !C While I've not been sent the signal to quit, I'll keep processing! 40 if (int_num .eq. -1) goto 50! C compute interval value from interval number ! int_val = int_num * width !C compute height of given rectangle ! height = F(int_val)! !C compute area ! area = height * width! !C send area back to master ! call pvmfinitsend(PvmDataDefault, info)! call pvmfpack(REAL4, area, 1, 1, info)! call pvmfsend(master, 9, info)! !C Wait for next job from master ! call pvmfrecv(-1, -1, info)! call pvmfunpack(INTEGER4, int_num, 1, 1, info)! call pvmfunpack(REAL4, width, 1, 1, info)! goto 40!C all done ! 50 call pvmfexit(info)! end!! REAL FUNCTION F(X)! REAL X! F = 4.0/(1.0+X**2)! RETURN! END!

cme342 - parallel methods in numerical...

Documents