cme342 - parallel methods in numerical...
TRANSCRIPT
CME342 - Parallel Methods in Numerical Analysis
April 2, 2014 Lecture 2
Parallel Architectures
2
Announcements 1. Subscribe to the mailing list.
Go to lists.stanford.edu and follow directions in 1st handout Subscribe to list with name: cme342-class
2. Sign your name up on the list I am passing around if you have not done so.
If you are not on the list, you will not have preferred access to cluster resources / GPUs.
3. If you missed lecture notes or handouts, they are on the web page now
http://adl.stanford.edu/cme342"
3
Some units
1MFlop= 10^6 flops/sec 1MByte=10^6 bytes
1GFlop= 10^9 flops/sec 1GByte=10^9 bytes 1TFlop= 10^12 flops/sec 1TByte=10^12 bytes
4
Performance goals
5
Microprocessor performance
6
Top 500 List – November 2013
7
Top 500 List – Historical Trends
8
Parallel Computers
• A parallel computer is a collection of CPUs/Processing Units that cooperate to solve a problem. It can solve a problem faster or just solve a bigger problem. – How large is the collection? – How is the memory organized? – How do they communicate and transfer
information?
This is not a course on parallel architectures: this lecture will be just an overview.You need to have a certain knowledge of the underlying hardware to efficiently use parallel computers.
9
Categorization of Parallel Architectures
• Control mechanism: instruction stream and data stream
• Process granularity • Address space organization • Interconnection network
– Static – Dynamic
10
Control mechanism (Flynn’s taxonomy)
• SISD: Single Instruction stream Single Data stream
• SIMD: Single Instruction stream Multiple Data stream
• MIMD: Multiple Instruction stream Multiple Data stream
• MISD: Multiple Instruction stream Single Data stream
11
SIMD
• Multiple processing elements are under the supervision of a control unit
Thinking Machine CM-2, MasPar MP-2, Quadrics
– SIMD extensions are now present in commercial microprocessors (MMX or SSE in Intel x86, 3DNow in AMD K6 and Athlon, Altivec in Motorola G4)
12
MIMD
• Each processing element is capable of executing a different program independent of the other processors
• Most multiprocessors can be classified in this category
13
Process granularity
• Coarse grain: Cray C90, Fujitsu
• Medium grain: IBM SP2, CM-5, clusters
• Fine grain: CM-2, Quadrics, Blue Gene/L?
14
Address space:
• Single address space
– Uniform Memory Address: SMP (UMA)
– Non Uniform memory Address (NUMA)
• Local address spaces: – Message passing
15
SMP architecture
Memory I/O
Bus or Crossbar Switch
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
• SMP uses shared system resources (memory, I/O) that can be accessed equally from all the processors • Cache coherence is maintained
16
NUMA architecture
• Shared address space • Memory latency varies whether you access local or remote
memory • Cache coherence is maintained using an hardware or
software protocol
Memory
Bus or Crossbar Switch
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Memory Memory Memory
17
Message-passing / Distributed Memory
• Local address space • No Cache coherence
Memory
Communication network
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Memory Memory Memory
18
Hybrids?
• Notice that, currently, a CPU can have many cores that share memory within one processor (and possibly multiple ones)
• In addition, accelerator cards can be present with separate memory (heterogeneous) architectures (GPUs and others)
• No Cache coherence
Communication network
Memory
CPU
Cache
Device memory
GPU
PCIx
Memory
CPU
Cache
Device memory
GPU
PCIx
Memory
CPU
Cache
Device memory
GPU
PCIx
19
Dynamic interconnections
• Crossbar Switching : most expensive and extensive interconnection.
• Bus connected : Processors are connected to memory through a common datapath
• Multistage interconnection: Butterfly,Omega network, perfect shuffle, etc
P1
M1
P2
M2
Butterfly
20
Static interconnection networks
• Complete interconnection • Star interconnection • Linear array • Mesh: 2D/3D mesh, 2D/3D torus • Tree and fat tree network • Hypercube network
21
Characteristics of static networks
• Diameter: maximum distance between any two processors in the network D=1 complete connection D=N-1 linear array D=N/2 ring D=2(√N -1) 2D mesh D=2 (√(N/2)) 2D torus D=log N hypercube
22
Characteristics of static networks
• Bisection width: the minimum number of communications links that have to be removed to partition the network in half.
• Channel rate: peak rate at which a single wire can deliver bits
• Channel bandwidth: it is the product of channel rate and channel width
• Bisection bandwidth B: it is the product of bisection width and channel bandwidth.
23
Linear array, ring, mesh, torus
Processors are arranged as a d-dimensional grid or torus
P
P
P
P
P
P
P
P
P
P
P P P
P
P
P
P
P
P
P
P
P
P
P
24
Tree, Fat-tree
• Tree network: there is only one path between any pair of processors.
• Fat tree network: increase the number of communication links close to the root
25
Hypercube
1-D 3-D 2-D
26
Binary Reflected GRAY code:
G(i,d) denotes the i-th entry in a sequence of Gray codes of d bits. G(i,d+1) is derived from G(i,d) by reflecting the table and prefixing the reflected entry with 1 and the original entry with 0.
27
Example of BRG code
0 1
1 0
0 0
1 1
0 1
1 0 1 1 0 1 0 0
0 0 0 0
1 1 1 1
0 0 0 1 1 1 1 0
0 1 2 3
4 5 6 7
0 1 3 2
6 7 5 4
1-bit 2-bit 3-bit 8p ring 8p hyper
28
Embedding other networks into hypercubes
• Mapping a linear array into an hypercube: A linear array (or ring) of 2^d processors can be embedded into a d-dimensional hypercube by mapping processor I onto processor G(I,d) of the hypercube
• Mapping a 2^r x 2^s mesh on an hypercube:
• processor(i,j)---> G(i,r)||G(j,s) (|| denotes concatenation)
The hypercube is a rich topology, many other networks can be “easily” mapped onto it.
29
Trade-off among different networks
Netwo r k Minimum latency Maximum Bw per Proc W i r e s S w i t c h e s Examp le
Completely connected Constant Constant O(p*p) - -
Crossbar Constant Constant O(p) O(p*p) Cray
Bus Constant O(1/p) O(p) O(p) SGI Challenge
Mesh O(sqrt p) Constant O(p) - Intel ASCI Red
Hypercube O(log p) Constant O(p log p) - Sgi Origin
Switched O(log p) Constant O(p log p) O(p log p) IBM SP-2
30
Beowulf
• Cluster built with commodity hardware components – PC hardware (x86,Alpha,PowerPC) – Commercial high-speed
interconnection (100Base-T, Gigabit Ethernet, Myrinet,SCI,Infiniband)
– Linux, Free-BSD operating system
http://www.beowulf.org
31
Appleseed: PowerPC cluster
http://exodus.physics.ucla.edu/appleseed
32
Clusters of SMPs
• The next generation of supercomputers will have thousand of SMP nodes connected. – Increase the computational power of the single
node – Keep the number of nodes “low” – New programming approach needed, MPI
+Threads (OpenMP,Pthreads,….) – ASCI White, Compaq SC, IBM SP3….
http://www.llnl.gov/asci
33
Multithreaded architecture • The MTA system provides scalable shared memory, in
which every processor has equal access to every memory location
• No concerns about the layout of memory • Each MTA processor has up to 128 RISC-like virtual
processors
http://www.tera.com (now Cray)
Each virtual processor is a hardware stream with its own instruction counter, register set, stream status word and target and trap registers. A different hardware stream is activated every clock period.
34
Earth Simulator
• Between 2002-04, the Earth Simulator was the most powerful supercomputer available today • 40 Teraflops of peak • 10 Terabytes of memory • Use a crossbar switch to connect 640 nodes: 3,000Km of
cables!!!! • Each node has 8 vector processors • Sustained performances of up to 20TFlops on climate
simulation, 15TFlops on 4096^3 isotropic turbulence simulation
35
BlueGene/L • In 2006-07, BGL was the #1 computer
on the TOP500 supercomputer list • 32 x 32 x 64 3D torus • 131,000 processors • System on a chip • Low-cost, low-power processor • 360 teraOps peak • 280 teraOps sustained (Linpack) • 32 tebibytes
36
Tianhe-1 and -2 • National Super Computer Center in
Guangzhou • 3,120,000 cores • Intel Xeon E5-2692, 2.2GHz • Intel Xeon Phi 31S1P accelerator cards • TH Express-2 interconnect • Linpack performance: 33.8 PFlop/s • Power: 18.8 MWatt • Intel CC compiler, Intel MKL-11.0.0 • MPICH2 for communication
37
Programming models
• Shared memory: – Automatic parallelization – Pthreads – Compiler directives: OpenMP
• Message passing: – MPI: message passing interface – PVM: parallel virtual machine – HPF: high performance Fortran
38
Pthreads
• Posix threads: – Standard definition but non-standard
implementation – Hard to code
More Recent Models
• Extracting high performance from the use of Graphics Processing Units (GPUs) using – CUDA for NVIDIA cards – OpenCL for CPUs / GPUs / DSPs / FPGAs – OpenACC for heterogeneous systems
• And the list goes on…what should you do?
39
40
OpenMP
• New “de-facto” standard available on all major platforms
• Easy to implement • Single source for parallel and serial code
41
MPI
• Standard parallel interface: – MPI 1 – MPI 2: extend MPI 1 with single side
communication • Need to rewrite the code • Code portable on all architectures
42
PVM
• Parallel Virtual Machine • Another popular message passing interface • Useful in environments with multiple
vendors and for MPMD approach
43
Simple code to compute Π !program compute_pi!!integer n, i!!double precision w, x, sum, pi, f, a!
c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!print *, 'Enter number of intervals: '!!read *,n!
c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!do i = 1, n!! !x = w * (i - 0.5d0)!! !sum = sum + f(x)!!end do!!pi = w * sum!!print *, 'computed pi = ', pi!!stop!!end!
44
OpenMP code program compute_pi!
!integer n, i!!double precision w, x, sum, pi, f, a!
c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!print *, 'Enter number of intervals: '!!read *,n!
c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!
!$OMP PARALLEL DO PRIVATE(x), SHARED(w)!!$OMP& REDUCTION(+: sum)!
!do i = 1, n!! !x = w * (i - 0.5d0)!! !sum = sum + f(x)!!end do!
!$OMP END PARALLEL DO!!pi = w * sum!!print *, 'computed pi = ', pi!!stop!!end!
45
MPI code program compute_pi!
!include 'mpif.h'!!double precision mypi, pi, w, sum, x, f, a!!integer n, myid, numprocs, i, rc!
c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!call MPI_INIT( ierr )!!call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )!!call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )!!if ( myid .eq. 0 ) then!!print *, 'Enter number of intervals: '!!read *, n!!endif!!call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)!
c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!do i = myid+1, n, numprocs!! !x = w * (i - 0.5d0)!! !sum = sum + f(x)!!enddo!!mypi = w * sum!
!
46
MPI code collect all the partial sums!
!call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,!!$ MPI_COMM_WORLD,ierr)!
c node 0 prints the answer.!!if (myid .eq. 0) then!! !print *, 'computed pi = ', pi!!endif!!call MPI_FINALIZE(rc)!!stop!!end!
!
47
Pthreads code #include <stdio.h>!#include <unistd.h>!#include <sys/times.h>!#include <pthread.h>!#include <stdlib.h>!!float sumall, width;!int i, iend;!__private int ibegin;!__private int cut;!__private float x;!__private float xsum;!void *do_work();!main(argc,argv)!int argc;!char *argv[];!{!/* Pi - Program loops over slices in interval, summing area of each slice!*/! struct tms time_buf;! float ticks, t1, t2;! int intrvls, numthreads, istart;! pthread_t pt[32];!!/* get intervals from command line */ !! intrvls = atoi(argv[1]);! numthreads = atoi(getenv( "PL_NUM_THREADS"));! printf(" intervals = %d PL_NUM_THREADS = %d \n",intrvls, numthreads);!!/* get number of clock ticks per second and initialize timer */ !! ticks = sysconf(_SC_CLK_TCK);! t2 =times(&time_buf);!!/* - - Compute width of cuts */!! width = 1. / intrvls;! sumall = 0.0;!!
48
Pthreads code /* - - Loop over interval, summing areas */! istart = 1;! iend = intrvls/numthreads;! for (i = 0; i < numthreads - 1 ; i++)! {! pthread_create(&pt[i], pthread_attr_default, do_work,(void *) istart);! istart += iend;! }! do_work( istart);! istart += iend;! for (i = 0; i < numthreads - 1 ; i++)! {! pthread_join(pt[i], NULL);! }!/* - - fininish any remaining slices */! iend = intrvls - (intrvls/numthreads)* numthreads;! if( iend) do_work( istart);!/* - - Finish overall timing and write results */! t1 = times(&time_buf);! printf("Time in main = %20.14e sum = %20.14f \n",(t1 -t2)/ticks,sumall);! printf("Error = %20.15e \n",sumall - 3.14159265358979323846); !}!!void *do_work(istart)!int istart;!{! ibegin = istart;! xsum = 0.0;! for (cut = ibegin ; cut < ibegin+iend; cut++)! {! x = (( (float ) cut) - .5) * width ;! xsum += width * 4. /(1. + x * x);! }! sumall += xsum;!}!
49
PVM code 1/3 program compute_pi_master! include '~/pvm3/include/fpvm3.h' ! parameter (NTASKS = 5)! parameter (INTERVALS = 1000)! integer mytid! integer tids(NTASKS)! real sum, area! real width! integer i, numt, msgtype, bufid, bytes, who, info!! sum = 0.0!C Enroll in PVM ! call pvmfmytid(mytid)!C spawn off NTASKS workers ! call pvmfspawn('comppi.worker', PVMDEFAULT, ' ',! + NTASKS, tids, numt)! width = 0.0! i = 0 !C Multi-cast initial dummy message to workers ! msgtype = 0! call pvmfinitsend(0, info)! call pvmfpack(INTEGER4, i, 1, 1, info)! call pvmfpack(REAL4, width, 1, 1, info)! call pvmfmcast(NTASKS, tids, msgtype, info)!C compute interval width ! width = 1.0 / INTERVALS!C for each interval, 1) receive area from worker 2) add area to sum 3) send worker new interval number and width ! call pvmfrecv(-1, -1, bufid)! call pvmfbufinfo(bufid, bytes, msgtype, who, info)! call pvmfunpack(REAL4, area, 1, 1, info)! sum = sum + area! call pvmfinitsend(PvmDataDefault, info)! call pvmfpack(INTEGER4, i, 1, 1, info)! call pvmfpack(REAL4, width, 1, 1, info)! call pvmfsend(who, msgtype, info) ! enddo!!
50
PVM code 2/3 C Signal to workers that tasks are done ! i = -1 ! call pvmfinitsend(0, info)! call pvmfpack(INTEGER4, i, 1, 1, info)! call pvmfpack(REAL4, width, 1, 1, info)! !C Collect the last NTASK areas and send the completion signal ! do i = 1, NTASKS! call pvmfrecv(-1,-1, bufid)! call pvmfbufinfo(bufid, bytes, msgtype, who, info)! call pvmfunpack(REAL4, area, 1, 1, info)! ! sum = sum + area! ! call pvmfsend(who, msgtype, info)! enddo! ! print 10,sum! 10 format(x,'Computed value of Pi is ',F8.6)! ! call pvmfexit(info)! end!
51
PVM code 3/3 integer mytid, master! real area! real width, int_val, height! integer int_num!C Enroll in PVM ! call pvmfmytid(mytid)!C who is sending me work? ! call pvmfparent(master)! !C receive first job from the master ! call pvmfrecv(-1, -1, info)! call pvmfunpack(INTEGER4, int_num, 1, 1, info)! call pvmfunpack(REAL4, width, 1, 1, info)! !C While I've not been sent the signal to quit, I'll keep processing! 40 if (int_num .eq. -1) goto 50! C compute interval value from interval number ! int_val = int_num * width !C compute height of given rectangle ! height = F(int_val)! !C compute area ! area = height * width! !C send area back to master ! call pvmfinitsend(PvmDataDefault, info)! call pvmfpack(REAL4, area, 1, 1, info)! call pvmfsend(master, 9, info)! !C Wait for next job from master ! call pvmfrecv(-1, -1, info)! call pvmfunpack(INTEGER4, int_num, 1, 1, info)! call pvmfunpack(REAL4, width, 1, 1, info)! goto 40!C all done ! 50 call pvmfexit(info)! end!! REAL FUNCTION F(X)! REAL X! F = 4.0/(1.0+X**2)! RETURN! END!