plan of lectures 1. development of data parallel programming. historical view of languages up to...
TRANSCRIPT
Plan of lectures 1. Development of data parallel programming.
Historical view of languages up to HPF. 2. Issues in translation of HPF. Simple
examples. Distributed array descriptors. 3. Communication in data parallel languages.
Communication patterns. Runtime libraries. 4. An “HPspmd” programming model.
Motivations. Introduction to HPJava. 5. Java for HP computing. Java Grande.
mpiJava.
Development of Data-Parallel Programming
Bryan Carpenter
NPAC at Syracuse UniversitySyracuse, NY [email protected]
Goals of this lecture Review the historical development
of data-parallel programming, up to and including emergence of High Performance Fortran.
Illustrate evolution of ideas by describing some major practical programming languages for data parallelism.
Contents of Lecture Overview.
Need for massive parallelism: SIMD, MIMD, SPMD.
Early data-parallel languages. ILLIAC IV CFD and DAP FORTRAN.
Later SIMD languages. Fortran 90 Connection Machine Fortran, related languages.
High Performance Fortran. Multi-processing and distributed data Processor arrangements and templates,
alignment Programming examples.
Moore’s law and massive parallelism Microprocessors double in speed
every eighteen months. To outpace growth in power of
sequential computers, parallelism better offer orders of magnitude better performance.
Moderate parallelism cannot give this speedup—massive parallelism or bust!
Niches for massive parallelism today Dedicated clusters of commodity
processors. Government labs, internet servers, Pixar Renderfarm, …
Harvesting cycles of processors distributed over the internet.
Two principal forms of massive parallelism: task farming, and data parallelism.
Data parallelism Large data structures, typically arrays,
split across computational nodes. Each node computes local patch of arrays. Typically need some intermediate results
from other nodes, but communication should not dominate.
Logical model of fine-grained parallelism—at level of array elements, rather than tasks.
Programming languages for massive parallelism Task farming adequately expressed in
conventional languages task = function call libraries handle communication + load-balancing
Data parallelism can be expressed in conventional languages + MPI (say), but historically higher level languages have been quite successful Potentially avoid the problems of general
concurrent programming (non-determinism, synchronization, deadlock, …)
Why special languages for data-parallelism? Originally devised for Single
Instruction, Multiple Data (SIMD) computers.
Specialized architectures demanded special language features.
New abstraction: distributed array. Central control avoided problems of
concurrent programming—concentrate on expressing parallel algorithms.
Advent of MIMD computers. By 90s, SIMD lost ground—general purpose
microprocessors too cheap, powerful. Multiple Instruction, Multiple Data (MIMD) computers could be built from commodity components.
Still want to avoid complexity of general concurrent programming. Single Program, Multiple Data (SPMD)—now a programming model rather than computer architecture.
Similarities to SIMD programming suggest similar language ideas should apply.
The road to HPF High Performance Fortran took ideas
from SIMD languages—embedded them in a standardized language for programming MIMD computers in SPMD style.
Next section of this lecture will review historically important SIMD languages—learn where the ideas came from.
ILLIAC IV ILLIAC IV development started late 60s.
Fully operational 1975. Illinois then NASA Ames.
SIMD computer for array processing. Control Unit + 64 Processing elements. 2K
words memory per PE. CU can access all memory. PEs can access
local memory and communicate with neighbours.
CU reads program text—broadcasts instructions to PEs.
ILLIAC IV CFD Software problematic for ILLIAC. CFD language developed at
Computational Fluid Dynamics Branch of Ames.
“Fortran-like” rather than a true FORTRAN dialect.
Deliberately no attempt to hide hardware peculiarities—on contrary, tried to give programmer direct access to all features of the hardware.
Data types in CFD Basic types included CU REAL, CU
INTEGER, PE REAL, PE INTEGER. Ordinary arrays can be in PE or CU
memory: CU REAL A, B(100) PE INTEGER I PE REAL D(25), E(1000)
Vector-aligned arrays in PE memory: PE INTEGER J(*) PE REAL X(*,4), Y(*,2,8)
Vector aligned arrays in CFD Early language instantiation of
Distributed Array concept. Only first dimension can be distributed.
Extent of distributed dimension exactly 64.
J(1) stored on first PE, J(2) on second PE, etc.
X(1,1),…,X(1,4) on first PE, X(2,1),…, X(2,4) on second PE, etc.
Parallel computation in CFD Parallel computation only in vector
assignments. Subscript is *. Communication between PEs by adding
shift to *: DIFP(*) = P(* + 1) – P(* - 1)
Conditional vector assignment: IF(A(*) .LT. 0) A(*) = -A(*)
PEs can locally access different locations: DIAG(*) = RHO(*, X(*))
Vector subscript must be in sequential dimension.
Other ILLIAC languages Glypnir
Algol-like rather than Fortran-like. swords (“super-words”) rather than explicit
parallel arrays. IVTRAN
Higher-level Fortran dialect. Scheme for mapping general arrays to PEs. Parallel loop similar to FORALL. Compiler never fully debugged.
ICL DAP FORTRAN ICL DAP (Distributed Array Processor)—early
commercial parallel computer. Available early 80s.
Trend to many, simple PEs: 64 × 64 grid of 4096 bit-serial processors.
FP performance much lower than ILLIAC, but good performance in other domains.
Module of ICL 2900 mainframe computer—shared memory and other facilities with host.
Programmed in DAP FORTRAN.
Constrained arrays in DAP FORTRAN First one or two extents could be omitted:
DIMENSION A(), BB(,) INTEGER II(,) REAL AA(,3), BBB(,,5)
A is a vector. BB and II are matrices. AA an array of vectors. BBB an array of
matrices. Extents of constrained dimensions 64
again (vectors of 4096 also possible).
Array assignments in DAP Fortran Leave subscripts empty for array
expression. + or – subscript used for nearest neighbour access: U(,) = 0.25 * (U(,-) + U(,+) + U(-,) + U(+,))
Masked assignment by using logical matrix as subscript: LOGICAL L(,) ... L(,) = BB(,) .LT. 0 B(L) = -B(,)
Standardized array processing: Fortran 90 By the mid 80s SIMD and vector computers
looked well-established. The Fortran 8X committee saw a need to make array assignments standard language features.
Fortran 90 (as it eventually became) added many new features to support modern software engineering. It also added new “array syntax”, including array assignments, array sections, transformational intrinsics.
Fortran 90 array assignments Rank-2 arrays, shape (5, 10):
REAL A(5, 10), B(5, 10), C(5, 10)
Array expressions involve conforming arrays: A + B, B * C, A + (B * C)
Arrays conform if they have same shape.
B * C, for example, stands for B(1,1) * C(1,1) . . . B(1,10) * C(1,10) · · · B(5,1) * C(5,1) . . . B(5,10) * C(5,10)
Scalars in array expressions Scalars conform with any array:
REAL D
Expression C + D stands for: C(1,1) + D . . . C(1,10) + D · · · C(5,1) + D . . . C(5,10) + D
Array sections
Array subscripts can be triplets: X(1:10:3) stands for the subarray: (X(1), X(4), X(7), X(10))
Special cases of triplets: X(1:4), stride defaults to 1: (X(1), X(2), X(3), X(4)) X(:) selects whole of array dimension, as
originally declared. Most useful for multi-dimensional arrays—here X(:) just equivalent to X.
Array assignments Simple assignment:
U(2:N-1) = 0.5 * (U(1:N-2) + U(3:N))
Right-hand-side expression must conform with left-hand-side variable.
Conditional assignment: WHERE (A .LT. 0.0) A = -A
The condition must be a logical array that conforms with the left-hand-side of the assignment.
Array intrinsics Fortran 90 added many transformational
intrinsics—operations on whole arrays. CSHIFT, EOSHIFT shift arrays in a
dimension. TRANSPOSE transposes array. RESHAPE: general reshaping of array. SUM, PRODUCT, MAXVAL, MINVAL, . . . :
reduction operations that take an array and return scalar or reduced-rank array.
The Connection Machine Thinking Machines Corp founded 1983 to
supply connectionist computers for AI. In practise, their Connection Machine series
was used largely for scientific computing. CM-2 launched 1986:
SIMD architecture, bit-serial PE like DAP. Up to 65536 PEs connected on a hypercube
network. Floating point coprocessors could give peak
performance 28 GFLOPS. Programmed initially in *LISP, then C*. By
1990 there was a CM Fortran compiler.
CM Fortran Extended from FORTRAN 77. Included all the new array syntax
of Fortran 90. FORALL statement. Array distribution directives.
Mapping arrays in CM Fortran Arrays can be allocated on front-end
(VAX, Sun,etc) or on the CM itself. By default an array is a CM array if
it ever used in a F90 array assignment in the current procedure; it is a front-end array if it is only ever used in F77 style.
Distributed arrays (“CM arrays”) can have any shape.
Layout of CM arrays Arrays mapped to Virtual Processor sets
(VP sets), one array element per VP. By default VP set has minimal shape s.t.:
VP grid is large enough to hold the array, Extents of VP grid are powers of 2, Total number VPs is a multiple of number PEs.
Map first array element to first VP. Disable or ignore VPs unneeded to hold
elements.
LAYOUTdirectives in CM Fortran Layout directive can override defaults. Serial dimensions can be elected:
REAL A(10, 64, 100)CMF$ LAYOUT A(:SERIAL, :NEWS, :NEWS)
A aligned with B: REAL B(64, 100)CMF$ LAYOUT B(:NEWS, :NEWS)
All dimensions serial implies front-end array.
LAYOUT directives can appear in procedure interface blocks.
ALIGN directives in CM Fortran Can explicitly specify alignment
relations: REAL V(100), B(64, 100)CMF$ ALIGN V(I) WITH B(1, I)
Offset alignments also allowed: REAL C(32, 50)CMF$ ALIGN C(I, J) WITH B(I+5, J+2)
More general alignments, eg transposed, not allowed.
FORALL Parallelism must be explicit. Eg in F90
style: U(2:N-1, 2:N-1) =& 0.25 * (U(2:N-1, 1:N-2) + U(2:N-1, 3:N)& + U(1:N-2, 2:N-1) + U(3:N, 2:N-1))
When array syntax unwieldy, can use FORALL statement: FORALL (I = 2:N-1, J = 2:N-1)& U(I, J) = 0.25 * (U(I, J-1) + U(I, J+1) +& U(I-1, J) + U(I+1, J))
Languages related to CM Fortran MasPar Fortran. Contemporary SIMD
computer corporation. Language very similar to CM Fortran. Slightly different syntax for directives.
*LISP. Original language of CM. Data parallel dialect of Common LISP. pvars rather than distributed arrays.
C*. CM version of C, extended with syntax for distributed arrays. Introduced shape concept, similar to templates in later HPF.
The High Performance Fortran Forum By early 90s, value of portable,
standardized languages universally acknowledged.
Goal of HPF Forum—a single language for High Performance programming. Effective across architectures—vector, SIMD, MIMD, though SPMD a focus.
Supported by most major vendors then: Cray, DEC, Fujitsu, HP, IBM, Intel, MasPar, Meiko,
nCube, Sun, Thinking Machines HPF 1.0 standard published 1993.
Multiprocessing and Distributed Data
Contemporary parallel computers are built from autonomous processors with some local memory. Processors access their local memory rapidly, or other processors memory much more slowly. Programs should be arranged to minimize non-local accesses.
ProcessorsMemoryareas
Excessive communication Computation divided into parallel
subcomputations, each computation involving operands from multiple processors.
Ideal decomposition All operands of individual subcomputation
on single processor; operands for different tasks on different processors.
Placement of data and computation HPF allows intricate control over
placement of array elements, through mapping directives.
HPF 1.0 allows no direct control over placement of computation. Compiler should infer good place to compute, according to location of operands (eg, “owner computes” heuristic).
Processor arrangements Abstract, program-level representation
of processor set over which data is distributed.
Directive syntax similar to Fortran array declaration:!HPF$ PROCESSORS P(10)
Set P contains 10 processors. Processor arrangements can be multi-
dimensional:!HPF$ PROCESSORS Q(4, 4)
Templates Abstract array shape, to which actual
data can be aligned. (Usually much finer grain than processor arrangements.)
Declaration syntax:!HPF$ TEMPLATE T(50, 50, 50)
Templates distributed over processor arrangements using DISTRIBUTE directive. Following examples assume:!HPF$ PROCESSORS P1(4)!HPF$ TEMPLATE T1(17)
Distributing a multidimensional template
Distribution formats can be mixed:!HPF$ PROCESSORS P2(4, 3)!HPF$ TEMPLATE T2(17, 20)!HPF$ DISTRIBUTE T2(CYCLIC, BLOCK) ONTO P2
Some dimensions may be serial, or collapsed:!HPF$ DISTRIBUTE T2(BLOCK, *) ONTO P1
LU Decomposition algorithm
REAL A(N, N)
DO K = 1, N – 1 DO J = K + 1, N A(K, J) = A(K, J) / A(K, K) DO I = K + 1, N A(I, J) = A(I, J) – A(I, K) * A(K, J) ENDDO ENDDO ENDDO
Parallel LU Decomposition REAL A(N, N) REAL COL(N), ROW(N)
DO K = 1, N – 1 COL(K:N) = A(K:N, K) A(K, K+1:N) = A(K, K+1:N) / COL(K) ROW(K+1:N) = A(K, K+1:N) FORALL (I = K+1:N, J = K+1:N) & A(I, J) = A(I, J) – COL(I) * ROW(J) ENDDO
Simple align directive
Create template matching principal array (A), then ALIGN the array to the template:!HPF$ TEMPLATE T(N, N)!HPF$ ALIGN A(I, J) WITH T(I, J)
Aligning auxiliary arrays Want to minimize communication in
largest parallel loop: FORALL (I = K+1:N, J = K+1:N) & A(I, J) = A(I, J) – COL(I) * ROW(J)
Suggests replicated alignments for COL and ROW:!HPF$ ALIGN COL(I) WITH T(I, *)!HPF$ ALIGN COL(J) WITH T(*, J)
Communication in LU example The statement
COL(K:N) = A(K:N, K)
broadcasts Kth column of A.
The statement A(K, K+1:N) = A(K, K+1:N) / COL(K)
needs no communication (reason for using COL(K) rather than A(K, K)).
The statement ROW(K+1:N) = A(K, K+1:N)
broadcasts Kth row of A.
Distribution in LU example
Block-wise distribution unattractive due to poor load balancing. Cyclic distribution preferable:
!HPF$ PROCESSORS P(NP, NP)!HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P
LU example with directives!HPF$ PROCESSORS P(NP, NP)!HPF$ TEMPLATE T(N, N)!HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P
REAL A(N, N)!HPF$ ALIGN A(I, J) WITH T(I, J)
REAL COL(N), ROW(N)!HPF$ ALIGN COL(I) WITH T(I, *)!HPF$ ALIGN COL(J) WITH T(*, J)
DO K = 1, N – 1 COL(K:N) = A(K:N, K) A(K, K+1:N) = A(K, K+1:N) / COL(K) ROW(K+1:N) = A(K, K+1:N) FORALL (I = K+1:N, J = K+1:N) & A(I, J) = A(I, J) – COL(I) * ROW(J) ENDDO
Other alignment options
Permuted dimensions: DIMENSION B(N, N)!HPF$ ALIGN B(I, J) WITH T(J, I)
Affine intra-dimensional alignment: DIMENSION C(N/2, N/2)!HPF$ ALIGN C(I, J) WITH T(N/2 + I, 2*J)