plan of lectures 1. development of data parallel programming. historical view of languages up to...

62
Plan of lectures 1. Development of data parallel programming. Historical view of languages up to HPF. 2. Issues in translation of HPF. Simple examples. Distributed array descriptors. 3. Communication in data parallel languages. Communication patterns. Runtime libraries. 4. An “HPspmd” programming model. Motivations. Introduction to HPJava. 5. Java for HP computing. Java Grande. mpiJava.

Upload: gavin-ellis

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Plan of lectures 1. Development of data parallel programming.

Historical view of languages up to HPF. 2. Issues in translation of HPF. Simple

examples. Distributed array descriptors. 3. Communication in data parallel languages.

Communication patterns. Runtime libraries. 4. An “HPspmd” programming model.

Motivations. Introduction to HPJava. 5. Java for HP computing. Java Grande.

mpiJava.

Lecture materials

http://www.npac.syr.edu/projects/pcrc/HPJava/beijing.html

Development of Data-Parallel Programming

Bryan Carpenter

NPAC at Syracuse UniversitySyracuse, NY [email protected]

Goals of this lecture Review the historical development

of data-parallel programming, up to and including emergence of High Performance Fortran.

Illustrate evolution of ideas by describing some major practical programming languages for data parallelism.

Contents of Lecture Overview.

Need for massive parallelism: SIMD, MIMD, SPMD.

Early data-parallel languages. ILLIAC IV CFD and DAP FORTRAN.

Later SIMD languages. Fortran 90 Connection Machine Fortran, related languages.

High Performance Fortran. Multi-processing and distributed data Processor arrangements and templates,

alignment Programming examples.

Moore’s law and massive parallelism Microprocessors double in speed

every eighteen months. To outpace growth in power of

sequential computers, parallelism better offer orders of magnitude better performance.

Moderate parallelism cannot give this speedup—massive parallelism or bust!

Niches for massive parallelism today Dedicated clusters of commodity

processors. Government labs, internet servers, Pixar Renderfarm, …

Harvesting cycles of processors distributed over the internet.

Two principal forms of massive parallelism: task farming, and data parallelism.

Data parallelism Large data structures, typically arrays,

split across computational nodes. Each node computes local patch of arrays. Typically need some intermediate results

from other nodes, but communication should not dominate.

Logical model of fine-grained parallelism—at level of array elements, rather than tasks.

Programming languages for massive parallelism Task farming adequately expressed in

conventional languages task = function call libraries handle communication + load-balancing

Data parallelism can be expressed in conventional languages + MPI (say), but historically higher level languages have been quite successful Potentially avoid the problems of general

concurrent programming (non-determinism, synchronization, deadlock, …)

Why special languages for data-parallelism? Originally devised for Single

Instruction, Multiple Data (SIMD) computers.

Specialized architectures demanded special language features.

New abstraction: distributed array. Central control avoided problems of

concurrent programming—concentrate on expressing parallel algorithms.

Advent of MIMD computers. By 90s, SIMD lost ground—general purpose

microprocessors too cheap, powerful. Multiple Instruction, Multiple Data (MIMD) computers could be built from commodity components.

Still want to avoid complexity of general concurrent programming. Single Program, Multiple Data (SPMD)—now a programming model rather than computer architecture.

Similarities to SIMD programming suggest similar language ideas should apply.

The road to HPF High Performance Fortran took ideas

from SIMD languages—embedded them in a standardized language for programming MIMD computers in SPMD style.

Next section of this lecture will review historically important SIMD languages—learn where the ideas came from.

ILLIAC IV ILLIAC IV development started late 60s.

Fully operational 1975. Illinois then NASA Ames.

SIMD computer for array processing. Control Unit + 64 Processing elements. 2K

words memory per PE. CU can access all memory. PEs can access

local memory and communicate with neighbours.

CU reads program text—broadcasts instructions to PEs.

Architecture of the ILLIAC IV

ILLIAC IV CFD Software problematic for ILLIAC. CFD language developed at

Computational Fluid Dynamics Branch of Ames.

“Fortran-like” rather than a true FORTRAN dialect.

Deliberately no attempt to hide hardware peculiarities—on contrary, tried to give programmer direct access to all features of the hardware.

Data types in CFD Basic types included CU REAL, CU

INTEGER, PE REAL, PE INTEGER. Ordinary arrays can be in PE or CU

memory: CU REAL A, B(100) PE INTEGER I PE REAL D(25), E(1000)

Vector-aligned arrays in PE memory: PE INTEGER J(*) PE REAL X(*,4), Y(*,2,8)

Vector aligned arrays in CFD Early language instantiation of

Distributed Array concept. Only first dimension can be distributed.

Extent of distributed dimension exactly 64.

J(1) stored on first PE, J(2) on second PE, etc.

X(1,1),…,X(1,4) on first PE, X(2,1),…, X(2,4) on second PE, etc.

Parallel computation in CFD Parallel computation only in vector

assignments. Subscript is *. Communication between PEs by adding

shift to *: DIFP(*) = P(* + 1) – P(* - 1)

Conditional vector assignment: IF(A(*) .LT. 0) A(*) = -A(*)

PEs can locally access different locations: DIAG(*) = RHO(*, X(*))

Vector subscript must be in sequential dimension.

Other ILLIAC languages Glypnir

Algol-like rather than Fortran-like. swords (“super-words”) rather than explicit

parallel arrays. IVTRAN

Higher-level Fortran dialect. Scheme for mapping general arrays to PEs. Parallel loop similar to FORALL. Compiler never fully debugged.

ICL DAP FORTRAN ICL DAP (Distributed Array Processor)—early

commercial parallel computer. Available early 80s.

Trend to many, simple PEs: 64 × 64 grid of 4096 bit-serial processors.

FP performance much lower than ILLIAC, but good performance in other domains.

Module of ICL 2900 mainframe computer—shared memory and other facilities with host.

Programmed in DAP FORTRAN.

Constrained arrays in DAP FORTRAN First one or two extents could be omitted:

DIMENSION A(), BB(,) INTEGER II(,) REAL AA(,3), BBB(,,5)

A is a vector. BB and II are matrices. AA an array of vectors. BBB an array of

matrices. Extents of constrained dimensions 64

again (vectors of 4096 also possible).

Array assignments in DAP Fortran Leave subscripts empty for array

expression. + or – subscript used for nearest neighbour access: U(,) = 0.25 * (U(,-) + U(,+) + U(-,) + U(+,))

Masked assignment by using logical matrix as subscript: LOGICAL L(,) ... L(,) = BB(,) .LT. 0 B(L) = -B(,)

Standardized array processing: Fortran 90 By the mid 80s SIMD and vector computers

looked well-established. The Fortran 8X committee saw a need to make array assignments standard language features.

Fortran 90 (as it eventually became) added many new features to support modern software engineering. It also added new “array syntax”, including array assignments, array sections, transformational intrinsics.

Fortran 90 array assignments Rank-2 arrays, shape (5, 10):

REAL A(5, 10), B(5, 10), C(5, 10)

Array expressions involve conforming arrays: A + B, B * C, A + (B * C)

Arrays conform if they have same shape.

B * C, for example, stands for B(1,1) * C(1,1) . . . B(1,10) * C(1,10) · · · B(5,1) * C(5,1) . . . B(5,10) * C(5,10)

Scalars in array expressions Scalars conform with any array:

REAL D

Expression C + D stands for: C(1,1) + D . . . C(1,10) + D · · · C(5,1) + D . . . C(5,10) + D

Array sections

Array subscripts can be triplets: X(1:10:3) stands for the subarray: (X(1), X(4), X(7), X(10))

Special cases of triplets: X(1:4), stride defaults to 1: (X(1), X(2), X(3), X(4)) X(:) selects whole of array dimension, as

originally declared. Most useful for multi-dimensional arrays—here X(:) just equivalent to X.

Array assignments Simple assignment:

U(2:N-1) = 0.5 * (U(1:N-2) + U(3:N))

Right-hand-side expression must conform with left-hand-side variable.

Conditional assignment: WHERE (A .LT. 0.0) A = -A

The condition must be a logical array that conforms with the left-hand-side of the assignment.

Array intrinsics Fortran 90 added many transformational

intrinsics—operations on whole arrays. CSHIFT, EOSHIFT shift arrays in a

dimension. TRANSPOSE transposes array. RESHAPE: general reshaping of array. SUM, PRODUCT, MAXVAL, MINVAL, . . . :

reduction operations that take an array and return scalar or reduced-rank array.

The Connection Machine Thinking Machines Corp founded 1983 to

supply connectionist computers for AI. In practise, their Connection Machine series

was used largely for scientific computing. CM-2 launched 1986:

SIMD architecture, bit-serial PE like DAP. Up to 65536 PEs connected on a hypercube

network. Floating point coprocessors could give peak

performance 28 GFLOPS. Programmed initially in *LISP, then C*. By

1990 there was a CM Fortran compiler.

CM Fortran Extended from FORTRAN 77. Included all the new array syntax

of Fortran 90. FORALL statement. Array distribution directives.

Mapping arrays in CM Fortran Arrays can be allocated on front-end

(VAX, Sun,etc) or on the CM itself. By default an array is a CM array if

it ever used in a F90 array assignment in the current procedure; it is a front-end array if it is only ever used in F77 style.

Distributed arrays (“CM arrays”) can have any shape.

Layout of CM arrays Arrays mapped to Virtual Processor sets

(VP sets), one array element per VP. By default VP set has minimal shape s.t.:

VP grid is large enough to hold the array, Extents of VP grid are powers of 2, Total number VPs is a multiple of number PEs.

Map first array element to first VP. Disable or ignore VPs unneeded to hold

elements.

LAYOUTdirectives in CM Fortran Layout directive can override defaults. Serial dimensions can be elected:

REAL A(10, 64, 100)CMF$ LAYOUT A(:SERIAL, :NEWS, :NEWS)

A aligned with B: REAL B(64, 100)CMF$ LAYOUT B(:NEWS, :NEWS)

All dimensions serial implies front-end array.

LAYOUT directives can appear in procedure interface blocks.

ALIGN directives in CM Fortran Can explicitly specify alignment

relations: REAL V(100), B(64, 100)CMF$ ALIGN V(I) WITH B(1, I)

Offset alignments also allowed: REAL C(32, 50)CMF$ ALIGN C(I, J) WITH B(I+5, J+2)

More general alignments, eg transposed, not allowed.

Layouts and alignments in CM Fortran

FORALL Parallelism must be explicit. Eg in F90

style: U(2:N-1, 2:N-1) =& 0.25 * (U(2:N-1, 1:N-2) + U(2:N-1, 3:N)& + U(1:N-2, 2:N-1) + U(3:N, 2:N-1))

When array syntax unwieldy, can use FORALL statement: FORALL (I = 2:N-1, J = 2:N-1)& U(I, J) = 0.25 * (U(I, J-1) + U(I, J+1) +& U(I-1, J) + U(I+1, J))

Languages related to CM Fortran MasPar Fortran. Contemporary SIMD

computer corporation. Language very similar to CM Fortran. Slightly different syntax for directives.

*LISP. Original language of CM. Data parallel dialect of Common LISP. pvars rather than distributed arrays.

C*. CM version of C, extended with syntax for distributed arrays. Introduced shape concept, similar to templates in later HPF.

The High Performance Fortran Forum By early 90s, value of portable,

standardized languages universally acknowledged.

Goal of HPF Forum—a single language for High Performance programming. Effective across architectures—vector, SIMD, MIMD, though SPMD a focus.

Supported by most major vendors then: Cray, DEC, Fujitsu, HP, IBM, Intel, MasPar, Meiko,

nCube, Sun, Thinking Machines HPF 1.0 standard published 1993.

Multiprocessing and Distributed Data

Contemporary parallel computers are built from autonomous processors with some local memory. Processors access their local memory rapidly, or other processors memory much more slowly. Programs should be arranged to minimize non-local accesses.

ProcessorsMemoryareas

Excessive communication Computation divided into parallel

subcomputations, each computation involving operands from multiple processors.

Bad load balancing All parallel subcomputations access

operands on the same processor.

Ideal decomposition All operands of individual subcomputation

on single processor; operands for different tasks on different processors.

Placement of data and computation HPF allows intricate control over

placement of array elements, through mapping directives.

HPF 1.0 allows no direct control over placement of computation. Compiler should infer good place to compute, according to location of operands (eg, “owner computes” heuristic).

Stages of data mapping in HPF

Processor arrangements Abstract, program-level representation

of processor set over which data is distributed.

Directive syntax similar to Fortran array declaration:!HPF$ PROCESSORS P(10)

Set P contains 10 processors. Processor arrangements can be multi-

dimensional:!HPF$ PROCESSORS Q(4, 4)

Templates Abstract array shape, to which actual

data can be aligned. (Usually much finer grain than processor arrangements.)

Declaration syntax:!HPF$ TEMPLATE T(50, 50, 50)

Templates distributed over processor arrangements using DISTRIBUTE directive. Following examples assume:!HPF$ PROCESSORS P1(4)!HPF$ TEMPLATE T1(17)

Simple block distribution

!HPF$ DISTRIBUTE T1(BLOCK) ONTO P1

Simple cyclic distribution!HPF$ DISTRIBUTE T1(CYCLIC) ONTO P1

Block distribution with specified block-size!HPF$ DISTRIBUTE T1(BLOCK(6)) ONTO P1

Block-cyclic distribution!HPF$ DISTRIBUTE T1(CYCLIC(3)) ONTO P1

Distributing a multidimensional template

Distribution formats can be mixed:!HPF$ PROCESSORS P2(4, 3)!HPF$ TEMPLATE T2(17, 20)!HPF$ DISTRIBUTE T2(CYCLIC, BLOCK) ONTO P2

Some dimensions may be serial, or collapsed:!HPF$ DISTRIBUTE T2(BLOCK, *) ONTO P1

LU Decomposition algorithm

REAL A(N, N)

DO K = 1, N – 1 DO J = K + 1, N A(K, J) = A(K, J) / A(K, K) DO I = K + 1, N A(I, J) = A(I, J) – A(I, K) * A(K, J) ENDDO ENDDO ENDDO

Parallel LU Decomposition REAL A(N, N) REAL COL(N), ROW(N)

DO K = 1, N – 1 COL(K:N) = A(K:N, K) A(K, K+1:N) = A(K, K+1:N) / COL(K) ROW(K+1:N) = A(K, K+1:N) FORALL (I = K+1:N, J = K+1:N) & A(I, J) = A(I, J) – COL(I) * ROW(J) ENDDO

Simple align directive

Create template matching principal array (A), then ALIGN the array to the template:!HPF$ TEMPLATE T(N, N)!HPF$ ALIGN A(I, J) WITH T(I, J)

Aligning auxiliary arrays Want to minimize communication in

largest parallel loop: FORALL (I = K+1:N, J = K+1:N) & A(I, J) = A(I, J) – COL(I) * ROW(J)

Suggests replicated alignments for COL and ROW:!HPF$ ALIGN COL(I) WITH T(I, *)!HPF$ ALIGN COL(J) WITH T(*, J)

Alignment of arrays in LU example

Communication in LU example The statement

COL(K:N) = A(K:N, K)

broadcasts Kth column of A.

The statement A(K, K+1:N) = A(K, K+1:N) / COL(K)

needs no communication (reason for using COL(K) rather than A(K, K)).

The statement ROW(K+1:N) = A(K, K+1:N)

broadcasts Kth row of A.

Distribution in LU example

Block-wise distribution unattractive due to poor load balancing. Cyclic distribution preferable:

!HPF$ PROCESSORS P(NP, NP)!HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P

LU example with directives!HPF$ PROCESSORS P(NP, NP)!HPF$ TEMPLATE T(N, N)!HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P

REAL A(N, N)!HPF$ ALIGN A(I, J) WITH T(I, J)

REAL COL(N), ROW(N)!HPF$ ALIGN COL(I) WITH T(I, *)!HPF$ ALIGN COL(J) WITH T(*, J)

DO K = 1, N – 1 COL(K:N) = A(K:N, K) A(K, K+1:N) = A(K, K+1:N) / COL(K) ROW(K+1:N) = A(K, K+1:N) FORALL (I = K+1:N, J = K+1:N) & A(I, J) = A(I, J) – COL(I) * ROW(J) ENDDO

Other alignment options

Permuted dimensions: DIMENSION B(N, N)!HPF$ ALIGN B(I, J) WITH T(J, I)

Affine intra-dimensional alignment: DIMENSION C(N/2, N/2)!HPF$ ALIGN C(I, J) WITH T(N/2 + I, 2*J)

Intra-dimensional alignment example

Next Lecture: Issues in translation of High

Performance Fortran Translation of simple HPF programs to

SPMD (MPI) programs. Design of a runtime Distributed Array

Descriptor.