mapping stream programs onto heterogeneous multiprocessor systems [by barcelona supercomputing...
TRANSCRIPT
![Page 1: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/1.jpg)
Mapping Stream Programs onto HeterogeneousMultiprocessor Systems
[by Barcelona Supercomputing Centre, Spain, Oct 09]
S. M. FarhadProgramming Language Group
School of Information TechnologyThe University of Sydney
1
![Page 2: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/2.jpg)
Abstract
Presents a partitioning and allocation algorithm for stream compiler Targeting heterogeneous multiprocessors with constrained
distributed memory and communication topology Introduce a novel definition of connectedness
Which enables the algorithm to model the capabilities of the compiler
The algorithm uses convexity and connectedness constraints to produce partitions Easier to compile and require short pipelines
Shows StreamIt 2.1.1 benchmarks results for an SMP, 2 X 2 mesh, SMP plus accelerator and IMP QS20 blade which are within 5% of the optimum
2
![Page 3: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/3.jpg)
Motivation
Recent trends: an increasing number of on-chip processor cores Intel IXP2850 [16], TI OMAP [7], Nexperia Home Platform [9],
and ST Nomadik Programs using explicit threads have to be tuned to match
the target The interesting thing is to automatically map a portable
program onto a heterogeneous multiprocessor Many multimedia and radio applications contain abundant
task and data parallelism But it is hard to extract from C code Stream languages represent the application as
independent actors communicating via point-to-point streams
3
![Page 4: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/4.jpg)
Overview
The input to the partitioning algorithm is the program graph of kernels and streams
The output is the mapping to fuse kernels to tasks, and to allocate tasks to processors
The difference between kernels and tasks, Kernels: the work function of an actor Tasks: which are present in the executable, contain
multiple kernels and are scheduled on a known processor The algorithm requires
The average data rate on each stream, and (SDF: statically)
The average load of each kernel on each processor (profiling)
4
![Page 5: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/5.jpg)
The ACOTES Stream Compiler This work is part of the ACOTES European
project Developing an open source stream compiler for
embedded systems Will automatically map a stream program onto a
multicore system Program will be written in SPM: an annotated C
programming language The target is represented using the ASM:
supports heterogeneous platforms
![Page 6: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/6.jpg)
The Mapping Phase
Blockingand splitting
Partitioning
Software pipeliningand queue length
assignment
Initial Partitioning
Merge tasks
Move bottlenecks
Create tasks
Reallocate tasks
A polyhedral optimization pass unrolls to aggregate computation
and communication, andsplits stateless kernels for greater
parallelism
Partitioning algorithm described in this paper
Performs softwarepipelining and allocates memory for the stream buffers
![Page 7: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/7.jpg)
Convex Connected Partitions
Convexity: a partition is convex if the graph of dependencies between tasks is acyclic
Equivalently, every directed path between two kernels in the same task is internal to that task
The convexity constraint is for Avoiding long software pipelines
Require long pipelines for a small increase in throughput (unaware of pipelining cost)
![Page 8: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/8.jpg)
8
Quick Review: Software Pipelining
RectPolar
RectPolar
RectPolar
RectPolar
Prologue
New Steady
State
• New steady-state is free of dependencies
![Page 9: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/9.jpg)
The number of pipeline stages
![Page 10: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/10.jpg)
Convex Partition
Data flowis from processor p1 (black) to p2 (grey) and p3 (white),
and from p2 to p3—an acyclic graph
![Page 11: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/11.jpg)
![Page 12: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/12.jpg)
Connectedness
The connectedness constraint To help code
generation It is easier to fuse
adjacent kernels
If k2 and k3 are fused into one task, then the entire graph must be fused
![Page 13: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/13.jpg)
Connectedness Contd.
A naïve definition: considers a partition to be connected when each processor has a weakly connected subgraph
Unfortunately, wide split-joins, as in filterbank, do not usually have good partitions subject to this constraint
Because strict connectedness Fig (c) performs 28% worse than Fig (a)
Generalize connectivity as a set of basic connected sets [Joseph’06]
![Page 14: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/14.jpg)
Formalization of the Problem
The target is represented as an undirected bipartite graph H = (V, E), where
is the set of vertices, a disjoint union of processors, P, and interconnects, I and E is the set of edges
Processor weight: wp Clock speed in GHz Interconnect weight: wu is bandwidth in GB/s Static route between processor p and q: , if
it uses interconnect u, and 0 otherwise In general,
14
IPV
1upqr
uqp
upq rr
![Page 15: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/15.jpg)
Topology of the Targets
15
Processorsa and b unable to
execute code; therefore
= wa = wb = 0
![Page 16: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/16.jpg)
Formalization of the Problem Contd. The program is represented as a directed
acyclic graph, G = (K, S), where K is the set of kernels, and S is the set of
streams Load of kernel i on processor p, denoted cip,
is the mean number of gigacycles in some fixed time period tau
Load of stream ij, denoted cij is the mean number of gigabytes transferred in time tau
16
![Page 17: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/17.jpg)
Formalization of the Problem Contd. The output of the algorithm is two map
functions Firstly, T maps kernels onto tasks, and Secondly, P maps tasks onto processors
The partition implied by T must be convex, so the graph of dependencies between tasks is acyclic
17
![Page 18: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/18.jpg)
Formalization of the Problem Contd. The cost on processor p or interconnect u is
The goal is to find the allocation (T, P), which minimises the maximum values of all the Cp and Cu
Subject to the convexity and connectedness constraints
18
qp
p
KjKiu
ij
Pqp
upq
u
Kip
ipp
w
crC
w
cC
,,
![Page 19: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/19.jpg)
Partitioning the Target
The algorithm First divides the target into two subgraphs, P1 and
P2, and An aggregate interconnect, I, balancing two
objectives: The subgraphs should have roughly equal total CPU
performance, and The aggregate interconnect bandwidth between them
should be low
19
![Page 20: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/20.jpg)
Partitioning the Target Cond.
Communications bottleneck
The target is divided into halves to maximise alpha
20
21 ,max
PqPpu
uqp
upq
Iu w
rrC
21
,minPq
q
Pp
p wwC
They find an approximate solution using a variant of theKernighan and Lin partitioning algorithm
![Page 21: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/21.jpg)
Partitioning the Program
The program graph is given edge and vertex weights
The edge weight for stream ij, denoted cij is the cost in cycles in time tau, if assigned to the aggregate interconnect
The vertex weight for kernel i is a pair (ciP1, ciP2), The goal is to find a two-way partition {K1,K2} to
minimize the bottleneck given by:
21
212
2
1
1
,
,,maxKjKiij
Kj
jP
Ki
iP cccc
![Page 22: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/22.jpg)
Partitioning the Program Contd. The partitioning algorithm is a branch and bound
search Each node in the search tree inherits a partial
partition (K1, K2), and unassigned vertices X; at the root K1 = K2 = 0 and X = K
The minimal cost, cK1K2 , for all partitions in the subtree rooted at node (K1,K2)
22
2121
21
,
,,maxKjKiij
Ki
iB
Ki
iAKK ccclbc
1111
21
,
,,maxKjKiij
Ki
iB
Ki
iAKK cccubc
![Page 23: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/23.jpg)
The Branch and Bound Search
23
qp
p
KjKiu
ij
Pqp
upq
u
Kip
ipp
w
crC
w
cC
,,
2121
21
,
,,maxKjKiij
Ki
iB
Ki
iAKK ccclbc
1111
21
,
,,maxKjKiij
Ki
iB
Ki
iAKK cccubc
![Page 24: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/24.jpg)
Refinement of the Partition
The refinement stage starts with a valid initial partition Merge tasks Move bottlenecks Create tasks Reallocate tasks
24
![Page 25: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/25.jpg)
Evaluation
25
![Page 26: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/26.jpg)
Convergence of the Refinement Phase
26
![Page 27: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/27.jpg)
Summary
A fast and robust partitioning algorithm for an iterative stream compiler
The algorithm maps an unstructured variable data rate stream program onto a heterogeneous multiprocessor system with any communications topology
The algorithm favours convex connected partitions, which do not require software pipelining and are easy to compile
The performance is, on average, within 5% of the optimum performance
27
![Page 28: Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language](https://reader035.vdocument.in/reader035/viewer/2022062716/56649e115503460f94afcb3e/html5/thumbnails/28.jpg)
Question?
Thank you
28