s 3 p: automatic, optimized mapping of signal processing applications to parallel architectures

MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt

S3P: Automatic, Optimized Mapping ofSignal Processing Applications to

Parallel Architectures

Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert BondMIT Lincoln Laboratory

27 September 2001HPEC Workshop, Lexington, MA

This work is sponsored by United States Air Force under Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense.


• Problem Statement• S3P Program

Outline

• Introduction

• Design

• Demonstration

• Results

• Summary


PCA Need: System Level Optimization

FilterXOUT = FIR(XIN)

DetectXOUT = |XIN|>c

BeamformXOUT = w *XIN

Signal Processing Application(made up of PCA components)

Morphware

Hardware

Software

Components

A B

Applications

• Applications built with components• Components have a defined scope

•Capable of local optimization• System requires global optimization

•Not visible to components•Too complex to add to application

• Need system level optimization capabilities as part of PCA

• Applications built with components• Components have a defined scope

•Capable of local optimization• System requires global optimization

•Not visible to components•Too complex to add to application

• Need system level optimization capabilities as part of PCA


Example: Optimum System Latency

1

10

100

0 8 16 24 32

LocalOptimum

Beamform

Filter

Latency < 8

Hard

ware <

32

Hardware Units (N)

La

ten

cy

Component Latency

0

8

16

24

32

0 8 16 24 32

Hardware < 32

Latency < 8

Filter Hardware

Be

am

form

Har

dw

are

System Latency

Global

Optimum

BeamformLatency = 2/N

FilterLatency = 1/N

• Simple two component system• Local optimum fails to satisfy

global constraints• Need system view to find

global optimum

• Simple two component system• Local optimum fails to satisfy

global constraints• Need system view to find

global optimum


System Optimization Challenge




Signal Processing Application

Compute Fabric(Cluster, FPGA, SOC …)• Optimizing to system constraints requires two

way component/system knowledge exchange• Need a framework to mediate exchange and

perform system level optimization

• Optimizing to system constraints requires two way component/system knowledge exchange

• Need a framework to mediate exchange and perform system level optimization

Optimal Resource Allocation(Latency, Throughput, Memory, Bandwidth …)


S3P Lincoln Internal R&D Program

ParallelSignal

ProcessingKepner/Hoffmann

(Lincoln)

•Goal: applications that self-optimize to any hardware•Combine LL system expertise and LCS FFTW approach

Self-OptimizingSoftwareLeiserson/Frigo

(MIT LCS)

S3P brings self-optimizing (FFTW) approach to parallel signal processing systems

S3P brings self-optimizing (FFTW) approach to parallel signal processing systems

• Framework exploits graph theory abstraction• Broadly applicable to system optimization problems• Defines clear component and system requirements

S3P Framework

1

1

2

M

2 N. . .

. . .

ProcessorMappings

Algorithm Stages

Time &Verify

BestMappings


• Requirements• Graph Theory

Outline

• Introduction

• Design

• Demonstration

• Results

• Summary


System Requirements

• Each compute stage can be mapped to different sets of hardware and timed

• Each compute stage can be mapped to different sets of hardware and timed




Mappableto different sets of hardware

Measurableresource usage of each mapping

Decomposableinto Tasks (comp)and Conduits (comm)


System Graph

Beamform Filter Detect

Node is a unique mapping of a task

Edge is a conduit between a pair of task mappings

• System Graph can store the hardware resource usage of every possible Task & Conduit

• System Graph can store the hardware resource usage of every possible Task & Conduit


Path = System Mapping


Each path is a complete system mapping

“Best” Path is the optimal system mapping

• Graph construct is very general and widely used for optimization problems

• Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming

• Graph construct is very general and widely used for optimization problems

• Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming


Example: Maximize Throughput


Node stores task time for a each mapping

• Goal: Maximize throughput and minimize hardware• Choose path with the smallest bottleneck that

satisfies hardware constraint

• Goal: Maximize throughput and minimize hardware• Choose path with the smallest bottleneck that

satisfies hardware constraint

Edge stores conduit time for a given pair of mappings

1.5

3.0

6.0

2.0

4.0

8.0

16.0

3.0

6.0

4.03.02.01.04.03.02.01.04.03.02.01.0

4.03.0

4.03.0

3.02.0

3323

MoreHardware

3.0 4.0

3.0


Dijkstra’s

Algorithm

Dynamic

Programming

Path Finding Algorithms

N = total hardware unitsM = number of tasksPi = number of mappings for task i

t = MpathTable[M][N] = all infinite weight pathsfor( j:1..M ){ for( k:1..Pj ){ for( i:j+1..N-t+1){ if( i-size[k] >= j ){ if( j > 1 ){ w = weight[pathTable[j-1][i-size[k]]] + weight[k] + weight[edge[last[pathTable[j-1][i-size[k]]],k] p = addVertex[pathTable[j-1][i-size[k]], k] }else{ w = weight[k] p = makePath[k] } if( weight[pathTable[j][i]] > w ){ pathTable[j][i] = p } } } } t = t - 1}

• Graph construct is very general• Widely used for optimization problems• Many efficient techniques for choosing “best” path (under constraints)

such as Dikkstra’s Algorithm and Dynamic Programming

• Graph construct is very general• Widely used for optimization problems• Many efficient techniques for choosing “best” path (under constraints)

such as Dikkstra’s Algorithm and Dynamic Programming

Initialize Graph GInitialize source vertex sStore all vertices of G in a minimum priority queue Q

while (Q is not empty) u = pop[Q] for (each vertex v, adjacent to u) w = u.totalPathWeight() + weight of edge <u,v> + v.weight() if(v.totalPathWeight() > w) v.totalPathWeight() = w v.predecessor() = u


S3P Inputs and Outputs

Hardware InformationHardware

Information

Algorithm InformationAlgorithm

Information

SystemConstraints

SystemConstraints

ApplicationApplication

S3P FrameworkS3P Framework“Best”SystemMapping

“Best”SystemMapping

Required

Optional

• Can flexibly add information about

•Application•Algorithm•System•Hardware

• Can flexibly add information about

•Application•Algorithm•System•Hardware


• Application• Middleware• Hardware• S3P

Outline

• Introduction

• Design

• Demonstration

• Results

• Summary


S3P Demonstration Testbed

Multi-Stage Application

Hardware (Workstation Cluster)

InputLow Pass

FilterBeamform

MatchedFilter

Middleware (PVL)

Map

Task

ConduitS3P EngineS3P Engine


Multi-Stage Application

Input

XINXIN

Low Pass Filter

XINXIN

W1W1

FIR1FIR1 XOUTXOUT

W2W2

FIR2FIR2

Beamform

XINXIN

W3W3

multmult XOUTXOUT

Matched Filter

XINXIN

W4W4

FFTFFT

IFFTIFFT XOUTXOUT

Features• “Generic” radar/sonar signal processing chain

• Utilizes key kernels (FIR, matrix multiply, FFT and corner turn)

• Scalable to any problem size (fully parameterize algorithm)

• Self validates (built-in target generator)

Features• “Generic” radar/sonar signal processing chain

• Utilizes key kernels (FIR, matrix multiply, FFT and corner turn)

• Scalable to any problem size (fully parameterize algorithm)

• Self validates (built-in target generator)


Sig

nal

Pro

cess

ing

& C

on

tro

lM

app

ing

Parallel Vector Library (PVL)

Data & TaskPerforms signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR)

Computation

DataUsed to perform matrix/vector algebra on data spanning multiple processors

Matrix/Vector

Task & Pipeline

Supports data movement between tasks (i.e. the arrows on a signal flow diagram)

Conduit

Task & Pipeline

Supports algorithm decomposition (i.e. the boxes in a signal flow diagram)

Task

Organizes processors into a 2D layoutGrid

Data, Task & Pipeline

Specifies how Tasks, Matrices/Vectors, and Computations are distributed on processor

Map

ParallelismDescriptionClass

• Simple mappable components support data, task and pipeline parallelism• Simple mappable components support data, task and pipeline parallelism


Hardware Platform

• Network of 8 Linux workstations– Dual 800 MHz Pentium III processors

• Communication– Gigabit ethernet, 8-port switch– Isolated network

• Software– Linux kernel release 2.2.14– GNU C++ Compiler – MPICH communication library over

TCP/IP

Advantages• Software tools• Widely available• Inexpensive (high Mflops/$)• Excellent rapid prototyping

platform

Disadvantages• Non real-time OS• Non real-time messaging• Slower interconnect• Difficulty to model• SMP behavior erratic


S3P Engine

Hardware InformationHardware

Information

Algorithm InformationAlgorithm

Information

SystemConstraints

SystemConstraints

ApplicationProgram

ApplicationProgram

S3P EngineS3P Engine“Best”SystemMapping

“Best”SystemMapping

• Map Generator constructs the system graph for all candidate mappings• Map Timer times each node and edge of the system graph• Map Selector searches the system graph for the optimal set of maps

• Map Generator constructs the system graph for all candidate mappings• Map Timer times each node and edge of the system graph• Map Selector searches the system graph for the optimal set of maps

MapGenerator

MapGenerator

MapTimerMap

TimerMap

SelectorMap

Selector


• Simulated/Predicted/Measured• Optimal Mappings• Validation and Verification

Outline

• Introduction

• Design

• Demonstration

• Results

• Summary


Optimal Throughput

Input Low Pass Filter Beamform Matched Filter

3.2 31.5

1.4 15.7

1.0 10.4

0.7 8.2

16.1 31.4

9.8 18.0

6.5 13.7

3.3 11.5

52494642472721244429202460332315

1231-

571617-

28149.1-

181815-

14

8.38.73.32.67.38.39.48.0----

17141413

Best 30 msec(1.6 MHz BW)

Best 15 msec(3.2 MHz BW)

• Vary number of processors used on each stage

• Time each computation stage and communication conduit

• Find path with minimum bottleneck

• Vary number of processors used on each stage

• Time each computation stage and communication conduit

• Find path with minimum bottleneck

1 CPU

2 CPU

3 CPU

4 CPU


S3P Timings (4 cpu max)

Tasks

4 CPU

3 CPU

2 CPU

1 CPU

InputLow Pass

FilterBeamform

MatchedFilter

• Graphical depiction of timings (wider is better)• Graphical depiction of timings (wider is better)


Input

S3P Timings (12 cpu max) (wider is better)

Low Pass Filter Beamform Matched FilterTasks

12 CPU

8 CPU

6 CPU

4 CPU

2 CPU

• Large amount of data requires algorithm to find best path

• Large amount of data requires algorithm to find best path


Predicted and Achieved Latency(4-8 cpu max)

• Find path that produces minimum latency for a given number of processors

• Excellent agreement between S3P predicted and achieved latencies

• Find path that produces minimum latency for a given number of processors

• Excellent agreement between S3P predicted and achieved latencies

Maximum Number of Processors Maximum Number of Processors

Lat

ency

(se

c)

Lat

ency

(se

c)

Large (48x128K) Problem Size Small (48x4K) Problem Size


Predicted and Achieved Throughput(4-8 cpu max)


Th

rou

gh

pu

t (p

uls

es/s

ec)

Th

rou

gh

pu

t (p

uls

e/se

c)

Large (48x128K) Problem Size Small (48x4K) Problem Size

• Find path that produces maximum throughput for a given number of processors

• Excellent agreement between S3P predicted and achieved throughput

• Find path that produces maximum throughput for a given number of processors

• Excellent agreement between S3P predicted and achieved throughput


SMP Results (16 cpu max)

• SMP overstresses Linux Real Time capabilities

• Poor overall system performance

• Divergence between predicted and measured

• SMP overstresses Linux Real Time capabilities

• Poor overall system performance

• Divergence between predicted and measured

Maximum Number of Processors

Th

rou

gh

pu

t (p

uls

e/se

c)

Large (48x128K) Problem Size


Simulated (128 cpu max)

• Simulator allows exploration of larger systems• Simulator allows exploration of larger systems


Th

rou

gh

pu

t (p

uls

es/s

ec)

Lat

ency

(se

c)

Small (48x4K) Problem Size Small (48x4K) Problem Size


Reducing the Search Space-Algorithm Comparison-

Graph algorithms provide baseline

performance

Graph algorithms provide baseline

performance

Hill Climbing performance varies as a function of

initialization and neighborhood definition

Hill Climbing performance varies as a function of

initialization and neighborhood definition

Preprocessor outperforms all other

algorithms.

Preprocessor outperforms all other

algorithms.Maximum Number of Processors

Nu

mb

er o

f T

imin

gs

Req

uir

ed


Future Work

• Program area– Determine how to incorporate global optimization into other

middleware efforts (e.g. PCA, HPEC-SI, …)

• Hardware area– Scale and demonstrate on larger/real-time system

HPCMO Mercury system at WPAFB Expect even better results than on Linux cluster

– Apply to parallel hardware RAW

• Algorithm area– Exploits ways of reducing search space– Provide solution “families” via sensitivity analysis


Outline

• Introduction

• Design

• Demonstration

• Results

• Summary


Summary

• System level constraints (latency, throughput, hardware size, …) necessitate system level optimization

• Application requirements for system level optimization are– Decomposable into components (input, filtering, output, …)– Mappable to different configurations (# processors, # links, …)– Measureable resource usage (time, memory, …)

• S3P demonstrates global optimization is feasible separate from the application


Acknowldegements

• Matteo Frigo (MIT/LCS & Vanu, Inc.)

• Charles Leiserson (MIT/LCS)

• Adam Wierman (CMU)

s 3 p: automatic, optimized mapping of signal processing applications to parallel architectures

Documents