s 3 p: automatic, optimized mapping of signal processing applications to parallel architectures
DESCRIPTION
S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures. Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond MIT Lincoln Laboratory 27 September 2001 HPEC Workshop, Lexington, MA - PowerPoint PPT PresentationTRANSCRIPT
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
S3P: Automatic, Optimized Mapping ofSignal Processing Applications to
Parallel Architectures
Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert BondMIT Lincoln Laboratory
27 September 2001HPEC Workshop, Lexington, MA
This work is sponsored by United States Air Force under Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense.
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
• Problem Statement• S3P Program
Outline
• Introduction
• Design
• Demonstration
• Results
• Summary
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
PCA Need: System Level Optimization
FilterXOUT = FIR(XIN)
DetectXOUT = |XIN|>c
BeamformXOUT = w *XIN
Signal Processing Application(made up of PCA components)
Morphware
Hardware
Software
Components
A B
Applications
• Applications built with components• Components have a defined scope
•Capable of local optimization• System requires global optimization
•Not visible to components•Too complex to add to application
• Need system level optimization capabilities as part of PCA
• Applications built with components• Components have a defined scope
•Capable of local optimization• System requires global optimization
•Not visible to components•Too complex to add to application
• Need system level optimization capabilities as part of PCA
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Example: Optimum System Latency
1
10
100
0 8 16 24 32
LocalOptimum
Beamform
Filter
Latency < 8
Hard
ware <
32
Hardware Units (N)
La
ten
cy
Component Latency
0
8
16
24
32
0 8 16 24 32
Hardware < 32
Latency < 8
Filter Hardware
Be
am
form
Har
dw
are
System Latency
Global
Optimum
BeamformLatency = 2/N
FilterLatency = 1/N
• Simple two component system• Local optimum fails to satisfy
global constraints• Need system view to find
global optimum
• Simple two component system• Local optimum fails to satisfy
global constraints• Need system view to find
global optimum
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
System Optimization Challenge
FilterXOUT = FIR(XIN)
DetectXOUT = |XIN|>c
BeamformXOUT = w *XIN
Signal Processing Application
Compute Fabric(Cluster, FPGA, SOC …)• Optimizing to system constraints requires two
way component/system knowledge exchange• Need a framework to mediate exchange and
perform system level optimization
• Optimizing to system constraints requires two way component/system knowledge exchange
• Need a framework to mediate exchange and perform system level optimization
Optimal Resource Allocation(Latency, Throughput, Memory, Bandwidth …)
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
S3P Lincoln Internal R&D Program
ParallelSignal
ProcessingKepner/Hoffmann
(Lincoln)
•Goal: applications that self-optimize to any hardware•Combine LL system expertise and LCS FFTW approach
Self-OptimizingSoftwareLeiserson/Frigo
(MIT LCS)
S3P brings self-optimizing (FFTW) approach to parallel signal processing systems
S3P brings self-optimizing (FFTW) approach to parallel signal processing systems
• Framework exploits graph theory abstraction• Broadly applicable to system optimization problems• Defines clear component and system requirements
S3P Framework
1
1
2
M
2 N. . .
. . .
ProcessorMappings
Algorithm Stages
Time &Verify
BestMappings
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
• Requirements• Graph Theory
Outline
• Introduction
• Design
• Demonstration
• Results
• Summary
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
System Requirements
• Each compute stage can be mapped to different sets of hardware and timed
• Each compute stage can be mapped to different sets of hardware and timed
FilterXOUT = FIR(XIN)
DetectXOUT = |XIN|>c
BeamformXOUT = w *XIN
Mappableto different sets of hardware
Measurableresource usage of each mapping
Decomposableinto Tasks (comp)and Conduits (comm)
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
System Graph
Beamform Filter Detect
Node is a unique mapping of a task
Edge is a conduit between a pair of task mappings
• System Graph can store the hardware resource usage of every possible Task & Conduit
• System Graph can store the hardware resource usage of every possible Task & Conduit
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Path = System Mapping
Beamform Filter Detect
Each path is a complete system mapping
“Best” Path is the optimal system mapping
• Graph construct is very general and widely used for optimization problems
• Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming
• Graph construct is very general and widely used for optimization problems
• Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Example: Maximize Throughput
Beamform Filter Detect
Node stores task time for a each mapping
• Goal: Maximize throughput and minimize hardware• Choose path with the smallest bottleneck that
satisfies hardware constraint
• Goal: Maximize throughput and minimize hardware• Choose path with the smallest bottleneck that
satisfies hardware constraint
Edge stores conduit time for a given pair of mappings
1.5
3.0
6.0
2.0
4.0
8.0
16.0
3.0
6.0
4.03.02.01.04.03.02.01.04.03.02.01.0
4.03.0
4.03.0
3.02.0
3323
MoreHardware
3.0 4.0
3.0
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Dijkstra’s
Algorithm
Dynamic
Programming
Path Finding Algorithms
N = total hardware unitsM = number of tasksPi = number of mappings for task i
t = MpathTable[M][N] = all infinite weight pathsfor( j:1..M ){ for( k:1..Pj ){ for( i:j+1..N-t+1){ if( i-size[k] >= j ){ if( j > 1 ){ w = weight[pathTable[j-1][i-size[k]]] + weight[k] + weight[edge[last[pathTable[j-1][i-size[k]]],k] p = addVertex[pathTable[j-1][i-size[k]], k] }else{ w = weight[k] p = makePath[k] } if( weight[pathTable[j][i]] > w ){ pathTable[j][i] = p } } } } t = t - 1}
• Graph construct is very general• Widely used for optimization problems• Many efficient techniques for choosing “best” path (under constraints)
such as Dikkstra’s Algorithm and Dynamic Programming
• Graph construct is very general• Widely used for optimization problems• Many efficient techniques for choosing “best” path (under constraints)
such as Dikkstra’s Algorithm and Dynamic Programming
Initialize Graph GInitialize source vertex sStore all vertices of G in a minimum priority queue Q
while (Q is not empty) u = pop[Q] for (each vertex v, adjacent to u) w = u.totalPathWeight() + weight of edge <u,v> + v.weight() if(v.totalPathWeight() > w) v.totalPathWeight() = w v.predecessor() = u
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
S3P Inputs and Outputs
Hardware InformationHardware
Information
Algorithm InformationAlgorithm
Information
SystemConstraints
SystemConstraints
ApplicationApplication
S3P FrameworkS3P Framework“Best”SystemMapping
“Best”SystemMapping
Required
Optional
• Can flexibly add information about
•Application•Algorithm•System•Hardware
• Can flexibly add information about
•Application•Algorithm•System•Hardware
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
• Application• Middleware• Hardware• S3P
Outline
• Introduction
• Design
• Demonstration
• Results
• Summary
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
S3P Demonstration Testbed
Multi-Stage Application
Hardware (Workstation Cluster)
InputLow Pass
FilterBeamform
MatchedFilter
Middleware (PVL)
Map
Task
ConduitS3P EngineS3P Engine
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Multi-Stage Application
Input
XINXIN
Low Pass Filter
XINXIN
W1W1
FIR1FIR1 XOUTXOUT
W2W2
FIR2FIR2
Beamform
XINXIN
W3W3
multmult XOUTXOUT
Matched Filter
XINXIN
W4W4
FFTFFT
IFFTIFFT XOUTXOUT
Features• “Generic” radar/sonar signal processing chain
• Utilizes key kernels (FIR, matrix multiply, FFT and corner turn)
• Scalable to any problem size (fully parameterize algorithm)
• Self validates (built-in target generator)
Features• “Generic” radar/sonar signal processing chain
• Utilizes key kernels (FIR, matrix multiply, FFT and corner turn)
• Scalable to any problem size (fully parameterize algorithm)
• Self validates (built-in target generator)
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Sig
nal
Pro
cess
ing
& C
on
tro
lM
app
ing
Parallel Vector Library (PVL)
Data & TaskPerforms signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR)
Computation
DataUsed to perform matrix/vector algebra on data spanning multiple processors
Matrix/Vector
Task & Pipeline
Supports data movement between tasks (i.e. the arrows on a signal flow diagram)
Conduit
Task & Pipeline
Supports algorithm decomposition (i.e. the boxes in a signal flow diagram)
Task
Organizes processors into a 2D layoutGrid
Data, Task & Pipeline
Specifies how Tasks, Matrices/Vectors, and Computations are distributed on processor
Map
ParallelismDescriptionClass
• Simple mappable components support data, task and pipeline parallelism• Simple mappable components support data, task and pipeline parallelism
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Hardware Platform
• Network of 8 Linux workstations– Dual 800 MHz Pentium III processors
• Communication– Gigabit ethernet, 8-port switch– Isolated network
• Software– Linux kernel release 2.2.14– GNU C++ Compiler – MPICH communication library over
TCP/IP
Advantages• Software tools• Widely available• Inexpensive (high Mflops/$)• Excellent rapid prototyping
platform
Disadvantages• Non real-time OS• Non real-time messaging• Slower interconnect• Difficulty to model• SMP behavior erratic
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
S3P Engine
Hardware InformationHardware
Information
Algorithm InformationAlgorithm
Information
SystemConstraints
SystemConstraints
ApplicationProgram
ApplicationProgram
S3P EngineS3P Engine“Best”SystemMapping
“Best”SystemMapping
• Map Generator constructs the system graph for all candidate mappings• Map Timer times each node and edge of the system graph• Map Selector searches the system graph for the optimal set of maps
• Map Generator constructs the system graph for all candidate mappings• Map Timer times each node and edge of the system graph• Map Selector searches the system graph for the optimal set of maps
MapGenerator
MapGenerator
MapTimerMap
TimerMap
SelectorMap
Selector
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
• Simulated/Predicted/Measured• Optimal Mappings• Validation and Verification
Outline
• Introduction
• Design
• Demonstration
• Results
• Summary
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Optimal Throughput
Input Low Pass Filter Beamform Matched Filter
3.2 31.5
1.4 15.7
1.0 10.4
0.7 8.2
16.1 31.4
9.8 18.0
6.5 13.7
3.3 11.5
52494642472721244429202460332315
1231-
571617-
28149.1-
181815-
14
8.38.73.32.67.38.39.48.0----
17141413
Best 30 msec(1.6 MHz BW)
Best 15 msec(3.2 MHz BW)
• Vary number of processors used on each stage
• Time each computation stage and communication conduit
• Find path with minimum bottleneck
• Vary number of processors used on each stage
• Time each computation stage and communication conduit
• Find path with minimum bottleneck
1 CPU
2 CPU
3 CPU
4 CPU
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
S3P Timings (4 cpu max)
Tasks
4 CPU
3 CPU
2 CPU
1 CPU
InputLow Pass
FilterBeamform
MatchedFilter
• Graphical depiction of timings (wider is better)• Graphical depiction of timings (wider is better)
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Input
S3P Timings (12 cpu max) (wider is better)
Low Pass Filter Beamform Matched FilterTasks
12 CPU
8 CPU
6 CPU
4 CPU
2 CPU
• Large amount of data requires algorithm to find best path
• Large amount of data requires algorithm to find best path
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Predicted and Achieved Latency(4-8 cpu max)
• Find path that produces minimum latency for a given number of processors
• Excellent agreement between S3P predicted and achieved latencies
• Find path that produces minimum latency for a given number of processors
• Excellent agreement between S3P predicted and achieved latencies
Maximum Number of Processors Maximum Number of Processors
Lat
ency
(se
c)
Lat
ency
(se
c)
Large (48x128K) Problem Size Small (48x4K) Problem Size
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Predicted and Achieved Throughput(4-8 cpu max)
Maximum Number of Processors Maximum Number of Processors
Th
rou
gh
pu
t (p
uls
es/s
ec)
Th
rou
gh
pu
t (p
uls
e/se
c)
Large (48x128K) Problem Size Small (48x4K) Problem Size
• Find path that produces maximum throughput for a given number of processors
• Excellent agreement between S3P predicted and achieved throughput
• Find path that produces maximum throughput for a given number of processors
• Excellent agreement between S3P predicted and achieved throughput
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
SMP Results (16 cpu max)
• SMP overstresses Linux Real Time capabilities
• Poor overall system performance
• Divergence between predicted and measured
• SMP overstresses Linux Real Time capabilities
• Poor overall system performance
• Divergence between predicted and measured
Maximum Number of Processors
Th
rou
gh
pu
t (p
uls
e/se
c)
Large (48x128K) Problem Size
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Simulated (128 cpu max)
• Simulator allows exploration of larger systems• Simulator allows exploration of larger systems
Maximum Number of Processors Maximum Number of Processors
Th
rou
gh
pu
t (p
uls
es/s
ec)
Lat
ency
(se
c)
Small (48x4K) Problem Size Small (48x4K) Problem Size
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Reducing the Search Space-Algorithm Comparison-
Graph algorithms provide baseline
performance
Graph algorithms provide baseline
performance
Hill Climbing performance varies as a function of
initialization and neighborhood definition
Hill Climbing performance varies as a function of
initialization and neighborhood definition
Preprocessor outperforms all other
algorithms.
Preprocessor outperforms all other
algorithms.Maximum Number of Processors
Nu
mb
er o
f T
imin
gs
Req
uir
ed
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Future Work
• Program area– Determine how to incorporate global optimization into other
middleware efforts (e.g. PCA, HPEC-SI, …)
• Hardware area– Scale and demonstrate on larger/real-time system
HPCMO Mercury system at WPAFB Expect even better results than on Linux cluster
– Apply to parallel hardware RAW
• Algorithm area– Exploits ways of reducing search space– Provide solution “families” via sensitivity analysis
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Outline
• Introduction
• Design
• Demonstration
• Results
• Summary
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Summary
• System level constraints (latency, throughput, hardware size, …) necessitate system level optimization
• Application requirements for system level optimization are– Decomposable into components (input, filtering, output, …)– Mappable to different configurations (# processors, # links, …)– Measureable resource usage (time, memory, …)
• S3P demonstrates global optimization is feasible separate from the application
MIT Lincoln Laboratory010927-S3p-HPEC-jvk.ppt
Acknowldegements
• Matteo Frigo (MIT/LCS & Vanu, Inc.)
• Charles Leiserson (MIT/LCS)
• Adam Wierman (CMU)