efficient radar processing via array and index algebras

lrm-1lrm 04/22/23

University at Albany, SUNY

Efficient Radar Processing Via Array and Index Algebras

Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry B. Hunt III, Xingmin Luo

University at Albany, SUNY

NSF CCR 0105536

University at Albany,SUNYlrm-2

lrm 04/22/23

Outline

• Overview– Motivation

Radar Software Processing: to exceed 1 x 1012 ops/second

The Mapping Problem: Efficient Use of Memory Hierarchy, Portable, Scalable, …

Radar uses Linear and Multi-linear Operators: Array Based Operations

Array Operations Require Array Algebra and Index Calculus

• Array Algebra: MoA and Index Calculus: Psi Calculus– Reshape to use Processor/Memory Hierarchy Efficiently: Lift

Dimension– High-Level Monolithic Operations: Remove Temporaries

• Time Domain Convolution• Benefits of Using MoA and Psi Calculus


lrm 04/22/23

Levels of Processor/Memory Hierarchy

• Can be Modeled by Increasing Dimensionality of Data Array.

– Additional dimension for each level of the hierarchy.– Envision data as reshaped to reflect increased

dimensionality.– Calculus automatically transforms algorithm to reflect

reshaped data array.– Data, layout, data movement, and scalarization automatically

generated based on reshaped data array.


lrm 04/22/23

Levels of Processor/Memory Hierarchycontinued

• Math and indexing operations in same expression

• Framework for design space search– Rigorous and provably correct– Extensible to complex architectures

Approach

Mathematics of Arrays

Example: “raising” arraydimensionality

y= conv (x)

Me

mo

ry H

iera

rch

y

Parallelism

Main Memory

L2 Cache

L1 Cache

Map

x: < 0 1 2 … 35 >

Map:

< 3 4 5 >< 0 1 2 >

< 6 7 8 >< 9 10 11 >

< 12 13 14 >

< 18 19 20 >< 21 22 23 >

< 24 25 26 >< 27 28 29 >

< 30 31 32 >

< 15 16 17 >

< 33 34 35 >

P0 P1 P2

P0

P1

P2


lrm 04/22/23

Application DomainSignal Processing

3-d Radar Data Processing Composition of Monolithic Array Operations

Algorithm is Input

Architectural Information is Input

Hardware Info:- Memory- Processor

Change algorithmto better match

hardware/memory/communication.Lift dimensionalgebraically

PulseCompression

DopplerFiltering

Beamforming Detection

Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1)

Model All Three: dim=dim+3

Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1)

Model All Three: dim=dim+3

Convolution Matrix Multiply


lrm 04/22/23

Current Abstraction Approaches

Even when operations compose, they don’t compose,X(YZ) without temporary arrays

Even when operations compose, they don’t compose,X(YZ) without temporary arrays

Classical CompilerTechnology &Optimization

Scalable/PortableFine Tune

High Performance

Blas, Linpack, LAPACK, SCALAPACK

ATLASLibraries

PVL, Blitz++,MTL

Libraries

Loop transformationsTheories

Grammar Changes

CompilerAST OptimizationsGrammar Changes

StandardCompiler

Optimizations

Interpreted

Fortran 95 ZPL

C++ w/classesfunctions, templates

MATLAB

Some Modern Programming Languages with Monolithic Arrays

Requireshighly skilledprogrammers

PartialAlgebras

Compiled

PETEAST Preprocessor


lrm 04/22/23

Outline

• Overview

• Array Algebra: MoA and Index Calculus: Psi Calculus

– Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension

– High-Level Monolithic Operations: Remove Temporaries

• Time Domain Convolution

• Benefits of Using MoA and Psi Calculus


lrm 04/22/23

PSI Calculus

Basic Properties:• Index calculus: Centers around psi function.• Shape polymorphic functions and operators:

•Operations are defined using shapes and psi.• Fundamental type is the array modeled as (shape_vector, components).

•scalars are 0-dimensional arrays, that is: (empty_vector, scalar value).• Denotational Normal Form(DNF) = reduced form in Cartesian coordinates

(independent of data layout: row major, column major, regular sparse, …)• Operational Normal Form(ONF) = reduced form for 1-d memory layout(s).


lrm 04/22/23

Psi Reduction

ONF has minimum number of reads/writes

PSI Calculus rules applied mechanically to produce ONF which is easily translated to optimal loop implementation

A=cat(rev(B), rev(C)) A[i]=B[B.size-1-i] if 0≤ i < B.size

A[i]=C[C.size+B.size-1-i] if B.size ≤ i < B.size+C.size)

This becomes by “psi” Reduction


lrm 04/22/23

Some Psi Calculus Operations


lrm 04/22/23

Convolution: PSI Calculus Description

PSI Calculus operators compose to form higher level operationsPSI Calculus operators compose to form higher level operations

Definitionof

y=conv(h,x)

y[n]= where x ‘has N elements, h has M elements, 0≤n<N+M-1, andx’ is x padded by M-1 zeros on either end

knxkhM

k

1

0

AlgorithmandPSI

CalculusDescription

Algorithm step Psi Calculus

sum Y=unaryOmega (sum, 1, Prod)

Initial step x= < 1 2 3 4 > h= < 5 6 7 >

rotate x’(N+M-1) times x’ rot=binaryOmega(rotate,0,iota(N+M-1), 1 x’)

Form x’ x’=cat(reshape(<k-1>, <0>), cat(x, reshape(<k-1>,<0>)))=

take the size of h part of x’rot

x’ final=binaryOmega(take,0,reshape<N+M-1>,<M>,=1,x’ rot

multiply Prod=binaryOmega (*,1, h,1,x’ final)

x= < 1 2 3 4 > h= < 5 6 7 >

< 0 0 1 . . . 4 0 0 >

x’ rot=< 0 0 1 2 . . . >< 0 1 2 3 . . . >< 1 2 3 4 . . . >

< 7 20 38 . . . >

< 0 0 1 >< 0 1 2 >< 1 2 3 >

< 0 0 7 >< 0 6 14 >< 5 12 21 >

x’ final=

Prod=

Y=

x’=


lrm 04/22/23

Experimental Platform and Method

Hardware• DY4 CHAMP-AV Board

– Contains 4 MPC7400’s and 1 MPC 8420

• MPC7400 (G4)– 450 MHz– 32 KB L1 data cache– 2 MB L2 cache– 64 MB memory/processor

Software

• VxWorks 5.2– Real-time OS

• GCC 2.95.4 (non-official release)– GCC 2.95.3 with patches for

VxWorks– Optimization flags:

-O3 -funroll-loops -fstrict-aliasing

Method

• Run many iterations, report average, minimum, maximum time

– From 10,000,000 iterations for small data sizes, to 1000 for large data sizes

• All approaches run on same data

• Only average times shown here

• Only one G4 processor used

• Use of the VxWorks OS resulted in very low variability in timing• High degree of confidence in results

• Use of the VxWorks OS resulted in very low variability in timing• High degree of confidence in results


lrm 04/22/23

Experiment: Conv(x,h)

• Cost of temporaries in regular C++ approach more pronounced due to large number of operations

• Cost of expression tree manipulation also more pronounced

• Cost of temporaries in regular C++ approach more pronounced due to large number of operations

• Cost of expression tree manipulation also more pronounced


lrm 04/22/23

Convolution and Dimension Lifting

• Model Processor and Level 1 cache.– Start with 1-d inputs(input dimension).– Envision 2nd dimension ranging over output values.– Envision Processors

Reshaped into a 3rd dimension. The 2nd dimension is partitioned.

– Envision Cache Reshaped into a 4th dimension. The 1st dimension is partitioned.

– “psi” Reduce to Normal Form


lrm 04/22/23

– Envision 2nd dimension ranging over output values.

Let tz=N+M-1 M=h=4

N=x

tztz

h3 h2 h1 h0 0 0 0 x0 x4


lrm 04/22/23

2

x xtz

2

tz

2

- Envision Processors Reshaped into a 3rd dimension. The 2nd dimension is partitioned.

Let p = 2

-4-4 - -4-


lrm 04/22/23

– Envision Cache

Tz/2

x

2x

2

Reshaped into a 4th dimension

The 1st dimension is partitioned.

tz/2

Tz/2

x

2x

2

2

tz

2

2 2


lrm 04/22/23

ONF for the Convolution Decomposition with Processors & Cache

Generic form- 4 dimensional after “psi” Reduction

1. For i0= 0 to p-1 do:2. For i11= 0 to tz/p –1 do:3. sum 04. For icacherow= 0 to M/cache -1 do:5. For i3 = 0 to cache –1 do:6. sum sum + h [(M-((icacherow cache) + i3))-1]

x’[(((tz/p i0)+i1) + icacherow cache) + i3)]

Let tz=N+M-1M=hN=x

Time DomainTime Domain

Processor

loop

TI

me

loop

Cache

loop

sum is calculated for each element of y.


lrm 04/22/23

Outline

• Overview• Array Algebra: MoA and Index Calculus: Psi Calculus• Time Domain Convolution• Other algorithms in Radar

– Modified Gram-Schmidt QR Decompositions MOA to ONF Experiments

– Composition of Matrix Multiplication in Beamforming MoA to DNF Experiments

– FFT

• Benefits of Using Moa and Psi Calculus


lrm 04/22/23

ONFfor1

proc

Algorithms in Radar

Time DomainConvolution (x,y)

Modified Gram SchmidtQR (A)

A x (BH x C)Beamforming

Manualdescription

&derivation

for 1 processor

DNF

Lift dimension- Processor- L1 cache

reformulate

DNF ONF

Mechanize UsingExpression Templates

Use toreason

about RAW

Benchmark at NCSAw/LAPACK

CompilerOptimizationsDNF to ONF

ImplementDNF/ONFFortran 90

Thoughtson an

AbstractMachine

MoA & CalculusMoA & Calculus


lrm 04/22/23

ONF for the QRDecomposition with Processors & Cache

Modified Gram Schmidt

Main

Loop

ProcessorLoop

ProcessorLoop

ProcessorCacheLoop

ProcessorCacheLoop

Initialization

ComputeNorm

Normalize

DoTProduct

Ortothogonalize


lrm 04/22/23

DNF for the Composition of A x (BH x C)

Generic form- 4 dimensional

1. Z=02. For i=0 to n-1 do:3. For j=0 to n-1 do:4. For k=0 to n-1 do:5. z[k;]z[k;]+A[k;j]xX[j;i]xB[i;]

Given A, B, X, Zn by n arrays

BeamformingBeamforming


lrm 04/22/23

Fftpsirad2: Performance Comparisons


lrm 04/22/23

Mechanizing MoA and Psi Reduction

MoA & calculus theory: Mullin ’88

Prototype compiler: output C, F90, HPF: Mullin and Thibault’94

HPF compiler: AST manipulations: Mullin, et al ‘96

SAC: functional C: Mullin and Bodo’96

C++ classes: Helal, Sameh and Mullin’01

C++ expression templates: Mullin, Ruttledge, Bond’02

PVL with the portable expression template engine(PETE)

Parallel and distributed processing

Abstract machine

Automate cost and determine optimizations: minimize search space

Lifting Compiler Optimizations to Application Programmer Interface

Theory applied to embedded systems

C++

C

Fortran

IndexTheory

IntroducedAbrams 1972


lrm 04/22/23

On-going research

• we are implementing the psi calculus using expression templates.

• we are building on work done at MIT and we are working with MTL library developers (lumsdaine) at Indiana University and STL library developer, musser, at rpi.


lrm 04/22/23

Benefits of Using Moa and Psi Calculus

• Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level.

• Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities

– Matches memory hierarchy levels– Avoids materialization of intermediate arrays.

• Algorithm can be automatically(algebraically) transformed to reflect array reshapings above.

• Facilitates programming expressed at a high level– Facilitates intentional program design and analysis– Facilitates portability

• This approach is applicable to many other problems in radar.


lrm 04/22/23

Email and Question?

• Lenore R. Mullin, [email protected] • Daniel J. Rosenkrantz, [email protected]

• Harry B. Hunt III, [email protected]

• Xingmin Luo, [email protected]

*The End*


lrm 04/22/23

Typical C++ Operator Overloading

temp

B+C temp

temp copy A

Main

Operator +

Operator =

1. Pass B and Creferences tooperator +

2. Create temporaryresult vector

3. Calculate results,store in temporary

4.Return copy of temporary

5. Pass results referenceto operator=

6. Perform assignment

temp copy

temp copy &

Example: A=B+C vector addExample: A=B+C vector add

B&,

C&

Additional Memory Use

Additional Execution Time

•Static memory•Dynamic memory(also affectsexecution time)

• Cache misses/page faults

• Time to create anew vector

• Time to create a copy of a vector

• Time to destructboth temporaries

2 temporary vectors created2 temporary vectors created


lrm 04/22/23

C++ Expression Templates and PETE

Parse trees, not vectors, createdParse trees, not vectors, created

Reduced Memory Use

Reduced Execution Time

• Parse tree only contains references

• Better cache use• Loop fusion style

optimization• Compile-time

expression tree manipulation

PETE: http://www.acl.lanl.gov/pete

• PETE, the Portable Expression Template Engine, is available from theAdvanced Computing Laboratory at Los Alamos National Laboratory

• PETE provides:– Expression template capability– Facilities to help navigate and evaluating parse trees

A=B+CA=B+CBinaryNode<OpAdd, Reference<Vector>, Reference<Vector > >Expression

Templates

Expression

Expression TypeParse Tree

B+C A

Main

Operator +

Operator =

+

B& C&

1. Pass B and Creferences tooperator +

4. Pass expression treereference to operator

2. Create expressionparse tree

3. Return expressionparse tree

5. Calculate result andperform assignment

copy &

copy

B&,

C&

Parse trees, not vectors, createdParse trees, not vectors, created

+

B C


lrm 04/22/23

Implementing Psi Calculus with Expression Templates

Example: A=take(4,drop(3,rev(B)))

B=<1 2 3 4 5 6 7 8 9 10>A=<7 6 5 4>

Recall:Psi Reduction for 1-d arrays always yields one or more expressions of the form:

x[i]=y[stride*i+ offset]l ≤ i < u

1. Form expression tree

2. Add size information 3. Apply Psi Reduction rules

4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root

take

drop

rev

B

4

3

take

drop

rev

B

4

3

size=4

size=7

size=10

size=10

size=4iterator:offset=6stride=-1

size=4 A[i]=B[-i+6]

size=7 A[i]=B[-(i+3)+9]=B[-i+6]

size=10 A[i]=B[-i+B.size-1] =B[-i+9]

size=10 A[i]=B[i]

Siz

e i

nfo

Re

du

cti

on

• Iterators used for efficiency, rather than recalculating indices for each i• One “for” loop to evaluate each sub-expression

efficient radar processing via array and index algebras

Documents