programming high performance embedded systems: tackling the performance portability problem
DESCRIPTION
Programming High Performance Embedded Systems: Tackling the Performance Portability Problem. Alastair Reid Principal Engineer, R&D ARM Ltd. Programming HP Embedded Systems. High-Performance Energy-Efficient Hardware Example: Ardbeg processor cluster (ARM R&D) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/1.jpg)
1
Programming High Performance Embedded Systems:
Tackling the Performance Portability Problem
Alastair Reid
Principal Engineer, R&D
ARM Ltd
![Page 2: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/2.jpg)
2
Programming HP Embedded Systems
High-Performance Energy-Efficient Hardware Example: Ardbeg processor cluster (ARM R&D)
Portable System-level programming Example: SoC-C language extensions (ARM R&D)
Portable Kernel-level programming Example: C+Builtins
Example: Data Parallel Language
Merging System/Kernel-level programming
![Page 3: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/3.jpg)
3 3
Mobile Consumer Electronics TrendsMobile Application Requirements Still Growing Rapidly Still cameras: 2Mpixel 10 Mpixel Video cameras: VGA HD 1080p … Video players: MPEG-2 H.264 2D Graphics: QVGA HVGA VGA FWVGA … 3D Gaming: > 30Mtriangle/s, antialiasing, … Bandwidth: HSDPA (14.4Mbps) WiMax (70Mbps) LTE (326Mbps)
Feature Convergence Phone + graphics + UI + games + still camera + video camera + music + WiFi + Bluetooth + 3.5G + 3.9G + WiMax + GPS + …
![Page 4: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/4.jpg)
5 5
Mobile SDR Design Challenges
1
10
100
1000
0.1 1 10 100
Power (Watts)
Pe
ak
Pe
rfo
rma
nc
e (
Go
ps
)
Better
Pow
er Efficiency
10 Mops/m
W
100 Mops/m
W
1 Mops/m
W
5
GeneralPurpose
ProcessorsEmbeddedDSPs
Mobile SDRRequirements
Pentium MTI C6x
IBM CellHigh-end
DSPs
SDR Design Objectives for 3G and WiFi
Throughput requirements 40+Gops peak throughput
Power budget 100mW~500mW peak power
SDR Design Objectives for 3G and WiFi
Throughput requirements 40+Gops peak throughput
Power budget 100mW~500mW peak power
Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008
![Page 5: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/5.jpg)
8 8
Energy Efficient Systems are “Lumpy”Drop Frequency 10x Desktop: 2-4GHz Mobile: 200-400MHz
Increase Parallelism 100x Desktop: 1-2 cores Mobile: 32-way SIMD Instruction Set, 4-8 cores
Match Processor Type to Task Desktop: homogeneous, general purpose Mobile: heterogeneous, specialised
Keep Memory Local Desktop: coherent, shared memory Mobile: processor-memory clusters linked by DMA
![Page 6: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/6.jpg)
10 10
512-bitSIMDReg.File
512-bitSIMD Mult
SIMDShuffle
Net-work
Scalar ALU+Mult
ScalarRF+ACC
L1Data
Memory
AGURF
AGU
1. wide SIMD
Pred.RF
SIMD+ScalarTransf
Unit
Ardbeg PE
3. Memory
SIMDPred.ALU
Scalarwdata
1024-bitSIMD
ACC RF
SIMDwdata
512-bitSIMD ALUwith
shuffle
EX
EX
INTERCONNECTS
INTERCONNECTS
L2Memory
2. Scalar & AGUL1ProgramMemory
Controller
EX
EX
AGU
AGU
WB
WB
WB
WB
64- b
it A
MB
A 3
AX
I In
terc
on
ne
ct
ControlProcessor
Ardbeg System
FECAccelerator
L1Mem
ExecutionUnit
PE
L1Mem
ExecutionUnit
PE
DMAC
Peripherals
L1Mem
L2Mem
512
-bit
Bu
s
Ardbeg SDR Processor
Application Specific HardwareApplication Specific Hardware
2-level memory hierarchy2-level memory hierarchy
8,16,32 bit fixed point support512-bit SIMD
8,16,32 bit fixed point support512-bit SIMD
Sparse Connected VLIWSparse Connected VLIW
Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008
![Page 7: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/7.jpg)
11 11
W-CDMA 2Mbps
DVB-H
DVB-T
802.11a
W-CDMA data
W-CDMA voice
802.11a 180nm 802.11a
W-CDMA 2Mbps180nm W-CDMA 2Mbps
802.11a
W-CDMA 2Mbps
W-CDMA data
W-CDMA voice
W-CDMA data
802.11a
W-CDMA 2Mbps
0.01
0.1
1
10
100
0.01 0.1 1 10 100 1000
Power (Watts)
Ac
hie
ve
d T
hro
ug
hp
ut
(Mb
ps
)
Ardbeg
SODA
ASIC
Sandblaster
TigerSHARC
7 Pentium M
Summary of Ardbeg SDR Processor
• Ardbeg is lower power at same throughput• We are getting closer to ASICs
Slide adapted from M. Woh’s ‘From Scotch to SODA’, MICRO-41, 2008
![Page 8: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/8.jpg)
12 12
How do we program AMP systems?
C doesn’t provide language features to support Multiple processors (or multi-ISA systems)
Distributed memory
Multiple threads
![Page 9: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/9.jpg)
13 13
Use Indirection (Strawman #1)
Add a layer of indirection Operating System
Layer of middleware
Device drivers
Hardware support
All impose a cost in Power/Performance/Area
![Page 10: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/10.jpg)
14 14
Raise Pain Threshold (Strawman #2)
Write efficient code at very low level of abstraction
Problems Hard, slow and expensive to write, test, debug and maintain
Design intent drowns in sea of low level detail
Not portable across different architectures
Expensive to try different points in design space
![Page 11: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/11.jpg)
15 15
Our Response
Extend C Support Asymmetric Multiprocessors
SoC-C language raises level of abstraction
… but take care not to hide expensive operations
![Page 12: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/12.jpg)
16 16
SoC-C Overview
Pocket-Sized Supercomputers Energy efficient hardware is “lumpy” … and unsupported by C … but supported by SoC-C
SoC-C Extensions by Example Pipeline Parallelism Code Placement Data Placement
SoC-C Conclusion
![Page 13: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/13.jpg)
17 17
3 steps in mapping an application
1. Decide how to parallelize
2. Choose processors for each pipeline stage
3. Resolve distributed memory issues
![Page 14: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/14.jpg)
18 18
A Simple Programint x[100];
int y[100];
int z[100];
while (1) {
get(x);
foo(y,x);
bar(z,y);
baz(z);
put(z);
}
![Page 15: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/15.jpg)
19 19
Simplified System Architecture
Distributed Memories
Control Processor
SIMD Instruction SetData Engines
Accelerators
Artist’s impression
![Page 16: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/16.jpg)
20 20
Step 1: Decide how to parallelizeint x[100];
int y[100];
int z[100];
while (1) {
get(x);
foo(y,x);
bar(z,y);
baz(z);
put(z);
}
50% of work
50% of work
![Page 17: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/17.jpg)
21 21
Step 1: Decide how to parallelize int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x);
FIFO(y);
bar(z,y);
baz(z);
put(z);
}
}
PIPELINE indicates region to parallelize
FIFO indicates boundaries between pipeline stages
![Page 18: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/18.jpg)
22 22
SoC-C Feature #1: Pipeline Parallelism
Annotations express coarse-grained pipeline parallelism
PIPELINE indicates scope of parallelism
FIFO indicates boundaries between pipeline stages
Compiler splits into threads communicating through FIFOs
![Page 19: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/19.jpg)
23 23
Step 2: Choose Processors int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x);
FIFO(y);
bar(z,y);
baz(z);
put(z);
}
}
![Page 20: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/20.jpg)
24 24
Step 2: Choose Processors int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
@ P indicates processor to execute function
![Page 21: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/21.jpg)
25 25
SoC-C Feature #2: RPC Annotations
Annotations express where code is to execute Behaves like Synchronous Remote Procedure Call
Does not change meaning of program
Bulk data is not implicitly copied to processor’s local memory
![Page 22: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/22.jpg)
26 26
Step 3: Resolve Memory Issues int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
P0 uses x x must be in M0
P1 uses z z must be in M1
P0 uses y y must be in M0
P1 uses y y must be in M1
Conflict?!
![Page 23: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/23.jpg)
27 27
Hardware Cache Coherency
P0
$0
P1
$1
write x
read x
write x
invalidate x
copy x
invalidate x
![Page 24: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/24.jpg)
28 28
Step 3: Resolve Memory Issues int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
Two versions: y@M0, y@M1
write y@M0 y@M1 is invalid
reads y@M1 Coherence error
![Page 25: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/25.jpg)
29 29
Step 3: Resolve Memory Issues int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
SYNC(x) @ DMA;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
SYNC(x) @ P copies data from one version of x to another using processor P
read y@M1
y@M1 and y@M0 are valid
![Page 26: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/26.jpg)
30 30
SoC-C Feature #3: Compile Time Coherency
Variables can have multiple coherent versions Compiler uses memory topology to determine which version
is being accessed
Compiler applies cache coherency protocol Writing to a version makes it valid and other versions invalid
Dataflow analysis propagates validity
Reading from an invalid version is an error
SYNC(x) copies from valid version to invalid version
![Page 27: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/27.jpg)
31
Compiling SoC-C
See paper:
SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip, Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems (CASES) 2008.
(Or view ‘bonus slides’ after talk.)
![Page 28: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/28.jpg)
32
More realistic SoC-C code
DVB-T Inner ReceiverOFDM receiver 20 tasks 500-7000 cycles each 29000 cycles total
adc_t adc;
ADC_Init(&adc,ADC_BUFSIZE_SAMPLES,adc_Re,adc_Im,13);
SOCC_PIPELINE {
ChannelEstimateInit_DVB_simd(TPS_INFO, CrRe, CrIm) @ DEd;
for(int sym = 0; sym<LOOPS; ++sym) {
cbuffer_t src_r, src_i;
unsigned len = Nguard+asC_MODE[Mode];
ADC_AcquireData(&adc,(sym*len)%ADC_BUFSIZE_SAMPLES,len,&src_r, &src_i);
align(sym_Re,&src_r,len*sizeof(int16_t)) @ DMA_512;
align(sym_Im,&src_i,len*sizeof(int16_t)) @ DMA_512;
ADC_ReleaseRoom(&adc,&src_r,&src_i,len);
RxGuard_DVB_simd(sym_Re,sym_Im,TPS_INFO,Nguard,guarded_Re,guarded_Im) @ DEa;
cscale_DVB_simd(guarded_Re,guarded_Im,23170,avC_MODE[Mode],fft_Re,fft_Im) @ DEa;
fft_DVB_simd(fft_Re,fft_Im,TPS_INFO,ReFFTTwid,ImFFTTwid) @ DEa;
SymUnWrap_DVB_simd(fft_Re,fft_Im,TPS_INFO,unwrapped_Re,unwrapped_Im) @ DEb;
DeMuxSymbol_DVB_simd(unwrapped_Re,unwrapped_Im,TPS_INFO,ISymNum,
demux_Re,demux_Im,PilotsRe,PilotsIm,TPSRe,TPSIm) @ DEb;
DeMuxSymbol_DVB_simd(CrRe,CrIm,TPS_INFO,ISymNum,
demux_CrRe,demux_CrIm,CrPilotsRe,CrPilotsIm,CrTPSRe,CrTPSIm) @ DEb;
cfir1_DVB_simd(demux_Re,demux_Im,demux_CrRe,demux_CrIm,avN_DCPS[Mode],equalized_Re,equalized_Im) @ DEc;
cfir1_DVB_simd(TPSRe,TPSIm,CrTPSRe,CrTPSIm,avN_TPSSCPS[Mode],equalized_TPSRe,equalized_TPSIm) @ DEb;
DemodTPS_DVB_simd(equalized_TPSRe,equalized_TPSIm,TPS_INFO,Pilot,TPSRe) @ DEb;
DemodPilots_DVB_simd(PilotsRe,PilotsIm,TPS_INFO,ISymNum,demod_PilotsRe,demodPilotsIm) @ DEb;
cmagsq_DVB_simd(demux_CrRe,demux_CrIm,12612,avN_DCPS[Mode],MagCr) @ DEc;
int Direction = (ISymNum & 1);
Direction ^= 1;
if (Direction) {
Error=SymInterleave3_DVB_simd2(equalized_Re,equalized_Im,MagCr,
DE_vinterleave_symbol_addr_DVB_T_N,
DE_vinterleave_symbol_addr_DVB_T_OFFSET,
TPS_INFO,Direction,sRe,sIm,sCrMag) @ DEc;
pack3_DVB_simd(sRe,sIm,sCrMag,avN_DCPS[Mode],interleaved_Re,interleaved_Im,Range) @ DEc;
} else {
unpack3_DVB_simd(equalized_Re,equalized_Im,MagCr,avN_DCPS[Mode],sRe,sIm,sCrMag) @ DEc;
Error=SymInterleave3_DVB_simd2(sRe,sIm,sCrMag,
DE_vinterleave_symbol_addr_DVB_T_N,
DE_vinterleave_symbol_addr_DVB_T_OFFSET,
TPS_INFO,Direction,interleaved_Re,interleaved_Im,Range) @ DEc;
}
ChannelEstimate_DVB_simd(interleaved_Re,interleaved_Im,Range,TPS_INFO,CrRe2,CrIm2) @ DEd;
Demod_DVB_simd(interleaved_Re,interleaved_Im,TPS_INFO,Range,demod_softBits) @ DEd;
BitDeInterleave_DVB_simd(demod_softBits,TPS_INFO,deint_softBits) @ DEd;
uint_t err=HardDecoder_DVB_simd(deint_softBits,uvMaxCnt,hardbits) @ DEd;
Bytecpy(&output[p],hardbits,uMaxCnt/8) @ ARM;
p += uMaxCnt/8;
ISymNum = (ISymNum+1) % 4;
}
ADC_Fini(&adc);
![Page 29: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/29.jpg)
33 33
Parallel Speedup
Efficient Same performance as hand-written code
Near Linear Speedup Very efficient use of parallel hardware
0%
50%
100%
150%
200%
250%
300%
350%
400%
1 2 3 4
Speedup
![Page 30: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/30.jpg)
34 34
What SoC-C Provides
SoC-C language features Pipeline to support parallelism Coherence to support distributed memory RPC to support multiple processors/ISAs
Non-features Does not choose boundary between pipeline stages Does not resolve coherence problems Does not allocate processors
SoC-C is concise notation to express mapping decisions (not a tool for making them on your behalf)
![Page 31: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/31.jpg)
35 35
Related Work
Language OpenMP: SMP data parallelism using ‘C plus annotations’ StreamIt: Pipeline parallelism using dataflow language
Pipeline parallelism J.E. Smith, “Decoupled access/execute computer
architectures,” Trans. Computer Systems, 2(4), 1984 Multiple independent reinventions
Hardware Woh et al., “From Soda to Scotch: The Evolution of a
Wireless Baseband Processor,” Proc. MICRO-41, Nov. 2008
![Page 32: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/32.jpg)
36 36
More Recent Related Work
Mapping applications onto Embedded SoCs Exposing Non-Standard Architectures to Embedded Software
using Compile-Time Virtualization, CASES 2009
Pipeline parallelism The Paralax Infrastructure: Automatic Parallelization with a
Helping Hand, PACT 2010
![Page 33: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/33.jpg)
37 37
The SoC-C Model
Program as if using SMP system Single multithreaded processor: RPCs provide a “Migrating
thread Model” Single memory: Compiler Managed Coherence handles
“bookkeeping” Annotations change execution, not semantics
Avoid need to restructure code Pipeline parallelism Compiler managed coherence
Efficiency Avoid abstracting expensive operations programmer can optimize and reason about
![Page 34: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/34.jpg)
38
Kernel Programming
![Page 35: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/35.jpg)
39
Overview
Example: FIR filter
Hand-vectorized code Optimal performance
Issues
An Alternative Approach
1
0
T
jjiji xhy
![Page 36: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/36.jpg)
40
Example Vectorized Code
Very fast, efficient code Uses 32-wide SIMD
Each SIMD multiply performs 32 (useful) multiplies
VLIW compiler overlaps operations 3 vector operations per cycle
VLIW compiler performs software pipelining Multiplier active on every cycle
void FIR(vint16_t x[], vint16_t y[], int16_t h[]) {
vint16_t v = x[0];
for (int i=0; i<N/SIMD_WIDTH; ++i) {
vint16_t w = x[i+1];
vint32L_t acc = vqdmull(v,h[0]);
s = vget_lane(w,0);
v = vdown(v,s);
for(int j=1; j<T-1; ++j) {
acc = vqdmlal(acc,v,h[j]);
s = vget_lane(w,j);
v = vdown(v,s);
}
y[i] = vqrdmlah(acc,v,h[j]);
v = w;
}
}
![Page 37: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/37.jpg)
41
Portability Issues
Vendor specific SIMD operations
vqdmull, vdown, vget_lane
SIMD-width specific
Assumes SIMD_WIDTH >= T
Doesn’t work/performs badly on Many SIMD architectures
GPGPU
SMP
![Page 38: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/38.jpg)
42
Flexibility issues
Improve arithmetic intensity Merge with adjacent kernel
E.g., if filtering input to FFT, combine with bit reversal
Parallelize task across two Ardbeg engines Requires modification to system-level code
![Page 39: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/39.jpg)
43
Summary
Programming directly to the processor Produces very high performance code
Kernel is not portable to other processor types
Kernels cannot be remapped to other devices
Kernels cannot be split/merged to improve scheduling or reduce inter-kernel overheads
Often produces local optimum
But misses global optimum
![Page 40: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/40.jpg)
44
(Towards)Performance-Portable Kernel Programming
![Page 41: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/41.jpg)
45
Outline
The goal
Quick and dirty demonstration
References to (more complete) versions
What still needs to be done
![Page 42: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/42.jpg)
46
An alternative approach
Compiler
1
0
T
jjiji xhy
![Page 43: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/43.jpg)
47
A simple data parallel languageloop(N) {
V1 = load(a);
V2 = load(b);
V3 = add(V1,V2);
store(c,V3);
}
* Currently implemented as a Haskell EDSL – adapted to C-like notation for presentation.
a0 a1 a2 a3 a4 a5 a6 a7 ...V1:
b0 b1 b2 b3 b4 b5 b6 b7 ...V2:
V3:a0+b0
a1+b1
a2+b2
a3+b3
a4+b4
a5+b5
a6+b6
a7+b7
...
c:a0+b0
a1+b1
a2+b2
a3+b3
a4+b4
a5+b5
a6+b6
a7+b7
...
N
![Page 44: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/44.jpg)
48
Compiling Vector Expressions
Vector Expression Init Generate Next
V1 = load(a); p1=a; V1=vld1(p1); p1+=32;
V2 = load(b); p2=b; V2=vld1(p2); p2+=32;
V3 = add(V1,V2); V3=vadd(V1,V2);
store(c,V3); p3=c; vst1(p3,V3); p3+=32;
p1=a; p2=b; p3=c;for(i=0; i<N; i+=32) { V1=vld1(p1); V2=vld1(p2); V3=vadd(V1,V2); vst1(p3,V3); p1+=32; p2+=32; p3+=32;}
![Page 45: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/45.jpg)
50
Generating datapath
+1
+1
MemA
MemB
+
+1
MemC
* Warning: this circuit does not adhere to any ARM quality standards.
![Page 46: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/46.jpg)
51
Adding control
+1
+1
-1
!=0
MemA
MemB
+
+1
MemC
nDone
* Warning: this circuit does not adhere to any ARM quality standards.
en
en
en
en
![Page 47: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/47.jpg)
52
Fixing timing
+1
+1
-1
!=0
MemA
MemB
+
+1
MemC
nDone
en
enen
en
![Page 48: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/48.jpg)
53
Related Work
NESL – Nested Data Parallelism (CMU) Cray Vector machines, Connection Machine
DpH – Generalization of NESL as a Haskell library (SLPJ++) GPGPU
Accelerator – Data parallel library in C#/F# (MSR) SMP, DirectX9, FPGA
Array Building Blocks – C++ template library (Intel) SMP, SSE
Thrust – C++ template library (NVidia) GPGPU
(Also: Parallel Skeletons, Map-reduce, etc. etc.)
![Page 49: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/49.jpg)
54
Summary of approach
(Only) Use highly structured bulk operations Bulk operations reason about vectors, not individual elements
Simple mathematical properties easy to optimize
Single frontend, multiple backends SIMD, SMP, GPGPU, FPGA, ...
(Scope for significant platform-dependent optimization)
![Page 50: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/50.jpg)
55
Breaking down boundaries
Hard boundary between system and kernel layers Separate languages
Separate tools
Separate people writing/optimizing
Need to soften boundary Allow kernels to be split across processors
Allow kernels to be merged across processors
Allow kernels A and B to agree to use a non-standard memory layout (to run more efficiently)
(This is an open problem)
![Page 51: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/51.jpg)
56
Tackling Performance Portability Problem
High Performance Embedded Systems Energy Efficient systems are “lumpy”
The hardware is the easy bit
Two level approach System Programming
Stitch kernels together
Inter-kernel parallelism, mapping onto processors/memory
Kernel Programming C+builtins efficient but inflexible, non-portable
Simple DPL in talk, references to more substantial efforts
Intra-kernel parallelism expressed
Boundary must be softened
![Page 52: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/52.jpg)
57
Fin
![Page 53: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/53.jpg)
58 58
Language Design Meta Issues
Compiler only uses simple analyses Easier to maintain consistency between different compiler
versions/implementations
Programmer makes the high-level decisions Code and Data Placement
Inserting SYNC
Load balancing
Implementation by many source-source transforms Programmer can mix high- and low-level features
90-10 rule: use high-level features when you can, low-level features when you need to
![Page 54: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/54.jpg)
59 59
Compiling SoC-C
1. Data Placementa) Infer data placement
b) Propagate coherence
c) Split variables with multiple placement
2. Pipeline Parallelisma) Identify maximal threads
b) Split into multiple threads
c) Apply zero copy optimization
3. RPC (see paper for details)
![Page 55: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/55.jpg)
60 60
Step 1a: Infer Data Placement int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
SYNC(x) @ DMA;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
Solve Set of Constraints
![Page 56: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/56.jpg)
61 61
Step 1a: Infer Data Placement int x[100];
int y[100];
int z[100];
PIPELINE {
while (1) {
get(x);
foo(y,x) @ P0;
SYNC(x) @ DMA;
FIFO(y);
bar(z,y) @ P1;
baz(z) @ P1;
put(z);
}
}
Solve Set of Constraints Memory Topology constrains where
variables could live
![Page 57: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/57.jpg)
62 62
Solve Set of Constraints Memory Topology constrains where
variables could live
Step 1a: Infer Data Placement int x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo(y@M0, x@M0) @ P0;
SYNC(y,?,?) @ DMA;
FIFO(y@?);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@?);
}
}
![Page 58: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/58.jpg)
63 63
Solve Set of Constraints Memory Topology constrains where
variables could live
Forwards Dataflow propagates availability of valid versions
Step 1b: Propagate Coherenceint x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo(y@M0, x@M0) @ P0;
SYNC(y,?,?) @ DMA;
FIFO(y@?);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@?);
}
}
![Page 59: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/59.jpg)
64 64
Solve Set of Constraints Memory Topology constrains where
variables could live
Forwards Dataflow propagates availability of valid versions
Step 1b: Propagate Coherenceint x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo(y@M0, x@M0) @ P0;
SYNC(y,?,M0) @ DMA;
FIFO(y@?);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@M1);
}
}
![Page 60: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/60.jpg)
65 65
Solve Set of Constraints Memory Topology constrains where
variables could live
Forwards Dataflow propagates availability of valid versions
Backwards Dataflow propagates need for valid versions
Step 1b: Propagate Coherenceint x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@?);
foo(y@M0, x@M0) @ P0;
SYNC(y,?,M0) @ DMA;
FIFO(y@?);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@M1);
}
}
![Page 61: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/61.jpg)
66 66
Solve Set of Constraints Memory Topology constrains where
variables could live
Forwards Dataflow propagates availability of valid versions
Backwards Dataflow propagates need for valid versions
Step 1b: Propagate Coherenceint x[100] @ {M0};
int y[100] @ {M0,M1};
int z[100] @ {M1};
PIPELINE {
while (1) {
get(x@M0);
foo(y@M0, x@M0) @ P0;
SYNC(y,M1,M0) @ DMA;
FIFO(y@M1);
bar(z@M1, y@M1) @ P1;
baz(z@M1) @ P1;
put(z@M1);
}
}
![Page 62: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/62.jpg)
67 67
Step 1c: Split Variablesint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}
Split variables with multiple locations
Replace SYNC with memcpy
![Page 63: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/63.jpg)
68 68
Step 2: Implement Pipeline Annotationint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}
Dependency Analysis
![Page 64: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/64.jpg)
69 69
Step 2a: Identify Dependent Operationsint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}
Dependency Analysis
Split use-def chains at FIFOs
![Page 65: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/65.jpg)
70 70
Step 2b: Identify Maximal Threadsint x[100] @ {M0}; int y0[100] @ {M0};int y1[100] @ {M1};int z[100] @ {M1};PIPELINE { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); }}
Dependency Analysis
Split use-def chains at FIFOs
Identify Thread Operations
![Page 66: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/66.jpg)
71 71
Step 2b: Split Into Multiple Threadsint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1}; int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}
Perform Dataflow Analysis
Split use-def chains at FIFOs
Identify Thread Operations
Split into threads
![Page 67: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/67.jpg)
72 72
Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1};int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}
Generate DataCopy into FIFO
Copy out of FIFOConsume Data
![Page 68: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/68.jpg)
73 73
Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int y1a[100] @ {M1};int y1b[100] @ {M1};int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; memcpy(y1a,y0,…) @ DMA; fifo_put(&f, y1a); } } SECTION { while (1) { fifo_get(&f, y1b); bar(z, y1b) @ P1; baz(z) @ P1; put(z); } }}
Calculate Live Range of variables passed through FIFOs
Live Range of y1a
Live Range of y1b
![Page 69: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/69.jpg)
74 74
Step 2c: Zero Copy Optimizationint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); foo(y0, x) @ P0; fifo_acquireRoom(&f, &py1a); memcpy(py1a,y0,…) @ DMA; fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); bar(z, py1b) @ P1; fifo_releaseRoom(&f, py1b); baz(z) @ P1; put(z); } }}
Calculate Live Range of variables passed through FIFOs
Transform FIFO operations to pass pointers instead of copying data
Acquire empty buffer
Generate data directly into buffer
Pass full buffer to thread 2
Acquire full buffer from thread 1
Consume data directly from buffer
Release empty buffer
![Page 70: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/70.jpg)
75 75
Step 3a: Resolve Overloaded RPCint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};PARALLEL { SECTION { while (1) { get(x); DE32_foo(0, y0, x); fifo_acquireRoom(&f, &py1a); DMA_memcpy(py1a,y0,…); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); DE32_bar(1, z, py1b); fifo_releaseRoom(&f, py1b); DE32_baz(1, z); put(z); } }}
Replace RPC by architecture specific call
bar(…)@P1 DE32_bar(1,…)
![Page 71: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/71.jpg)
76 76
Step 3b: Split RPCsint x[100] @ {M0}; int y0[100] @ {M0};int *py1a;int *py1b;int z[100] @ {M1};
PARALLEL { SECTION { while (1) { get(x); start_DE32_foo(0, y0, x); wait(semaphore_DE32[0]); fifo_acquireRoom(&f, &py1a); start_DMA_memcpy(py1a,y0,…); wait(semaphore_DMA); fifo_releaseData(&f, py1a); } } SECTION { while (1) { fifo_acquireData(&f, &py1b); start_DE32_bar(1, z, py1b); wait(semaphore_DE32[1]); fifo_releaseRoom(&f, py1b); start_DE32_baz(1, z); wait(semaphore_DE32[1]); put(z); } }}
RPCs have two phases
start RPC
wait for RPC to complete
DE32_foo(0,…);
start_DE32_foo(0,…);
wait(semaphore_DE32[0]);
![Page 72: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/72.jpg)
77 77
Order of transformations
Dataflow-sensitive transformations go first Inferring data placement
Coherence checking within threads
Dependency analysis for parallelism
Parallelism transformations Obscures data and control flow
Thread-local optimizations go last Zero-copy optimization of FIFO operations
Continuation passing thread implementation
![Page 73: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/73.jpg)
78
Aside: Why hardware companies are fun
You get to play with cool hardware Often before it has been debugged
You get to play with powerful debugging tools Incredible level of detail visible
E.g., Palladium traces on next slides
![Page 74: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/74.jpg)
79 79
Unoptimized task scheduling
968
DE0
DE1
fft demod
195 cycles
273 cycles
![Page 75: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/75.jpg)
80 80
Optimized device driver on ARM
968
DE0
DE1
fft demod
155 cycles
257 cycles
![Page 76: Programming High Performance Embedded Systems: Tackling the Performance Portability Problem](https://reader035.vdocument.in/reader035/viewer/2022062804/568148f3550346895db61280/html5/thumbnails/76.jpg)
81 81
Task scheduling hardware support
968
DE0
DE1
fft demod
1 cycle
303 cycles
183 cycles