Download - Program Synthesis for Low-Power Accelerators
![Page 1: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/1.jpg)
Program Synthesis for Low-Power AcceleratorsRas Bodik Mangpo Phitchaya PhothilimthanaTikhon JelvisRohin ShahNishant Totla
Computer ScienceUC Berkeley
![Page 2: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/2.jpg)
What we do and talk overview
Our main expertise is in program synthesisa modern alternative/complement to compilation
Our applications of synthesis: hard-to-write code
parallel, concurrent, dynamic programming, end-user code
In this project, we explore spatial accelerators
an accelerator programming model aided by synthesis 2
![Page 3: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/3.jpg)
Future Programmable Accelerators
3
![Page 4: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/4.jpg)
![Page 5: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/5.jpg)
![Page 6: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/6.jpg)
no cache coherencelimited interconnect
spatial & temporal partitioningno clock!
crazy ISAsmall memoryback to 16-bit nums
![Page 7: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/7.jpg)
no cache coherencelimited interconnect
spatial & temporal partitioningno clock!
crazy ISAsmall memory
back to 16-bit nums
![Page 8: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/8.jpg)
What we desire from programmable acceleratorsWe want the obvious conflicting properties:
- high performance at low energy- easy to program, port, and autotune
Can’t usually get both- transistors that aid programmability burn
energy- ex: cache coherence, smart interconnects, …
In principle, most decisions can be done in compilers
- which would simplify the hardware- but the compiler power has proven limited
8
![Page 9: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/9.jpg)
We askHow to use fewer “programmability transistors”?
How should a programming framework support this?
Our approach: synthesis-aided programming model
9
![Page 10: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/10.jpg)
GA 144: our stress-test case study
10
![Page 11: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/11.jpg)
Low-Power Architectures
thanks: Per Ljung (Nokia)11
![Page 12: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/12.jpg)
Why is GA low powerThe GA architecture uses several mechanisms to achieve ultra high energy efficiency, including:
– asynchronous operation (no clock tree), – ultra compact operator encoding (4 instructions per
word) to minimize fetching, – minimal operand communication using stack
architecture, – optimized bitwidths and small local resources, – automatic fine-grained power gating when waiting for
inter-node communication, and – node implementation minimizes communication energy.
12
![Page 13: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/13.jpg)
GreenArray GA144
slide from Rimas Avizienis
![Page 14: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/14.jpg)
MSP430 vs GreenArray
![Page 15: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/15.jpg)
Finite Impulse Response BenchmarkPerformance
MSP430 (65nm) GA144 (180nm)
F2274swmult
F2617hwmult
F2617hwmult+dm
ausec / FIR output
688.75 37.125 24.25 2.18
nJ / FIR output 2824.54 233.92 152.80 17.66MSP430 GreenArrays
Data from Rimas Avizienis
GreenArrays 144 is 11x faster and simultaneously 9x more energy efficient than MSP 430.
![Page 16: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/16.jpg)
How to compile to spatial architectures
16
![Page 17: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/17.jpg)
Programming/Compiling Challenges
Algorithm/
Pseudocode
Partition&
DistributeData
Structures
PlaceComp1Comp2Comp3Send XComp4Recv YComp5
Schedule&
Implement code
&Route
Communication
![Page 18: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/18.jpg)
Existing technologiesOptimizing compilers
hard to write optimizing code transformations
FPGAs: synthesis from Cpartition C code, map it and route the communication
Bluespec: synthesis from C or SystemVerilogsynthesizes HW, SW, and HW/SW interfaces
19
![Page 19: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/19.jpg)
What does our synthesis differ?FPGA, BlueSpec:
partition and map to minimize cost
We push synthesis further:– super-optimization: synthesize code that meets
a spec– a different target: accelerator, not FPGA, not
RTL
Benefits: superoptimization: easy to port as there is no need to port code generator and the optimizer to a new ISA 20
![Page 20: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/20.jpg)
Overview of program synthesis
21
![Page 21: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/21.jpg)
Synthesis with “sketches”
22
spec: int foo (int x) { return x + x; }
sketch: int bar (int x) implements foo { return x << ??;
}
result: int bar (int x) implements foo { return x << 1;}
Extend your language with two constructs
22
𝜙 (𝑥 , 𝑦 ) :𝑦=foo (𝑥)
?? substituted with an int constant meeting
instead of implements, assertions over safety properties can be used
![Page 22: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/22.jpg)
Example: 4x4-matrix transpose with SIMDa functional (executable) specification:
int[16] transpose(int[16] M) { int[16] T = 0; for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) T[4 * i + j] = M[4 * j + i]; return T;}
This example comes from a Sketch grad-student contest
2323
![Page 23: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/23.jpg)
Implementation idea: parallelize with SIMDIntel SHUFP (shuffle parallel scalars) SIMD
instruction:return = shufps(x1, x2, imm8 :: bitvector8)
24
x1 x2
return
24
imm8[0:1]
![Page 24: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/24.jpg)
High-level insight of the algorithm designerMatrix transposed in two shuffle phases
Phase 1: shuffle into an intermediate matrix with some number of shufps instructions
Phase 2: shuffle into an result matrix with some number of shufps instructions
Synthesis with partial programs helps one to complete their insight. Or prove it wrong.
25
![Page 25: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/25.jpg)
The SIMD matrix transpose, sketchedint[16] trans_sse(int[16] M) implements trans { int[16] S = 0, T = 0;
S[??::4] = shufps(M[??::4], M[??::4], ??); S[??::4] = shufps(M[??::4], M[??::4], ??); … S[??::4] = shufps(M[??::4], M[??::4], ??);
T[??::4] = shufps(S[??::4], S[??::4], ??); T[??::4] = shufps(S[??::4], S[??::4], ??); … T[??::4] = shufps(S[??::4], S[??::4], ??);
return T;} 26
Phase 1
Phase 2
![Page 26: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/26.jpg)
The SIMD matrix transpose, sketchedint[16] trans_sse(int[16] M) implements trans { int[16] S = 0, T = 0; repeat (??) S[??::4] = shufps(M[??::4],
M[??::4], ??); repeat (??) T[??::4] = shufps(S[??::4],
S[??::4], ??); return T;}int[16] trans_sse(int[16] M) implements trans { //
synthesized code S[4::4] = shufps(M[6::4], M[2::4], 11001000b); S[0::4] = shufps(M[11::4], M[6::4], 10010110b); S[12::4] = shufps(M[0::4], M[2::4], 10001101b); S[8::4] = shufps(M[8::4], M[12::4], 11010111b); T[4::4] = shufps(S[11::4], S[1::4], 10111100b); T[12::4] = shufps(S[3::4], S[8::4], 11000011b); T[8::4] = shufps(S[4::4], S[9::4], 11100010b); T[0::4] = shufps(S[12::4], S[0::4], 10110100b);}
27
From the contestant email: Over the summer, I spent about 1/2 a day manually figuring it out. Synthesis time: <5 minutes.
![Page 27: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/27.jpg)
Demo: transpose on Sketch
Try Sketch online at http://bit.ly/sketch-language
28
![Page 28: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/28.jpg)
We propose a programming model for low-power devices
by exploiting program synthesis.
34
![Page 29: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/29.jpg)
Programming/Compiling Challenges
Algorithm/
Pseudocode
Partition&
DistributeData
Structures
PlaceComp1Comp2Comp3Send XComp4Recv YComp5
Schedule&
Implement code
&Route
Communication
![Page 30: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/30.jpg)
Language?
Project Pipeline
Comp1Comp2Comp3Send XComp4Recv YComp5
Language Design
• expressive• flexible (easy to
partition)
Partitioning
• minimize # of msgs
• fit each block in acore
Placement & Routing • minimize
comm cost• reason about
I/O pins
Intracore scheduling & Optimization• overlap comp
and comm• avoid deadlock• find most
energy-efficient code
![Page 31: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/31.jpg)
Programming model abstractions
![Page 32: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/32.jpg)
MD5 case study
38
![Page 33: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/33.jpg)
MD5 Hash
39
ith round
Buffer (before)
Buffer (after)
message
constantfrom lookup table
Figure taken from Wikipedia
Ri
![Page 34: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/34.jpg)
MD5 from Wikipedia
40Figure taken from Wikipedia
![Page 35: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/35.jpg)
Actual MD5 Implementation on GA144
41
![Page 36: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/36.jpg)
MD5 on GA144
42
High order
Low order
105104103102
005004003002
shift value
R
currenthash
message
M
message
M
106
006
rotate&
add with carry
rotate&
add with carry
constant
K
constant
K
Ri
![Page 37: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/37.jpg)
This is how we express MD5
43
![Page 38: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/38.jpg)
Project Pipeline
![Page 39: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/39.jpg)
Annotation at Variable Declarationtypedef pair<int,int> myInt;
vector<myInt>@{[0:16]=(103,3)} message[16];vector<myInt>@{[0:64]=(106,6)} k[64];vector<myInt>@{[0:4] =(104,4)} output[4];vector<myInt>@{[0:4] =(104,4)} hash[4];vector<int> @{[0:64]=102} r[64];
45
@core indicates where data lives.
106
6k[i]
high order
low order
![Page 40: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/40.jpg)
Annotation in Programvoid@(104,4) md5() { for(myInt@any t = 0; t < 16; t++) { myInt@here a = hash[0], b = hash[1], c = hash[2], d = hash[3]; for(myInt@any i = 0; i < 64; i++) { myInt@here temp = d; d = c; c = b; b = round(a, b, c, d, i); a = temp; } hash[0] += a; hash[1] += b; hash[2] += c; hash[3] += d; } output[0] = hash[0]; output[1] = hash[1]; output[2] = hash[2]; output[3] = hash[3];}
46
(104,4) is home for md5()function
@any suggests that any core can have this variable.
@here refers to the function’s home @(104,4)
![Page 41: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/41.jpg)
MD5 in CParttypedef pair<int,int> myInt;
vector<myInt>@{[0:64]=(106,6)} k[64];
myInt@(105,5) sumrotate(myInt@(104,4) buffer, ...) { myInt@h sum = buffer +@here k[i] + message[g]; ...}
![Page 42: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/42.jpg)
MD5 in CParttypedef pair<int,int> myInt;
vector<myInt>@{[0:64]=(106,6)} k[64];
myInt@(105,5) sumrotate(myInt@(104,4) buffer, ...) { myInt@h sum = buffer +@here k[i] + message[g]; ...}
48
buffer is at (104,4)
104
4buffer
high order
low order
![Page 43: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/43.jpg)
MD5 in CParttypedef pair<int,int> myInt;
vector<myInt>@{[0:64]=(106,6)} k[64];
myInt@(105,5) sumrotate(myInt@(104,4) buffer, ...) { myInt@h sum = buffer +@here k[i] + message[g]; ...}
49
k[i] is at (106,6)
106
6k[i]
high order
low order
![Page 44: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/44.jpg)
MD5 in CParttypedef pair<int,int> myInt;
vector<myInt>@{[0:64]=(106,6)} k[64];
myInt@(105,5) sumrotate(myInt@(104,4) buffer, ...) { myInt@h sum = buffer +@here k[i] + message[g]; ...}
50
+ is at (105,5)
105
5
+
+
high order
low order
![Page 45: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/45.jpg)
MD5 in CParttypedef pair<int,int> myInt;
vector<myInt>@{[0:64]=(106,6)} k[64];
myInt@(105,5) sumrotate(myInt@(104,4) buffer, ...) { myInt@h sum = buffer +@here k[i] + message[g]; ...}
51
buffer is at (104,4)+ is at (105,5)k[i] is at (106,6)
104
4
105
5
106
6
+
+
buffer k[i]
high order
low order
Implicit communication in source program. Communication inserted by synthesizer.
![Page 46: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/46.jpg)
MD5 in CParttypedef pair<int,int> myInt;
vector<myInt>@{[0:64]=(205,105)} k[64];
myInt@(204,5) sumrotate(myInt@(104,4) buffer, ...) { myInt@h sum = buffer +@here k[i] + message[g]; ...}
52
buffer is at (104,4)+ is at (204,5)k[i] is at (205,105)
104
4
205
105
204+
5+
buffer
k[i]
low order
high order
![Page 47: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/47.jpg)
MD5 in CParttypedef vector<int> myInt;
vector<myInt>@{[0:64]={306,206,106,6}} k[64];
myInt@{305,205,105,5} sumrotate(myInt@{304,204,104,4} buffer, ...) { myInt@h sum = buffer +@here k[i] + message[g]; ...}
53
buffer is at {304,204,104,4}+ is at {305,205,105,5}k[i] is at {306,206,106,6}
104
4
106
6
105+
5+
buffer
204
304
205+
305+
206
306k[i]
![Page 48: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/48.jpg)
MD5 in CParttypedef pair<int,int> myInt;
vector<myInt>@{[0:64]=(106,6)} k[64];
myInt@(105,5) sumrotate(myInt@(104,4) buffer, ...) { myInt@h sum = buffer +@here k[i] + message[g]; ...}
![Page 49: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/49.jpg)
MD5 in CParttypedef pair<int,int> myInt;
vector<myInt>@{[0:64]=(106,6)} k[64];
myInt@(105,5) sumrotate(myInt@(104,4) buffer, ...) { myInt@h sum = buffer +@here k[i] +@?? message[g]; ...}
+ happens at here which is (105,5)
+ happens at where the synthesizer decides
![Page 50: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/50.jpg)
Summary: Language & CompilerLanguage features:• Specify code and data placement by
annotation• No explicit communication required• Option to not specifying place
Synthesizer fills in holes such that:• number of messages is minimized• code can fit in each core
56
![Page 51: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/51.jpg)
Project Pipeline
![Page 52: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/52.jpg)
Program Synthesis for Code GenerationSynthesizing optimal code
Input: unoptimized code (the spec)Search space of all programs
Synthesizing optimal library codeInput: sketch + specSearch completions of the sketch
Synthesizing communication codeInput: program with virtual channelsInsert actual communication code
58
![Page 53: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/53.jpg)
slower
faster
unoptimized code (spec)
optimal code
synthesizer
1) Synthesizing optimal code
![Page 54: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/54.jpg)
Our Experiment
Register-based processor Stack-based processor
slower
faster
naive
optimized
spec
most optimal
hand
bit trick synthesizer
![Page 55: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/55.jpg)
Our Experiment
Register-based processor Stack-based processor
slower
faster
naive
optimized
spec
most optimal
hand
bit trick synthesizer
![Page 56: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/56.jpg)
Comparison
Register-based processor Stack-based processor
slower
faster
naive
optimized
spec
most optimal
translation
hand
bit trick
hand
synthesizer
![Page 57: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/57.jpg)
Preliminary Synthesis Times
Synthesizing a program with 8 unknown instructions takes 5 second to 5 minutes
Synthesizing a program up to ~25 unknown instructions within 50 minutes
![Page 58: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/58.jpg)
Preliminary ResultsProgram Description Approx.
SpeedupCode
length reduction
x – (x & y) Exclude common bits 5.2x 4x~(x – y) Negate difference 2.3x 2xx | y Inclusive or 1.8x 1.8x(x + 7) & -8 Round up to multiple
of 81.7x 1.8x
(x & m) | (y & ~m) Replace x with y where bits of m are 1’s
2x 2x
(y & m) | (x & ~m) Replace y with x where bits of m are 1’s
2.6x 2.6x
x’ = (x & m) | (y & ~m)y’ = (y & m) | (x & ~m)
Swap x and y where bits of m are 1’s
2x 2x
![Page 59: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/59.jpg)
Code Length
Program Original Length
Output Length
x – (x & y) 8 2~(x – y) 8 4x | y 27 15(x + 7) & -8 9 5(x & m) | (y & ~m) 22 11(y & m) | (x & ~m) 21 8x’ = (x & m) | (y & ~m)y’ = (y & m) | (x & ~m)
43 21
![Page 60: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/60.jpg)
2) Synthesizing optimal library codeInput:
Sketch: program with holes to be filledSpec: program in any programing
language
Output:Complete program with filled holes
![Page 61: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/61.jpg)
Example: Integer Division by ConstantNaïve Implementation:
Subtract divisor until reminder < divisor. # of iterations = output value Inefficient!
Better Implementation:
n - inputM - “magic” numbers - shifting value
M and s depend on the number of bits and constant divisor.
quotient = (M * n) >> s
![Page 62: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/62.jpg)
Example: Integer Division by 3Spec:
n/3
Sketch:quotient = (?? * n) >> ??
![Page 63: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/63.jpg)
Preliminary ResultsProgram Solution Synthesi
sTime (s)
Verification
Time (s)
# of Pairs
x/3 (43691 * x) >> 17
2.3 7.6 4
x/5 (209716 * x) >> 20
3 8.6 6
x/6 (43691 * x) >> 18
3.3 6.6 6
x/7 (149797 * x) >> 20
2 5.5 3
deBruijn: Log2x (x is power of 2)
deBruijn = 46,Table = {7, 0, 1, 3, 6, 2, 5, 4}
3.8 N/A 8Note: these programs work for 18-bit number except Log2x is for 8-bit number.
![Page 64: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/64.jpg)
3) Communication Code for GreenArraySynthesize communication code between nodes
Interleave communication code with computational code such that
There is no deadlock.The runtime or energy comsumption of the synthesized program is minimized.
![Page 65: Program Synthesis for Low-Power Accelerators](https://reader036.vdocument.in/reader036/viewer/2022070422/56816388550346895dd476eb/html5/thumbnails/65.jpg)
The key to the success of low-power computing lies in
inventing better and more effective programming frameworks.
71