an efficient, portable and generic library for successive...
TRANSCRIPT
An Efficient, Portable and Generic Libraryfor Successive Cancellation Decoding of Polar Codes
Adrien Cassagne1,2 Bertrand Le Gal1 Camille Leroux1
Olivier Aumage2 Denis Barthou2
1IMS, Univ. Bordeaux, INP, France
2Inria / Labri, Univ. Bordeaux, INP, France
LCPC, September 2015
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 1 / 20
Context: Error Correction Codes (ECC)
Algorithm that enable reliable delivery of digital dataRedundancy for error correction
Usually implemented in hardwareGrowing interest for software implementation
End-user utilization (low power consumption processors)Less expensive production than dedicated hardware chipsAlgorithms validation (typically Monte-Carlo HPC simulations)
Require performanceFocus on the decoder (most time consuming part)
Comm. Chan.Transmitter Receiver
ChannelEncoderSource Decoder Sink
The communication chain
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 2 / 20
Polar Codes as a New Class of ECC
Explored for upcoming 5G Mobile Phones (Huawei1)Redundancy: adding bits at fixed positions (value is always 0)
0 0 0 01 0 1 0
N = 8
Frozen bits Information bits
1 0 1 0Information bits
Enc. process
K = 4
Example of Polar Code (number of info. bits K = 4, frame size N = 8)
Rate R = N/K : frame size / information bits ratio
1http://www.huawei.com/minisite/has2015/img/5g_radio_whitepaper.pdf
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 3 / 20
The Successive Cancellation (SC) Algorithm
Depth-first binary tree traversal/search algorithm3 key functions:{
λc = f (λa, λb) = sign(λa.λb).min(|λa|, |λb|)λc = g(λa, λb, s) = (1− 2s)λa + λb(sc , sd) = h(sa, sb) = (sa ⊕ sb, sb).
1 3
�d
�d+1 �d+1
f
g
h
2
Per-node downward and upwardcomputations
4
3
2
1
0
Depth
Data layout representation
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 4 / 20
Polar Decoding Tree
?
?
? ? ? ?
?
f() g()
f() f()
f()f()
g()g()
f() f() g()g()g()g()
h() h() h() h()
h()h()
h()
Position of the frozen and information bits in the tree
Same specialized tree for each frameFrames are independent
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 5 / 20
A Wide Optimization Space
Simplification of the computations (tree pruning, rewriting rules)Vectorization of the node functions (f , g , h)Optimization on the decoder binary sizeImplementation of low level kernels: various instruction sets (SSE,AVX, NEON, etc.)
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 6 / 20
Example: Application of the Rewriting Rules
��������
������
������
������
������
����������������
������������
��������������������
? ?
gr()
xor()
f()
xor()
gr()f()
SPCRep
Rate 0 Rate 1
Rep SPC
cut branch
Rewriting rules applied to a N = 8 and K = 4 frame
Rewriting rules are applied recursivelyRepeated application of this rules lead to a simplified tree
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 7 / 20
Rewriting Rules and Tree Pruning
���������
���������
���������
���������
������������
������������ ����
������������
���������
��������� ��
������
������
������ ����
������������
��������
��������
����������������
?
?
? ?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
????
?level >= 2
???
?
f() g()
xor()
f() gr()
xor()
f()
xor()
g1()g0()
xor0()
spc()rep()h()
spc()rep()h()
Rate 1
RepRepetition
SPCSingle Parity Check
Rate 0
Any node
LeafRate 1, left onlyRate 0, left only
Rate 0 Rate 1 Repetition SPC
Level 2+ SPCRep., leaf childrenRate 1, leaf childrensRate 0, leaf childrens
Repetition, left only Standard Case
Sub-tree rewriting rules and tree pruning for processing specialization
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 8 / 20
Vectorization
Intra frame SIMD strategy:n data
f f f f
Inter frame SIMD strategy:
frame 0
frame 1
frame 2
frame 3
frame 0
frame 1
frame 2
frame 3
n data
n fra
mes
f
f
f
f
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 9 / 20
P-EDGE: a Dedicated Framework for Polar CodesFeatures
1 Code generationFlattening recursive callsRewriting rulesGeneration of templated C++
2 C++ specializationLoop unrollingData typesSIMD
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 10 / 20
P-EDGE: Code Generation Example
��������
������
������
������
������
����������������
������������
��������������������
? ?
gr()
xor()
f()
xor()
gr()f()
SPCRep
Rate 0 Rate 1
Rep SPC
cut branch
Generated code for N = 8 and K = 4
1 void Generated_SC_decoder_N8_K4 :: decode ()2 {3 // ------------ template args --------------- -std args -4 // -types - --funcs -- ----offsets ---- -size - --buffs ---5 f < R , F , FI , 0 , 4 , 8 , 4 > :: apply ( l );6 rep < B , R , H , HI , 8 , 0 , 4 > :: apply ( l , s );7 gr < B , R , G , GI , 0 , 4 , 0 , 8 , 4 > :: apply ( l , s );8 spc < B , R , H , HI , 8 , 4 , 4 > :: apply ( l , s );9 xo < B , X , XI , 0 , 4 , 0 , 4 > :: apply ( s );
10 }
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 11 / 20
Reducing the L1I Cache Occupancy
Flattening may generate large binariesBinary size grows with the frame sizePerformance slowdown when the binary exceeds the L1I cacheMoving offsets from template to function arguments
Help the compiler to factorize many function calls
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 12 / 20
Sub-tree Folding Technique
Legend
Default
Rate 0 left
Rate 0
Rate 1
Rep left
Rep
SPC 2
N0 xo()
N1
f()
N22
g()
xo()
N2
f()
N11
g()
xo0()
N3 N4
g0()
xo0()
N5 N6
g0()
xo0()
N7 N8
g0()
xo0()
N9 N10
g0()
h()
xo()
N12
f()
N17
g()
xo0()
N13 N14
g0()
xo()
N15
f()
N16
gr()
rep() spc()
xo()
N18
f()
N21
g()
xo()
N19
f()
N20
gr()
rep() spc()
spc()
xo()
N23
f()
N34
g()
xo()
N24
f()
N29
g()
xo()
N25
f()
N26
gr()
rep() xo()
N27
f()
N28
gr()
rep() spc()
xo()
N30
f()
N33
g()
xo()
N31
f()
N32
gr()
rep() spc()
h()
xo()
N35
f()
N42
g()
xo()
N36
f()
N41
g()
xo()
N37
f()
N40
g()
xo0()
N38 N39
g0()
h()
h()
h()
h()
Full decoding tree representation (N = 128, K = 64).
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 13 / 20
Sub-tree Folding TechniqueEnabling compression
Legend
Default
Rate 0 left
Rate 0
Rate 1
Rep left
Rep
SPC 2
N0
N1 N22
N2 N11
N3 N4
N5N6
N7 N8
N9 N10
N12 N17
N14
N15N16
N21
N23 N34
N24 N29
N25 N33
N35 N42
N36
N40
A single occurrence of a given sub-tree traversal is generated, andreused wherever neededCompression ratio on the example shown: 1.48
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 14 / 20
Binary Sizes Comparison
P-EDGE generated decoder binary sizes depending on the frame size (R=1/2)
2
4
8
16
32
64
128
256
512
1024
2048
4096
6 7 8 9 10 11 12 13 14 15 16
Dec
oder
bin
ary
size
(KB)
Codewords size (n = log2(N))
Without compression
32-bit Inter-SIMD32-bit Intra-SIMD8-bit Intra-SIMD
L1I size
0.5
1
2
4
8
16
32
64
6 7 8 9 10 11 12 13 14 15 16
Codewords size (n = log2(N))
With compression
Enable sub-tree folding
De-templetization: 10-fold binary reductionTree folding: 5-fold binary reduction (for N = 216)
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 15 / 20
Performance results
P-Edge Intel-based P-Edge ARM-based McGill impl.2CPU Intel Xeon E31225 ARM Cortex-A15 Intel Core i7-2600
3.10Ghz MPCore 2.32GHz 3.40GHzCache L1I/L1D 32KB L1I/L1D 32KB L1I/L1D 32KB
L2 256KB L2 1024KB L2 256KBL3 6MB No L3 L3 8MB
Compiler GNU g++ 4.8 GNU g++ 4.8 GNU g++ 4.8
Performance evaluation platforms.
Compiler flags: -std=c++11 -Ofast -funroll-loops
2G. Sarkis, P. Giard, C. Thibeault, and W.J. Gross. Autogenerating software polardecoders. In Signal and Information Processing (GlobalSIP), 2014 IEEE GlobalConference on, pages 6–10, Dec 2014.
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 16 / 20
Intra-SIMD Comparison with State of Art
Intra frame vectorization (32-bit, float)
250
300
350
400
450
500
550
6 7 8 9 10 11 12 13 14 15 16
Cod
ed th
roug
hput
(Mbi
t/s)
Frame size (n = log2(N))
Intel Xeon CPU E31225 (AVX1 SIMD)
P-Edge, rate 5/6McGill, rate 5/6
50
60
70
80
90
100
110
120
6 7 8 9 10 11 12 13 14 15 16
Frame size (n = log2(N))
Nvidia Jetson TK1 CPU A15 (NEON SIMD)
P-Edge, rate 5/6
Cross marks: Sarkis et al.P-Edge up to 25% better
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 17 / 20
Inter-SIMD Comparison with State of Art
Inter frame vectorization (8-bit, char)
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
6 7 8 9 10 11 12 13 14 15 16
Cod
ed th
roug
hput
(Mbi
t/s)
Frame size (n = log2(N))
Intel Xeon CPU E31225 (SSE4.1 SIMD)
P-Edge, rate 5/6Handw., rate 5/6
100
150
200
250
300
350
400
450
500
6 7 8 9 10 11 12 13 14 15 16
Frame size (n = log2(N))
Nvidia Jetson TK1 CPU A15 (NEON SIMD)
Triangles marks: former “handwritten” implementation3
P-Edge up to 25% better3B. Le Gal, C. Leroux, and C. Jego. Multi-gb/s software decoding of polar codes.
IEEE Transactions on Signal Processing, 63(2):349–359, Jan 2015.A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 18 / 20
Exploring Optimization Impacts
Impact of the different optimizations on the throughput
0
100
200
300
400
500
1/5 1/2 5/6 9/10
Cod
ed th
roug
hput
(Mbi
t/s)
Rate (R = K / N)
32-bit intra frame SIMD (AVX1)
412
304
380
511 521492
0
500
1000
1500
2000
1/5 1/2 5/6 9/10
Rate (R = K / N)
8-bit inter frame SIMD (SSE4.1)
spc2spc4spc6
repcut1cut0
ref
1861
1377
1540
1808 1804 1770
Throughput depending on the different optimizations for N = 2048
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 19 / 20
Conclusion and Future Works
Conclusion:P-EDGE: a Polar ECC decoder exploration frameworkClear separation of concerns
Rewriting rules engineLow level building blocks (f , g , h, rep, spc, ...)
Large optimization explorationOutperform state of art decodersPerformance Portability
Future works:In-depth performance analysis
Performance model (Roof-line, Execution-Cache-Memory (ECM))Reduce memory footprintExplore other Polar code decoder variants
A. Cassagne, B. Le Gal, C. Leroux, O. Aumage, D. Barthou IMS, Inria / Labri, Univ. Bordeaux, INPP-EDGE: Polar ECC Decoder Generation Environment 20 / 20