very high radix montgomery multiplication
DESCRIPTION
Very High Radix Montgomery Multiplication. David Harris, Kyle Kelley and Ted Jiang Harvey Mudd College Claremont, CA Supported by Intel Circuit Research Labs. Outline. RSA Encryption Montgomery Multiplication Radix 2 Implementations Tenca-Koç Radix 2 Improved Radix 2 - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/1.jpg)
1HMC VLSI Lab
Very High Radix Montgomery Very High Radix Montgomery MultiplicationMultiplicationDavid Harris, Kyle Kelley and Ted Jiang
Harvey Mudd College
Claremont, CA
Supported by Intel Circuit Research Labs
![Page 2: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/2.jpg)
2HMC VLSI Lab
OutlineOutline• RSA Encryption• Montgomery Multiplication• Radix 2 Implementations
– Tenca-Koç Radix 2– Improved Radix 2
• Very High Radix Implementations– Very High Radix– Parallel Very High Radix– Quotient Pipelining
• Results• Future Work
![Page 3: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/3.jpg)
3HMC VLSI Lab
RSA EncryptionRSA Encryption
• Most widely used public key system.– Good for encryption and signatures.– Invented by Rivest, Shamir, Adleman (1978)
• Public e and private d keys are long #s– n = 256-2048+ bits– Satisfy xde mod M = x for all x– Finding d from e is as hard as factoring M
• Encryption: B = Ae mod M• Decryption: C = Bd mod M = Aed = A
![Page 4: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/4.jpg)
4HMC VLSI Lab
RSA DerivationRSA Derivation
• Choose two large random primes p, q• M = pq• Totient: = (p-1)(q-1)• Public key e
– e is coprime to
• Private key d – such that de = 1 mod
• Then xed mod M = x – According to Fermat’s Little Theorem
![Page 5: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/5.jpg)
5HMC VLSI Lab
Cryptographic AlgorithmsCryptographic Algorithms
• DES, AES– Symmetric key algorithms
• Require exchange of secret key
– Computationally efficient
• RSA, ECC– Public key algorithms
• No key exchange needed (e.g. ecommerce)
– Computationally expensive– Use public key to exchange symmetric
key
![Page 6: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/6.jpg)
6HMC VLSI Lab
Modular ExponentiationModular Exponentiation
• Critical operation in RSA and for– Digital signature algorithm– Diffie-Hellman key exchange
• SSL, IPSec, IPv6
– Elliptic curve cryptosystems
• Done with modular multiplications– Ex: A27 = ((((((A2) * A)2)2) * A)2) * A– Division after each multiplication to compute
modulo– Maximum 2n, average 1.5n mults needed
![Page 7: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/7.jpg)
7HMC VLSI Lab
Binary Extension FieldsBinary Extension Fields
• Building blocks are polynomials in x– Operations performed modulo some
irreducible polynomial f(x) of degree n– Arithmetic done modulo 2– Called GF(2n)
• Example: GF(23)– 0, 1, x, x+1, x2, x2+1, x2+x, x2+x+1
• Computation is the same as GF(p)– Except that no carries are propagated
![Page 8: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/8.jpg)
8HMC VLSI Lab
Montgomery MultiplicationMontgomery Multiplication
• Faster way to do modular exponentation– Operate on Montgomery residues– Division becomes a simple shift– Requires conversion to and from
residues only once per exponentiation
![Page 9: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/9.jpg)
9HMC VLSI Lab
Montgomery ResiduesMontgomery Residues
• Let the modulus M be an odd n-bit integer– 2n-1 < M < 2n
• Define R = 2n
• Define the M-residue of an integer A < M as– AA = AR mod M
• There is a one-to-one correspondence between integers and M-residues for
0 < A < M-1
![Page 10: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/10.jpg)
10HMC VLSI Lab
Montgomery MultiplicatonMontgomery Multiplicaton
• DefineZ = MM(X, Y) = X Y R-1 mod M
• Where R-1 is the inverse of R mod M: R-1R = 1 (mod M)
• Montgomery Mult finds residue of Z = XY mod MZ = X Y R-1 mod M
= (XR) (YR) R-1 mod M
= XYR mod M
= ZR mod M
![Page 11: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/11.jpg)
11HMC VLSI Lab
Montgomery ReductionMontgomery Reduction
Precompute M’ satisfying RR-1 – MM’ = 1
Convert mult and mod to 3 mult and shift
Multiply: Z = X × Y
Reduce: reduce = Z × M’ mod R
Z = [Z + reduce × M] / R
Normalize: if Z ≥ M then Z = Z – M
Why is Z + Reduce × M divisible by R?
Mult
Mult
Mult Shift for R-1
Drop bits
![Page 12: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/12.jpg)
12HMC VLSI Lab
Reduction ProofReduction Proof
[ Z + reduce × M ] mod R= [ Z + (Z × M’ mod R) × M ] mod R= [ Z + Z × M’M ] mod R= [ Z + Z(RR-1 - 1) ] mod R= ZRR-1 mod R= 0 mod R
So Z + reduce × M is divisible by R
![Page 13: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/13.jpg)
13HMC VLSI Lab
More Comments on M’More Comments on M’
• RR-1 – MM’ = 1 – Implies M’ -M-1 mod R– M’ is odd
• M’ is precomputed from M using the extended Euclidian algorithm– M is held constant over many mults
• Only least significant v bits of M’ are needed when computing in radix 2v
– Dusse & Kaliski, Eurocrypt ’90s
![Page 14: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/14.jpg)
14HMC VLSI Lab
CPU Crypto AcceleratorsCPU Crypto Accelerators
• VIA Esther Padlock Hardware Security– Montgomery Multiplier < 0.5 mm2 die area– Accessed by x86 instruction– 256b - 32Kb keys in 128 bit granularity– Also supports AES
• SmartMIPS Smart Card Extensions– RSA, ECC, AES applications– GF(2n) multiply, MAC instructions– Carry propagation for multiword adds– AES permutations and rotations
• Intel LaGrande Technology– Trusted computing
![Page 15: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/15.jpg)
15HMC VLSI Lab
Embedded Crypto AcceleratorsEmbedded Crypto Accelerators
3COM Router 5000 Series Encryption Accelerator
IBM PCI SSL Cryptography Accelerator
![Page 16: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/16.jpg)
16HMC VLSI Lab
Radix 2 AlgorithmRadix 2 Algorithm
• In radix 2, process one bit of X per step– Reduction becomes trivial because M’ mod 2 = 1– Two multiplies and one shift per step
Z = 0for i = 0 to n-1
Z = Z + Xi × Y
reduce = Z0 trivial
Z = Z + reduce × M make Z divisible by 2
Z = Z/2if Z ≥ M then Z = Z – M final Mod M
Z = X × Y
reduce = Z × M’ mod R
Z = [Z + reduce × M] / R
if Z ≥ M then Z = Z – M
![Page 17: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/17.jpg)
17HMC VLSI Lab
Final ModuloFinal Modulo
• Result before last step in range – 0 Z < 2M– Reducing Z-M at the end is a hassle
• Allow 0 X, Y < 2M to avoid reduction– Then if R > 4M, 0 Z < 2M– Hence add two bits to R to avoid
subtraction at end of each step
Walter, Electronic Letters ’99
![Page 18: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/18.jpg)
18HMC VLSI Lab
ConversionConversion
• Conversion of integers to/from Montgomery residues takes one MM operation (if r2 mod M is precomputed and saved):
• Modular exponentiation takes two conversion steps and ~2n multiplication steps.
xMrxrMrxxMMx
MxrMrxrrxMMx
mod 1 mod1)1,(
mod mod),(11
122
![Page 19: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/19.jpg)
19HMC VLSI Lab
Reconfigurable HardwareReconfigurable Hardware
• Building hardwired n-bit unit is limiting– Slow for large n– Not scalable to different n
• Better to design for w-bit words– Break n-bit operand into e w-bit words
• e = n/w
– This is called scalable
• Also handle both GF(p) and GF(2n)– Requires conditionally killing carries– Called unified
![Page 20: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/20.jpg)
20HMC VLSI Lab
Tenca-Koç Montgomery MultiplierTenca-Koç Montgomery Multiplier
Z = 0
for i = 0 to n-1
(Ca, Z0) = Z0 + Xi × Y0
reduce = Z0
(Cb, Z0) = Z0 + reduce × M0
for j = 1 to e
(Ca,Zj) = Zj + Ca + Xi × Yj
(Cb,Zj) = Zj + Cb + reduce × Mj
Zj-1 = (Zj0, Zj-1
w-1:1)
M = (M(e-1), …, M1, M0), Y = (Y(e-1), …, Y1, Y0), Z = (Z(e-1), …, Z1, Z0), X = (Xn-1, …, X1, X0)
Tenca, Koçç, Trans. Computers, 2003
![Page 21: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/21.jpg)
21HMC VLSI Lab
Processing ElementsProcessing Elements
• Keep Z in carry-save redundant form– Tc = 2tAND + 2tCSA + tMUX + tBUF(w) + tREG
3:2C
SA
3:2C
SA
(w)
Cin
Cb
xi
Zw-1:0
Yw-1:0
Mw-1:0
Ca
Cout
Cin
Cout
reset
(w)
Z0
Zw-1
Z0
Zw-1:0
Mw-1:0
Yw-1:0
reduce
![Page 22: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/22.jpg)
22HMC VLSI Lab
ParallelismParallelism
• Two dimensions of parallelism:– Width of processing element w– Number of pipelined PEs p
• Multiply takes k = n/p kernel cycles
FIFO
0YM
Mem
X Mem
PE1 PE2 PE3 PE p
SequenceControl
Result
Z
MY
xKernel
Z’
![Page 23: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/23.jpg)
23HMC VLSI Lab
Pipeline TimingPipeline Timing
Ker
nel
Cyc
le 1
Case I: e > 2p-1e = 4, p = 2
Case II: e < 2p-1e = 4, p = 4
tim
e
spacePE1 PE2
1 x0
x1
x0
x1
x0
x1
x0
x1
x3
x3
x3
x3
x2
x2
x2
x2
2
3
4
5
6
7
8
9
10
11
12
13
Ker
nel
Cyc
le 2
PE1 PE2 PE3 PE4
KernelStall
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
x1
MY5w-1:4wZ5w-2:4w-1
x0
MY5w-1:4wZ5w-2:4w-1
x3
MY5w-1:4wZ5w-2:4w-1
x2
MY5w-1:4wZ5w-2:4w-1
x0
x0
x0
x0
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
x0
MY5w-1:4wZ5w-2:4w-1
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
MY5w-1:4wZ5w-2:4w-1
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
MY5w-1:4wZ5w-2:4w-1
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
MY5w-1:4wZ5w-2:4w-1
x1
x2
x3
x1
x2
x3
x1
x2
x3
x1
x2
x3
x1
x2
x3
x0
x0
x0
x0
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
x0
MY5w-1:4wZ5w-2:4w-1
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
MY5w-1:4wZ5w-2:4w-1
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
MY5w-1:4wZ5w-2:4w-1
MYw-1:0Zw-2:-1
MY2w-1:wZ2w-2:w-1
MY3w-1:2wZ3w-2:2w-1
MY4w-1:3wZ4w-2:3w-1
MY5w-1:4wZ5w-2:4w-1
x1
x2
x3
x1
x2
x3
x1
x2
x3
x1
x2
x3
x1
x2
x3
Ker
nel
Cyc
le 1
Ker
nel
Cyc
le 2
14
15
16
17
18
19
20
![Page 24: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/24.jpg)
24HMC VLSI Lab
QueueQueue
• If full PEs cause stall, queue results
• Convert back to nonredundant form– Saves queue space– CPA needed for final result anyway
Z’ Z
Result
FIFO(0 or more words)
firstcycle
byp
ass
1x0
0100w w
sum
carry CP
A
![Page 25: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/25.jpg)
25HMC VLSI Lab
Improved DesignImproved Design
• Don’t wait two cycles for MSB• Kick off dependent operation right away
on the available bits• Take extra cycle(s) at the end to handle
the extra bits• For p processing elements, cycle count
reduces from 2p to p + (p/w)
Harris, Krishnamurthy, Anders, Mathew, Hsu, Arith 2005.
![Page 26: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/26.jpg)
26HMC VLSI Lab
Improved PEImproved PE
• Left-shift M and Y – Rather than right-shift Z
• Same amount of hardware
3:2C
SA
3:2C
SA
(w)Cin
Cb
xi
reduce
Zw-1:0
Mw-1:0Yw-1:0
M-1
Zw-2:-1
Yw-2:-1
Mw-2:-1
Mw-1
Ca
Cout
Cin
Cout
reset
(w)
Yw-1
Y-1
Z0
![Page 27: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/27.jpg)
27HMC VLSI Lab
Pipeline TimingPipeline Timing
Ker
nel
Cyc
le 1
Case I: e > p+1e = 4, p = 2
Case II: e < p+1e = 4, p = 4
tim
espace
PE1 PE2
1 MYw-1:0Zw-2:-1
x0
MYw-2:-1Zw-3:-2
x1MY2w-1:wZ2w-2:w-1
x0
MY2w-2:w-1Z2w-3:w-2
x1MY3w-1:2wZ3w-2:2w-1
x0
MY3w-2:2w-1Z3w-3:2w-2
x1MY4w-1:3wZ4w-2:3w-1
x0
MY4w-2:3w-1Z4w-3:3w-2
x1
x3
x3
x3
x3
x2
x2
x2
x2
2
3
4
5
6
7
8
9
10
11
12
13
Ker
nel
Cyc
le 2
Ker
nel
Cyc
le 1
PE1 PE2x0
x1x0
x1x0
x1x0
x1
MYw-3:-2Zw-4:-3
x2
MY2w-3:w-2Z2w-4:w-3
x2
MY3w-3:2w-2Z3w-4:2w-3
x2
MY4w-3:3w-2Z4w-4:3w-3
x2
MYw-4:-3Zw-5:-4
x3
MY2w-4:w-3Z2w-5:w-4
x3
MY3w-4:2w-3Z3w-5:2w-4
x3
MY4w-4:3w-3Z4w-5:3w-4
x3
PE3 PE4
Ker
nel
Cyc
le 2 x4
x5x4
x4
x4
x6
x7
Kernel StallMYw-1:0Zw-2:-1
MYw-2:-1Zw-3:-2
MY2w-1:wZ2w-2:w-1
MY2w-2:w-1Z2w-3:w-2
MY3w-1:2wZ3w-2:2w-1
MY3w-2:2w-1Z3w-3:2w-2
MY4w-1:3wZ4w-2:3w-1
MY4w-2:3w-1Z4w-3:3w-2
MYw-1:0Zw-2:-1
MYw-2:-1Zw-3:-2
MY2w-1:wZ2w-2:w-1
MY2w-2:w-1Z2w-3:w-2
MY3w-1:2wZ3w-2:2w-1
MY3w-2:2w-1Z3w-3:2w-2
MY4w-1:3wZ4w-2:3w-1
MY4w-2:3w-1Z4w-3:3w-2
MYw-1:0Zw-2:-1
MYw-2:-1Zw-3:-2
MY2w-1:wZ2w-2:w-1
MY2w-2:w-1Z2w-3:w-2
MY3w-1:2wZ3w-2:2w-1
MY3w-2:2w-1Z3w-3:2w-2
MY4w-1:3wZ4w-2:3w-1
MY4w-2:3w-1Z4w-3:3w-2
MYw-3:-2Zw-4:-3
MY2w-3:w-2Z2w-4:w-3
MY3w-3:2w-2Z3w-4:2w-3
MY4w-3:3w-2Z4w-4:3w-3
MYw-4:-3Zw-5:-4
MY3w-4:2w-3Z3w-5:2w-4
MY4w-4:3w-3Z4w-5:3w-4
MY2w-4:w-3Z2w-5:w-4
x5
x6
x7
x5
x6
x7
x5
x6
x7
MY5w-1:4wZ5w-2:4w-1
x0
MY5w-2:4w-1Z5w-3:4w-2
x1
14
MY5w-1:4wZ5w-2:4w-1
x2
MY5w-2:4w-1Z5w-3:4w-2
x3
x0
MY5w-3:4w-2Z5w-4:4w-3
x2
MY5w-4:4w-3Z5w-5:4w-4
x3
MY5w-1:4wZ5w-2:4w-1
x1MY5w-2:4w-1Z5w-3:4w-2
x4
MY5w-3:4w-2Z5w-4:4w-3
x6
MY5w-4:4w-3Z5w-5:4w-4
x7
MY5w-1:4wZ5w-2:4w-1
MY5w-2:4w-1Z5w-3:4w-2
x5
![Page 28: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/28.jpg)
28HMC VLSI Lab
LatencyLatency
• Tenca-Koç
k(e+1) + 2(p-1) n > 2pw – w k(2p+1) + e - 2 n 2pw – w
• Improved Design
(k+1)(e+1) + p-2 n > pw k(p+1) + 2e - 1 n pw
![Page 29: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/29.jpg)
29HMC VLSI Lab
BreakBreak
![Page 30: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/30.jpg)
30HMC VLSI Lab
BreakBreak
![Page 31: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/31.jpg)
31HMC VLSI Lab
BreakBreak
![Page 32: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/32.jpg)
32HMC VLSI Lab
Very High RadixVery High Radix
• Handle many bits of X at a time– Radix 2v processes v bits of X
• Only f = n/v outer loop iterations needed• k = n/pv kernel cycles with p PEs
– Hardware changes• w bit AND v w bit multiplier• Right shift by v bits after each step
– Use v w so bits are available to shift
• Cycle time gets longer
– Reduce becomes more complicated• Must drive v lsbs to 0
![Page 33: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/33.jpg)
33HMC VLSI Lab
Very High Radix AlgorithmVery High Radix Algorithm
Z = 0
for i = 0 to f-1
Z = Z + Xi × Y reduce = (M’ × Z) mod 2v reduce bottom v bits
Z = Z + reduce × M
Z = Z / 2v
Z = X × Y
reduce = Z × M’ mod R
Z = [Z + reduce × M] / R
![Page 34: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/34.jpg)
34HMC VLSI Lab
Scalable Very High Radix MMScalable Very High Radix MM
Z = 0
for i = 0 to f-1
(CA, Z0) = Z0 + X0 × Y0
reduce = (M’ × Z0) mod 2v only reduce bottom v bits
(CB, Z0) = Z0 + reduce × M0
for j = 1 to e + (v + 1) / w - 1 (CA, Zj) = Zj + CA + Xi × Yj
(CB, Zj) = Zj + CB + reduce × Mj
Zj-1 = (Zjv-1:0, Zj-1
w-1:v)
2 mul, 1 shift
in inner loop
![Page 35: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/35.jpg)
35HMC VLSI Lab
Very High Radix PEVery High Radix PE
Z
*
X
Y
reduce
M
M'
v
v+w0
1*
1 0CA CB
w
w
v
v v
w+1
w w
v+w
w+1
v+w
YMZ to next PE
Y
first
first to next PE
v+w
w w
vv
v
w-vMAC MAC
upper
low
er
v
Kelley, Harris IWSOC 2005Kelley, Harris IWSOC 2005
![Page 36: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/36.jpg)
36HMC VLSI Lab
Pipeline TimingPipeline Timing
Each MAC is given a full cycle
Tc = tMUL(v,w) + tCPA(v+w) + tmux + tREG
Two MAC columns for each PE
Four cycle latency between PEs:
1) Z0 = Xi × Y0
2) reduce = M’ × Z0 mod 2v
3) Z0 = Z0 + reduce × M0
4) Z1 = Z1 + reduce × M1, shift into Z0
Zw-1:0
reduce
Xv-1:0
Yw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
Y3w-1:2w
Z3w-1:2w
Y4w-1:3w
Z4w-1:3w
Y 5w-1:4w
Z5w-1:4w
Y6w-1:5w
Z6w-1:5w
Mw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
M3w-1:2w
Z3w-1:2w
M4w-1:3w
Z4w-1:3w
M5w-1:4w
Z5w-1:4w
M6w-1:5w
Z6w-1:5w
Zw-1:0
reduce
X 2v-1:v
Yw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
Y3w-1:2w
Z3w-1:2w
Y4w-1:3w
Z4w-1:3w
Y5w-1:4w
Z5w-1:4w
Y6w-1:5w
Z6w-1:5w
Mw-1:0
Zw-1:0
M2w-1:w
Z2w-1:w
M3w-1:2w
Z3w-1:2w
M4w-1:3w
Z4w-1:3w
M5w-1:4w
Z5w-1:4w
M6w-1:5w
Z6w-1:5w
Zw-1:0
reduce
Yw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
Y3w-1:2w
Z3w-1:2w
Y4w-1:3w
Z4w-1:3w
Y5w-1:4w
Z5w-1:4w
Y6w-1:5w
Z6w-1:5w
Mw-1:0
Zw-1:0
M2w-1:w
Z2w-1:w
M3w-1:2w
Z3w-1:2w
M4w-1:3w
Z4w-1:3w
M5w-1:4w
Z5w-1:4w
M6w-1:5w
Z6w-1:5w
X3v-1:2v
Zw-1:0
reduce
Yw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
Y 3w-1:2w
Z3w-1:2w
Y4w-1:3w
Z4w-1:3w
Y5w-1:4w
Z5w-1:4w
Mw-1:0
Zw-1:0
M2w-1:w
Z2w-1:w
M3w-1:2w
Z3w-1:2w
Yw-1:0
Zw-1:0
Ker
nelS
tall
PE 1 PE 2 PE 3
Ke
rnel
Cyc
le1
Ke
rnel
Cyc
le2
X4v-1:3v
X5v-1:4v
… … … … … …
1
Cycle #
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
![Page 37: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/37.jpg)
37HMC VLSI Lab
Very High Radix LatencyVery High Radix Latency
k(e + 3) + 4(p - 1) + 2 n > 4pw – 2w k(4p + 1) + (e - 1) n 4pw – 2w
• Design limited for small n by 4-cycle latency between PEs
![Page 38: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/38.jpg)
38HMC VLSI Lab
Parallel Very High RadixParallel Very High Radix
• Eliminate two of the cycles– Multiplication to compute reduce
• By precomputing M = M’ × M mod R
– Dependency of Z0 on reduce• By prescaling X by 2v so Z0 = 0
• Math proposed by Orup Arith95– But no scalable very high radix HW
~
![Page 39: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/39.jpg)
39HMC VLSI Lab
Improvement 1: Eliminate MultiplyImprovement 1: Eliminate Multiply
Z = 0
for i = 0 to f-1
Z = Z + Xi × Y
Z = Z + Z0 × M M = (M’ mod 2v)M mod R
Z = Z / 2v
M = M’ × M mod R (precompute)
Z = X × Y
Z = [Z + Z × M] / R
Z = X × Y
reduce = Z × M’ mod R
Z = [Z + reduce × M] / R~
~
~
~
![Page 40: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/40.jpg)
40HMC VLSI Lab
Improvement 2: Prescale X by 2Improvement 2: Prescale X by 2vv
Z = 0
for i = 0 to f
Z = Z + 2vXi × Y + Z0 × M
Z = Z / 2v
Z = 0
for i = 0 to f
Z = (Z + Z0 × M) / 2v + Xi × Y
Z = 0
for i = 0 to f-1
Z = Z + Xi × Y
Z = Z + Z0 × M
Z = Z / 2v
~
~
~
Because Z0 is independent of 2vXi
Final result in range 0 Z < 2n+v+1
- avoid final small mod in successive mults by using larger R
One more iteration
![Page 41: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/41.jpg)
41HMC VLSI Lab
Improvement 3: Avoid LSW addImprovement 3: Avoid LSW add
Z = 0
for i = 0 to f
Z = (Z + Z0 × M) / 2v + Xi × Y
Z = 0
for i = 0 to f
reduce = Z0
Z = Z >> v + reduce × M + Xi × Y
~
M + 1M =
~
2v
(Z + Z0 × M) / 2v
= Z >> v + (Z0 × M + Z mod 2v) / 2v
= Z >> v + (Z0 × (M+1)) / 2v
= Z >> v + Z0 × M
~
~
~
M M’M -1 mod 2v
So M + 1 is divisible
~
~
![Page 42: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/42.jpg)
42HMC VLSI Lab
Scalable Parallel Very High RadixScalable Parallel Very High Radix
Z = 0
for i = 0 to f
C = 0
reduce = Z0
for j = 0 to e + v/w (C, Zj) = Zj + C + reduce × Mj
+ Xi × Yj
Zj-1 = (Zjv-1:0, Zj-1
w-1:v)
![Page 43: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/43.jpg)
43HMC VLSI Lab
Parallel Very High Radix PEParallel Very High Radix PE
Kelley, Harris Asilomar 2005Kelley, Harris Asilomar 2005
![Page 44: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/44.jpg)
44HMC VLSI Lab
Pipeline TimingPipeline Timing
Xv-1:0
Yw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
Y3w-1:2w
Z3w-1:2w
Y4w-1:3w
Z4w-1:3w
Y5w-1:4w
Z5w-1:4w
Y6w-1:5w
Z6w-1:5w
Yw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
Y3w-1:2w
Z3w-1:2w
Y4w-1:3w
Z4w-1:3w
Y5w-1:4w
Z5w-1:4w
Ker
nel S
tall
PE 1 PE 2 PE 3
Ke
rne
l Cyc
le 1
Ke
rne
l Cyc
le 2
X5v-1:4v
… … … …
1
Cycle #
2
3
4
5
6
7
8
9
10
11
12
13
14
15
X2v-1:v
Yw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
Y3w-1:2w
Z3w-1:2w
Y4w-1:3w
Z4w-1:3w
Y5w-1:4w
Z5w-1:4w
Y6w-1:5w
Z6w-1:5w
X3v-1:2v
Yw-1:0
Zw-1:0
Z3w-1:2w
Y4w-1:3w
Z4w-1:3w
Y5w-1:4w
Y6w-1:5w
Z6w-1:5w
X4v-1:3v
Yw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
Y3w-1:2w
Z3w-1:2w
Y4w-1:3w
Z4w-1:3w
Y5w-1:4w
Z5w-1:4w
Y6w-1:5w
Z6w-1:5w
PE 4
Y2w-1:w
Z2w-1:w
Y3w-1:2w
Z5w-1:4w
X6v-1:5v
Yw-1:0
Zw-1:0
Y2w-1:w
Z2w-1:w
Y3w-1:2w
Z3w-1:2w
X7v-1:6v
Yw-1:0
Zw-1:0
Two cycle latency between PEs:
1) Z0 = Z0 + Xi × Y0 + reduce × M0
2) Z1 = Z1 + Xi × Y1 + reduce × M1,
shift into Z0
Tc = tMUL(v,w) + 2tCSA + tCPA(v+w) + tREG
![Page 45: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/45.jpg)
45HMC VLSI Lab
Parallel Very High Radix LatencyParallel Very High Radix Latency
k(e+1) + e+1 + 2(f mod p) n > 2pw – v k(2p+1) + e+1 + 2(f mod p) n 2pw
– v
![Page 46: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/46.jpg)
46HMC VLSI Lab
Quotient PipeliningQuotient Pipelining
• Reduce depends on previous Z– Pipeline reduce calculation to avoid
reduce being on the critical path– Parallel Very High Radix can be viewed
as 0-stage Quotient Pipeline architecture
![Page 47: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/47.jpg)
47HMC VLSI Lab
00-stage Quotient Pipelining-stage Quotient Pipelining
• Parallel Very High Radix
– reduce × Mj and Xi ×
Yj occurs simultaneously
– Require reduce in non-redundant form
– Solution: Delay reduce × Mj
calculation by d PE’s: d-stage delay Quotient Pipelining
![Page 48: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/48.jpg)
48HMC VLSI Lab
dd-Stage Delay Quotient Pipelining-Stage Delay Quotient Pipelining
• Parallel Design– M = (M’ mod 2v) × M >> v
– reduce produced by PEi is used by PEi+1
• Quotient Pipelining– M = (M’ mod 2v(d+1)) × M >> v(d+1)– Where d is the # of delay stages
– Reduce produced by PEi is used by PEi+1+d
• Parallel: d = 0-Stage Quotient Pipelining
![Page 49: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/49.jpg)
49HMC VLSI Lab
1-Stage Quotient Pipelined Algorithm1-Stage Quotient Pipelined Algorithm
Z = 0 oldreduce = 0for i = 0 to f
reduce = Z0
Z = Z >> v + oldreduce × M + Xi × Y
oldreduce = reduceZ = Z << v + oldreduce
![Page 50: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/50.jpg)
50HMC VLSI Lab
1-Stage Scalable Quotient Pipelining1-Stage Scalable Quotient Pipelining
Z = 0 oldreduce = 0for i = 0 to f
C = 0reduce = Z0
for j = 0 to e (C, Zj) = (Zj
v-1:0, Zj-1w- 1:v) +
oldreduce × Mj + Xj × Yj + Coldreduce = reduce
Z = Z << v + oldreduce
![Page 51: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/51.jpg)
51HMC VLSI Lab
Quotient Pipelined PEQuotient Pipelined PE
Y
first
Z
X
*
3:2
CS
A
3:2
CS
A
3:2
CS
A
3:2
CS
A
Z
C
* OldReduce
Up
pe
rL
ow
er
OldReduce
M M
Y
first
w
v+w
w
w
v
v
w
v+1
v+1
v
v
w-v
w-v
OldReduce*MOldReduce*M
![Page 52: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/52.jpg)
52HMC VLSI Lab
Quotient Pipelined PerformanceQuotient Pipelined Performance
• Tc = tMUL(v,w) + tCSA + tREG
k(e+1) + e+1 + 2(f mod p) for n > 2pw – v k(2p+1) + e+1 + 2(f mod p) for n 2pw – v
k = n’/pve = n’/wf = n’/vn’ = n+2v• Differs from Parallel design:
– Parallel: n’ = n+v– Extra v to account for the extra stage of delay
Y
first
Z
X
*
3:2
CS
A
3:2
CS
A
3:2
CS
A
3:2
CS
A
Z
C
* OldReduce
Up
per
Low
er
OldReduce
M M
Y
first
w
v+w
w
w
v
v
w
v+1
v+1
v
v
w-v
w-v
OldReduce*MOldReduce*M
![Page 53: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/53.jpg)
53HMC VLSI Lab
Comparison of LatenciesComparison of Latencies
• Tenca-Koç Radix 2n2/wp + n/p + 2p – 2 n > 2wp – w 2n + n/p + n/w – 2 n 2wp – w
• Improved Radix 2n2/wp + n/w + n/p + p – 1 n > wpn + n/p + 2n/w – 1 n wp
• Very High Radixn2/wpv + 3n/pv + 4p – 2 n > 4wp – 2w 4n/v + n/pv + n/w - 1 n 4wp – 2w
• Parallel Very High Radixn2/wpv + n/pv + n/w + 1 n > 2wp – v 2n/v + n/pv + n/w + 1 n 2wp – v
![Page 54: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/54.jpg)
54HMC VLSI Lab
Latency vs. # of PEsLatency vs. # of PEs
• Assume w = 16, v = 1 or 16• Let m = wvp be the amount of HW• For small m
– All similar– Cycles m
• For large m– Saturates– High radix
is better10
100
1000
10000
100000
10 100 1000 10000 100000
m = wpv
cycles
tk (256)
tk (1024)
imp (256)
imp(1024)
high(256)
high(1024)
parallel(256)
parallel(1024)
![Page 55: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/55.jpg)
55HMC VLSI Lab
Comparison of Cycle TimesComparison of Cycle Times
• Tenca-Koç / Improved Radix 2Tc = 2tAND + 2tCSA + tMUX + tBUF(w) + tREG
• Very High RadixTc = tMUL(v,w) + tCPA(v+w) + tmux + tREG
• Parallel Very High Radix
Tc = tMUL(v,w) + 2tCSA + tCPA(v+w) + tREG
• Quotient Pipelining
Tc = tMUL(v,w) + tCSA + tREG
![Page 56: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/56.jpg)
56HMC VLSI Lab
Synthesis ResultsSynthesis Results
• Xilinx Virtex II Pro XC2V250-6– ~5n RAM bits needed for X, Y, M, Z, FIFO
Arch Freq (MHz)
LUTs/PE
Regs/PE
16 × 16 Mults/PE
RAM
Improved Radix 2
144 85
69
0 5n
Very High Radix
107 178
218
2 5n
Parallel Very High Radix
107 133
147
2 5n
Quotient
Pipelined
138 238
226
2 5n
![Page 57: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/57.jpg)
57HMC VLSI Lab
Exponentiation TimesExponentiation Times
• On average, 1.5n + 2 Montgomery multiplies are needed for modular exponentiation– Texp = Tc * latency(n, w, v, p) * (1.5n + 2)
![Page 58: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/58.jpg)
58HMC VLSI Lab
Hardware ResultsHardware Results
Arch Freq (MHz) p n Latency (cycles) Texp (ms)
Improved Radix 2
w = 16
144 16 256 303 0.811024 4239 45.3
64 256 291 0.781024 1167 12.5
Very High Radix
w = v = 16
107 4 256 90 0.321024 1086 15.6
16 256 80 0.281024 330 4.74
Parallel Very High Radix
w = v = 16
107 4 256 85 0.311024 1105 15.9
16 256 50 0.181024 325 4.67
Quotient
Pipelined
w = v = 16
138 4 256 95 0.261024 1139 12.7
16 256 53 0.151024 334 3.72
![Page 59: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/59.jpg)
59HMC VLSI Lab
Athena TeraFire 5008Athena TeraFire 5008
• 5.5 ms 1024-bit exponentiation
• 95 Kgates IP block
![Page 60: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/60.jpg)
60HMC VLSI Lab
Software ResultsSoftware Results
• 1024 bit modular exponentiation– 2.4 GHz Pentium 4, FLINT/C library
• 92 ms without Montgomery’s alg• 41 ms with Montgomery’s alg
– 2.4 GHz Pentium 4, GMP library• 25 ms with Karatsuba’s algs
– 80 MHz ARM• 876 ms with Montgomery’s alg
![Page 61: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/61.jpg)
61HMC VLSI Lab
SummarySummary
• Latency (cycles)– Comparable per full adder for all designs when p small– Saturates when p gets too big
• Improved radix 2 design 2x better than T-K
• Very high radix even better (by v/4 or v/2)
• Cycle Time– Worse for very high radix– But only slightly so on FPGAs with efficient multiplier
hardware
• Total Time– Radix 2 best when little HW is available– Radix 216 attractive for FPGAs for minimum latency when
plenty of HW is available
![Page 62: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/62.jpg)
62HMC VLSI Lab
Future WorkFuture Work
• Better pipelining of parallel design
• Radix 2 parallel design
• Radix 4/8 with precomputed multiples
• Karatsuba Algorithm
• Side channel attack countermeasures
• Reconfigurable Logic on CPUs
![Page 63: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/63.jpg)
63HMC VLSI Lab
Pipelining Parallel DesignPipelining Parallel Design
Y
first
Z
X
*
3:2 CS
A
3:2 CS
A
3:2 CS
A
3:2 CS
A
Z
C
Reduce
Upper
Lower
Reduce
M M
Y
first
w
w
w
v
v
w
v+1
v+1
v
v
w-v
w-v
*
Tc = tMUL(v,w) + 2tCSA + tREG
![Page 64: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/64.jpg)
64HMC VLSI Lab
Parallel Radix 2Parallel Radix 2
• Tc = tAND + 2tCSA + tREG
Z = 0
for i = 0 to n
C = 0
reduce = Z0
for j = 0 to e
(C, Zj) = Zj + C + reduce × Mj + Xi × Yj
Zj-1 = (Zj0, Zj-1
w-1:1)
3:2
CS
A
3:2
CS
A
(w)
Cin
xi
reduce
Z
MY
Z
YM
Cout
Cin
Cout
(w)
![Page 65: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/65.jpg)
65HMC VLSI Lab
Radix 4/8 with Precomputed MultiplesRadix 4/8 with Precomputed Multiples
• Todorov & Twalbeh extended T-K to radix 4/8– Precompute multiples of Y, M– Use mux instead of AND / MUL
• Improve latency using right shifts
• Does Booth encoding help?
• Tc = tMUX + 2tCSA + tREG
![Page 66: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/66.jpg)
66HMC VLSI Lab
Karatsuba AlgorithmKaratsuba Algorithm
• “The Karatsuba multiplication arouses our curiosity, since it seems simple, and one could pleasantly occupy a (preferably rainy) Sunday afternoon trying it out.”– M. Welschenbach, Cryptography in C and C++
• A = A12n + A0; B = B12n + B0
• Regular Multiplication (O(n2))– AB = 22nA1B1 + 2n(A0B1 + A1B0) + A0B0
• Karatsuba Multiplication (O(n1.585))– C0 = A0B0; C1 = A1B1; C2 = (A0 + A1)(B0 + B1) – C0 – C1
– AB = 22nC1 + 2nC2 + C0
![Page 67: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/67.jpg)
67HMC VLSI Lab
Side Channel AttacksSide Channel Attacks
• Monitor chip activity to try deducing private key– Timing– Current consumption– Photon emissions
• How vulnerable is very high radix MM to side channel attacks?
• How can it be improved?– CVSL, other differential logic families?– Differential registers– Inherent symmetry of addition & multiplication
![Page 68: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/68.jpg)
68HMC VLSI Lab
Reconfigurable Logic on CPUReconfigurable Logic on CPU
• Add FPGA for dynamically reconfigurable accelerators– ~1000 transistors / 4-input LUT
• Differentiates Intel from competition by unique reconfigurable logic fabric
![Page 69: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/69.jpg)
69HMC VLSI Lab
ApplicationsApplications
• Montgomery Mult – RSA, ECC, Diffie-Hellman Key Exchange
• AES accelerator – symmetric key crypto
• Viterbi decoder • Pattern matching
– Genome BLAST, Google, network security
• DSP accelerators – photoshop filters, video encoding
![Page 70: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/70.jpg)
70HMC VLSI Lab
ExampleExample
Montecito
2 cores 28.5M Tran each
24MB L3$ 1550M Transistors
Alternative
2 cores 20MB L3$
1300M Transistors250 KLUT FPGA
250M Transistors
FPGA
FPGA
![Page 71: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/71.jpg)
71HMC VLSI Lab
Superblock OrganizationSuperblock Organization
• Each contain substantial resources– CLBs, memories, multipliers, etc.
• Chip provides one or more superblock– Accelerators are compiled for one or
more Superblocks– Easy reconfiguration without recompile– “Memory manager” controls how many
can be downloaded at a time
![Page 72: Very High Radix Montgomery Multiplication](https://reader035.vdocument.in/reader035/viewer/2022062422/56813a78550346895da2738d/html5/thumbnails/72.jpg)
72HMC VLSI Lab
ConclusionsConclusions
• High radix best when multipliers are cheap– Is this a research direction of relevance to
Intel?
• Focus of future research– Low radix improvements– Side channel security– FPGA coprocessor architectures– Other Discussion ?