de ne = 2 curve : of (2 ) on the elliptic curve + over f · 1968 v eltk amp, 1971 dekk er. sp...
TRANSCRIPT
High-speed
elliptic-curve cryptography
D. J. Bernstein
Thanks to:
University of Illinois at Chicago
NSF CCR–9983950
Alfred P. Sloan Foundation
Define = 2255 � 19; prime.
Define = 358990. Define
Curve : Z 0 � 1 � � � � � � 1 � by� � � coordinate of � th multiple
of (2 � � � � ) on the elliptic curve2 = � 3 + � 2 + � over F � .
Main topic of this talk: Compute
� Curve( ) � Curve( )
in very few CPU cycles.
In particular, use floating point
for fast arithmetic mod .
High-speed
elliptic-curve cryptography
D. J. Bernstein
Thanks to:
University of Illinois at Chicago
NSF CCR–9983950
Alfred P. Sloan Foundation
Define = 2255 � 19; prime.
Define = 358990. Define
Curve : Z 0 � 1 � � � � � � 1 � by� � � coordinate of � th multiple
of (2 � � � � ) on the elliptic curve2 = � 3 + � 2 + � over F � .
Main topic of this talk: Compute
� Curve( ) � Curve( )
in very few CPU cycles.
In particular, use floating point
for fast arithmetic mod .
Why cryptographers care
Each user has secret key ,
public key Curve( ).
Users with secret keys �
exchange Curve( ) � Curve( )
through an authenticated channel;
compute Curve( ); hash it;
use hash as shared secret to
encrypt and authenticate messages.
Curve speed is important
when number of messages is small.
Define = 2255 � 19; prime.
Define = 358990. Define
Curve : Z 0 � 1 � � � � � � 1 � by� � � coordinate of � th multiple
of (2 � � � � ) on the elliptic curve2 = � 3 + � 2 + � over F � .
Main topic of this talk: Compute
� Curve( ) � Curve( )
in very few CPU cycles.
In particular, use floating point
for fast arithmetic mod .
Why cryptographers care
Each user has secret key ,
public key Curve( ).
Users with secret keys �
exchange Curve( ) � Curve( )
through an authenticated channel;
compute Curve( ); hash it;
use hash as shared secret to
encrypt and authenticate messages.
Curve speed is important
when number of messages is small.
Define = 2255 � 19; prime.
Define = 358990. Define
Curve : Z 0 � 1 � � � � � � 1 � by� � � coordinate of � th multiple
of (2 � � � � ) on the elliptic curve2 = � 3 + � 2 + � over F � .
Main topic of this talk: Compute
� Curve( ) � Curve( )
in very few CPU cycles.
In particular, use floating point
for fast arithmetic mod .
Why cryptographers care
Each user has secret key ,
public key Curve( ).
Users with secret keys �
exchange Curve( ) � Curve( )
through an authenticated channel;
compute Curve( ); hash it;
use hash as shared secret to
encrypt and authenticate messages.
Curve speed is important
when number of messages is small.
Analogous system using 2�
mod :
1976 Diffie Hellman.
Using elliptic curves
to avoid index-calculus attacks:
1986 Miller, 1987 Koblitz.
Using � 3 + � 2 + � for speed:
1987 Montgomery (for ECM).
High precision from fp sums:
1968 Veltkamp, 1971 Dekker.
Speedups: 1999–2005 Bernstein.
Why cryptographers care
Each user has secret key ,
public key Curve( ).
Users with secret keys �
exchange Curve( ) � Curve( )
through an authenticated channel;
compute Curve( ); hash it;
use hash as shared secret to
encrypt and authenticate messages.
Curve speed is important
when number of messages is small.
Analogous system using 2�
mod :
1976 Diffie Hellman.
Using elliptic curves
to avoid index-calculus attacks:
1986 Miller, 1987 Koblitz.
Using � 3 + � 2 + � for speed:
1987 Montgomery (for ECM).
High precision from fp sums:
1968 Veltkamp, 1971 Dekker.
Speedups: 1999–2005 Bernstein.
Why cryptographers care
Each user has secret key ,
public key Curve( ).
Users with secret keys �
exchange Curve( ) � Curve( )
through an authenticated channel;
compute Curve( ); hash it;
use hash as shared secret to
encrypt and authenticate messages.
Curve speed is important
when number of messages is small.
Analogous system using 2�
mod :
1976 Diffie Hellman.
Using elliptic curves
to avoid index-calculus attacks:
1986 Miller, 1987 Koblitz.
Using � 3 + � 2 + � for speed:
1987 Montgomery (for ECM).
High precision from fp sums:
1968 Veltkamp, 1971 Dekker.
Speedups: 1999–2005 Bernstein.
Understanding CPU design
Computers are designed for
music, movies, Photoshop, Doom 3,
etc. Heavy use of fp arithmetic,
i.e., approximate real arithmetic.
Example: Athlon, every cycle,
does one add and one multiply
of high-precision fp numbers.
Programmer paying attention
to these CPU features
can use them for cryptography.
Analogous system using 2�
mod :
1976 Diffie Hellman.
Using elliptic curves
to avoid index-calculus attacks:
1986 Miller, 1987 Koblitz.
Using � 3 + � 2 + � for speed:
1987 Montgomery (for ECM).
High precision from fp sums:
1968 Veltkamp, 1971 Dekker.
Speedups: 1999–2005 Bernstein.
Understanding CPU design
Computers are designed for
music, movies, Photoshop, Doom 3,
etc. Heavy use of fp arithmetic,
i.e., approximate real arithmetic.
Example: Athlon, every cycle,
does one add and one multiply
of high-precision fp numbers.
Programmer paying attention
to these CPU features
can use them for cryptography.
Analogous system using 2�
mod :
1976 Diffie Hellman.
Using elliptic curves
to avoid index-calculus attacks:
1986 Miller, 1987 Koblitz.
Using � 3 + � 2 + � for speed:
1987 Montgomery (for ECM).
High precision from fp sums:
1968 Veltkamp, 1971 Dekker.
Speedups: 1999–2005 Bernstein.
Understanding CPU design
Computers are designed for
music, movies, Photoshop, Doom 3,
etc. Heavy use of fp arithmetic,
i.e., approximate real arithmetic.
Example: Athlon, every cycle,
does one add and one multiply
of high-precision fp numbers.
Programmer paying attention
to these CPU features
can use them for cryptography.
A 53-bit fp number
is a real number 2�
with � � Z and� �
253.
Round each real number � to
closest 53-bit fp number, fp53 � .
Round halves to even.
Examples:
fp53(8675309) = 8675309;
fp53(2127 + 8675309) = 2127;
fp53(2127 � 8675309) = 2127.
Understanding CPU design
Computers are designed for
music, movies, Photoshop, Doom 3,
etc. Heavy use of fp arithmetic,
i.e., approximate real arithmetic.
Example: Athlon, every cycle,
does one add and one multiply
of high-precision fp numbers.
Programmer paying attention
to these CPU features
can use them for cryptography.
A 53-bit fp number
is a real number 2�
with � � Z and� �
253.
Round each real number � to
closest 53-bit fp number, fp53 � .
Round halves to even.
Examples:
fp53(8675309) = 8675309;
fp53(2127 + 8675309) = 2127;
fp53(2127 � 8675309) = 2127.
Understanding CPU design
Computers are designed for
music, movies, Photoshop, Doom 3,
etc. Heavy use of fp arithmetic,
i.e., approximate real arithmetic.
Example: Athlon, every cycle,
does one add and one multiply
of high-precision fp numbers.
Programmer paying attention
to these CPU features
can use them for cryptography.
A 53-bit fp number
is a real number 2�
with � � Z and� �
253.
Round each real number � to
closest 53-bit fp number, fp53 � .
Round halves to even.
Examples:
fp53(8675309) = 8675309;
fp53(2127 + 8675309) = 2127;
fp53(2127 � 8675309) = 2127.
Typical CPU: UltraSPARC III.
Every cycle, UltraSPARC III can do
one fp multiplication� ��� � fp53(
� � )
and one fp addition� ��� � fp53(
� + � ),
subject to limits on � .
“4-cycle fp-operation latency”:
Results available after 4 cycles.
Can substitute subtraction
for addition. I’ll count
subtractions as additions.
A 53-bit fp number
is a real number 2�
with � � Z and� �
253.
Round each real number � to
closest 53-bit fp number, fp53 � .
Round halves to even.
Examples:
fp53(8675309) = 8675309;
fp53(2127 + 8675309) = 2127;
fp53(2127 � 8675309) = 2127.
Typical CPU: UltraSPARC III.
Every cycle, UltraSPARC III can do
one fp multiplication� ��� � fp53(
� � )
and one fp addition� ��� � fp53(
� + � ),
subject to limits on � .
“4-cycle fp-operation latency”:
Results available after 4 cycles.
Can substitute subtraction
for addition. I’ll count
subtractions as additions.
A 53-bit fp number
is a real number 2�
with � � Z and� �
253.
Round each real number � to
closest 53-bit fp number, fp53 � .
Round halves to even.
Examples:
fp53(8675309) = 8675309;
fp53(2127 + 8675309) = 2127;
fp53(2127 � 8675309) = 2127.
Typical CPU: UltraSPARC III.
Every cycle, UltraSPARC III can do
one fp multiplication� ��� � fp53(
� � )
and one fp addition� ��� � fp53(
� + � ),
subject to limits on � .
“4-cycle fp-operation latency”:
Results available after 4 cycles.
Can substitute subtraction
for addition. I’ll count
subtractions as additions.
Some variation among CPUs.
PowerPC RS64 IV: One addition
or one multiplication or one
“fused” � ��� ��� � fp53(� � + � ).
Results available after 4 cycles.
Athlon: fp64 instead of fp53;
one multiplication and one addition.
Results available after 4 cycles.
I’ll focus on UltraSPARC III.
Not the most important CPU,
but it’s a good warmup.
Typical CPU: UltraSPARC III.
Every cycle, UltraSPARC III can do
one fp multiplication� ��� � fp53(
� � )
and one fp addition� ��� � fp53(
� + � ),
subject to limits on � .
“4-cycle fp-operation latency”:
Results available after 4 cycles.
Can substitute subtraction
for addition. I’ll count
subtractions as additions.
Some variation among CPUs.
PowerPC RS64 IV: One addition
or one multiplication or one
“fused” � ��� ��� � fp53(� � + � ).
Results available after 4 cycles.
Athlon: fp64 instead of fp53;
one multiplication and one addition.
Results available after 4 cycles.
I’ll focus on UltraSPARC III.
Not the most important CPU,
but it’s a good warmup.
Typical CPU: UltraSPARC III.
Every cycle, UltraSPARC III can do
one fp multiplication� ��� � fp53(
� � )
and one fp addition� ��� � fp53(
� + � ),
subject to limits on � .
“4-cycle fp-operation latency”:
Results available after 4 cycles.
Can substitute subtraction
for addition. I’ll count
subtractions as additions.
Some variation among CPUs.
PowerPC RS64 IV: One addition
or one multiplication or one
“fused” � ��� ��� � fp53(� � + � ).
Results available after 4 cycles.
Athlon: fp64 instead of fp53;
one multiplication and one addition.
Results available after 4 cycles.
I’ll focus on UltraSPARC III.
Not the most important CPU,
but it’s a good warmup.
Exact dot products
If � � � 220 � � � � � 0 � 1 � � � � � 220
then � is a 53-bit fp number
so � = fp53( � ).
If � � ��� � � 220 � � � � � 220
then � ��� � � + � are
53-bit fp numbers so
� = fp53( � ), � = fp53( � ),
� + � = fp53( � + � ).
UltraSPARC III computes
� � ��� � � � + � with
two fp mults, one fp add.
Some variation among CPUs.
PowerPC RS64 IV: One addition
or one multiplication or one
“fused” � ��� ��� � fp53(� � + � ).
Results available after 4 cycles.
Athlon: fp64 instead of fp53;
one multiplication and one addition.
Results available after 4 cycles.
I’ll focus on UltraSPARC III.
Not the most important CPU,
but it’s a good warmup.
Exact dot products
If � � � 220 � � � � � 0 � 1 � � � � � 220
then � is a 53-bit fp number
so � = fp53( � ).
If � � ��� � � 220 � � � � � 220
then � ��� � � + � are
53-bit fp numbers so
� = fp53( � ), � = fp53( � ),
� + � = fp53( � + � ).
UltraSPARC III computes
� � ��� � � � + � with
two fp mults, one fp add.
Some variation among CPUs.
PowerPC RS64 IV: One addition
or one multiplication or one
“fused” � ��� ��� � fp53(� � + � ).
Results available after 4 cycles.
Athlon: fp64 instead of fp53;
one multiplication and one addition.
Results available after 4 cycles.
I’ll focus on UltraSPARC III.
Not the most important CPU,
but it’s a good warmup.
Exact dot products
If � � � 220 � � � � � 0 � 1 � � � � � 220
then � is a 53-bit fp number
so � = fp53( � ).
If � � ��� � � 220 � � � � � 220
then � ��� � � + � are
53-bit fp numbers so
� = fp53( � ), � = fp53( � ),
� + � = fp53( � + � ).
UltraSPARC III computes
� � ��� � � � + � with
two fp mults, one fp add.
Bit extraction
Define � � = 3 � 2� +51,
top �� = fp53(fp53(
� + � � ) � � � ),
bottom � � = fp53(� � top �
� ).
If � is a 53-bit fp number
and� � �
2� +51 then
top �� 2
�Z;
�bottom � �
�2��� 1; and
� = top �� + bottom � � .
Exact dot products
If � � � 220 � � � � � 0 � 1 � � � � � 220
then � is a 53-bit fp number
so � = fp53( � ).
If � � ��� � � 220 � � � � � 220
then � ��� � � + � are
53-bit fp numbers so
� = fp53( � ), � = fp53( � ),
� + � = fp53( � + � ).
UltraSPARC III computes
� � ��� � � � + � with
two fp mults, one fp add.
Bit extraction
Define � � = 3 � 2� +51,
top �� = fp53(fp53(
� + � � ) � � � ),
bottom � � = fp53(� � top �
� ).
If � is a 53-bit fp number
and� � �
2� +51 then
top �� 2
�Z;
�bottom � �
�2��� 1; and
� = top �� + bottom � � .
Exact dot products
If � � � 220 � � � � � 0 � 1 � � � � � 220
then � is a 53-bit fp number
so � = fp53( � ).
If � � ��� � � 220 � � � � � 220
then � ��� � � + � are
53-bit fp numbers so
� = fp53( � ), � = fp53( � ),
� + � = fp53( � + � ).
UltraSPARC III computes
� � ��� � � � + � with
two fp mults, one fp add.
Bit extraction
Define � � = 3 � 2� +51,
top �� = fp53(fp53(
� + � � ) � � � ),
bottom � � = fp53(� � top �
� ).
If � is a 53-bit fp number
and� � �
2� +51 then
top �� 2
�Z;
�bottom � �
�2��� 1; and
� = top �� + bottom � � .
Big integers as fp sums
Every integer mod 2255 � 19
can be written as a sum�
0 + �22 + �
43 + �64 +
�85 + �
107 + �128 + �
149 +�
170 + �192 + �
213 + �234
where � � 2� � 222 � � � � � 222 .
Indices�
are � 255 12 �for 0 � 1 � � � � � 11 .
Representation is not unique;
it’s not the input/output format.
Uniqueness would cost cycles!
Bit extraction
Define � � = 3 � 2� +51,
top �� = fp53(fp53(
� + � � ) � � � ),
bottom � � = fp53(� � top �
� ).
If � is a 53-bit fp number
and� � �
2� +51 then
top �� 2
�Z;
�bottom � �
�2��� 1; and
� = top �� + bottom � � .
Big integers as fp sums
Every integer mod 2255 � 19
can be written as a sum�
0 + �22 + �
43 + �64 +
�85 + �
107 + �128 + �
149 +�
170 + �192 + �
213 + �234
where � � 2� � 222 � � � � � 222 .
Indices�
are � 255 12 �for 0 � 1 � � � � � 11 .
Representation is not unique;
it’s not the input/output format.
Uniqueness would cost cycles!
Bit extraction
Define � � = 3 � 2� +51,
top �� = fp53(fp53(
� + � � ) � � � ),
bottom � � = fp53(� � top �
� ).
If � is a 53-bit fp number
and� � �
2� +51 then
top �� 2
�Z;
�bottom � �
�2��� 1; and
� = top �� + bottom � � .
Big integers as fp sums
Every integer mod 2255 � 19
can be written as a sum�
0 + �22 + �
43 + �64 +
�85 + �
107 + �128 + �
149 +�
170 + �192 + �
213 + �234
where � � 2� � 222 � � � � � 222 .
Indices�
are � 255 12 �for 0 � 1 � � � � � 11 .
Representation is not unique;
it’s not the input/output format.
Uniqueness would cost cycles!
Assume � = � � as above,
and similarly � = � � . Then� � = 0 + 22 + � � � + 468
where 0 = �0 � 0,
22 = �0 � 22 + �
22 � 0,
43 = �0 � 43 + �
22 � 22 + �43 � 0,
etc.
Each � is a 53-bit fp number.
Given � � ’s and � � ’s,
can compute � ’s using
144 fp mults, 121 fp adds.
Big integers as fp sums
Every integer mod 2255 � 19
can be written as a sum�
0 + �22 + �
43 + �64 +
�85 + �
107 + �128 + �
149 +�
170 + �192 + �
213 + �234
where � � 2� � 222 � � � � � 222 .
Indices�
are � 255 12 �for 0 � 1 � � � � � 11 .
Representation is not unique;
it’s not the input/output format.
Uniqueness would cost cycles!
Assume � = � � as above,
and similarly � = � � . Then� � = 0 + 22 + � � � + 468
where 0 = �0 � 0,
22 = �0 � 22 + �
22 � 0,
43 = �0 � 43 + �
22 � 22 + �43 � 0,
etc.
Each � is a 53-bit fp number.
Given � � ’s and � � ’s,
can compute � ’s using
144 fp mults, 121 fp adds.
Big integers as fp sums
Every integer mod 2255 � 19
can be written as a sum�
0 + �22 + �
43 + �64 +
�85 + �
107 + �128 + �
149 +�
170 + �192 + �
213 + �234
where � � 2� � 222 � � � � � 222 .
Indices�
are � 255 12 �for 0 � 1 � � � � � 11 .
Representation is not unique;
it’s not the input/output format.
Uniqueness would cost cycles!
Assume � = � � as above,
and similarly � = � � . Then� � = 0 + 22 + � � � + 468
where 0 = �0 � 0,
22 = �0 � 22 + �
22 � 0,
43 = �0 � 43 + �
22 � 22 + �43 � 0,
etc.
Each � is a 53-bit fp number.
Given � � ’s and � � ’s,
can compute � ’s using
144 fp mults, 121 fp adds.
Furthermore, modulo 2255 � 19,� � �
0 + �22 + � � � + �
234
where �0 = 0 + 19 � 2
� 255255,
�22 = 22 + 19 � 2
� 255277, etc.
Each � � is a 53-bit fp number.
Example: �0 is an integer;
� �0
�381 � 244.
Computing � � ’s from � ’s takes
11 fp mults, 11 fp adds.
Structure: (Z[ � ] Z[2255�12 � ])
(2255 � 12 � 19) Z (2255 � 19).
Assume � = � � as above,
and similarly � = � � . Then� � = 0 + 22 + � � � + 468
where 0 = �0 � 0,
22 = �0 � 22 + �
22 � 0,
43 = �0 � 43 + �
22 � 22 + �43 � 0,
etc.
Each � is a 53-bit fp number.
Given � � ’s and � � ’s,
can compute � ’s using
144 fp mults, 121 fp adds.
Furthermore, modulo 2255 � 19,� � �
0 + �22 + � � � + �
234
where �0 = 0 + 19 � 2
� 255255,
�22 = 22 + 19 � 2
� 255277, etc.
Each � � is a 53-bit fp number.
Example: �0 is an integer;
� �0
�381 � 244.
Computing � � ’s from � ’s takes
11 fp mults, 11 fp adds.
Structure: (Z[ � ] Z[2255�12 � ])
(2255 � 12 � 19) Z (2255 � 19).
Assume � = � � as above,
and similarly � = � � . Then� � = 0 + 22 + � � � + 468
where 0 = �0 � 0,
22 = �0 � 22 + �
22 � 0,
43 = �0 � 43 + �
22 � 22 + �43 � 0,
etc.
Each � is a 53-bit fp number.
Given � � ’s and � � ’s,
can compute � ’s using
144 fp mults, 121 fp adds.
Furthermore, modulo 2255 � 19,� � �
0 + �22 + � � � + �
234
where �0 = 0 + 19 � 2
� 255255,
�22 = 22 + 19 � 2
� 255277, etc.
Each � � is a 53-bit fp number.
Example: �0 is an integer;
� �0
�381 � 244.
Computing � � ’s from � ’s takes
11 fp mults, 11 fp adds.
Structure: (Z[ � ] Z[2255�12 � ])
(2255 � 12 � 19) Z (2255 � 19).
Carries
“Carry from �0 to �
22”:
replace �0 and �
22 by
bottom22�0 and �
22 + top22�0.
This takes 4 fp adds,
and guarantees� �
0�
221.
Series of 13 carries puts all � � ’s
in range for subsequent products:
from �192 to �
213 to �234 to 255;
then from �0 to �
22 to �43 to � � �
to �192 to �
213.
This takes 52 fp adds.
Furthermore, modulo 2255 � 19,� � �
0 + �22 + � � � + �
234
where �0 = 0 + 19 � 2
� 255255,
�22 = 22 + 19 � 2
� 255277, etc.
Each � � is a 53-bit fp number.
Example: �0 is an integer;
� �0
�381 � 244.
Computing � � ’s from � ’s takes
11 fp mults, 11 fp adds.
Structure: (Z[ � ] Z[2255�12 � ])
(2255 � 12 � 19) Z (2255 � 19).
Carries
“Carry from �0 to �
22”:
replace �0 and �
22 by
bottom22�0 and �
22 + top22�0.
This takes 4 fp adds,
and guarantees� �
0�
221.
Series of 13 carries puts all � � ’s
in range for subsequent products:
from �192 to �
213 to �234 to 255;
then from �0 to �
22 to �43 to � � �
to �192 to �
213.
This takes 52 fp adds.
Furthermore, modulo 2255 � 19,� � �
0 + �22 + � � � + �
234
where �0 = 0 + 19 � 2
� 255255,
�22 = 22 + 19 � 2
� 255277, etc.
Each � � is a 53-bit fp number.
Example: �0 is an integer;
� �0
�381 � 244.
Computing � � ’s from � ’s takes
11 fp mults, 11 fp adds.
Structure: (Z[ � ] Z[2255�12 � ])
(2255 � 12 � 19) Z (2255 � 19).
Carries
“Carry from �0 to �
22”:
replace �0 and �
22 by
bottom22�0 and �
22 + top22�0.
This takes 4 fp adds,
and guarantees� �
0�
221.
Series of 13 carries puts all � � ’s
in range for subsequent products:
from �192 to �
213 to �234 to 255;
then from �0 to �
22 to �43 to � � �
to �192 to �
213.
This takes 52 fp adds.
Total 155 mults, 184 adds
to multiply modulo 2255 � 19
in this representation.
184 UltraSPARC III cycles.
= 184 cycles? Two obstacles:
fp-operation latency;
“load/store” latency imposed by
limited number of “registers.”
Schedule instructions carefully
to bring cycles down to 184.
Carries
“Carry from �0 to �
22”:
replace �0 and �
22 by
bottom22�0 and �
22 + top22�0.
This takes 4 fp adds,
and guarantees� �
0�
221.
Series of 13 carries puts all � � ’s
in range for subsequent products:
from �192 to �
213 to �234 to 255;
then from �0 to �
22 to �43 to � � �
to �192 to �
213.
This takes 52 fp adds.
Total 155 mults, 184 adds
to multiply modulo 2255 � 19
in this representation.
184 UltraSPARC III cycles.
= 184 cycles? Two obstacles:
fp-operation latency;
“load/store” latency imposed by
limited number of “registers.”
Schedule instructions carefully
to bring cycles down to 184.
Carries
“Carry from �0 to �
22”:
replace �0 and �
22 by
bottom22�0 and �
22 + top22�0.
This takes 4 fp adds,
and guarantees� �
0�
221.
Series of 13 carries puts all � � ’s
in range for subsequent products:
from �192 to �
213 to �234 to 255;
then from �0 to �
22 to �43 to � � �
to �192 to �
213.
This takes 52 fp adds.
Total 155 mults, 184 adds
to multiply modulo 2255 � 19
in this representation.
184 UltraSPARC III cycles.
= 184 cycles? Two obstacles:
fp-operation latency;
“load/store” latency imposed by
limited number of “registers.”
Schedule instructions carefully
to bring cycles down to 184.
Have developed qhasm,
new programming language
for high-speed computations.
Includes range verification,
guided register allocation, et al.
Lets me write desired code
with much less human time than
traditional asm, C compiler, etc.
Have also used for fast AES,
fast Poly1305, fast Salsa20, etc.;
see, e.g., http://cr.yp.to
/mac/poly1305_athlon.s.
Total 155 mults, 184 adds
to multiply modulo 2255 � 19
in this representation.
184 UltraSPARC III cycles.
= 184 cycles? Two obstacles:
fp-operation latency;
“load/store” latency imposed by
limited number of “registers.”
Schedule instructions carefully
to bring cycles down to 184.
Have developed qhasm,
new programming language
for high-speed computations.
Includes range verification,
guided register allocation, et al.
Lets me write desired code
with much less human time than
traditional asm, C compiler, etc.
Have also used for fast AES,
fast Poly1305, fast Salsa20, etc.;
see, e.g., http://cr.yp.to
/mac/poly1305_athlon.s.
Total 155 mults, 184 adds
to multiply modulo 2255 � 19
in this representation.
184 UltraSPARC III cycles.
= 184 cycles? Two obstacles:
fp-operation latency;
“load/store” latency imposed by
limited number of “registers.”
Schedule instructions carefully
to bring cycles down to 184.
Have developed qhasm,
new programming language
for high-speed computations.
Includes range verification,
guided register allocation, et al.
Lets me write desired code
with much less human time than
traditional asm, C compiler, etc.
Have also used for fast AES,
fast Poly1305, fast Salsa20, etc.;
see, e.g., http://cr.yp.to
/mac/poly1305_athlon.s.
Speedup: Squarings
Often know in advance that � = � .
�0
�64 + �
22�
43 + �43
�22 + �
64�
0
is more efficiently computed as
2( �0
�64 + �
22�
43).
Even better: First compute
2 �0 � 2 �
22 � � � � � 2 �234
and then compute
(2 �0)
�64 + (2 �
22)�
43 etc.
130 fp adds instead of 184.
Makes carry time even more visible.
Have developed qhasm,
new programming language
for high-speed computations.
Includes range verification,
guided register allocation, et al.
Lets me write desired code
with much less human time than
traditional asm, C compiler, etc.
Have also used for fast AES,
fast Poly1305, fast Salsa20, etc.;
see, e.g., http://cr.yp.to
/mac/poly1305_athlon.s.
Speedup: Squarings
Often know in advance that � = � .
�0
�64 + �
22�
43 + �43
�22 + �
64�
0
is more efficiently computed as
2( �0
�64 + �
22�
43).
Even better: First compute
2 �0 � 2 �
22 � � � � � 2 �234
and then compute
(2 �0)
�64 + (2 �
22)�
43 etc.
130 fp adds instead of 184.
Makes carry time even more visible.
Have developed qhasm,
new programming language
for high-speed computations.
Includes range verification,
guided register allocation, et al.
Lets me write desired code
with much less human time than
traditional asm, C compiler, etc.
Have also used for fast AES,
fast Poly1305, fast Salsa20, etc.;
see, e.g., http://cr.yp.to
/mac/poly1305_athlon.s.
Speedup: Squarings
Often know in advance that � = � .
�0
�64 + �
22�
43 + �43
�22 + �
64�
0
is more efficiently computed as
2( �0
�64 + �
22�
43).
Even better: First compute
2 �0 � 2 �
22 � � � � � 2 �234
and then compute
(2 �0)
�64 + (2 �
22)�
43 etc.
130 fp adds instead of 184.
Makes carry time even more visible.
Speedup: Karatsuba’s method
Say 0 = �0 + �
22 � + � � � + �107 � 5,
1 = �128 + �
149 � + � � � + �234 � 5,
0 = � 0 + � � � , 1 = � 128 + � � � .
Original, 184 adds: Product is
0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.
Karatsuba, 182 adds:
(( 0+ 1)( 0+ 1) �0 0
�1 1) � 6
+ 0 0 + 1 1 � 12.
Improved Karatsuba, 177 adds:
( 0 + 1)( 0 + 1) � 6
+ ( 0 0�
1 1 � 6)(1 � � 6).
Speedup: Squarings
Often know in advance that � = � .
�0
�64 + �
22�
43 + �43
�22 + �
64�
0
is more efficiently computed as
2( �0
�64 + �
22�
43).
Even better: First compute
2 �0 � 2 �
22 � � � � � 2 �234
and then compute
(2 �0)
�64 + (2 �
22)�
43 etc.
130 fp adds instead of 184.
Makes carry time even more visible.
Speedup: Karatsuba’s method
Say 0 = �0 + �
22 � + � � � + �107 � 5,
1 = �128 + �
149 � + � � � + �234 � 5,
0 = � 0 + � � � , 1 = � 128 + � � � .
Original, 184 adds: Product is
0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.
Karatsuba, 182 adds:
(( 0+ 1)( 0+ 1) �0 0
�1 1) � 6
+ 0 0 + 1 1 � 12.
Improved Karatsuba, 177 adds:
( 0 + 1)( 0 + 1) � 6
+ ( 0 0�
1 1 � 6)(1 � � 6).
Speedup: Squarings
Often know in advance that � = � .
�0
�64 + �
22�
43 + �43
�22 + �
64�
0
is more efficiently computed as
2( �0
�64 + �
22�
43).
Even better: First compute
2 �0 � 2 �
22 � � � � � 2 �234
and then compute
(2 �0)
�64 + (2 �
22)�
43 etc.
130 fp adds instead of 184.
Makes carry time even more visible.
Speedup: Karatsuba’s method
Say 0 = �0 + �
22 � + � � � + �107 � 5,
1 = �128 + �
149 � + � � � + �234 � 5,
0 = � 0 + � � � , 1 = � 128 + � � � .
Original, 184 adds: Product is
0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.
Karatsuba, 182 adds:
(( 0+ 1)( 0+ 1) �0 0
�1 1) � 6
+ 0 0 + 1 1 � 12.
Improved Karatsuba, 177 adds:
( 0 + 1)( 0 + 1) � 6
+ ( 0 0�
1 1 � 6)(1 � � 6).
The Curve function
Overall strategy to compute
� Curve( ) � Curve( ),
using arithmetic mod = 2255 � 19:
For various integers � ,
find � � � � � such that
Curve( � ) � � � � (mod ),
i.e., � � Curve( � ) � � (mod ).
e.g. �1 = Curve( ), � 1 = 1,
assuming Curve( ) = .
Can easily restrict � Curve( )
to ensure that never appears.
Speedup: Karatsuba’s method
Say 0 = �0 + �
22 � + � � � + �107 � 5,
1 = �128 + �
149 � + � � � + �234 � 5,
0 = � 0 + � � � , 1 = � 128 + � � � .
Original, 184 adds: Product is
0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.
Karatsuba, 182 adds:
(( 0+ 1)( 0+ 1) �0 0
�1 1) � 6
+ 0 0 + 1 1 � 12.
Improved Karatsuba, 177 adds:
( 0 + 1)( 0 + 1) � 6
+ ( 0 0�
1 1 � 6)(1 � � 6).
The Curve function
Overall strategy to compute
� Curve( ) � Curve( ),
using arithmetic mod = 2255 � 19:
For various integers � ,
find � � � � � such that
Curve( � ) � � � � (mod ),
i.e., � � Curve( � ) � � (mod ).
e.g. �1 = Curve( ), � 1 = 1,
assuming Curve( ) = .
Can easily restrict � Curve( )
to ensure that never appears.
Speedup: Karatsuba’s method
Say 0 = �0 + �
22 � + � � � + �107 � 5,
1 = �128 + �
149 � + � � � + �234 � 5,
0 = � 0 + � � � , 1 = � 128 + � � � .
Original, 184 adds: Product is
0 0 +( 0 1 + 1 0) � 6 + 1 1 � 12.
Karatsuba, 182 adds:
(( 0+ 1)( 0+ 1) �0 0
�1 1) � 6
+ 0 0 + 1 1 � 12.
Improved Karatsuba, 177 adds:
( 0 + 1)( 0 + 1) � 6
+ ( 0 0�
1 1 � 6)(1 � � 6).
The Curve function
Overall strategy to compute
� Curve( ) � Curve( ),
using arithmetic mod = 2255 � 19:
For various integers � ,
find � � � � � such that
Curve( � ) � � � � (mod ),
i.e., � � Curve( � ) � � (mod ).
e.g. �1 = Curve( ), � 1 = 1,
assuming Curve( ) = .
Can easily restrict � Curve( )
to ensure that never appears.
We’ll see how to compute� � � � � � �
2� � � 2
� ; and� � � � � � � �
+1 � � �+1 � Curve( )
� �2
�+1 � � 2
�+1.
Combine to compute� � � � � � � �
+1 � � �+1 � � Curve( )
� � � � � � � � �
+1 � � �
+1
where =� � 2 � , = � mod 2.
Conditional branches and
input-dependent load addresses
can leak via timing.
Replace with arithmetic:
e.g., (1 � ) � � + ( ) � �+1.
The Curve function
Overall strategy to compute
� Curve( ) � Curve( ),
using arithmetic mod = 2255 � 19:
For various integers � ,
find � � � � � such that
Curve( � ) � � � � (mod ),
i.e., � � Curve( � ) � � (mod ).
e.g. �1 = Curve( ), � 1 = 1,
assuming Curve( ) = .
Can easily restrict � Curve( )
to ensure that never appears.
We’ll see how to compute� � � � � � �
2� � � 2
� ; and� � � � � � � �
+1 � � �+1 � Curve( )
� �2
�+1 � � 2
�+1.
Combine to compute� � � � � � � �
+1 � � �+1 � � Curve( )
� � � � � � � � �
+1 � � �
+1
where =� � 2 � , = � mod 2.
Conditional branches and
input-dependent load addresses
can leak via timing.
Replace with arithmetic:
e.g., (1 � ) � � + ( ) � �+1.
The Curve function
Overall strategy to compute
� Curve( ) � Curve( ),
using arithmetic mod = 2255 � 19:
For various integers � ,
find � � � � � such that
Curve( � ) � � � � (mod ),
i.e., � � Curve( � ) � � (mod ).
e.g. �1 = Curve( ), � 1 = 1,
assuming Curve( ) = .
Can easily restrict � Curve( )
to ensure that never appears.
We’ll see how to compute� � � � � � �
2� � � 2
� ; and� � � � � � � �
+1 � � �+1 � Curve( )
� �2
�+1 � � 2
�+1.
Combine to compute� � � � � � � �
+1 � � �+1 � � Curve( )
� � � � � � � � �
+1 � � �
+1
where =� � 2 � , = � mod 2.
Conditional branches and
input-dependent load addresses
can leak via timing.
Replace with arithmetic:
e.g., (1 � ) � � + ( ) � �+1.
Eventually reach � = .
Divide � � by � � modulo
to obtain Curve( ).
Simple division method: Fermat!� � � � � � �
� � 2� .
Euclid-type division methods
are faster but have
input-dependent timings.
Finally convert from
floating-point representation
to byte-string output format.
We’ll see how to compute� � � � � � �
2� � � 2
� ; and� � � � � � � �
+1 � � �+1 � Curve( )
� �2
�+1 � � 2
�+1.
Combine to compute� � � � � � � �
+1 � � �+1 � � Curve( )
� � � � � � � � �
+1 � � �
+1
where =� � 2 � , = � mod 2.
Conditional branches and
input-dependent load addresses
can leak via timing.
Replace with arithmetic:
e.g., (1 � ) � � + ( ) � �+1.
Eventually reach � = .
Divide � � by � � modulo
to obtain Curve( ).
Simple division method: Fermat!� � � � � � �
� � 2� .
Euclid-type division methods
are faster but have
input-dependent timings.
Finally convert from
floating-point representation
to byte-string output format.
We’ll see how to compute� � � � � � �
2� � � 2
� ; and� � � � � � � �
+1 � � �+1 � Curve( )
� �2
�+1 � � 2
�+1.
Combine to compute� � � � � � � �
+1 � � �+1 � � Curve( )
� � � � � � � � �
+1 � � �
+1
where =� � 2 � , = � mod 2.
Conditional branches and
input-dependent load addresses
can leak via timing.
Replace with arithmetic:
e.g., (1 � ) � � + ( ) � �+1.
Eventually reach � = .
Divide � � by � � modulo
to obtain Curve( ).
Simple division method: Fermat!� � � � � � �
� � 2� .
Euclid-type division methods
are faster but have
input-dependent timings.
Finally convert from
floating-point representation
to byte-string output format.
From � to 2 �
In Z :�
2� = ( � 2� � � 2� )2,
� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).
Compute as follows:
( � � � � � )2; ( � � + � � )2;�
2� = ( � � � � � )2( � � + � � )2;
4 � � � � = ( � � + � � )2 � ( � � � � � )2;
( � 2) � � � � = 89747 � 4 � � � � ;
� 2� =
4 � � � � (( � � + � � )2 + ( � 2) � � � � ).
Eventually reach � = .
Divide � � by � � modulo
to obtain Curve( ).
Simple division method: Fermat!� � � � � � �
� � 2� .
Euclid-type division methods
are faster but have
input-dependent timings.
Finally convert from
floating-point representation
to byte-string output format.
From � to 2 �
In Z :�
2� = ( � 2� � � 2� )2,
� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).
Compute as follows:
( � � � � � )2; ( � � + � � )2;�
2� = ( � � � � � )2( � � + � � )2;
4 � � � � = ( � � + � � )2 � ( � � � � � )2;
( � 2) � � � � = 89747 � 4 � � � � ;
� 2� =
4 � � � � (( � � + � � )2 + ( � 2) � � � � ).
Eventually reach � = .
Divide � � by � � modulo
to obtain Curve( ).
Simple division method: Fermat!� � � � � � �
� � 2� .
Euclid-type division methods
are faster but have
input-dependent timings.
Finally convert from
floating-point representation
to byte-string output format.
From � to 2 �
In Z :�
2� = ( � 2� � � 2� )2,
� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).
Compute as follows:
( � � � � � )2; ( � � + � � )2;�
2� = ( � � � � � )2( � � + � � )2;
4 � � � � = ( � � + � � )2 � ( � � � � � )2;
( � 2) � � � � = 89747 � 4 � � � � ;
� 2� =
4 � � � � (( � � + � � )2 + ( � 2) � � � � ).
From � � � + 1 to 2 � + 1
�2
�
+1 = 4( � � � �
+1� � � � �
+1)2,
� 2�
+1 =
4( � � � �
+1� � � � �
+1)2 Curve( ).
Compute as follows:
( � � � � � )( � �
+1 + � �
+1);
( � � + � � )( � �
+1� � �
+1);
2( � � � �
+1� � � � �
+1) = sum;
2( � � � �
+1� � � � �
+1) = difference;�
2�
+1 = (2( � � � �
+1� � � � �
+1))2;
(2( � � � �
+1� � � � �
+1))2;
� 2�
+1 = ( � � � ) Curve( ).
From � to 2 �
In Z :�
2� = ( � 2� � � 2� )2,
� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).
Compute as follows:
( � � � � � )2; ( � � + � � )2;�
2� = ( � � � � � )2( � � + � � )2;
4 � � � � = ( � � + � � )2 � ( � � � � � )2;
( � 2) � � � � = 89747 � 4 � � � � ;
� 2� =
4 � � � � (( � � + � � )2 + ( � 2) � � � � ).
From � � � + 1 to 2 � + 1
�2
�
+1 = 4( � � � �
+1� � � � �
+1)2,
� 2�
+1 =
4( � � � �
+1� � � � �
+1)2 Curve( ).
Compute as follows:
( � � � � � )( � �
+1 + � �
+1);
( � � + � � )( � �
+1� � �
+1);
2( � � � �
+1� � � � �
+1) = sum;
2( � � � �
+1� � � � �
+1) = difference;�
2�
+1 = (2( � � � �
+1� � � � �
+1))2;
(2( � � � �
+1� � � � �
+1))2;
� 2�
+1 = ( � � � ) Curve( ).
From � to 2 �
In Z :�
2� = ( � 2� � � 2� )2,
� 2� = 4 � � � � ( � 2� + � � � � + � 2� ).
Compute as follows:
( � � � � � )2; ( � � + � � )2;�
2� = ( � � � � � )2( � � + � � )2;
4 � � � � = ( � � + � � )2 � ( � � � � � )2;
( � 2) � � � � = 89747 � 4 � � � � ;
� 2� =
4 � � � � (( � � + � � )2 + ( � 2) � � � � ).
From � � � + 1 to 2 � + 1
�2
�
+1 = 4( � � � �
+1� � � � �
+1)2,
� 2�
+1 =
4( � � � �
+1� � � � �
+1)2 Curve( ).
Compute as follows:
( � � � � � )( � �
+1 + � �
+1);
( � � + � � )( � �
+1� � �
+1);
2( � � � �
+1� � � � �
+1) = sum;
2( � � � �
+1� � � � �
+1) = difference;�
2�
+1 = (2( � � � �
+1� � � � �
+1))2;
(2( � � � �
+1� � � � �
+1))2;
� 2�
+1 = ( � � � ) Curve( ).
Total time
Slightly over 1600 fp adds
(520 from carries)
for each bit of .
Total for 256-bit :
413000 fp adds; plus
50000 fp adds for final division.
Aiming for 500000 cycles.
Still have to finish software.
Should end up even faster than
my NIST P-224 software,
despite 14% more bits!
From � � � + 1 to 2 � + 1
�2
�
+1 = 4( � � � �
+1� � � � �
+1)2,
� 2�
+1 =
4( � � � �
+1� � � � �
+1)2 Curve( ).
Compute as follows:
( � � � � � )( � �
+1 + � �
+1);
( � � + � � )( � �
+1� � �
+1);
2( � � � �
+1� � � � �
+1) = sum;
2( � � � �
+1� � � � �
+1) = difference;�
2�
+1 = (2( � � � �
+1� � � � �
+1))2;
(2( � � � �
+1� � � � �
+1))2;
� 2�
+1 = ( � � � ) Curve( ).
Total time
Slightly over 1600 fp adds
(520 from carries)
for each bit of .
Total for 256-bit :
413000 fp adds; plus
50000 fp adds for final division.
Aiming for 500000 cycles.
Still have to finish software.
Should end up even faster than
my NIST P-224 software,
despite 14% more bits!