faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · this work:...

35
Faster arbitrary-precision dot product and matrix multiplication Fredrik Johansson Inria Bordeaux 26th IEEE Symposium on Computer Arithmetic (ARITH 26) Kyoto, Japan June 10, 2019 1 / 26

Upload: others

Post on 14-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Faster arbitrary-precision dot product andmatrix multiplication

Fredrik JohanssonInria Bordeaux

26th IEEE Symposium on Computer Arithmetic (ARITH 26)Kyoto, JapanJune 10, 2019

1 / 26

Page 2: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Arbitrary-precision arithmetic

Precision: p ≥ 2 bits (can be thousands or millions)

I Floating-point numbers

3.14159265358979323846264338328

I Ball arithmetic (mid-rad interval arithmetic)

[3.14159265358979323846264338328± 8.65 · 10−31]

Why?

I Computational number theory, computer algebraI Dynamical systems, ill-conditioned problemsI Verifying/testing numerical results/methods

2 / 26

Page 3: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

This work: faster arithmetic and linear algebra

CPU time (seconds) to multiply two real 1000× 1000 matrices

p = 53 p = 106 p = 212 p = 848

BLAS 0.08

QD 11 111

MPFR 36 44 110 293

Arb* (classical) 19 25 76 258

Arb* (block) 3.6 5.6 8.2 27

* With ball coefficientsArb version 2.16 – http://arblib.org

3 / 26

Page 4: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Two important requirements

I True arbitrary precision; inputs and output can havemixed precision; no restrictions on the exponents

I Preserve structure: near-optimal enclosures for each entry[1.23·10100 ± 1080] −1.5 01 [2.34± 10−20] [3.45± 10−50]0 2 [4.56 · 10−100 ± 10−130]

4 / 26

Page 5: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product

N∑k=1

akbk , ak ,bk ∈ R or C

Kernel in basecase (N . 10 to 100) algorithms for:I Matrix multiplicationI Triangular solving, recursive LU factorizationI Polynomial multiplication, division, compositionI Power series operations

5 / 26

Page 6: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product as an atomic operation

The old way:

arb_mul(s, a, b, prec);

for (k = 1; k < N; k++)

arb_addmul(s, a + k, b + k, prec);

The new way:

arb_dot(s, NULL, 0, a, 1, b, 1, N, prec);

(More generally, computes s = s0 + (−1)c ∑N−1k=0 ak·astepbk·bstep)

arb dot – ball arithmetic, realacb dot – ball arithmetic, complexarb approx dot – floating-point, realacb approx dot – floating-point, complex

6 / 26

Page 7: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Numerical dot product

Approximate (floating-point) dot product:

s =

N∑k=1

akbk + ε, |ε| ≈ 2−pN∑

k=1

|akbk |

Ball arithmetic dot product:

[m ± r] ⊇N∑

k=1

[mk ± rk ][m′k ± r ′k ]

m =

N∑k=1

mkm′k + ε, r ≥ |ε|+N∑

k=1

|mk |r ′k + |m′k |rk + rkr ′k

7 / 26

Page 8: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Representation of numbers in Arb (like MPFR)

Arbitrary-precision floating-point numbers:

(−1)sign · 2exp ·n−1∑k=0

bk264(k−n)

Limbs bk are 64-bit words, normalized:

0 ≤ bk < 264, bn−1 ≥ 263, b0 6= 0

All core arithmetic operations are implemented using wordmanipulations and low-level GMP (mpn layer) function calls

Radius: 30-bit unsigned floating-point

8 / 26

Page 9: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Arbitrary-precision multiplication

1.......

1.......

m limbs

n limbs

Exact multiplication: mpn mul→m + n limbs

01......

Rounding to p bits and bit alignment

1....... ....1000

≤ p bits

9 / 26

Page 10: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Arbitrary-precision multiplication

1.......

1.......

m limbs

n limbs

Exact multiplication: mpn mul→m + n limbs

01......

Rounding to p bits and bit alignment

1....... ....1000

≤ p bits

9 / 26

Page 11: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Arbitrary-precision multiplication

1.......

1.......

m limbs

n limbs

Exact multiplication: mpn mul→m + n limbs

01......

Rounding to p bits and bit alignment

1....... ....1000

≤ p bits

9 / 26

Page 12: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Arbitrary-precision addition

Exponentdifference

Align limbs: mpn lshift etc.

Addition: mpn add n, mpn sub n, mpn add 1 etc.

Rounding to p bits and bit alignment

1....... ....1000

≤ p bits

10 / 26

Page 13: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Arbitrary-precision addition

Exponentdifference

Align limbs: mpn lshift etc.

Addition: mpn add n, mpn sub n, mpn add 1 etc.

Rounding to p bits and bit alignment

1....... ....1000

≤ p bits

10 / 26

Page 14: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Arbitrary-precision addition

Exponentdifference

Align limbs: mpn lshift etc.

Addition: mpn add n, mpn sub n, mpn add 1 etc.

Rounding to p bits and bit alignment

1....... ....1000

≤ p bits

10 / 26

Page 15: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Arbitrary-precision addition

Exponentdifference

Align limbs: mpn lshift etc.

Addition: mpn add n, mpn sub n, mpn add 1 etc.

Rounding to p bits and bit alignment

1....... ....1000

≤ p bits

10 / 26

Page 16: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product

First pass: inspect the terms

I Count nonzero termsI Bound upper and lower exponents of termsI Detect Inf/NaN/overflow/underflow (fallback code)

Second pass: compute the dot product!

I Exploit knowledge about exponentsI Single temporary memory allocationI Single final rounding and normalization

11 / 26

Page 17: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product

Nte

rms

p bits

2’s complement accumulator

A B

Error accumulator

A, B:∼ log2 N bits padding

12 / 26

Page 18: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product

Nte

rms

p bits

2’s complement accumulator

A B

Error accumulator

A, B:∼ log2 N bits padding

12 / 26

Page 19: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product

Nte

rms

p bits

2’s complement accumulator

A B

Error accumulator

A, B:∼ log2 N bits padding

12 / 26

Page 20: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product

Nte

rms

p bits

2’s complement accumulator

A B

Error accumulator

A, B:∼ log2 N bits padding

12 / 26

Page 21: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Technical comments

Radius dot products (for ball arithmetic):

I Dedicated code using 64-bit accumulator

Special sizes:

I Inline ASM instead of GMP function calls for≤ 2× 2 limbproduct,≤ 3 limb accumulator

I Mulder’s mulhigh (via MPFR) for 25 to 10000 limbs

Complex numbers:

I Essentially done as two length-2N real dot productsI Karatsuba-style multiplication (3 instead of 4 real muls)

for≥ 128 limbs

13 / 26

Page 22: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product performance

64 128 192 256 384 512 640 768

Bit precision p

0

100

200

300

400

500C

ycl

es/

term

arb addmul

mpfr mul/mpfr add

arb dot

arb approx dot

14 / 26

Page 23: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product performance

64 128 192 256

Bit precision p

0

50

100

150

200

250

300C

ycl

es/

term

arb addmul

mpfr mul/mpfr add

arb dot

arb approx dot

QD (p = 106)

QD (p = 212)

15 / 26

Page 24: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product: polynomial operations speedup in Arb

1 10 100 1000

Polynomial degree N

1

2

3

4

5S

pee

du

p

mul

mullow

divrem

inv series

exp series

sin cos

compose

revert

(Complex coefficients, p = 64 bits)

16 / 26

Page 25: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Dot product: matrix operations speedup in Arb

1 10 100 1000

Matrix size N

1

2

3

4

5S

pee

du

p

mul

solve

inv

det

exp

charpoly

(Complex coefficients, p = 64 bits)

17 / 26

Page 26: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Matrix multiplication (large N )

Same ideas as polynomial multiplication in Arb:

1. [A±a][B±b] via three multiplications AB, |A|b, a(|B|+b)

2. Split + scale matrices into blocks with uniform magnitude

3. Multiply blocks of A,B exactly over Z using FLINT

4. Multiply blocks of |A|,b, a, |B|+b using hardware FP

Where is the gain?

I Integers and hardware FP have less overheadI Multimodular/RNS arithmetic (60-bit primes in FLINT)I Strassen O(N 2.81) matrix multiplication in FLINT

18 / 26

Page 27: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Matrix multiplication (large N )

Same ideas as polynomial multiplication in Arb:

1. [A±a][B±b] via three multiplications AB, |A|b, a(|B|+b)

2. Split + scale matrices into blocks with uniform magnitude

3. Multiply blocks of A,B exactly over Z using FLINT

4. Multiply blocks of |A|,b, a, |B|+b using hardware FP

Where is the gain?

I Integers and hardware FP have less overheadI Multimodular/RNS arithmetic (60-bit primes in FLINT)I Strassen O(N 2.81) matrix multiplication in FLINT

18 / 26

Page 28: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Matrix multiplication

Blocks As,Bs chosen (using greedy search) so that each row ofAs and column of Bs has a small internal exponent range

C = ABci,j =

∑k ai,kbk,j

A

B

Row i

Column j

19 / 26

Page 29: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Block matrix multiplication

Choose blocks As,Bs (indices s ⊆ {1, . . . ,N}) so that each rowof As and column of Bs has a small internal exponent range

C ← C + AsBsAs

Bs

Row i

Column j

20 / 26

Page 30: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Block matrix multiplication, scaled to integers

Scaling is applied internally to each block As,Bs each row of As

and column of Bs has a small range of exponent differences

C ← C + E−1s ((EsAs)(BsFs))F−1

sAs

Bs

Row i× 2ei,s

Column j × 2fj,s

Es = diag(2ei,s),

Fs = diag(2fi,s)

21 / 26

Page 31: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Uniform and non-uniform matrices

Uniform matrix, N = 1000

p Classical Block Number of blocks Speedup

53 19 s 3.6 s 1 5.3

212 76 s 8.2 s 1 9.3

3392 1785 s 115 s 1 15.5

Pascal matrix, N = 1000 (entries Ai,j = π ·(i+j

j

))

p Classical Block Number of blocks Speedup

53 12 s 20 s 10 0.6

212 43 s 35 s 9 1.2

3392 1280 s 226 s 2 5.7

22 / 26

Page 32: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Approximate and certified linear algebra

Three approaches to linear solving Ax = b:

I Gaussian elimination in floating-point arithmetic:stable if A is well-conditioned

I Gaussian elimination in interval/ball arithmetic:unstable for generic well-conditioned A (lose O(N ) digits)

I Approx + certification: 3.141→ [3.141± 0.001]

Example: Hansen-Smith algorithm

1. Compute R ≈ A−1 approximately

2. Solve (RA)x = Rb in interval/ball arithmetic

23 / 26

Page 33: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Linear solving

Solving a dense real linear system Ax = b (N = 1000, p = 212)

Eigen/MPFR Arb0

20

40

60

Tim

e(s

econ

ds)

Appro

x(L

U)

Appro

x(L

U)

Cer

tified

(Han

sen-S

mit

h)

24 / 26

Page 34: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

EigenvaluesComputing all eigenvalues and eigenvectors of anonsymmetric complex matrix (N = 100, p = 128)

Julia Arb0

5

10

15

Tim

e(s

econ

ds)

Appro

x(Q

Rm

ethod)

Appro

x(Q

Rm

ethod)

Cer

tified

(vdH

oev

en-M

ourr

ain)

Cer

tified

(Rum

p)

25 / 26

Page 35: Faster arbitrary-precision dot product and matrix ...fredrikj.net/math/arith2019.pdf · This work: faster arithmetic and linear algebra CPU time (seconds) to multiply two real 1000

Conclusion

Faster arbitrary-precision arithmetic, linear algebra

I Handle dot product as an atomic operation, use insteadof single add/muls where possible (1− 5× speedup)

I Accurate and fast large-N matrix multiplication usingscaled integer blocks (≈ 10× speedup)

I Higher operations reduce well to dot product (small N ),matrix multiplication (large N )

Future work ideas

I Correctly rounded dot product, for MPFR (easy)

I Horner scheme (in analogy with dot product)

I Better matrix scaling + splitting algorithm

26 / 26