research matters nick higham february 25, 2009 school of ...higham/talks/squeezing19.pdffp128...

Research Matters

February 25, 2009

Nick HighamDirector of Research

School of Mathematics

1 / 6

Exploiting Half Precision Arithmetic inSolving Ax = b

Nick HighamSchool of Mathematics

The University of Manchesterwww.maths.manchester.ac.uk/~higham

nla-group.orgSlides available at http://bit.ly/squeeze19

Joint work with Srikara Pranesh and Mawussi Zounon

http://www.manchester.ac.uk

http://www.maths.manchester.ac.uk/our-research/research-groups/numerical-analysis-and-scientific-computing/numerical-analysis/

http://www.maths.manchester.ac.uk/~higham/

http://www.maths.manchester.ac.uk/

http://www.man.ac.uk

http://www.maths.manchester.ac.uk/~higham

https://nla-group.org/

http://bit.ly/squeeze19

Today’s Floating-Point Arithmetics

Type Bits Range u = 2−t

bfloat16 half 16 10±38 2−8 ≈ 3.9× 10−3

fp16 half 16 10±5 2−11 ≈ 4.9× 10−4

fp32 single 32 10±38 2−24 ≈ 6.0× 10−8

fp64 double 64 10±308 2−53 ≈ 1.1× 10−16

fp128 quadruple 128 10±4932 2−113 ≈ 9.6× 10−35

fp* forms all IEEE standard, but fp16 storage only.bfloat16 used by Google TPU and forthcoming IntelNervana Neural Network Processor.

Nick Higham Half Precision in Solving Ax = b 2 / 19

Why Use Lower Precision in Sci Comp?

Faster flops.Less communication.Lower energy consumption.

But need toprove low precision (where used) gives sufficientaccuracy,refine low accuracy quantities.

Focus in this talk on Ax = b.


Harmonic Series

What is the harmonic sum∞∑

k=1

1k

?

Arithmetic Computed sum No. of termsfp8 3.5000 16

bfloat16 5.0625 65fp16 7.0859 513fp32 15.404 2097152fp64 34.122 2.81 · · · × 1014

Simulations in MATLAB with chop function(H & Pranesh, 2019).


Iterative Refinement in Three PrecisionsA,b given in precision u.

Solve Ax0 = b by LU factorization in precision uf > u.r = b − Ax0 precision ur < uSolve Ad = r precision uf

x1 = fl(x0 + d) precision u

uf u ur

half single doublehalf double quad

single double quad


GMRES-IR

H & Carson (2018, 2019):to compute the update di apply GMRES to

Adi ≡ U−1L−1Adi = U−1L−1ri .

B’erroruf u ur κ∞(A) nrm cmp F’error

LU H D Q 104 D D DGMRES-IR H D Q 1012 D D DGMRES-IR H D D 108 D D D

Essentially need κ∞(A)u � 1 for convergence.Haidar, Tomov, Dongarra & H (2018): on NVIDIAV100 GPU speedup of 4 and energy reduction of 80%.


Overflow and Underflow

First step of IR3: round fp64 (range [10−324,10308)→fp16 (range [10−8,105]).

Can sufferoverflow,underflow,elements becoming subnormal: the fp16 interval[10−8,10−5].

Need to squeeze the range of A,b and exploit thewhole fp16 range.

Write xmax = 6.55× 104 for fp16 overflow level.


Example: Vector 2-Norm in fp16

Evaluate ‖x‖2 for

x =

[αα

]as√

x21 + x2

2 in fp16.

Recall uh = 4.88× 10−4, rmin = 6.10× 10−5.

α Relative error Comment10−4 1 Underflow to 0

3.3× 10−4 4.7× 10−2 Subnormal range5.5× 10−4 7.1× 10−3 Subnormal range1.1× 10−2 1.4× 10−4 Perfect rel. err


Simple Conversion Strategies

Algorithm Inf Round then replace infinities.1: A(h) = flh(A)

2: For every |a(h)ij | ≥ θxmax, set a(h)

ij = sign(aij)θxmax.

Algorithm Scale Scale then round.1: amax = maxi,j |aij |2: µ = θxmax/amax

3: A(h) = flh(µA)

Both algs make maxi,j |a(h)ij | ≤ θxmax, where θ ∈ (0,1].

Alg Inf can make large changes to A.Alg Scale: underflow or subnormals if |aij | � xmax.


Conversion With Two-Sided Diagonal Scaling

Algorithm 2DS Two-sided diagonal scaling then round.1: Obtain diagonal matrices R, S.2: β = maxi,j |RAS|ij .3: µ = θxmax/β4: A(h) = flh(µ(RAS)) % Max elt θxmax.

Algorithm Row and column equilibration.1: ri = ‖A(i , :)‖−1

∞ , i = 1 : n2: R = diag(r)

3: A = RA % A is row equilibrated.4: sj = ‖A(:, j)‖−1

∞ , j = 1 : n5: S = diag(s)


Expose to the Right (ETTR) Analogy

In digital photography:expose image so histogram just touchesthe right edge.

Maximizes use of dynamic range of sensor.

Same principal here. fp16 numbers:

Subnormals Normalized numbers

Keep computations in the yellow zone.Likely to need to scale data up.Need problem analysis to find appropriate scaling thatavoids overflow.


Choice of θ

In PA = LU (partial pivoting):|`ij | ≤ 1,|uij | ≤ ρn maxi,j |aij |, where growth factor ρn not large.

Take θ = 0.1 (say).

Can show that if ukk underflows then

κ∞(A) ≥ θxmax

xsmin

.

For fp16 and θ = 0.1, this is κ∞(A) ≥ 1.09× 1011.


Numerical Experiments

13 badly scaled matrices from SuiteSparse MatrixCollection with maxi,j |aij | > xmax for fp16.

κ∞(A) ≤ 1014.maxi,j |aij | = 1010.mini,j{ |aij | : aij 6= 0 } = 10−25.

Precisions = (H,S,D) and (H,D,Q).

MATLAB using Moler’s fp16 class and Advanpix forquad precision.

IR convergence test is b’err ≤ nu.


GMRES-IR Iterations (IR Steps)

Index (half, single, double) (half, double, quad)Alg Inf Alg Scale Alg Inf Alg Scale

1 6 (1) 2 (1) 14 (2) 7 (2)2 4 (1) 2 (1) 12 (2) 8 (2)3 35 (3) 6 (3) 84 (4) 14 (4)4 24 (2) 0 (0) 214 (3) 28 (4)5 108 (2) 0 (0) 258 (3) 6 (2)6 37 (3) 2 (1) 180 (4) 3 (1)7 0 (0) 1 (1) 4 (2) 2 (1)8 0 (0) 3 (1) 25 (3) 10 (2)9 120 (4) 0 (0) 116 (3) 16 (4)

10 – (–) 0 (0) – (–) 18 (4)11 255 (3) 0 (0) 686 (4) 37 (4)12 0 (0) – (–) 13 (2) – (–)13 0 (0) – (–) 11 (2) – (–)


GMRES-IR Iterations (IR Steps)

Index (half, single, double) (half, double, quad)Alg 2DS Alg 2DS

1 0 (0) 2 (1)2 0 (0) 4 (2)3 2 (1) 6 (2)4 0 (0) 16 (2)5 0 (0) 2 (1)6 0 (0) 2 (1)7 0 (0) 2 (1)8 0 (0) 8 (2)9 0 (0) 9 (3)

10 1 (1) 11 (3)11 0 (0) 36 (3)12 0 (0) 9 (2)13 0 (0) 7 (2)


Conditioning

2 4 6 8 10 12100

105

1010

1015

(A)

(RAS)

(MA)


Notes on Scaling

The purpose of two-sided diagonal scaling is tosqueeze A into fp16.

The scaled alg is mathematically equivalent to theunscaled one if the LU pivot sequence doesn’t change

. . . and even numerically equivalent to the unscaledone if the scaling is by powers of 2.

Scaling may change the pivot sequence, though.

Important to work with the unscaled problem as scalingchanges norms!


Conclusions

Must consider overflow/underflow in conversion to fp16.

Two-sided diagonal scaling works well to compress therange.

Further scalar mult needed to move data close to xmax.

Alg 2DS greatly widens the class of problems solvablewith IR3 (4x double precision solver).

Slides available at http://bit.ly/squeeze19

For more on performance of GMRES-IR Ax = b solvers see3:05-3:25 Jack J. Dongarra, Experimentswith Mixed Precision Algorithms in LinearAlgebra


http://bit.ly/squeeze19

James H. Wilkinson (1919–1986) Centenaryhttps://nla-group.org/

Wilkinson page and blog posts during 2019.Advances in Numerical Linear Algebra, Manchester,May 29-30, 2019, Celebrating the Centenary of theBirth of James H. Wilkinson.


References I

E. Carson and N. J. Higham.A new analysis of iterative refinement and its applicationto accurate solution of ill-conditioned sparse linearsystems.SIAM J. Sci. Comput., 39(6):A2834–A2856, 2017.

E. Carson and N. J. Higham.Accelerating the solution of linear systems by iterativerefinement in three precisions.SIAM J. Sci. Comput., 40(2):A817–A847, 2018.

N. J. Higham.Iterative refinement for linear systems and LAPACK.IMA J. Numer. Anal., 17(4):495–509, 1997.


References II

N. J. Higham and S. Pranesh.Simulating low precision floating-point arithmetic.MIMS EPrint 2019.xx, Manchester Institute forMathematical Sciences, The University of Manchester,UK, 2019.In preparation.

N. J. Higham, S. Pranesh, and M. Zounon.Squeezing a matrix into half precision, with anapplication to solving linear systems.MIMS EPrint 2018.37, Manchester Institute forMathematical Sciences, The University of Manchester,UK, Nov. 2018.15 pp.


research matters nick higham february 25, 2009 school of ...higham/talks/squeezing19.pdffp128...

Documents