metalibm - inria

Metalibm

Olga Kupriianova

Sorbonne Universites, UPMC Univ Paris 06,UMR 7606, LIP6,

F-75005 Paris, France

Rencontres Arithmetiques de l’Informatique Mathematique7 April 2015

O. Kupriianova (LIP6, Paris) Metalibm RAIM, 7 April 2015 1 / 25

Outline

1 Introducing libm

2 Problem statement

3 The Metalibm Project

4 General approach to function implementation


Mathematical libraries

The libm

gives a set of maths functions (exp, log, sin, cos, pow, ...)

supports important precisions (float, double, long double)

The existing implementations

glibc libm

libmcr by Suna

crlibm by ENS-Lyon

libultim by IBM

Intel’s proprietary mathimf

aAll other trademarks and copyrights are the property of their respective owners


The glibc libm

Why glibc libm?

Math library running on all linux-powered machines

Written by different developer teams

Some codes were written 20 years ago

Strange naming conventions


/∗∗ IBM Accurate Mathemat ica l L i b r a r y∗ w r i t t e n by I n t e r n a t i o n a l Bu s i n e s s Machines Corp .∗ Copy r i gh t (C) 2001-2013 Free So f tware Foundat ion , I n c . ∗//∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗//∗ ∗//∗ MODULE NAME: halfulp.c ∗//∗ ∗//∗ FUNCTIONS : halfulp ∗//∗ FILES NEEDED: mydefs . h d l a . h end ian . h ∗//∗ u roo t . c ∗//∗ ∗//∗Rout ine h a l f u l p ( doub l e x , doub l e y ) computes xy where ∗//∗ r e s u l t does not need round ing . I f the r e s u l t i s c l o s e r ∗//∗ to 0 than can be r e p r e s e n t e d i t returns 0. ∗//∗ I n the f o l l o w i n g c a s e s the f u n c t i o n does not compute ∗//∗ any th i ng and r e t u r n s a negative number: ∗//∗ 1 . i f the r e s u l t needs round ing , ∗//∗ 2 . i f y i s o u t s i d e the i n t e r v a l [ 0 , 2ˆ20−1] , ∗//∗ 3 . i f x can be r e p r e s e n t e d by x=2∗∗n f o r some i n t e g e r n ∗//∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗/


The libm

The restrictions of libm

Limited set of functions: trigonometric, exponentiation andlogarithmic, hyperbolic, special functions

Limited set of precisions (float, double, long double)

The input range: “as far as the input and output formats allow”

The only one implementation, hard-coded long time ago

No special code for exp(x) on [−10; 10] with 42 bits of accuracy


Speed vs. accuracy compromise

Common wisdom

The more accurate you compute, the more expensive it gets

Observation

The binary64 format provides about 16 decimal digits,while physical measurements might use only 3!

Compiler flag?

Why can’t we have an option like

gcc -c code.c -fp-transcendental=15ulps

to get less accuracy when we don’t need more


Outline

1 Introducing libm

2 Problem statement




Flavors of functions

Flavor

A flavor is a particular specification of a function

Absolute error in ulps

|X − x | ≤ α · ulp(x)

Relative Error ∣∣∣∣Xx − 1

∣∣∣∣ ≤ α · β−p+1

Correct Rounding

|X − x | ≤ 0.5 · ulp(x)


Table Maker’s Dilemma

Table Maker’s Dilemma

Sometimes huge accuracy is neededRuntime of function is not bounded or really huge

Recommendations

Precompute TMD worst cases

Avoid correct rounding wherever we do not really need it


Different flavors of the functions

Precision Ulp Error DecimalDigits

NeededAccuracy

Verdict

Single 1 ulp 7 2−24 fastSingle (CR) 0.5 ulps 7 2−50 slowDouble 1 ulp 16 2−53.5 fastDouble (CR) 0.5 ulps 16 2−150 slow

Not fast enough? Still more flavors

Single 25 ulps 5 2−17 fasterSingle 211 ulps 3 2−11 the fastestDouble 211 ulps 12 2−40 fasterDouble 242 ulps 3 2−10 the fastest


Still more flavors

We should get even more flavors!

set floating-point flags or not 2 possibilities

support all rounding modes or not: 4 possibilities

save mode, perform computations, restore modeperform computations to support all the rounding modesperform integer-based computations

⇒ Eight more flavors


Task

Rough Computations

3 precisions

∼ 50 functions in libm

∼ 10 flavors in terms of accuracy

8 additional flavors

I have to implement 12000 functions...

Each function implementation takes ∼ 1 man-monthIt will take me 1000 years to rewrite the libm


Task

Rough Computations

3 precisions





Each function implementation takes ∼ 1 man-month

It will take me 1000 years to rewrite the libm


Task

Rough Computations

3 precisions





Each function implementation takes ∼ 1 man-monthIt will take me 1000 years to rewrite the libm


Outline

1 Introducing libm

2 Problem statement




Solution

Metalibm

Write a tool that produces code for math functions

Analogy

assembly → compilerscode → code generators

. c f i s t a r t p r o csubq $8 , %r s p. c f i d e f c f a o f f s e t 16movl $52 , %r8dmovl $37 , %ecxmovl $15 , %edxmovl $ . LC0 , %e s imovl $1 , %e d ix o r l %eax , %eaxc a l l p r i n t f c h kx o r l %eax , %eaxaddq $8 , %r s p. c f i d e f c f a o f f s e t 8r e t. c f i e n d p r o c

i n c l u d e <s t d i o . h>

i n t main ( ) {i n t a , b , sum ;a = 1 5 ;b = 3 7 ;sum = a + b ;p r i n t f ( ”%d + %d = %d\n” , a , b ,

sum ) ;r e t u r n 0 ;

}


Metalibm prototype

Academic prototype

Produces flexible math functions implementations

Gives a function on a specified domain

It’s free and already available!Try it on http://metalibm.org/

State of the art

Supports almost all libm functions

The documentation has to be finished

We are working on the vectorizable implementations

We are working on support of specific functions (e.g. dickman’s, erf)


http://metalibm.org/

Demo


Timings for exponential function

3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=30

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=31

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=32

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=33

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=34

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=35

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=36

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=37

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=38

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=39

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=40

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=41

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=42

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=43

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=44

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=45

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=46

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=47

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=48

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=49

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=50

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=51

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=52

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=53

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=54

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=55

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=56

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=57

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=58

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=59

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=60

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=61

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=62

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=63

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=64

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=65

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=66

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=67

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=68

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=69

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=70

table

degree

time

0

0.2

0.4

0.6

0.8

1



3 4 5 6 7 8 9 10 11 12 2 4

6 8

10 12

14 16

0

0.2

0.4

0.6

0.8

1

time

precision=70

table

degree

time

0

0.2

0.4

0.6

0.8

1

Making this movie meant implementing 6150 functionsO. Kupriianova (LIP6, Paris) Metalibm RAIM, 7 April 2015 18 / 25

Outline

1 Introducing libm

2 Problem statement




Steps in function implementation

1 Eliminating special cases and values:zeros, infinities, NaNs, etc.

2 Argument reduction: x ∈ [α, β] is asmall interval

3 Polynomial approximation: Remezalgorithm for minimaxapproximations, polynomial of lowdegree

4 Reconstruction

Example

implement f (x) = ex

ex = 2x

log 2 = 2

⌊x

log 2

⌉· 2

xlog 2−⌊

xlog 2

⌉=

= 2E · ex−E log 2 = 2E · er


Range reduction

Range reduction

Step is based on mathematical properties:na+b = na · nb, sin(x + 2π) = sin(x), log(a · b) = log(a) + log(b), . . .

What properties do we know for erf (x), or a function defined by ODE?

When range reduction does not work

Piecewise-polynomial implementation

Algorithm to split the domain

Reconstruction becomes an execution of if-statements


Domain Splitting. Different approaches

f = asin(x), I = [0, 0.85], d ≤ 8, ε = 2−52

Simple and naive: split domain into k equal parts, k is large.

0

0.5

1

1.5

2

0 0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

splittingasin(x)

0

1

2

3

4

5

6

7

8

9

0 0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

degrees



f = asin(x), I = [0, 0.85], d ≤ 8, ε = 2−52

Bisection (use of a theorem of de la Vallee Poussin)

0

0.5

1

1.5

2

0 0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

splittingasin(x)

0

1

2

3

4

5

6

7

8

9

0 0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

degrees



f = asin(x), I = [0, 0.85], d ≤ 8, ε = 2−52

Our optimized method (use of a theorem of de la Vallee Poussin)

0

0.5

1

1.5

2

0 0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

splittingasin(x)

0

1

2

3

4

5

6

7

8

9

0 0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

degrees


Results

name function f target accuracy domain I degree bound

f1 asin ε = 2−52 I = [0, 0.85] dmax = 8

f2 asin ε = 2−45 I = [−0.75, 0.75] dmax = 8

f3 erf ε = 2−51 I = [−0.75, 0.75] dmax = 9

f4 erf ε = 2−45 I = [−0.75, 0.75] dmax = 7

f5 erf ε = 2−50 I = [−0.75, 0.75] dmax = 6

Table: Flavor specifications

measure f1 f2 f3 f4 f5subdomain qty in bisection 24 15 9 12 39

subdomain qty in improved bisection 18 10 5 8 25

subdomains saved 25% 30% 44% 30% 36%

coefficients saved 42 31 27 24 79

memory saved (bytes) 336 248 216 192 632

Table: Table of measurements for several function flavorsO. Kupriianova (LIP6, Paris) Metalibm RAIM, 7 April 2015 23 / 25

Conclusion

writing code → writing code generator

gives functions for the specified domain, specified accuracies, etc.

parametrized function implementation in several minutes

new domain splitting procedure

adding argument reduction procedures

available at http://metalibm.org

work on vectorizable implementations and domain spltting

work on documentation


http://metalibm.org

Q/A

Thank you for your attention!Questions?


metalibm - inria

Documents