floating point computation
DESCRIPTION
Floating Point Computation. Jyun-Ming Chen. Contents. Sources of Computational Error Computer Representation of (Floating-point) Numbers Efficiency Issues. Converting a mathematical problem to numerical problem, one introduces errors due to limited computation resources: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/1.jpg)
1
Floating Point Computation
Jyun-Ming Chen
![Page 2: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/2.jpg)
2
Contents
• Sources of Computational Error
• Computer Representation of (Floating-point) Numbers
• Efficiency Issues
![Page 3: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/3.jpg)
3
Sources of Computational Error
• Converting a mathematical problem to numerical problem, one introduces errors due to limited computation resources:– round off error (limited
precision of representation)
– truncation error (limited time for computation)
• Misc.– Error in original data
– Blunder (programming/data input error)
– Propagated error
![Page 4: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/4.jpg)
4
Supplement: Error Classification (Hildebrand)
• Gross error: caused by human or mechanical mistakes
• Roundoff error: the consequence of using a number specified by n correct digits to approximate a number which requires more than n digits (generally infinitely many digits) for its exact specification.
• Truncation error: any error which is neither a gross error nor a roundoff error.
• Frequently, a truncation error corresponds to the fact that, whereas an exact result would be afforded (in the limit) by an infinite sequence of steps, the process is truncated after a certain finite number of steps.
![Page 5: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/5.jpg)
5
Common measures of error
• Definitions– total error = round off + truncation– Absolute error = | numerical – exact |– Relative error = Abs. error / | exact |
• If exact is zero, rel. error is not defined
![Page 6: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/6.jpg)
6
Ex: Round off error
Representation consists of finite number of digits
Implication: real-number is discrete (more later)
R
![Page 7: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/7.jpg)
7
Watch out for printf !!
• By default, “%f” prints out 6 digits behind decimal point.
![Page 8: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/8.jpg)
8
Ex: Numerical Differentiation
• Evaluating first derivative of f(x)
hxf
xfh
xfhxfxf
xfhxfxfhxf
hxfhxf
h
h
smallfor ,)('
)(")()(
)('
)(")(')()(
)()(
2
2
2
Truncationerror
![Page 9: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/9.jpg)
9
Numerical Differentiation (cont)
• Select a problem with known answer– So that we can evaluate the error!
300)10('
3)(')( 23
f
xxfxxf
![Page 10: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/10.jpg)
10
Numerical Differentiation (cont)
• Error analysis– h (truncation) error
• What happened at h = 0.00001?!
![Page 11: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/11.jpg)
11
Ex: Polynomial Deflation
• F(x) is a polynomial with 20 real roots
• Use any method to numerically solve a root, then deflate the polynomial to 19th degree
• Solve another root, and deflate again, and again, …
• The accuracy of the roots obtained is getting worse each time due to error propagation
)20()2)(1()( xxxxf
![Page 12: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/12.jpg)
12
Computer Representation of Floating Point Numbers
Floating point VS. fixed point
Decimal-binary conversion
Standard: IEEE 754 (1985)
![Page 13: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/13.jpg)
13
Floating VS. Fixed Point
• Decimal, 6 digits (positive number)– fixed point: with 5 digits after decimal point
• 0.00001, … , 9.99999
– Floating point: 2 digits as exponent (10-base); 4 digits for mantissa (accuracy)
• 0.001x10-99, … , 9.999x1099
• Comparison:– Fixed point: fixed accuracy; simple math for computati
on (sometimes used in graphics programs)– Floating point: trade accuracy for larger range of repres
entation
![Page 14: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/14.jpg)
14
Decimal-Binary Conversion
• Ex: 134 (base 10)
• Ex: 0.125 (base 10)
• Ex: 0.1 (base 10)
![Page 15: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/15.jpg)
15
Floating Point Representation
• Fraction, f– Usually normalized so that
• Base, – 2 for personal computers– 16 for mainframe– …
• Exponent, e
ef
11 f
![Page 16: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/16.jpg)
16
Understanding Your Platform
![Page 17: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/17.jpg)
17
PaddingHow about
![Page 18: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/18.jpg)
18
IEEE 754-1985
• Purpose: make floating system portable
• Defines: the number representation, how calculation performed, exceptions, …
• Single-precision (32-bit)
• Double-precision (64-bit)
![Page 19: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/19.jpg)
19
Number Representation
• S: sign of mantissa• Range (roughly)
– Single: 10-38 to 1038
– Double: 10-307 to 10307
• Precision (roughly) – Single: 7 significant de
cimal digits
– Double: 15 significant decimal digits Describe how these
are obtained
![Page 20: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/20.jpg)
20
Implication
• When you write your program, make sure the results you printed carry the meaningful significant digits.
![Page 21: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/21.jpg)
21
Implicit One
• Normalized mantissa to increase one extra bit of precision
• Ex: –3.5
![Page 22: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/22.jpg)
22
Exponent Bias
• Ex: in single precision, exponent has 8 bits– 0000 0000 (0) to 1111 1111 (255)
• Add an offset to represent +/ – numbers– Effective exponent = biased exponent – bias– Bias value: 32-bit (127); 64-bit (1023)– Ex: 32-bit
• 1000 0000 (128): effective exp.=128-127=1
![Page 23: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/23.jpg)
23
Ex: Convert – 3.5 to 32-bit FP Number
![Page 24: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/24.jpg)
24
Examine Bits of FP Numbers
• Explain how this program works
![Page 25: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/25.jpg)
25
The “Examiner”
• Use the previous program to – Observe how ME work– Test subnormal behaviors on your computer/co
mpiler– Convince yourself why the subtraction of two n
early equal numbers produce lots of error– NaN: Not-a-Number !?
![Page 26: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/26.jpg)
26
Design Philosophy of IEEE 754
• [s|e|m]• S first: whether the number is +/- can be tested eas
ily• E before M: simplify sorting• Represent negative by bias (not 2’s complement) f
or ease of sorting– [biased rep] –1, 0, 1 = 126, 127, 128– [2’s compl.] –1, 0, 1 = 0xFF, 0x00, 0x01
• More complicated math for sorting, increment/decrement
![Page 27: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/27.jpg)
27
Exceptions
• Overflow: – ±INF: when number exceeds the range of representation
• Underflow– When the number are too close to zero, they are treated a
s zeroes
• Dwarf– The smallest representable number in the FP system
• Machine Epsilon (ME)– A number with computation significance (more later)
![Page 28: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/28.jpg)
28
Extremities
• E : (1…1) – M (0…0): infinity– M not all zeros; NaN (Not a Number)
• E : (0…0)– M (0…0): clean zero– M not all zero: dirty zero (see next page)
More later
![Page 29: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/29.jpg)
29
Not-a-Number
• Numerical exceptions– Sqrt of a negative number– Invalid domain of trigonometric functions– …
• Often cause program to stop running
![Page 30: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/30.jpg)
30
Extremities (32-bit)
• Max:
• Min (w/o stepping into dirty-zero)
11111111111111111111111011111110
1.
00000000000000000000000100000000
1.
(1.111…1)2254-127=(10-0.000…1) 21272128
(1.000…0)21-127=2-126
![Page 31: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/31.jpg)
31
Dirty-Zero (a.k.a. denormals)
• No “Implicit One”• IEEE 754 did not specify compatibility for
denormals• If you are not sure how to handle them, stay
away from them. Scale your problem properly– “Many problems can be solved by pretending a
s if they do not exist”
a.k.a.: also known as
![Page 32: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/32.jpg)
32
Dirty-Zero (cont)
00000000 10000000 00000000 00000000
00000000 01000000 00000000 0000000000000000 00100000 00000000 0000000000000000 00010000 00000000 00000000
2-126
2-127
2-128
2-129
(Dwarf: the smallest representable)
R
0 2-126
denormals
dwarf
![Page 33: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/33.jpg)
33
Drawf (32-bit)
Value: 2-149Value: 2-149
![Page 34: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/34.jpg)
34
Machine Epsilon (ME)
• Definition– smallest non-zero number that makes a
difference when added to 1.0 on your working platform
• This is not the same as the dwarf
![Page 35: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/35.jpg)
35
Computing ME (32-bit)
1+epsGetting closer to 1.0
ME: (00111111 10000000 00000000 00000001)–1.0
= 2-23 1.12 10-7
![Page 36: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/36.jpg)
36
Effect of ME
![Page 37: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/37.jpg)
37
Significance of ME
• Never terminate the iteration on that 2 FP numbers are equal.
• Instead, test whether |x-y| < ME
![Page 38: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/38.jpg)
38
Numerical Scaling
• Number Density: there are as many IEEE 754 numbers between [1.0, 2.0] as there are in [256, 512]
• Revisit:– “roundoff” error
– ME: a measure of density near the 1.0
• Implication:– Scale your problem so
that intermediate results lie between 1.0 and 2.0 (where numbers are dense; and where roundoff error is smallest)
R
![Page 39: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/39.jpg)
39
Scaling (cont)
• Performing computation on denser portions of real line minimizes the roundoff error– but don’t over do it; switch to double precision
will easily increase the precision– The densest part is near subnormal, if density is
defined as numbers per unit length
![Page 40: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/40.jpg)
40
How Subtraction is Performed on Your PC
• Steps: – convert to Base 2– Equalize the exponents by adjusting the
mantissa values; truncate the values that do not fit
– Subtract mantissa– normalize
![Page 41: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/41.jpg)
41
Subtraction of Nearly Equal Numbers
• Base 10: 1.24446 – 1.24445
1.
111011101000111010100…
–Significant loss of accuracy (most bits are unreliable)
![Page 42: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/42.jpg)
42
Theorem of Loss Precision
• x, y be normalized floating point machine numbers, and x>y>0
• If then at most p, at least q significant binary bits are lost in the subtraction of x-y.
• Interpretation:– “When two numbers are very close, their
subtraction introduces a lot of numerical error.”
qp
x
y 212
![Page 43: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/43.jpg)
43
Implications
• When you program: • You should write these instead:
11)( 2 xxf1111
1122
2
2
2
)11()(
x
x
x
xxxf
1)ln()( xxg )ln()ln()ln()(e
xexxg
Every FP operation introduces error, but the subtraction of nearly equal numbers is the worst and should be avoided whenever possible
![Page 44: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/44.jpg)
44
Efficiency Issues
• Horner Scheme
• program examples
![Page 45: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/45.jpg)
45
Horner Scheme
• For polynomial evaluation
• Compare efficiency
![Page 46: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/46.jpg)
46
Accuracy vs. Efficiency
![Page 47: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/47.jpg)
47
Good Coding Practice
![Page 48: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/48.jpg)
48
On Arrays …
![Page 49: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/49.jpg)
49
Issues of PI
• 3.14 is often not accurate enough– 4.0*atan(1.0) is a good substitute
![Page 50: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/50.jpg)
50
Compare:
![Page 51: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/51.jpg)
51
Exercise
• Explain why
• Explain why converge when implemented numerically
000,101.0000,100
0
i
4
1
3
1
2
11
1
1n n
![Page 52: Floating Point Computation](https://reader031.vdocument.in/reader031/viewer/2022012914/56814b07550346895db81fe5/html5/thumbnails/52.jpg)
52
Exercise
• Why Me( ) does not work as advertised?
• Construct the 64-bit version of everything– Bit-Examiner– Dme( );
• 32-bit: int and float. Can every int be represented by float (if converted)?