introduction and floating-point numbers douglas wilhelm harder, m.math. lel department of electrical...

61
Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo, Ontario, Canada ece.uwaterloo.ca [email protected] © 2012 by Douglas Wilhelm Harder. Some rights

Upload: reynold-hunter

Post on 11-Jan-2016

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

Introduction andFloating-point Numbers

Douglas Wilhelm Harder, M.Math. LELDepartment of Electrical and Computer Engineering

University of Waterloo

Waterloo, Ontario, Canada

ece.uwaterloo.ca

[email protected]

© 2012 by Douglas Wilhelm Harder. Some rights reserved.

Page 2: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

2

Outline

This topic discusses numerical methods:– We will quickly quick overview of the tools used

• Iteration• Linear algebra• Interpolation• Taylor series• Bracketing

– Landau symbols and floating-point numbers– Topics to be covered in NE 216 and NE 217

• IVPs in NE 216• PDEs in NE 217

Introduction and Floating-point Numbers

Page 3: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

3

Five Tools for Numerical Methods

For many numerical algorithms, we use one or more of five tools:– Iteration– Linear algebra– Interpolation– Taylor series– Bracketing

All numerical algorithms you have seen use these techniques in some combination

Introduction and Floating-point Numbers

Page 4: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

4

Iteration

The first tool used by almost all numerical algorithmsis iteration– We have a problem for which we are looking for a solution x*

– We start with an initial approximation or guess, call it x0

– We develop an algorithm or formula that takes a given approximation xk and hopefully produces a better approximation xk + 1 of x*

– We need mechanisms to indicate:• When our guess is “good enough”• When it is pointless to continue using a particular algorithm

Introduction and Floating-point Numbers

Page 5: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

5

Iteration

The easiest example is an application of the fixed-point theorem

Given a problem of the form

x = f(x)

there are certain conditions (local contraction mappings) under which this may be solved by starting with an initial guess x0 and iterating:

xk + 1 = f(xk)

Introduction and Floating-point Numbers

Page 6: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

6

Iteration

Let’s look at

x = cos(x)

Plotting this, we see that

x0 = 0.7388

is likely a good initial approximation

Introduction and Floating-point Numbers

Page 7: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

7

Iteration

>> x = 0.7388 x = 0.738800000000000

>> for i = 1:20 x = cos( x ) end

Introduction and Floating-point Numbers

0.7392771723320480.7389557597283780.7391722745666530.7390264309464660.7391246744960430.7390584971549310.7391030753235560.7390730470761540.7390932745298010.7390796491031920.7390888273678380.7390826447844370.7390868094497420.7390840040823450.7390858938121370.7390846208676740.7390854783384940.7390849007358880.7390852898159750.739085027726959

x* = 0.739085133215160641655312

Page 8: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

8

Iteration

>> x = 0.7388 x = 0.738800000000000

>> for i = 1:3 x = x + (cos(x) - x)/(sin(x) + 1) end

Introduction and Floating-point Numbers

0.7390851511722200.7390851332151610.739085133215161

x* = 0.739085133215160641655312

Note: some algorithms are better than others…

Page 9: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

9

Selecting Initial Points

Note: there is no such thing as a good numerical algorithm that can find a solution with any random initial point– Perhaps a mathematician cares about such properties, but if you

as an engineer do not have a reasonable idea of what a good starting point is, you have a problem…

– In this course, where necessary, you will be given initial points unless you are looking at a problem where you are explicitly asked to estimate an initial point

Introduction and Floating-point Numbers

Page 10: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

10

Linear Algebra

The next tool used is linear algebra– Most interesting systems contain more than one variable– Usually this results in a system of either non-linear or linear

equations– The only systems that we can easily solve are linear systems– Systems of non-linear equations can be solved by approximating

it by a system of linear equations and iterating

Introduction and Floating-point Numbers

Page 11: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

11

Linear Algebra

In one dimension, this is what Newton’s method does:– Given a non-linear function f and a point xk,

– Find a tangent to the function at (xk, f(xk))

Introduction and Floating-point Numbers

Page 12: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

12

Linear Algebra

The equation of a line with a slope m at a point (xk, yk) is

m(x – xk) + yk

Finding the root of this line requires only simple algebra:

In this case, we have:

kk

yx x

m

1 1

kk k

k

f xx x

f x

Introduction and Floating-point Numbers

Page 13: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

13

Linear Algebra

Once we start getting into larger systems of linear equations there are fast iterative systems– Much better than Gaussian elimination…

– One of the worst performing is the Jacobi method

Introduction and Floating-point Numbers

Mx b

off D M x b

off Dx M x b

off Dx b M x

1off

x D b M x x = f(x)

Page 14: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

14

Linear Algebra

Consider this system of linear equations:

>> M = [5.2 0.3 0.7 0.4; 0.3 4.8 -1.3 0.5; 0.7 -1.3 7.3 -0.8; 0.4 0.5 -0.8 6.4];

>> b = [2 4 5 2]';>> D = diag( diag( M ) );>> Moff = M – D; % M = D + Moff>> x = D^-1 * b % initial guess: Dx = b x = 0.384615384615385 0.833333333333333 0.684931506849315 0.312500000000000

Introduction and Floating-point Numbers

5.2 0.3 0.7 0.4 2

0.3 4.8 1.3 0.5 4

0.7 1.3 7.3 0.8 5

0.4 0.5 0.8 6.4 2

x

0.18039333

1.02772874

0.88701657

0.33181118

xSolution:

Page 15: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

15

Linear Algebra

>> for i = 1:5 x = D^-1 * (b - Moff*x) end

x = 0.220297681770285 0.962245071566561 0.830698981383913 0.308973810151036

x = 0.193509166827092 1.012360930456767 0.869025926564131 0.327393371346209

x = 0.184041581486460 1.022496722669196 0.882538012313945 0.329943220201888

Introduction and Floating-point Numbers

0.18039333

1.02772874

0.88701657

0.33181118

x

X = 0.181441747403601 1.026482360721093 0.885530302546704 0.331432096237808

X = 0.180694469520357 1.027300171035569 0.886652537362349 0.331657244174278

Page 16: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

16

Interpolation

The next tool is finding interpolating polynomials:– Given a set of points, find the best-fitting polynomial that passes

through them

– For example, given (2, 5) and (7, 8), find a line y = ax + b that passes through these points

2a + b = 57a + b = 8

– This is a system of two equations and two unknowns:>> [2 1; 7 1] \ [5 8]' ans = 0.600000000000000 3.800000000000000

Introduction and Floating-point Numbers

y = 0.6 x + 3.8

Page 17: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

17

Interpolation

Given n points with unique x values, it is always possible to find a unique interpolating polynomial passing through the points– For example, given (–2, 6), (–1, 0), (3, 4), and (5, 7), find a cubic

polynomial y = ax3 + bx2 + cx + d passing through them

(–2)3 a + (–2)2 b + (–2) c + d = 6(–1)3 a + (–1)2 b + (–1) c + d = 0 33 a + 32 b + 3 c + d = 4 53 a + 52 b + 5 c + d = 7

– This gives the system

Introduction and Floating-point Numbers

8 4 2 1 6

1 1 1 1 0

27 9 3 1 4

125 25 5 1 7

c

Page 18: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

18

Interpolation

In Matlab, we would do the following:>> M = [-8 4 -2 1; -1 1 -1 1; 27 9 3 1; 125 25 5 1] M = -8 4 -2 1 -1 1 -1 1 27 9 3 1 125 25 5 1

>> c = M \ [6 0 4 7]' c = -0.188095238095238 1.400000000000000 -0.483333333333333 -2.071428571428571

Introduction and Floating-point Numbers

8 4 2 1 6

1 1 1 1 0

27 9 3 1 4

125 25 5 1 7

x

y = –0.188 x3 + 1.400 x2 – 0.483 x – 2.071

Page 19: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

19

Interpolation

Checking our answer:>> plot( [-2 -1 3 5], [6 0 4 7], 'ro' )>> xs = -2.5:0.1:5.5;>> ys = polyval( c, xs );>> hold on>> plot( xs, ys, 'b' );

Introduction and Floating-point Numbers

Page 20: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

20

Taylor Series

The fourth tool used in numerical methods that we will be seriously using is Taylor series– Normally, these are written as

We can truncate the series as

where

2 31 2 30 0 0 0 0 0 0

1 1

2 3!f x f x f x x x f x x x f x x x

2 31 2 30 0 0 0 0 0

1 1

2 3!f x f x f x x x f x x x f x x

0 ,x x

Introduction and Floating-point Numbers

Page 21: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

21

Taylor Series

In numerical analysis, we usually write Taylor series as– Given a point x, we’d like to know what happens at x + h where h

is very small– Thus, the Taylor series will usually be in the form

– The truncated forms, of course, contain an approximation of the truncation error:

where

1 2 32 31 1

2 3!f x h f x f x h f x h f x h

1 2 21

2f x h f x f x h f h

,x x h

Introduction and Floating-point Numbers

Page 22: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

22

Bracketing

The fifth tool is bracketing:– Determine that the solution is on an interval [a, b]– Find an algorithm to reduce the size of the interval to either

[a, c] or [c, b]

where a < c < b

– Iterate until the width of the interval is sufficiently small• Choose the endpoint that best satisfies the conditions

– The most inefficient of methods…

Introduction and Floating-point Numbers

Page 23: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

23

Summary of Numerical Tools

We have quickly summarized five tools used in numerical algorithms:– Iteration– Linear algebra– Interpolation– Taylor series– Bracketing

The balance of this topic will discuss:– Landau symbols– Floating-point numbers– Topics to be covered in NE 216 and NE 217

• IVPs in NE 216• PDEs in NE 217

Introduction and Floating-point Numbers

Page 24: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

24

Landau Symbols

We will also use Landau symbols– When iterating, very often, we will have the situation:

• Given a value of h that describes a quantitative value about which we may make observations

– For example• For some algorithms, if we make h smaller by half, the error of our

approximation is reduced by approximately half• In others, (e.g., Newton’s method), if the error is h at one iteration,

the error at the next iteration will be h2

Introduction and Floating-point Numbers

Page 25: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

25

Landau Symbols

To state this mathematically, we will use the big-O Landau symbol:– An algorithm is O(h) if halving h reduces the error by half– We will use O(h2) to indicate the halving h will reduce the error by

a factor of four:

– We will see one case where the error is O(h4)—halving h will reduce the error by a factor of 16:

Introduction and Floating-point Numbers

2 2

2 4

h h

4 4

2 16

h h

Page 26: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

26

Floating-point Numbers

Everything we do deals with floating-point numbers– Unfortunately, there are problems with floating point numbers:

• We can only store a finite amount of precision• We lose associativity: (x + y) + z ≠ x + (y + z)

– Both of these require us to carefully design our algorithms…

Introduction and Floating-point Numbers

Page 27: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

27

Floating-point Numbers

Floating-point operations are specified in IEEE-754– Lead by William Kahan– One of the most successful examples of collaboration

• Every individual in the working group could have promoted the floating-point format used by his or her company

• Instead, they created a floating-point format that no one used butwas superior to all that were

Introduction and Floating-point Numbers

Geo

rge

M.

Ber

gman

Page 28: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

28

Floating-point Numbers

Some users will try:>> 1 - 2/3 - 1/3 ans = 5.551115123125783e-017

and claim floating-point numbers don’t work….

Introduction and Floating-point Numbers

1 1.00000000000000000000000000000000000000000000000000002/3 0.101010101010101010101010101010101010101010101010101011/3 0.010101010101010101010101010101010101010101010101010101

1.0000000000000000000000000000000000000000000000000000- 0.10101010101010101010101010101010101010101010101010101 0.01010101010101010101010101010101010101010101010101011

0.01010101010101010101010101010101010101010101010101011- 0.010101010101010101010101010101010101010101010101010101 0.000000000000000000000000000000000000000000000000000001

Page 29: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

29

Floating-point Numbers

Thus,>> 1 - 2/3 - 1/3 ans = 5.551115123125783e-017

>> 2^-54 ans = 5.551115123125783e-017

Floating-point arithmetic is not exact– Each operation—including conversion of decimal numbers to

binary—may have an error up to 0.5 in the least-significant bit– For function evaluates, the allowable error is slightly larger

Introduction and Floating-point Numbers

Page 30: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

30

Absolute and Relative Error

To begin, we need to review absolute and relative error:– If a approximates the value x, we say that

is the absolute error of a

is the relative error of a

is the percent relative error

x a

x a

x

100%x a

x

Introduction and Floating-point Numbers

Page 31: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

31

Absolute and Relative Error

Absolute error is not very useful:– An error of 1 mm may be:

• Insignificant in a trip to Mars• Possibly significant in designing a Mars rover• Catastrophic for anything designed at the nanometer scale

We will focus on the relative error:– A 0.00001 error or 0.001 % relative error is usually acceptable in

many engineering applications regardless of scale– This is, however, application specific

Introduction and Floating-point Numbers

Page 32: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

32

Storing Real Numbers

Let us try to store real numbers with a finite number of digits—for example,

We will some constraints:– Use a fixed amount of memory– Represent both very large and very small numbers– Represent numbers with a small relative error– Easily test equality and relative magnitude

Introduction and Floating-point Numbers

3.14

Page 33: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

33

A Simple Example

How significant a range can we represent with six decimal digits and a sign?

±NNNNNN

Ideas?

Introduction and Floating-point Numbers

Page 34: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

34

A Reasonable Representation

Here’s one very simple idea:– Let the six digits

±NNNNNNrepresent:

– For example, +039432 represents 39.432– We store –3.14152 as -003142

Introduction and Floating-point Numbers

.NNN NNN

Page 35: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

35

A Reasonable Representation

Our range is somewhat limited…– We can only represent numbers from 0.001 to 999.999– Also, consider the relative error:

Limited range, some numbers with large relative errors…

Relative magnitude can, however, be found quickly

Introduction and Floating-point Numbers

Value Representation Relative Error

0.0015 +000002 33 %

999.9985 +999998 0.00005 %

Page 36: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

36

A Reasonable Representation

Here is another idea:– Let the six digits

±EEMNNNrepresent:

where we will require that M is non-zero

– For example, +549238 represents 9.238 × 105

– We represent 372.863 as +513729

49. 10EEM NNN

Introduction and Floating-point Numbers

Page 37: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

37

A Reasonable Representation

How does it fare?

±00MNNN represents numbers as small as:

• For example, +005723 represents 5.723 × 10–49

±99MNNN represents numbers as large as:

• For example, +995723 represents 5.723 × 1050

00 49 49. 10 . 10M NNN M NNN

99 49 50. 10 . 10M NNN M NNN

Introduction and Floating-point Numbers

Page 38: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

38

A Reasonable Representation

Also, no number on the range

[1.000 × 10–49, 9.999 × 1050]

has a representation with a relative error largerthan 0.05 %

– For example, 33 476 688 is represented by +563348 with a relative error of 0.0099 %

– For example, 6.626 069 57×10−34 is represented by +156626 with a relative error of 0.0010 %

Introduction and Floating-point Numbers

Page 39: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

39

A Reasonable Representation

The requirement that the digit M is non-zero ensures unique representations:

±EEMNNN

Otherwise, all four of

+491000 +500100 +510010 +520001represent the same value: 1

1.000 × 100 0.100 × 101 0.010 × 102 0.001 × 103

– Imagine if simply checking for equality required addition and iteration?

Introduction and Floating-point Numbers

Page 40: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

40

A Reasonable Representation

Also, by choosing the order

±EEMNNNand using a bias (the –49), we have one final advantage:– Relative comparisons are also fast

These four represent numbers using our format:

+856729 +389657 +573823 +195737

Which is the largest in magnitude? Which is smallest?

Introduction and Floating-point Numbers

366.729 10 83.823 10 119.657 10 305.737 10

Page 41: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

41

A Reasonable Representation

We have a few more issues:– Representing zero: +000000– Requiring negative zero: -000000

Why?– Recall that a floating-point zero represents all numbers on the

range (–s, s) where s is the smallest number that can be represented by a non-zero floating-point number

– One reason: Branch cuts:

Introduction and Floating-point Numbers

ln 1

ln 1

j j

j j

Page 42: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

42

A Reasonable Representation

Other issues:– 1.153 × 10 –49 can be represented with full precision, but

4.853 × 10 –50 must be represented by +000000

Solution?Denormalized numbers: ±00NNNN represents ±N.NNN ×

10 –49 For example, 4.853 × 10 –50 = 0.4853 × 10 –49 is representedby ±000485

– Representing infinity (for example, 1/0, –1/0):

+990000 and -990000

– Representing undefined operations (0/0, not-a-number or NaN):

+990100

Introduction and Floating-point Numbers

Page 43: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

43

IEEE 754

When Dr. Kahan lead the committee that eventually produced the IEEE 754 standard, there were numerous conflicts– People from numerous corporations were represented, each

wanting to advocate for their own representations– Each corporation already had invested in their own designs

• No one wants to modify existing hardware that has already been tested

Fortunately, this committee overcame these biases and produced an excellent standard– IEEE 754-2008 contains most of the original standard

Introduction and Floating-point Numbers

Page 44: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

44

IEEE 754

The original standard defines two formats:– The float, a single-precision floating-point number– The double, a double-precision floating-point number

For most applications outside of graphics, float is unacceptable– We will focus on double

Introduction and Floating-point Numbers

Page 45: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

45

IEEE 754

The double uses 64 bits:

SEEEEEEEEEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

to represent (in binary):

where 011111111112 = 1023

Introduction and Floating-point Numbers

2 20111111111121 1. 2

S EEEEEEEEEEENNNNNNN N

Page 46: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

46

IEEE 754

The smallest positive normalized number is

0000000000010000000000000000000000000000000000000000000000000000

which represents

Introduction and Floating-point Numbers

2 20 1 01111111111 1022 308

21 1.000 2 2 2.225 10

Page 47: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

47

IEEE 754

Denormalized numbers go as small as

0000000000000000000000000000000000000000000000000000000000000001

which represents

Note:

Introduction and Floating-point Numbers

2 20 1 01111111111 324

210741 0.000 01 2 2 4.941 10

2 21 0111111111152 1 52 1023 10742 2 2 2

Page 48: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

48

IEEE 754

The largest positive number is

0111111111101111111111111111111111111111111111111111111111111111

which represents

>> format long>> 1.999999999999999778 * 2^1023 ans = 1.797693134862316e+308>> 2^1024 ans = Inf

Introduction and Floating-point Numbers

2 20 11111111110 01111111111 1024 308

21 1.111 1 2 2 1.79769 10

Recall in decimal: 1.999910 = 2In binary: 1.11112 = 2

Page 49: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

49

IEEE 754

Infinity is represented by

S111111111110000000000000000000000000000000000000000000000000000

which represents

If the mantissa is not all zero, it represents a NaN– NaN has special properties:

NaN == NaN returns false (0)—you must use isnan( x )

Introduction and Floating-point Numbers

1S

Page 50: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

50

IEEE 754

You can view the underlying format:>> format hex>> pi ans = 400921fb54442d18>> exp(1) ans = 4005bf0a8b14576a>> 1/0 ans = 7ff0000000000000>> 0 ans = 0000000000000000>> 1e-300 ans = 01a56e1fc2f8f359

Introduction and Floating-point Numbers

Hexadecimal Binary 0 0000 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111 8 1000 9 1001 a 1010 b 1011 c 1100 d 1101 e 1110 f 1111

Page 51: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

51

Issues with Floating-point Numbers

There are still issues:– Overflow and underflow– Subtractive cancellation– Adding large and small numbers– Order of operations—not associative

These must be dealt with in our algorithms...

Introduction and Floating-point Numbers

Page 52: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

52

A More Accurate Sum

The Kahan algorithm for adding numbers:

function [s] = Kahan_sum( v ) s = 0; c = 0; for x = v y = x - c; t = s + y; c = (t - s) - y; s = t; endend

Introduction and Floating-point Numbers

Page 53: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

53

A More Accurate Sum

Consider:>> v = rand( 1, 10000000 );>> sum( v ) ans = 5.000908006717473e+006>> Kahan_sum( v ) ans = 5.000908006717303e+006>> sum( sort( v, 'ascend' ) ) ans = 5.000908006717453e+006>> sum( sort( v, 'descend' ) ) ans = 5.000908006717471e+006

Introduction and Floating-point Numbers

Page 54: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

54

A More Accurate Sum

Are you serious?> Kahan_sum := proc( v )

local s, c, y, t, x; s := 0; c := 0;

for x in v do y := x - c; t := s + y; c := (t - s) - y; s := t; end do;

return s;end proc:

> S := [seq( rand()/1e12, i = 1..1000000 )]:

Introduction and Floating-point Numbers

Page 55: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

55

A More Accurate Sum

Are you serious?> add( i, i = S );

5.002264965 · 105

> add( i, i = sort( S ) );

5.002264658 · 105

> add( i, i = sort( S, `>` ) );

5.002264784 · 105

> Kahan_sum( S );

5.002264636 · 105

> Digits := 30:> add( i, i = S );

5.00226463567620254 · 105

Introduction and Floating-point Numbers

Page 56: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

56

The Laboratories

In the laboratories of this class, you will see six problems that arise often in nanotechnology engineering in association with this course

The laboratories will be divided into two parts:– A one-hour presentation one week, and– A one-hour help session the next week for assistance

Introduction and Floating-point Numbers

Page 57: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

57

The Laboratories

This is part of an integrated approach in your nanotechnology courses:– NE 113 Engineering Computation was your introduction– NE 216 will focus on numerical solutions to IVPs– NE 217 focuses on PDEs

This will lead toNE 336 Micro and Nanosystem Computer-aided Design

Introduction and Floating-point Numbers

Page 58: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

58

The Laboratories

NE 216 will look at numerical approximations to:– Numerical algorithms– Differentiation– 1st-order initial value problems (IVPs) using Euler and Heun's

methods– 1st-order IVPs with the better 4th-order Runge-Kutta and the

Dormand-Prince methods– Systems of IVPs and converting higher-order IVPs into a system

of 1st-order IVPs– Boundary-value problems (BVPs) using the shooting method

Introduction and Floating-point Numbers

Page 59: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

59

The Laboratories

NE 217 will look at numerical approximations to:– Boundary-value problems using finite differences– Heat-conduction/diffusion equation– Heat-conduction/diffusion using the Crank-Nicolson method with

insulated boundaries– Wave equation– Laplace's equation in two and three dimensions– The heat-conduction/diffusion and wave equations in two and

three dimensions

Introduction and Floating-point Numbers

Page 60: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

60

The Laboratories

By the end of this sequence of laboratories, you will be able to produce the following animations:

Introduction and Floating-point Numbers

Page 61: Introduction and Floating-point Numbers Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

61

Outline

In this topic, we saw:– Five tools used in numerical algorithms:

• Iteration• Linear algebra• Interpolation• Taylor series• Bracketing

– Landau symbols and floating-point numbers, IEEE 744– A summary of the laboratories in NE 216 and NE 217

• IVPs and then PDEs

Introduction and Floating-point Numbers