chapter 13 numerical issues. dr. naim dahnoun, bristol university, (c) texas instruments 2002...

Chapter 13Chapter 13

Numerical IssuesNumerical Issues

Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002

Chapter 13, Slide 2

Learning ObjectivesLearning Objectives

Numerical issues and data formats.Numerical issues and data formats. Fixed point.Fixed point. Fractional number.Fractional number. Floating point.Floating point. Comparison of formats and dynamic Comparison of formats and dynamic

ranges.ranges.


Chapter 13, Slide 3

Numerical Issues and Data FormatsNumerical Issues and Data Formats

C6000 Numerical C6000 Numerical Representation Representation

Fixed point arithmetic:Fixed point arithmetic: 16-bit (integer or fractional). 16-bit (integer or fractional). Signed or unsigned.Signed or unsigned.

Floating point arithmetic:Floating point arithmetic: 32-bit single precision.32-bit single precision. 64-bit double precision.64-bit double precision.


Chapter 13, Slide 4

Fixed Point Arithmetic - DefinitionFixed Point Arithmetic - Definition

For simplicity a 4-bit representation is used:For simplicity a 4-bit representation is used:

00 00 00 00 00

Decimal Decimal EquivalentEquivalent

Binary Binary NumberNumber

2233 2222 2211 2200

00 00 00 00

Unsigned Unsigned integer integer numbersnumbers


Chapter 13, Slide 5



00 00 00 11 1100 00 00 00 00



2233 2222 2211 2200

00 00 00 11

UnsignedUnsigned integer integer numbersnumbers


Chapter 13, Slide 6



00 00 00 1100 00 11 00

1122

00 00 00 00 00



2233 2222 2211 2200

00 00 11 00

UnsignedUnsigned integer integer numbersnumbers


Chapter 13, Slide 7





2233 2222 2211 2200

11 11 11 11

00 00 00 1100 00 11 0000 00 11 1100 11 00 0000 11 00 1100 11 11 0000 11 11 1111 00 00 0011 00 00 1111 00 11 0011 00 11 1111 11 00 0011 11 00 1111 11 11 0011 11 11 11

112233445566778899

101011111212131314141515

00 00 00 00 00UnsignedUnsigned integer integer numbersnumbers


Chapter 13, Slide 8



00 00 00 00 00

00 00 00 00 Decimal Decimal EquivalentEquivalent


-2-233 2222 2211 2200

SignedSigned integer integer numbersnumbers


Chapter 13, Slide 9



00 00 00 00 00



-2-233 2222 2211 2200

00 00 00 11 11SignedSigned integer integer numbersnumbers


Chapter 13, Slide 10



00 00 00 00 00



-2-233 2222 2211 2200

00 00 00 11 1100 00 11 00 22






00 00 00 1100 00 11 0000 00 11 1100 11 00 0000 11 00 1100 11 11 0000 11 11 11

11223344556677

00 00 00 00 00



-2-233 2222 2211 2200

00 11 11 11






00 00 00 1100 00 11 0000 00 11 1100 11 00 0000 11 00 1100 11 11 0000 11 11 1111 00 00 00

11223344556677-8-8

00 00 00 00 00



-2-233 2222 2211 2200

11 00 00 00






00 00 00 1100 00 11 0000 00 11 1100 11 00 0000 11 00 1100 11 11 0000 11 11 1111 00 00 00

11223344556677-8-8

00 00 00 00 00



-2-233 2222 2211 2200

11 00 00 11

11 00 00 11 -7-7






00 00 00 1100 00 11 0000 00 11 1100 11 00 0000 11 00 1100 11 11 0000 11 11 1111 00 00 0011 00 00 1111 00 11 0011 00 11 1111 11 00 0011 11 00 1111 11 11 0011 11 11 11

11223344556677-8-8-7-7-6-6-5-5-4-4-3-3-2-2-1-1

00 00 00 00 00



-2-233 2222 2211 2200

11 11 11 11




Fixed Point Arithmetic - ProblemsFixed Point Arithmetic - Problems

The following equation is the basis of many The following equation is the basis of many DSP algorithms (See Chapter 1):DSP algorithms (See Chapter 1):

Two problems arise when using signed and Two problems arise when using signed and unsigned integers:unsigned integers: Multiplication overflow.Multiplication overflow. Addition overflow.Addition overflow.

1

0

N

k

knxkany



16-bit x 16-bit = 32-bit16-bit x 16-bit = 32-bit Example: using 4-bit representationExample: using 4-bit representation

24 cannot be represented with 4-bits.24 cannot be represented with 4-bits.

Multiplication OverflowMultiplication Overflow

33

88

2424

xx

00 00 11 11

11 00 00 00xx

11 00 00 0000 00 00 11



32-bit + 32-bit = 33-bit32-bit + 32-bit = 33-bit Example: using 4-bit representationExample: using 4-bit representation

16 cannot be represented with 4-bits.16 cannot be represented with 4-bits.

Addition OverflowAddition Overflow

11 00 00 00

11 00 00 00++

88

88

1616

++

00 00 00 0011



Fixed Point Arithmetic - SolutionFixed Point Arithmetic - Solution

The solutions for The solutions for reducingreducing the overflow the overflow problem are:problem are: Saturate the result.Saturate the result. Use double precision result.Use double precision result. Use fractional arithmetic.Use fractional arithmetic. Use floating point arithmetic.Use floating point arithmetic.



Solution - Saturate the resultSolution - Saturate the result

Unsigned numbers:Unsigned numbers: If A x B If A x B 15 15 result = A x B result = A x B If A x B > 15 If A x B > 15 result = 15 result = 15

00 00 11 11

11 00 00 00xx

11 00 00 00

11 11 11 11

00 00 00 11

33

88

2424

1515SaturatedSaturated



Solution - Saturate the resultSolution - Saturate the result

Signed numbers:Signed numbers: If -8 If -8 A x B A x B 7 7 result = A x B result = A x B If If A x B > 7 A x B > 7 result = 7 result = 7 If If A x B < -8 A x B < -8 result = -8 result = -8

00 00 11 11

11 00 00 00xx

11 00 00 00

11 00 00 00

11 11 11 00

33

-8-8

-24-24

-8-8SaturatedSaturated



Solution - Double precision resultSolution - Double precision result

For a 4-bit x 4-bit multiplication hold the For a 4-bit x 4-bit multiplication hold the result in an 8-bit location.result in an 8-bit location.

Problems:Problems: Uses more memory for storing data.Uses more memory for storing data. If the result is used in another multiplication If the result is used in another multiplication

the data needs to be represented into single the data needs to be represented into single precision format (e.g. prod = prod x sum).precision format (e.g. prod = prod x sum).

Results need to be scaled down if it is to be Results need to be scaled down if it is to be sent to an A/D converter.sent to an A/D converter.



Solution - Fractional arithmeticSolution - Fractional arithmetic

If A and B are fractional then:If A and B are fractional then: A x B < min(A, B)A x B < min(A, B) i.e. The result is less than the operands hence i.e. The result is less than the operands hence

it will never overflow.it will never overflow. Examples: Examples:

0.6 x 0.2 = 0.12 (0.12 < 0.6 and 0.12 < 0.2)0.6 x 0.2 = 0.12 (0.12 < 0.6 and 0.12 < 0.2) 0.9 x 0.9 = 0.81 (0.81 < 0.9)0.9 x 0.9 = 0.81 (0.81 < 0.9) 0.1 x 0.1 = 0.01 (0.01 < 0.1)0.1 x 0.1 = 0.01 (0.01 < 0.1)



-2-200 22-1-1 22-2-2 22-(N-1)-(N-1)

++

Fractional numbersFractional numbers

Definition:Definition:

00 00 11

-2-200 22-1-1 22-2-2

11

22-(N-1)-(N-1)

00 11 11 11 = MAX= MAX

00 00 00 11 = 2= 2-(N-1)-(N-1)

11 00 00 00 = MAX+2= MAX+2-(N-1) -(N-1) = 1= 1

MAX = 1-2MAX = 1-2-(N-1)-(N-1)

Largest Largest Number:Number:

What is the largest number?What is the largest number?

-1-1 0.50.5 0.250.25



Fractional numbersFractional numbers

Definition:Definition:

00 00 11

-2-200 22-1-1 22-2-2

11

22-(N-1)-(N-1)

11 00 00

-2-200 22-1-1 22-2-2

00

22-(N-1)-(N-1)

= MIN = -1= MIN = -1

For 16-bit representation:For 16-bit representation: MAX = 1 - 2MAX = 1 - 2-15 -15 = 0.999969= 0.999969 MIN = -1MIN = -1 -1-1 x < 1 x < 1

Smallest Smallest Number:Number:

What is the smallest number?What is the smallest number?



Fractional numbers - Sign ExtensionFractional numbers - Sign Extension

To keep the same resolution as the To keep the same resolution as the operands we need to select these 4-bits:operands we need to select these 4-bits:

00 11 11 00a=a= = 0.5 + 0.25 = 0.75= 0.5 + 0.25 = 0.75

11 11 11 00b=b= = -1 + 0.5 + 0.25 = -0.25= -1 + 0.5 + 0.25 = -0.25

00 00 00 0000 11 11 00 ..

00 11 11 00 .. ..11 00 11 00 .. .. ..

00 11 00 0011 11 11 11

Sign extensionSign extension

11 11 11 00

xx




The way to do it is to shift left by one bit The way to do it is to shift left by one bit and store upper 4-bits or right shift by and store upper 4-bits or right shift by three and store the lower 4-bits:three and store the lower 4-bits:

00 11 11 00a=a= = 0.5 + 0.25 = 0.75= 0.5 + 0.25 = 0.75

11 11 11 00b=b= = -1 + 0.5 + 0.25 = -0.25= -1 + 0.5 + 0.25 = -0.25

00 00 0000 11 00

00 11 11 0011 00 11 00

.... ..

.. .. ..

00 11 00 0011 11 11 11

Sign extensionSign extension

11 11 11 00

xx

0000

1100

1100 000000000000

Sign extension bitsSign extension bits



CPUCPUMPY A3,A4,A6MPY A3,A4,A6NOP NOP

Q15Q15 s. x x x x x x x x x x x x x x x

s. y y y y y y y y y y y y y y yxx Q15 Q15

s.s z z z z z z z z z z z z z z z z z z z z z z z z z z z z z zQ30Q30

15-bit * 15-bit Multiplication15-bit * 15-bit Multiplication

Store toStore toData MemoryData Memory SHR SHR A6, A6,1515,A6,A6

STH STH A6,*A7 A6,*A7

s. z z z z z z z z z z z z z z zQ15Q15



‘‘C6000 C Data TypesC6000 C Data Types

TypeType SizeSize RepresentationRepresentation

char, signed charchar, signed char 8 bits8 bits ASCIIASCIIunsigned charunsigned char 8 bits8 bits ASCIIASCIIshortshort 16 bits16 bits 2’s complement2’s complementunsigned shortunsigned short 16 bits16 bits binarybinaryint, signed intint, signed int 32 bits32 bits 2s complement 2s complement unsigned intunsigned int 32 bits32 bits binarybinarylong, signed longlong, signed long 40 bits 40 bits 2’s complement2’s complementunsigned longunsigned long 40 bits 40 bits binarybinaryenumenum 32 bits 32 bits 2’s complement2’s complementfloatfloat 32 bits 32 bits IEEE 32-bitIEEE 32-bitdoubledouble 64 bits 64 bits IEEE 64-bitIEEE 64-bitlong doublelong double 64 bits 64 bits IEEE 64-bitIEEE 64-bitpointerspointers 32 bits 32 bits binarybinary



Pseudo assembly language:Pseudo assembly language:

Pseudo ‘C’ language:Pseudo ‘C’ language:


A0 = 0x80000000 ; initial valueA1 = 0.5 ; initial valueA2 = 0.5 ; initial valueA3 = 0 ; initial value

MPY A1, A2, A3 ; A3 = 0x10000000SHL A3,1,A3 ; A3 = 0x20000000STH A3, *A0 ; 0x2000 -> 0x80000000

or

MPY A1, A2, A3 ; A3 = 0x10000000SHR A3,15,A3 ; A3 = 0x00002000STH A3, *A0 ; 0x2000 -> 0x80000000

short a, b, result;int prod;

prod = a * b;prod = prod >> 15;result = (short) prod;



Fractional numbers - ProblemsFractional numbers - Problems

There are some problems that need to There are some problems that need to be resolved when using fractional be resolved when using fractional numbers.numbers.

These are:These are: Result of -1 x -1 = 1Result of -1 x -1 = 1 Accumulative overflow.Accumulative overflow.



Problem of -1 x -1Problem of -1 x -1

We have seen that:We have seen that: -1-1 x < 1 x < 1 -1 x -1 = 1 which cannot be represented.-1 x -1 = 1 which cannot be represented.

Solution:Solution: There are two instructions that saturate the There are two instructions that saturate the

result if you have -1 x -1:result if you have -1 x -1:

SMPYSMPY SMPYHSMPYH



Problem of -1 x -1Problem of -1 x -1

In one cycle these instructions do the In one cycle these instructions do the following:following: Multiply.Multiply. Shift left by 1-bit.Shift left by 1-bit. Saturate if the sign bits are 01.Saturate if the sign bits are 01.

It can be shown that:It can be shown that:

Positive ResultPositive ResultNegative ResultNegative Result-1 x -1 Result-1 x -1 Result

Result of MPY(H)Result of MPY(H)00.xxx-xb00.xxx-xb11.xxx-xb11.xxx-xb01.xxx-xb01.xxx-xb

Result of SMPY(H)Result of SMPY(H)0.xxx-xb0.xxx-xb1.xxx-xb1.xxx-xb0.xxx-xb0.xxx-xb



Problem of Accumulative OverflowProblem of Accumulative Overflow

In this case the overflow is due to the summation.In this case the overflow is due to the summation.

Examples of overflow:Examples of overflow:

99

0k

knxkany

0x7fff + 0x0002 = 0x80010x7fff + 0x0002 = 0x8001

0x7ffe0x7ffe

0x00000x00000xffff0xffff

0x7fff0x7fff0x80010x8001

(positive number + positive number = negative number!)(positive number + positive number = negative number!)

0xffff + 0x0002 = 0x00010xffff + 0x0002 = 0x0001(negative number + positive number = negative number!)(negative number + positive number = negative number!)




Solutions:Solutions:(1)(1) Saturate the intermediate results by using these add instructions:Saturate the intermediate results by using these add instructions:

If saturation occurs the SAT bit in the CSR is set to 1. You must If saturation occurs the SAT bit in the CSR is set to 1. You must clear it.clear it.

(2)(2) Use guard bits:Use guard bits:

e.g. e.g. ADD ADD A1A1, , A2A2, , A1:A0A1:A0

SADDSADD SSUBSSUB




Solutions:Solutions:(3)(3) Do nothing if the system is Do nothing if the system is Non-Gain:Non-Gain:

With a non-gain system the final result is always less than With a non-gain system the final result is always less than unity.unity.

Example system:Example system:

This will be non-gain if:This will be non-gain if:

99

0

1k

ka

99

0k

knxkany

1ix



Floating Point ArithmeticFloating Point Arithmetic

The C67xx support both single and The C67xx support both single and double precision floating point formats.double precision floating point formats.

The single precision format is as The single precision format is as follows:follows:

ss3131

ee3030

ee2222 2121

ee ee mm...... mm00

mm mm......

1-bit1-bit 8-bits8-bits 23-bits23-bits

value = (-1)value = (-1)sign sign * (1.mantissa) * 2* (1.mantissa) * 2(exponent-127)(exponent-127)

s = sign bits = sign bit

e = exponent (8-bit biased : -127)e = exponent (8-bit biased : -127)

m = mantissa (23-bit normalised fraction)m = mantissa (23-bit normalised fraction)



Floating Point Arithmetic ExampleFloating Point Arithmetic Example

Example: Conversion between integer and floating point.Example: Conversion between integer and floating point.

Convert ‘dd’ to the IEEE floating point format:Convert ‘dd’ to the IEEE floating point format:

int dd = 0x6000 0000;int dd = 0x6000 0000;

flot1 = (float) dd;flot1 = (float) dd;




flot1 = 0x4EC0 0000flot1 = 0x4EC0 0000

To view the value of “flot1” use:To view the value of “flot1” use:

VView: iew: MMemory:emory:AAddress= &flot1ddress= &flot1

We find:We find:




Let us check to see if we have the same Let us check to see if we have the same number:number:

4 E C 0 0 0 0 00 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0s exponent mantissa

s = 0s = 0

e = 10011101b = 128+16+8+4+1 = 157e = 10011101b = 128+16+8+4+1 = 157

m = 0.100b = 0.5m = 0.100b = 0.5

float1 float1 = (-1)= (-1)00 * (1.5) * 2 * (1.5) * 2(157-127)(157-127) = 1.5 * 2 = 1.5 * 23030

= 1610612736 decimal= 1610612736 decimal

= 0x6000 0000= 0x6000 0000




The previous example can be seen in:The previous example can be seen in: numerical.pjtnumerical.pjt Numerical_.wsNumerical_.ws

Use the mixed mode display to see the assembly code.Use the mixed mode display to see the assembly code.



Floating Point IEEE StandardFloating Point IEEE Standard

Special values:Special values:

ss

0011ssss0011ss

ee

000000

0<e<2550<e<255255255255255255255

mm

0000

00mm0000

00

NumberNumber

0-0(-1)s * 0.m * 2-126

(-1)s * 1.m * 2e-127

+-NaN (not a number)




Dynamic range:Dynamic range: Largest positive number:Largest positive number:

e(max) = 255, e(max) = 255, m(max) = 1-2m(max) = 1-2-(23-1)-(23-1)

max max = [1 + (1 -2= [1 + (1 -2-24-24)] * 2)] * 2255-127255-127

= 3.4 * 10= 3.4 * 103838

Smallest positive number:Smallest positive number: e(min) = 0, e(min) = 0, m(min) = 0.5 (normalised 0.100…0b)m(min) = 0.5 (normalised 0.100…0b) minmin = 1.5 * 2= 1.5 * 2-127-127 = 8.816 * 10 = 8.816 * 10-39-39





Dynamic range:Dynamic range: Largest negative number:Largest negative number:

e(max) = 255, e(max) = 255, m(max) = 1-2m(max) = 1-2-24 -24

max max = [-1 + (1 -2= [-1 + (1 -2-24-24)] * 2)] * 2255-127255-127

= -3.4 * 10= -3.4 * 103838

Smallest negative number:Smallest negative number: e(min) = 0, e(min) = 0, m(min) = 0.5 (normalised 1.100…0b)m(min) = 0.5 (normalised 1.100…0b) minmin = -1.5 * 2= -1.5 * 2-127-127 = -8.816 * 10 = -8.816 * 10-39-39




Floating/Fixed Point SummaryFloating/Fixed Point Summary

Floating point single precision:Floating point single precision:

Floating point double precision:Floating point double precision:ss

3131

ee3030

ee2323 2222

ee ee mm...... mm00

mm mm......


ss6363

ee6262

ee5252 5151

ee ee mm...... mm00

mm mm......


value = (-1)value = (-1)ss * 1.m * 2 * 1.m * 2e-127e-127

value = (-1)value = (-1)ss * 1.m * 2 * 1.m * 2e-1023e-1023

odd:even registersodd:even registers



Floating/Fixed Point Summary Floating/Fixed Point Summary (Short: N = 16;(Short: N = 16; Int: N = 32)Int: N = 32)

Unsigned integer:Unsigned integer:

Signed integer:Signed integer:

Signed fractional:Signed fractional: xx22N-1N-1 2200

xx xx......

2211

xx-2-2N-1N-1 2200

xx xx......

2211

xx-2-200 22-(N-1)-(N-1)

xx......xx22-1-1

xx22-2-2



Floating/Fixed Point Dynamic RangeFloating/Fixed Point Dynamic Range

Smallest Number Smallest Number (positive)(positive)

Largest Number Largest Number (positive)(positive)

Smallest Number Smallest Number (negative)(negative)

Floating Floating Point Point Single Single

PrecisionPrecision

3.4 x 103.4 x 103838

8.8 x 108.8 x 10-39-39

-3.4 x 10-3.4 x 103838

221616 - 1 - 1

11

-2-21616

16-bit16-bit

223232 - 1 - 1

11

-2-23232

32-bit32-bit

1-21-2-15-15

22-15-15

-1-1

16-bit16-bit

1-21-2-31-31

22-31-31

-1-1

32-bit32-bit

IntegerInteger

Fixed PointFixed Point

FractionalFractional



Numerical Issues - Useful TipsNumerical Issues - Useful Tips Multiply by 2: Multiply by 2: Use shift leftUse shift left Divide by 2:Divide by 2: Use shift rightUse shift right LogLog22N:N: Use shiftUse shift Sine, Cosine, Log:Sine, Cosine, Log: Use look up tablesUse look up tables To convert a fractional number to hex:To convert a fractional number to hex:

Num x 2Num x 21515

Then convert to hexThen convert to hex

e.g: convert 0.5 to hexe.g: convert 0.5 to hex 0.5 x 20.5 x 21515 = 16384 = 16384 (16384)(16384)decdec = (0x4000) = (0x4000)hexhex



Numerical Issues - 32-bit MultiplicationNumerical Issues - 32-bit Multiplication

It is possible to perform 32-bit multiplication using It is possible to perform 32-bit multiplication using 16-bit multipliers.16-bit multipliers.

Example: c = a x b (with 32-bit values).Example: c = a x b (with 32-bit values).

aahh aall

bbhh bbll

a =a =

b =b =

32-bits32-bits

a * b a * b == (a(ahh << 16 + a << 16 + all)* (b)* (bhh << 16 + b << 16 + bll))

== [(a[(ahh * b * bhh) << 32] + [(a) << 32] + [(all * b * bhh) << 16] + ) << 16] +

[(a[(ahh * b * bll) << 16] + [a) << 16] + [all * b * bl l ]]



LinksLinks

Further reading:Further reading: Understanding TMS320C62xx DSP Single-precision Understanding TMS320C62xx DSP Single-precision

Floating-Point Functions:Floating-Point Functions: \Links\spra515.pdf\Links\spra515.pdf TMS320C6000 Integer Division: TMS320C6000 Integer Division: \Links\spra707.pdf\Links\spra707.pdf

Chapter 13Chapter 13

Numerical IssuesNumerical Issues

- End -- End -

chapter 13 numerical issues. dr. naim dahnoun, bristol university, (c) texas instruments 2002...

Documents

c texas instruments

bristol university

naim dahnoun

floating point arithmetic

fractional arithmetic

unsigned integer numbers

fractional number

unsigned integers