ece291 lecture 13 advanced mathematics. ece 291 lecture 13page 2 of 40 lecture outline floating...

ECE291ECE291Lecture 13Lecture 13

Advanced MathematicsAdvanced Mathematics

ECE 291 Lecture 13ECE 291 Lecture 13

Page 2 of 40 of 40

Lecture outlineLecture outline

Floating point arithmeticFloating point arithmetic80x87 Floating Point Unit80x87 Floating Point UnitMMX InstructionsMMX InstructionsSSE InstructionsSSE Instructions


Page 3 of 40 of 40

Floating-point numbersFloating-point numbers

Components of a floating point number Components of a floating point number Sign:Sign: 0=pos(+); 1=neg(-) 0=pos(+); 1=neg(-) Exponent:Exponent: (with Constant bias added) (with Constant bias added) Mantissa:Mantissa: 1 <= M < 2 1 <= M < 2

Number = (Mantissa) * 2Number = (Mantissa) * 2(exponent-bias)(exponent-bias) Example: 10.25 (decimal) Example: 10.25 (decimal)

= 1010.01 (binary) = 1010.01 (binary) = + 1.01001 * 2= + 1.01001 * 233 (normalized) (normalized)


Page 4 of 40 of 40

Floating-point formatFloating-point format

Precision Size Precision Size Exp. Exp. Bias Bias Mant. Mant. Mant. Mant. (bits) (bits) (bits) (bits) (bits) 1st Digit (bits) 1st Digit

Single 32 8 127 (7Fh) 23Single 32 8 127 (7Fh) 23 Implied Implied

Double 64 11 1023 (3FFh) 52 ImpliedDouble 64 11 1023 (3FFh) 52 Implied Extended 80 15 16383 (3FFFh) 64 ExplicitExtended 80 15 16383 (3FFFh) 64 Explicit


Page 5 of 40 of 40

Converting to 32-bit floating-point formConverting to 32-bit floating-point formConvert the decimal number into binaryConvert the decimal number into binaryNormalize the binary numberNormalize the binary numberCalculate the biased exponentCalculate the biased exponentStore the number in floating point formatStore the number in floating point formatExample:Example: 1.1. 100.25 = 01100100.01100.25 = 01100100.012.2. 1100100.01 = 1.10010001 x 21100100.01 = 1.10010001 x 266

3.3. 110 + 01111111(7Fh) = 110 + 01111111(7Fh) = 10000101(85h)10000101(85h)4. 4. SignSign = 0 = 0

ExponentExponent = 10000101 = 10000101MantissaMantissa = 100 1000 1000 0000 0000 = 100 1000 1000 0000 0000

00000000

Notemantissa of a number 1.XXXX is the XXXXThe 1. is an implied one-bitthat is only stored in the extended precision formof the floating-point numberas an explicit-one bit


Page 6 of 40 of 40

Converting to floating-point formConverting to floating-point form

FormatFormat SignSign ExponentExponent MantissaMantissa

SingleSingle (+)(+) 7F+6 = 85h (8 bits)7F+6 = 85h (8 bits) Implied 1 + (23 bits)Implied 1 + (23 bits)00 1000,01011000,0101 100 1000 1000 0000 0000 0000100 1000 1000 0000 0000 0000

DoubleDouble (+)(+) 3FFh + 6 = 405h (11 bits)3FFh + 6 = 405h (11 bits) Implied 1 + (52 bits)Implied 1 + (52 bits)00 100,0000,0101100,0000,0101 100 1000 1000 0000 … 0000100 1000 1000 0000 … 0000

ExtendedExtended (+)(+) 3FFFh+6 = 4005h (15 bits)3FFFh+6 = 4005h (15 bits) Explicit 1 + (63 bits)Explicit 1 + (63 bits)00 100,0000,0000,0101100,0000,0000,0101 1 100 1000 1000 0000 … 00001 100 1000 1000 0000 … 0000


Page 7 of 40 of 40

Special representationsSpecial representations

Zero:Zero: Exp = All Zero, Mant = All ZeroExp = All Zero, Mant = All Zero

NAN (not a number)NAN (not a number) an invalid floating-point result an invalid floating-point result Exp = All Ones, Mant non-zeroExp = All Ones, Mant non-zero

+ Infinity:+ Infinity: Sign = 0, Exp = All Ones, Mant = 0Sign = 0, Exp = All Ones, Mant = 0

- Infinity:- Infinity: Sign = 1, Exp = All Ones, Mant = 0Sign = 1, Exp = All Ones, Mant = 0


Page 8 of 40 of 40

Converting fromConverting fromFloating-Point Form (example) Floating-Point Form (example)

SignSign = 1 = 1ExponentExponent = 1000,0011 = 1000,0011MantissaMantissa = 100,1001,0000,0000,0000 = 100,1001,0000,0000,0000

100 = 1000,0011 - 0111,1111 (convert biased exponent)100 = 1000,0011 - 0111,1111 (convert biased exponent)1.1001001 x 21.1001001 x 244 (a normalized binary number)(a normalized binary number)11001.00111001.001 (a de-normalized binary number)(a de-normalized binary number)-25.125-25.125 ( a decimal number)( a decimal number)


Page 9 of 40 of 40

Variable declaration examplesVariable declaration examples

Declaring and Initializing Floating Point VariablesDeclaring and Initializing Floating Point VariablesFpvar1Fpvar1 dddd -7.5-7.5 ; single precision (4 bytes); single precision (4 bytes)Fpvar2Fpvar2 dqdq 0.56250.5625 ; double precision (8 bytes); double precision (8 bytes)Fpvar3Fpvar3 dtdt -247.6-247.6 ; extended precision (10 bytes); extended precision (10 bytes)Fpvar4Fpvar4 dtdt 345.2345.2 ; extended precision (10 bytes); extended precision (10 bytes)Fpvar5Fpvar5 dtdt 1.01.0 ; extended precision (10 bytes); extended precision (10 bytes)


Page 10 of 40 of 40

The 80x87 floating point unit The 80x87 floating point unit On-chip hardware for fast computation on floating point On-chip hardware for fast computation on floating point numbers numbers FPU operations run in parallel with integer operations FPU operations run in parallel with integer operations Historically implemented as separate chip Historically implemented as separate chip

8087-80387 Coprocessor 8087-80387 Coprocessor

80486DX-Pentium III contain their own internal and fully 80486DX-Pentium III contain their own internal and fully compatible versions of the 80387compatible versions of the 80387CPU/FPU Exchange data through memoryCPU/FPU Exchange data through memory

X87 always converts Single and Double to Extended-precision X87 always converts Single and Double to Extended-precision


Page 11 of 40 of 40

Arithmetic Coprocessor ArchitectureArithmetic Coprocessor ArchitectureControl unit Control unit

interfaces the coprocessor to the microprocessor-system data businterfaces the coprocessor to the microprocessor-system data bus

Numeric execution unit (NEU)Numeric execution unit (NEU) executes all coprocessor instructionsexecutes all coprocessor instructions

NEU has eight-register stack NEU has eight-register stack each register is 80-bit wide (extended-precision numbers)each register is 80-bit wide (extended-precision numbers) data appear as any other form when they reside in the memory systemdata appear as any other form when they reside in the memory system holds operands for arithmetic instructions and the results of arithmetic holds operands for arithmetic instructions and the results of arithmetic

instructionsinstructions

Other registers - status, control, tag, exceptionOther registers - status, control, tag, exception


Page 12 of 40 of 40

Arithmetic Coprocessor FPU RegistersArithmetic Coprocessor FPU Registers

The eight 80-bit data registers areThe eight 80-bit data registers areorganized as a stackorganized as a stackAs data items are pushed into the As data items are pushed into the top register, previous data itemstop register, previous data itemsmove into higher-numbered registers, move into higher-numbered registers, which are lower on the stackwhich are lower on the stackAll coprocessor data are stored All coprocessor data are stored in registers in the 10-byte real formatin registers in the 10-byte real formatInternally, all calculations are Internally, all calculations are done on the greatest precision numbersdone on the greatest precision numbersInstructions that transfer numbersInstructions that transfer numbersbetween memory and coprocessorbetween memory and coprocessorautomatically convert numbers to automatically convert numbers to and from the 10-byte real formatand from the 10-byte real format


Page 13 of 40 of 40

Sample 80x87 FPU Opcodes Sample 80x87 FPU Opcodes FLD:FLD:Load (lets you load a memory address into an fp reg)Load (lets you load a memory address into an fp reg) FST:FST:Store (moves fp reg to memory address)Store (moves fp reg to memory address) FADD: Addition FADD: Addition FMUL: Multiplication FMUL: Multiplication FDIV: Division (divides the destination by the source)FDIV: Division (divides the destination by the source) FDIVR: Division (divides the source by the destination)FDIVR: Division (divides the source by the destination) FSIN: Sine (uses radians) FSIN: Sine (uses radians) FCOS: Cosine (uses radians) FCOS: Cosine (uses radians) FSQRT: Square Root FSQRT: Square Root FSUB: Subtraction FSUB: Subtraction FABS: Absolute Value FABS: Absolute Value FYL2X: Calculates Y * logFYL2X: Calculates Y * log22 (X) (Really) (X) (Really) FYL2XP1: Calculates Y * logFYL2XP1: Calculates Y * log22(X+1) (Really) (X+1) (Really)


Page 14 of 40 of 40

Programming the FPUProgramming the FPU

FPU Registers = ST0 … ST7 FPU Registers = ST0 … ST7 Format: Format: OPCODE source OPCODE source

Restrictions: Source can usually be ST# or Restrictions: Source can usually be ST# or memory location memory location

Use of ST0 as destination is implicit when not Use of ST0 as destination is implicit when not specified otherwisespecified otherwise

FINIT -FINIT - Execute before starting to initialize FPU Execute before starting to initialize FPU registers!registers!


Page 15 of 40 of 40

Programming the FPU (cont.)Programming the FPU (cont.)

FLD size[MemVar]FLD size[MemVar] Load ST0 with MemVar &Load ST0 with MemVar & Push all other STi to ST(i+1) for i=0..6 Push all other STi to ST(i+1) for i=0..6

FSTP size[MemVar]FSTP size[MemVar] Move top of stack (STO) to memory & Move top of stack (STO) to memory & Pop all other STi to ST(i-1) for I = 1..7 Pop all other STi to ST(i-1) for I = 1..7

FST size[MemVar]FST size[MemVar] Move top of stack to memory Move top of stack to memory Leave result in ST0 Leave result in ST0


Page 16 of 40 of 40

FPU ProgrammingFPU ProgrammingClassical-Stack FormatClassical-Stack Format

Instructions treat the coprocessor registers like items on the stackInstructions treat the coprocessor registers like items on the stackST0 (the top of the stack) is the source operandST0 (the top of the stack) is the source operandST1, the second register, is the destinationST1, the second register, is the destination

ST0

ST1

fld1 fldpi fadd st1

1.0

1.0

3.14 4.14

Instructions

StackContents1.0


Page 17 of 40 of 40

FPU ProgrammingFPU ProgrammingMemory FormatMemory Format

The stack top (ST0) is always the implied The stack top (ST0) is always the implied destinationdestination

ST0

ST1

fld m1 fld m2 fadd m1

1.0 1.0

1.0

fstp m1 fst m2

m1

m2

1.0

2.0

1.0

2.0

1.0

2.0

1.0

2.0

3.0

2.0

3.0

1.0

1.0

1.0

2.0 3.0

SEGMENT DATAm1 DD 1.0m2 DD 2.0SEGMENT CODE….fld dword [m1]fld dword [m2]fadd dword [m1]fstp dword [m1]fst dword [m2]


Page 18 of 40 of 40

FPU ProgrammingFPU ProgrammingRegister FormatRegister Format

Instructions treat coprocessor registers as registers rather than Instructions treat coprocessor registers as registers rather than stack elementsstack elements

ST0

ST1

fadd to st1

1.0

3.0

4.0

fadd st2 fxch st1

ST2

2.0

1.0

3.0

3.0

3.0 3.0

4.0

3.0

3.0


Page 19 of 40 of 40

FPU ProgrammingFPU ProgrammingRegister-Pop FormatRegister-Pop Format

Coprocessor registers are treated as a Coprocessor registers are treated as a modified stackmodified stack the source register must always be a stack topthe source register must always be a stack top

ST0

ST1

faddp st2, st0

2.0

ST2

2.0

1.0

3.0

4.0


Page 20 of 40 of 40

FPU ProgrammingFPU ProgrammingTiming IssuesTiming Issues

A processor instruction following a coprocessor instrA processor instruction following a coprocessor instr coordinated by assembler for 8086 and 8088coordinated by assembler for 8086 and 8088

coordinated by processor on 80186 - Pentiumcoordinated by processor on 80186 - Pentium

A processor instruction that accesses memory following A processor instruction that accesses memory following a coprocessor instruction that accesses the same a coprocessor instruction that accesses the same memorymemory

for 8087 need to include for 8087 need to include WAIT or FWAITWAIT or FWAIT to ensure that to ensure that coprocessor finishes before the processor beginscoprocessor finishes before the processor begins

assembler did this automaticallyassembler did this automatically


Page 21 of 40 of 40

Calculating the area of a circleCalculating the area of a circleRAD DD 2.34, 5.66, 9.33, 234.5, 23.4RAD DD 2.34, 5.66, 9.33, 234.5, 23.4AREA RESDAREA RESD 55

..START..STARTmovmov si, 0si, 0 ;source element 0;source element 0movmov di, 0di, 0 ;destination element 0;destination element 0mov mov cx, 5cx, 5 ;count of 5;count of 5finitfinit

Main1Main1fld fld dword [RAD + si]dword [RAD + si] ;radius to ST0;radius to ST0fmulfmul st0st0 ; square radius; square radius

fldpifldpi ; pi to ST0; pi to ST0

fmul st1 fmul st1 ; multiply ST0 = ST0 x ST1; multiply ST0 = ST0 x ST1

fstp dword [AREA + di]fstp dword [AREA + di]add si, 4add si, 4add di, 4add di, 4loop loop Main1Main1

mov ax, 4C00hmov ax, 4C00h intint 21h21h


Page 22 of 40 of 40

Floating Point OperationsFloating Point Operations Calculating Calculating Sqrt(xSqrt(x22 + y + y22))

StoreFloat resbStoreFloat resb 9494 X X dd dd 4.04.0 Y Y dd dd 3.03.0 Z Z resd resd

%macro%%macro% ShowFP ShowFP fsavefsave StoreFloatStoreFloat frstorfrstor StoreFloat StoreFloat %endmacro%%endmacro% ..start..start movmov ax, cs ; Initialize DS=CS ax, cs ; Initialize DS=CS movmov ds, axds, ax

finit finit ShowFPShowFP fldfld dword [X] dword [X] ShowFPShowFP fldfld st0st0 ShowFPShowFP

fmulpfmulp st1, st0st1, st0 ShowFPShowFP fldfld dword [Y]dword [Y] ShowFPShowFP fldfld st0st0 ShowFPShowFP fmulpfmulp st1, st0st1, st0 ShowFPShowFP faddfadd st0, st1st0, st1 ShowFPShowFP fsqrtfsqrt ShowFPShowFP fstfst dword [Z]dword [Z] ShowFPShowFP movmov ax, 4C00hax, 4C00h ;Exit to DOS ;Exit to DOS intint 21h21h


Page 23 of 40 of 40

MMX MMX MMultiultiMMedia eedia eXXtentionstentions

Designed to accelerate multimedia and communication Designed to accelerate multimedia and communication applicationsapplications

motion video, image processing, audio synthesis, speech motion video, image processing, audio synthesis, speech synthesis and compression, video conferencing, 2D and 3D synthesis and compression, video conferencing, 2D and 3D graphicsgraphics

Includes new instructions and data types to significantly Includes new instructions and data types to significantly improve application performanceimprove application performanceExploits the parallelism inherent in many multimedia and Exploits the parallelism inherent in many multimedia and communications algorithmscommunications algorithmsMaintains full compatibility with existing operating Maintains full compatibility with existing operating systems and applicationssystems and applications


Page 24 of 40 of 40

MMX MMX MMultiultiMMedia eedia eXXtentionstentionsProvides a set of basic, general purpose integer instructionsProvides a set of basic, general purpose integer instructionsSingle Instruction, Multiple Data (SIMD) techniqueSingle Instruction, Multiple Data (SIMD) technique

allows many pieces of information to be processed with a single allows many pieces of information to be processed with a single instruction, providing parallelism that greatly increases performance instruction, providing parallelism that greatly increases performance

57 new instructions57 new instructionsFour new data typesFour new data typesEight 64-bit wide MMX registersEight 64-bit wide MMX registersFirst available in 1997First available in 1997Supported on:Supported on:

Intel Pentium-MMX, Pentium II, Pentium III (and later) Intel Pentium-MMX, Pentium II, Pentium III (and later) AMD K6, K6-2, K6-3, K7 (and later) AMD K6, K6-2, K6-3, K7 (and later) Cyrix M2, MMX-enhanced MediaGX, Jalapeno (and later) Cyrix M2, MMX-enhanced MediaGX, Jalapeno (and later)


Page 25 of 40 of 40

Internal Register Set Internal Register Set of MMX Technology of MMX Technology

Uses 64-bit mantissa portion of 80-bit FPU registersUses 64-bit mantissa portion of 80-bit FPU registersThis technique is called aliasing because This technique is called aliasing because the floating-point registers are shared the floating-point registers are shared as the MMX registersas the MMX registersAliasing the MMX state upon the Aliasing the MMX state upon the floating point state does not precludefloating point state does not precludeapplications from executing both MMX applications from executing both MMX technology instructions and technology instructions and floating point instructionsfloating point instructionsThe same techniques used by FPUThe same techniques used by FPUto interface with the operatingto interface with the operatingsystem are used by MMX technologysystem are used by MMX technology

preserves FSAVE/FRSTOR instructionspreserves FSAVE/FRSTOR instructions


Page 26 of 40 of 40

Data TypesData Types

MMX architecture introduces new packed data typesMMX architecture introduces new packed data typesMultiple integer words are grouped into a single 64-bit quantityMultiple integer words are grouped into a single 64-bit quantity

Eight 8-bit packed bytes (B) Eight 8-bit packed bytes (B) Four 16-bit packed words (W) Four 16-bit packed words (W) Two 32-bit packed doublewords (D) Two 32-bit packed doublewords (D) One 64-bit quadword (Q)One 64-bit quadword (Q)

Example: consider graphics pixel Example: consider graphics pixel data represented as bytes.data represented as bytes.

with MMX, eight of these pixels can be packed together in a 64-bit with MMX, eight of these pixels can be packed together in a 64-bit quantity and moved into an MMX registerquantity and moved into an MMX register

MMX instruction performs the arithmetic or logical operation on all MMX instruction performs the arithmetic or logical operation on all eight elements in parallel eight elements in parallel


Page 27 of 40 of 40

Arithmetic InstructionsArithmetic Instructions

PADD(B/W/D): AdditionPADD(B/W/D): Addition

PADDB MM1, MM2PADDB MM1, MM2 adds 64-bit contents of MM2 to MM1, adds 64-bit contents of MM2 to MM1, byte-by-byte any carries generated byte-by-byte any carries generated are dropped, e.g., byte A0h + 70h = 10hare dropped, e.g., byte A0h + 70h = 10h

PSUB(B/W/D): Subtraction PSUB(B/W/D): Subtraction


Page 28 of 40 of 40

Arithmetic InstructionsArithmetic InstructionsPMUL(L/H)W: Multiplication (Low/High Result)PMUL(L/H)W: Multiplication (Low/High Result)

multiplies four pairs of 16-bit operands, producing 32 bit resultmultiplies four pairs of 16-bit operands, producing 32 bit result

PMADDWD: Multiply and AddPMADDWD: Multiply and Add

Key instruction to many signal processing algorithms like matrix Key instruction to many signal processing algorithms like matrix multiplies or FFTsmultiplies or FFTs

a3 a2 a1 a0

b3 b2 b1 b0

a3*b3 + a2*b2 a1*b1 + a0*b0


Page 29 of 40 of 40

Logical, Shifting, and Compare InstructionsLogical, Shifting, and Compare InstructionsLogical Logical PAND: Logical AND (64-bit) PAND: Logical AND (64-bit) POR: Logical OR (64-bit) POR: Logical OR (64-bit) PXOR: Logical Exclusive OR (64-bit) PXOR: Logical Exclusive OR (64-bit) PANDN: Destination = (NOT Destination) AND SourcePANDN: Destination = (NOT Destination) AND Source

Shifting Shifting PSLL(W/D/Q): Packed Shift Left (Logical) PSLL(W/D/Q): Packed Shift Left (Logical) PSRL(W/D/Q): Packed Shift Rigth (Logical) PSRL(W/D/Q): Packed Shift Rigth (Logical) PSRA(W/D/Q): Packed Shift Right (Arithmetic)PSRA(W/D/Q): Packed Shift Right (Arithmetic) CompareCompare

PCMPEQ: Compare for Equality PCMPEQ: Compare for Equality PCMPGT: Compare for Greater Than PCMPGT: Compare for Greater Than

Sets Result Register to ALL 0's or ALL 1's Sets Result Register to ALL 0's or ALL 1's


Page 30 of 40 of 40

Conversion Instructions UnpackingConversion Instructions Unpacking

Unpacking (Increasing data size by 2*n bits) Unpacking (Increasing data size by 2*n bits) PUNPCKLBW: Reg1, Reg2: PUNPCKLBW: Reg1, Reg2:

Unpacks lower four bytes to create Unpacks lower four bytes to create four words. four words.

PUNPCKLWD: Reg1, Reg2: PUNPCKLWD: Reg1, Reg2: Interleaves lower two words to create two doubles Interleaves lower two words to create two doubles

PUNPCKLDQ: Reg1, Reg2: PUNPCKLDQ: Reg1, Reg2: Interleaves lower double to create Quadword Interleaves lower double to create Quadword


Page 31 of 40 of 40

Conversion Instructions PackingConversion Instructions Packing

Packing (Reducing data size by 2*n bits) Packing (Reducing data size by 2*n bits) PACKSSDW Reg1, Reg2: Pack Double to WordPACKSSDW Reg1, Reg2: Pack Double to Word Four doubles in Reg2:Reg1 compressed to Four words in Reg1 Four doubles in Reg2:Reg1 compressed to Four words in Reg1 PACKSSWB Reg1, Reg2: Pack Word to Byte PACKSSWB Reg1, Reg2: Pack Word to Byte Eight words in Reg2:Reg1 compressed to Eight bytes in Reg1Eight words in Reg2:Reg1 compressed to Eight bytes in Reg1

The pack and unpack instructions are especially important when an The pack and unpack instructions are especially important when an algorithm needs higher precision in its intermediate calculations, algorithm needs higher precision in its intermediate calculations, e.g., an image filteringe.g., an image filtering


Page 32 of 40 of 40

Data Transfer InstructionsData Transfer Instructions

MOVQ Dest, Source : 64-bit move MOVQ Dest, Source : 64-bit move One or both arguments must be a MMX One or both arguments must be a MMX

register register

MOVD Dest, Source : 32-bit move MOVD Dest, Source : 32-bit move Zeros loaded to upper MMX bits for 32-bit Zeros loaded to upper MMX bits for 32-bit

move move


Page 33 of 40 of 40

Saturation/Wraparound Saturation/Wraparound ArithmeticArithmetic

Wraparound: carry bit lost (significant Wraparound: carry bit lost (significant portion lost) (PADD)portion lost) (PADD)

a3 a2 a1 FFFFh

b3 b2 b1 8000h

a3+b3 a2+b2 a1+b1 7FFFh

+


Page 34 of 40 of 40

Saturation/Wraparound Saturation/Wraparound Arithmetic (cont.)Arithmetic (cont.)

Unsigned Saturation: add with unsigned saturation (PADDUS)Unsigned Saturation: add with unsigned saturation (PADDUS)

a3 a2 a1 FFF4h

b3 b2 b1 1123h

a3+b3 a2+b2 a1+b1 FFFFh

+

Saturation if addition results in overflow or subtraction results in underflow,the result is clamped to the largest or the smallest value representable

• for an unsigned, 16-bit word the values are FFFFh and 0000h• for a signed 16-bit word the values are 7FFFh and 8000h


Page 35 of 40 of 40

Saturation/Wraparound Saturation/Wraparound Arithmetic (cont.)Arithmetic (cont.)

Saturation: add with signed saturation Saturation: add with signed saturation (PADDS)(PADDS)

a3 a2 a1 7FF4h

b3 b2 b1 0050h

a3+b3 a2+b2 a1+b1 7FFFh

+


Page 36 of 40 of 40

Adding 8 8-bit Integers Adding 8 8-bit Integers with Saturationwith Saturation

X0 dq X0 dq 80808080555555558080808055555555 X1 dq X1 dq 000090903333FFFF000090903333FFFF

MOVQ MOVQ mm0,X0mm0,X0 PADDSB PADDSB mm0,X1 mm0,X1

Result mm0 = Result mm0 = 808080807F7F5454808080807F7F545480h+00h=80h (addition with zero)80h+90h=80h (saturation to maximum negative value)55h+33h=7fh (saturation to maximum positive value)55h+FFh=54h (subtraction by one)


Page 37 of 40 of 40

Parallel CompareParallel Compare

PCMPGT[B/W/D] mmxreg, r/m64PCMPGT[B/W/D] mmxreg, r/m64

PCMPEQ[B/W/D] mmxreg, r/m64PCMPEQ[B/W/D] mmxreg, r/m642323 4545 1616 3434

Eq?Eq? Eq?Eq? Eq?Eq? Eq?Eq?

3131 77 1616 6767

0000h0000h 0000h0000h FFFFhFFFFh 0000h0000h

2323 4545 1616 3434

Gt?Gt? Gt?Gt? gt/?gt/? Gt?Gt?

3131 77 1616 6767

0000h0000h FFFFhFFFFh 0000h0000h 0000h0000h


Page 38 of 40 of 40

Use of FPU Registers for Use of FPU Registers for Storing MMX DataStoring MMX Data

The EMMS (empty MMX-state) instruction sets (11) all The EMMS (empty MMX-state) instruction sets (11) all the tags in the FPU, so the floating-point registers are the tags in the FPU, so the floating-point registers are listed as emptylisted as emptyEMMS must be executed before the return instruction at EMMS must be executed before the return instruction at the end of any MMX procedurethe end of any MMX procedure

otherwise any subsequent floating point operation will cause a otherwise any subsequent floating point operation will cause a floating point interrupt error, potentially crashing your applicationfloating point interrupt error, potentially crashing your application

if you use floating point within MMX procedure, you must use if you use floating point within MMX procedure, you must use EMMS instruction before executing the floating point instructionEMMS instruction before executing the floating point instruction

Any MMX instruction resets (00) all FPU tag bits, so the Any MMX instruction resets (00) all FPU tag bits, so the floating-point registers are listed as validfloating-point registers are listed as valid


Page 39 of 40 of 40

Streaming SIMD ExtensionsStreaming SIMD ExtensionsIntel Pentium IIIIntel Pentium III

Streaming SIMD defines a new architecture for Streaming SIMD defines a new architecture for floating point operationsfloating point operationsOperates on IEEE-754 Single-precision 32-bit Operates on IEEE-754 Single-precision 32-bit Real Numbers Real Numbers Uses eight new 128-bit wide general-purpose Uses eight new 128-bit wide general-purpose registers (XMM0 - XMM7)registers (XMM0 - XMM7)Introduced in Pentium III in March 1999 Introduced in Pentium III in March 1999 Pentium III includes floating point, MMX technology, Pentium III includes floating point, MMX technology,

and XMM registersand XMM registers


Page 40 of 40 of 40

Streaming SIMD ExtensionsStreaming SIMD ExtensionsIntel Pentium III (cont.)Intel Pentium III (cont.)

Supports packed and scalar operations on the Supports packed and scalar operations on the new packed single precision floating point data new packed single precision floating point data typestypesPackedPacked instructions operate vertically on four instructions operate vertically on four pairs of floating point data elements in parallelpairs of floating point data elements in parallel instructions have suffix instructions have suffix psps, e.g., addps, e.g., addps

ScalarScalar instructions operate on the least- instructions operate on the least-significant data elements of the two operands significant data elements of the two operands instructions have suffix instructions have suffix ssss, e.g., addss, e.g., addss

ece291 lecture 13 advanced mathematics. ece 291 lecture 13page 2 of 40 lecture outline floating...

Documents