ece291 lecture 13 advanced mathematics. ece 291 lecture 13page 2 of 40 lecture outline floating...
DESCRIPTION
ECE 291 Lecture 13Page 3 of 40 Floating-point numbers Components of a floating point number Sign: 0=pos(+); 1=neg(-) Sign: 0=pos(+); 1=neg(-) Exponent: (with Constant bias added) Exponent: (with Constant bias added) Mantissa: 1TRANSCRIPT
ECE291ECE291Lecture 13Lecture 13
Advanced MathematicsAdvanced Mathematics
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 22 of 40 of 40
Lecture outlineLecture outline
Floating point arithmeticFloating point arithmetic80x87 Floating Point Unit80x87 Floating Point UnitMMX InstructionsMMX InstructionsSSE InstructionsSSE Instructions
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 33 of 40 of 40
Floating-point numbersFloating-point numbers
Components of a floating point number Components of a floating point number Sign:Sign: 0=pos(+); 1=neg(-) 0=pos(+); 1=neg(-) Exponent:Exponent: (with Constant bias added) (with Constant bias added) Mantissa:Mantissa: 1 <= M < 2 1 <= M < 2
Number = (Mantissa) * 2Number = (Mantissa) * 2(exponent-bias)(exponent-bias) Example: 10.25 (decimal) Example: 10.25 (decimal)
= 1010.01 (binary) = 1010.01 (binary) = + 1.01001 * 2= + 1.01001 * 233 (normalized) (normalized)
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 44 of 40 of 40
Floating-point formatFloating-point format
Precision Size Precision Size Exp. Exp. Bias Bias Mant. Mant. Mant. Mant. (bits) (bits) (bits) (bits) (bits) 1st Digit (bits) 1st Digit
Single 32 8 127 (7Fh) 23Single 32 8 127 (7Fh) 23 Implied Implied
Double 64 11 1023 (3FFh) 52 ImpliedDouble 64 11 1023 (3FFh) 52 Implied Extended 80 15 16383 (3FFFh) 64 ExplicitExtended 80 15 16383 (3FFFh) 64 Explicit
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 55 of 40 of 40
Converting to 32-bit floating-point formConverting to 32-bit floating-point formConvert the decimal number into binaryConvert the decimal number into binaryNormalize the binary numberNormalize the binary numberCalculate the biased exponentCalculate the biased exponentStore the number in floating point formatStore the number in floating point formatExample:Example: 1.1. 100.25 = 01100100.01100.25 = 01100100.012.2. 1100100.01 = 1.10010001 x 21100100.01 = 1.10010001 x 266
3.3. 110 + 01111111(7Fh) = 110 + 01111111(7Fh) = 10000101(85h)10000101(85h)4. 4. SignSign = 0 = 0
ExponentExponent = 10000101 = 10000101MantissaMantissa = 100 1000 1000 0000 0000 = 100 1000 1000 0000 0000
00000000
Notemantissa of a number 1.XXXX is the XXXXThe 1. is an implied one-bitthat is only stored in the extended precision formof the floating-point numberas an explicit-one bit
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 66 of 40 of 40
Converting to floating-point formConverting to floating-point form
FormatFormat SignSign ExponentExponent MantissaMantissa
SingleSingle (+)(+) 7F+6 = 85h (8 bits)7F+6 = 85h (8 bits) Implied 1 + (23 bits)Implied 1 + (23 bits)00 1000,01011000,0101 100 1000 1000 0000 0000 0000100 1000 1000 0000 0000 0000
DoubleDouble (+)(+) 3FFh + 6 = 405h (11 bits)3FFh + 6 = 405h (11 bits) Implied 1 + (52 bits)Implied 1 + (52 bits)00 100,0000,0101100,0000,0101 100 1000 1000 0000 … 0000100 1000 1000 0000 … 0000
ExtendedExtended (+)(+) 3FFFh+6 = 4005h (15 bits)3FFFh+6 = 4005h (15 bits) Explicit 1 + (63 bits)Explicit 1 + (63 bits)00 100,0000,0000,0101100,0000,0000,0101 1 100 1000 1000 0000 … 00001 100 1000 1000 0000 … 0000
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 77 of 40 of 40
Special representationsSpecial representations
Zero:Zero: Exp = All Zero, Mant = All ZeroExp = All Zero, Mant = All Zero
NAN (not a number)NAN (not a number) an invalid floating-point result an invalid floating-point result Exp = All Ones, Mant non-zeroExp = All Ones, Mant non-zero
+ Infinity:+ Infinity: Sign = 0, Exp = All Ones, Mant = 0Sign = 0, Exp = All Ones, Mant = 0
- Infinity:- Infinity: Sign = 1, Exp = All Ones, Mant = 0Sign = 1, Exp = All Ones, Mant = 0
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 88 of 40 of 40
Converting fromConverting fromFloating-Point Form (example) Floating-Point Form (example)
SignSign = 1 = 1ExponentExponent = 1000,0011 = 1000,0011MantissaMantissa = 100,1001,0000,0000,0000 = 100,1001,0000,0000,0000
100 = 1000,0011 - 0111,1111 (convert biased exponent)100 = 1000,0011 - 0111,1111 (convert biased exponent)1.1001001 x 21.1001001 x 244 (a normalized binary number)(a normalized binary number)11001.00111001.001 (a de-normalized binary number)(a de-normalized binary number)-25.125-25.125 ( a decimal number)( a decimal number)
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 99 of 40 of 40
Variable declaration examplesVariable declaration examples
Declaring and Initializing Floating Point VariablesDeclaring and Initializing Floating Point VariablesFpvar1Fpvar1 dddd -7.5-7.5 ; single precision (4 bytes); single precision (4 bytes)Fpvar2Fpvar2 dqdq 0.56250.5625 ; double precision (8 bytes); double precision (8 bytes)Fpvar3Fpvar3 dtdt -247.6-247.6 ; extended precision (10 bytes); extended precision (10 bytes)Fpvar4Fpvar4 dtdt 345.2345.2 ; extended precision (10 bytes); extended precision (10 bytes)Fpvar5Fpvar5 dtdt 1.01.0 ; extended precision (10 bytes); extended precision (10 bytes)
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1010 of 40 of 40
The 80x87 floating point unit The 80x87 floating point unit On-chip hardware for fast computation on floating point On-chip hardware for fast computation on floating point numbers numbers FPU operations run in parallel with integer operations FPU operations run in parallel with integer operations Historically implemented as separate chip Historically implemented as separate chip
8087-80387 Coprocessor 8087-80387 Coprocessor
80486DX-Pentium III contain their own internal and fully 80486DX-Pentium III contain their own internal and fully compatible versions of the 80387compatible versions of the 80387CPU/FPU Exchange data through memoryCPU/FPU Exchange data through memory
X87 always converts Single and Double to Extended-precision X87 always converts Single and Double to Extended-precision
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1111 of 40 of 40
Arithmetic Coprocessor ArchitectureArithmetic Coprocessor ArchitectureControl unit Control unit
interfaces the coprocessor to the microprocessor-system data businterfaces the coprocessor to the microprocessor-system data bus
Numeric execution unit (NEU)Numeric execution unit (NEU) executes all coprocessor instructionsexecutes all coprocessor instructions
NEU has eight-register stack NEU has eight-register stack each register is 80-bit wide (extended-precision numbers)each register is 80-bit wide (extended-precision numbers) data appear as any other form when they reside in the memory systemdata appear as any other form when they reside in the memory system holds operands for arithmetic instructions and the results of arithmetic holds operands for arithmetic instructions and the results of arithmetic
instructionsinstructions
Other registers - status, control, tag, exceptionOther registers - status, control, tag, exception
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1212 of 40 of 40
Arithmetic Coprocessor FPU RegistersArithmetic Coprocessor FPU Registers
The eight 80-bit data registers areThe eight 80-bit data registers areorganized as a stackorganized as a stackAs data items are pushed into the As data items are pushed into the top register, previous data itemstop register, previous data itemsmove into higher-numbered registers, move into higher-numbered registers, which are lower on the stackwhich are lower on the stackAll coprocessor data are stored All coprocessor data are stored in registers in the 10-byte real formatin registers in the 10-byte real formatInternally, all calculations are Internally, all calculations are done on the greatest precision numbersdone on the greatest precision numbersInstructions that transfer numbersInstructions that transfer numbersbetween memory and coprocessorbetween memory and coprocessorautomatically convert numbers to automatically convert numbers to and from the 10-byte real formatand from the 10-byte real format
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1313 of 40 of 40
Sample 80x87 FPU Opcodes Sample 80x87 FPU Opcodes FLD:FLD:Load (lets you load a memory address into an fp reg)Load (lets you load a memory address into an fp reg) FST:FST:Store (moves fp reg to memory address)Store (moves fp reg to memory address) FADD: Addition FADD: Addition FMUL: Multiplication FMUL: Multiplication FDIV: Division (divides the destination by the source)FDIV: Division (divides the destination by the source) FDIVR: Division (divides the source by the destination)FDIVR: Division (divides the source by the destination) FSIN: Sine (uses radians) FSIN: Sine (uses radians) FCOS: Cosine (uses radians) FCOS: Cosine (uses radians) FSQRT: Square Root FSQRT: Square Root FSUB: Subtraction FSUB: Subtraction FABS: Absolute Value FABS: Absolute Value FYL2X: Calculates Y * logFYL2X: Calculates Y * log22 (X) (Really) (X) (Really) FYL2XP1: Calculates Y * logFYL2XP1: Calculates Y * log22(X+1) (Really) (X+1) (Really)
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1414 of 40 of 40
Programming the FPUProgramming the FPU
FPU Registers = ST0 … ST7 FPU Registers = ST0 … ST7 Format: Format: OPCODE source OPCODE source
Restrictions: Source can usually be ST# or Restrictions: Source can usually be ST# or memory location memory location
Use of ST0 as destination is implicit when not Use of ST0 as destination is implicit when not specified otherwisespecified otherwise
FINIT -FINIT - Execute before starting to initialize FPU Execute before starting to initialize FPU registers!registers!
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1515 of 40 of 40
Programming the FPU (cont.)Programming the FPU (cont.)
FLD size[MemVar]FLD size[MemVar] Load ST0 with MemVar &Load ST0 with MemVar & Push all other STi to ST(i+1) for i=0..6 Push all other STi to ST(i+1) for i=0..6
FSTP size[MemVar]FSTP size[MemVar] Move top of stack (STO) to memory & Move top of stack (STO) to memory & Pop all other STi to ST(i-1) for I = 1..7 Pop all other STi to ST(i-1) for I = 1..7
FST size[MemVar]FST size[MemVar] Move top of stack to memory Move top of stack to memory Leave result in ST0 Leave result in ST0
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1616 of 40 of 40
FPU ProgrammingFPU ProgrammingClassical-Stack FormatClassical-Stack Format
Instructions treat the coprocessor registers like items on the stackInstructions treat the coprocessor registers like items on the stackST0 (the top of the stack) is the source operandST0 (the top of the stack) is the source operandST1, the second register, is the destinationST1, the second register, is the destination
ST0
ST1
fld1 fldpi fadd st1
1.0
1.0
3.14 4.14
Instructions
StackContents1.0
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1717 of 40 of 40
FPU ProgrammingFPU ProgrammingMemory FormatMemory Format
The stack top (ST0) is always the implied The stack top (ST0) is always the implied destinationdestination
ST0
ST1
fld m1 fld m2 fadd m1
1.0 1.0
1.0
fstp m1 fst m2
m1
m2
1.0
2.0
1.0
2.0
1.0
2.0
1.0
2.0
3.0
2.0
3.0
1.0
1.0
1.0
2.0 3.0
SEGMENT DATAm1 DD 1.0m2 DD 2.0SEGMENT CODE….fld dword [m1]fld dword [m2]fadd dword [m1]fstp dword [m1]fst dword [m2]
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1818 of 40 of 40
FPU ProgrammingFPU ProgrammingRegister FormatRegister Format
Instructions treat coprocessor registers as registers rather than Instructions treat coprocessor registers as registers rather than stack elementsstack elements
ST0
ST1
fadd to st1
1.0
3.0
4.0
fadd st2 fxch st1
ST2
2.0
1.0
3.0
3.0
3.0 3.0
4.0
3.0
3.0
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 1919 of 40 of 40
FPU ProgrammingFPU ProgrammingRegister-Pop FormatRegister-Pop Format
Coprocessor registers are treated as a Coprocessor registers are treated as a modified stackmodified stack the source register must always be a stack topthe source register must always be a stack top
ST0
ST1
faddp st2, st0
2.0
ST2
2.0
1.0
3.0
4.0
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2020 of 40 of 40
FPU ProgrammingFPU ProgrammingTiming IssuesTiming Issues
A processor instruction following a coprocessor instrA processor instruction following a coprocessor instr coordinated by assembler for 8086 and 8088coordinated by assembler for 8086 and 8088
coordinated by processor on 80186 - Pentiumcoordinated by processor on 80186 - Pentium
A processor instruction that accesses memory following A processor instruction that accesses memory following a coprocessor instruction that accesses the same a coprocessor instruction that accesses the same memorymemory
for 8087 need to include for 8087 need to include WAIT or FWAITWAIT or FWAIT to ensure that to ensure that coprocessor finishes before the processor beginscoprocessor finishes before the processor begins
assembler did this automaticallyassembler did this automatically
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2121 of 40 of 40
Calculating the area of a circleCalculating the area of a circleRAD DD 2.34, 5.66, 9.33, 234.5, 23.4RAD DD 2.34, 5.66, 9.33, 234.5, 23.4AREA RESDAREA RESD 55
..START..STARTmovmov si, 0si, 0 ;source element 0;source element 0movmov di, 0di, 0 ;destination element 0;destination element 0mov mov cx, 5cx, 5 ;count of 5;count of 5finitfinit
Main1Main1fld fld dword [RAD + si]dword [RAD + si] ;radius to ST0;radius to ST0fmulfmul st0st0 ; square radius; square radius
fldpifldpi ; pi to ST0; pi to ST0
fmul st1 fmul st1 ; multiply ST0 = ST0 x ST1; multiply ST0 = ST0 x ST1
fstp dword [AREA + di]fstp dword [AREA + di]add si, 4add si, 4add di, 4add di, 4loop loop Main1Main1
mov ax, 4C00hmov ax, 4C00h intint 21h21h
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2222 of 40 of 40
Floating Point OperationsFloating Point Operations Calculating Calculating Sqrt(xSqrt(x22 + y + y22))
StoreFloat resbStoreFloat resb 9494 X X dd dd 4.04.0 Y Y dd dd 3.03.0 Z Z resd resd
%macro%%macro% ShowFP ShowFP fsavefsave StoreFloatStoreFloat frstorfrstor StoreFloat StoreFloat %endmacro%%endmacro% ..start..start movmov ax, cs ; Initialize DS=CS ax, cs ; Initialize DS=CS movmov ds, axds, ax
finit finit ShowFPShowFP fldfld dword [X] dword [X] ShowFPShowFP fldfld st0st0 ShowFPShowFP
fmulpfmulp st1, st0st1, st0 ShowFPShowFP fldfld dword [Y]dword [Y] ShowFPShowFP fldfld st0st0 ShowFPShowFP fmulpfmulp st1, st0st1, st0 ShowFPShowFP faddfadd st0, st1st0, st1 ShowFPShowFP fsqrtfsqrt ShowFPShowFP fstfst dword [Z]dword [Z] ShowFPShowFP movmov ax, 4C00hax, 4C00h ;Exit to DOS ;Exit to DOS intint 21h21h
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2323 of 40 of 40
MMX MMX MMultiultiMMedia eedia eXXtentionstentions
Designed to accelerate multimedia and communication Designed to accelerate multimedia and communication applicationsapplications
motion video, image processing, audio synthesis, speech motion video, image processing, audio synthesis, speech synthesis and compression, video conferencing, 2D and 3D synthesis and compression, video conferencing, 2D and 3D graphicsgraphics
Includes new instructions and data types to significantly Includes new instructions and data types to significantly improve application performanceimprove application performanceExploits the parallelism inherent in many multimedia and Exploits the parallelism inherent in many multimedia and communications algorithmscommunications algorithmsMaintains full compatibility with existing operating Maintains full compatibility with existing operating systems and applicationssystems and applications
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2424 of 40 of 40
MMX MMX MMultiultiMMedia eedia eXXtentionstentionsProvides a set of basic, general purpose integer instructionsProvides a set of basic, general purpose integer instructionsSingle Instruction, Multiple Data (SIMD) techniqueSingle Instruction, Multiple Data (SIMD) technique
allows many pieces of information to be processed with a single allows many pieces of information to be processed with a single instruction, providing parallelism that greatly increases performance instruction, providing parallelism that greatly increases performance
57 new instructions57 new instructionsFour new data typesFour new data typesEight 64-bit wide MMX registersEight 64-bit wide MMX registersFirst available in 1997First available in 1997Supported on:Supported on:
Intel Pentium-MMX, Pentium II, Pentium III (and later) Intel Pentium-MMX, Pentium II, Pentium III (and later) AMD K6, K6-2, K6-3, K7 (and later) AMD K6, K6-2, K6-3, K7 (and later) Cyrix M2, MMX-enhanced MediaGX, Jalapeno (and later) Cyrix M2, MMX-enhanced MediaGX, Jalapeno (and later)
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2525 of 40 of 40
Internal Register Set Internal Register Set of MMX Technology of MMX Technology
Uses 64-bit mantissa portion of 80-bit FPU registersUses 64-bit mantissa portion of 80-bit FPU registersThis technique is called aliasing because This technique is called aliasing because the floating-point registers are shared the floating-point registers are shared as the MMX registersas the MMX registersAliasing the MMX state upon the Aliasing the MMX state upon the floating point state does not precludefloating point state does not precludeapplications from executing both MMX applications from executing both MMX technology instructions and technology instructions and floating point instructionsfloating point instructionsThe same techniques used by FPUThe same techniques used by FPUto interface with the operatingto interface with the operatingsystem are used by MMX technologysystem are used by MMX technology
preserves FSAVE/FRSTOR instructionspreserves FSAVE/FRSTOR instructions
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2626 of 40 of 40
Data TypesData Types
MMX architecture introduces new packed data typesMMX architecture introduces new packed data typesMultiple integer words are grouped into a single 64-bit quantityMultiple integer words are grouped into a single 64-bit quantity
Eight 8-bit packed bytes (B) Eight 8-bit packed bytes (B) Four 16-bit packed words (W) Four 16-bit packed words (W) Two 32-bit packed doublewords (D) Two 32-bit packed doublewords (D) One 64-bit quadword (Q)One 64-bit quadword (Q)
Example: consider graphics pixel Example: consider graphics pixel data represented as bytes.data represented as bytes.
with MMX, eight of these pixels can be packed together in a 64-bit with MMX, eight of these pixels can be packed together in a 64-bit quantity and moved into an MMX registerquantity and moved into an MMX register
MMX instruction performs the arithmetic or logical operation on all MMX instruction performs the arithmetic or logical operation on all eight elements in parallel eight elements in parallel
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2727 of 40 of 40
Arithmetic InstructionsArithmetic Instructions
PADD(B/W/D): AdditionPADD(B/W/D): Addition
PADDB MM1, MM2PADDB MM1, MM2 adds 64-bit contents of MM2 to MM1, adds 64-bit contents of MM2 to MM1, byte-by-byte any carries generated byte-by-byte any carries generated are dropped, e.g., byte A0h + 70h = 10hare dropped, e.g., byte A0h + 70h = 10h
PSUB(B/W/D): Subtraction PSUB(B/W/D): Subtraction
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2828 of 40 of 40
Arithmetic InstructionsArithmetic InstructionsPMUL(L/H)W: Multiplication (Low/High Result)PMUL(L/H)W: Multiplication (Low/High Result)
multiplies four pairs of 16-bit operands, producing 32 bit resultmultiplies four pairs of 16-bit operands, producing 32 bit result
PMADDWD: Multiply and AddPMADDWD: Multiply and Add
Key instruction to many signal processing algorithms like matrix Key instruction to many signal processing algorithms like matrix multiplies or FFTsmultiplies or FFTs
a3 a2 a1 a0
b3 b2 b1 b0
a3*b3 + a2*b2 a1*b1 + a0*b0
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 2929 of 40 of 40
Logical, Shifting, and Compare InstructionsLogical, Shifting, and Compare InstructionsLogical Logical PAND: Logical AND (64-bit) PAND: Logical AND (64-bit) POR: Logical OR (64-bit) POR: Logical OR (64-bit) PXOR: Logical Exclusive OR (64-bit) PXOR: Logical Exclusive OR (64-bit) PANDN: Destination = (NOT Destination) AND SourcePANDN: Destination = (NOT Destination) AND Source
Shifting Shifting PSLL(W/D/Q): Packed Shift Left (Logical) PSLL(W/D/Q): Packed Shift Left (Logical) PSRL(W/D/Q): Packed Shift Rigth (Logical) PSRL(W/D/Q): Packed Shift Rigth (Logical) PSRA(W/D/Q): Packed Shift Right (Arithmetic)PSRA(W/D/Q): Packed Shift Right (Arithmetic) CompareCompare
PCMPEQ: Compare for Equality PCMPEQ: Compare for Equality PCMPGT: Compare for Greater Than PCMPGT: Compare for Greater Than
Sets Result Register to ALL 0's or ALL 1's Sets Result Register to ALL 0's or ALL 1's
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3030 of 40 of 40
Conversion Instructions UnpackingConversion Instructions Unpacking
Unpacking (Increasing data size by 2*n bits) Unpacking (Increasing data size by 2*n bits) PUNPCKLBW: Reg1, Reg2: PUNPCKLBW: Reg1, Reg2:
Unpacks lower four bytes to create Unpacks lower four bytes to create four words. four words.
PUNPCKLWD: Reg1, Reg2: PUNPCKLWD: Reg1, Reg2: Interleaves lower two words to create two doubles Interleaves lower two words to create two doubles
PUNPCKLDQ: Reg1, Reg2: PUNPCKLDQ: Reg1, Reg2: Interleaves lower double to create Quadword Interleaves lower double to create Quadword
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3131 of 40 of 40
Conversion Instructions PackingConversion Instructions Packing
Packing (Reducing data size by 2*n bits) Packing (Reducing data size by 2*n bits) PACKSSDW Reg1, Reg2: Pack Double to WordPACKSSDW Reg1, Reg2: Pack Double to Word Four doubles in Reg2:Reg1 compressed to Four words in Reg1 Four doubles in Reg2:Reg1 compressed to Four words in Reg1 PACKSSWB Reg1, Reg2: Pack Word to Byte PACKSSWB Reg1, Reg2: Pack Word to Byte Eight words in Reg2:Reg1 compressed to Eight bytes in Reg1Eight words in Reg2:Reg1 compressed to Eight bytes in Reg1
The pack and unpack instructions are especially important when an The pack and unpack instructions are especially important when an algorithm needs higher precision in its intermediate calculations, algorithm needs higher precision in its intermediate calculations, e.g., an image filteringe.g., an image filtering
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3232 of 40 of 40
Data Transfer InstructionsData Transfer Instructions
MOVQ Dest, Source : 64-bit move MOVQ Dest, Source : 64-bit move One or both arguments must be a MMX One or both arguments must be a MMX
register register
MOVD Dest, Source : 32-bit move MOVD Dest, Source : 32-bit move Zeros loaded to upper MMX bits for 32-bit Zeros loaded to upper MMX bits for 32-bit
move move
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3333 of 40 of 40
Saturation/Wraparound Saturation/Wraparound ArithmeticArithmetic
Wraparound: carry bit lost (significant Wraparound: carry bit lost (significant portion lost) (PADD)portion lost) (PADD)
a3 a2 a1 FFFFh
b3 b2 b1 8000h
a3+b3 a2+b2 a1+b1 7FFFh
+
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3434 of 40 of 40
Saturation/Wraparound Saturation/Wraparound Arithmetic (cont.)Arithmetic (cont.)
Unsigned Saturation: add with unsigned saturation (PADDUS)Unsigned Saturation: add with unsigned saturation (PADDUS)
a3 a2 a1 FFF4h
b3 b2 b1 1123h
a3+b3 a2+b2 a1+b1 FFFFh
+
Saturation if addition results in overflow or subtraction results in underflow,the result is clamped to the largest or the smallest value representable
• for an unsigned, 16-bit word the values are FFFFh and 0000h• for a signed 16-bit word the values are 7FFFh and 8000h
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3535 of 40 of 40
Saturation/Wraparound Saturation/Wraparound Arithmetic (cont.)Arithmetic (cont.)
Saturation: add with signed saturation Saturation: add with signed saturation (PADDS)(PADDS)
a3 a2 a1 7FF4h
b3 b2 b1 0050h
a3+b3 a2+b2 a1+b1 7FFFh
+
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3636 of 40 of 40
Adding 8 8-bit Integers Adding 8 8-bit Integers with Saturationwith Saturation
X0 dq X0 dq 80808080555555558080808055555555 X1 dq X1 dq 000090903333FFFF000090903333FFFF
MOVQ MOVQ mm0,X0mm0,X0 PADDSB PADDSB mm0,X1 mm0,X1
Result mm0 = Result mm0 = 808080807F7F5454808080807F7F545480h+00h=80h (addition with zero)80h+90h=80h (saturation to maximum negative value)55h+33h=7fh (saturation to maximum positive value)55h+FFh=54h (subtraction by one)
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3737 of 40 of 40
Parallel CompareParallel Compare
PCMPGT[B/W/D] mmxreg, r/m64PCMPGT[B/W/D] mmxreg, r/m64
PCMPEQ[B/W/D] mmxreg, r/m64PCMPEQ[B/W/D] mmxreg, r/m642323 4545 1616 3434
Eq?Eq? Eq?Eq? Eq?Eq? Eq?Eq?
3131 77 1616 6767
0000h0000h 0000h0000h FFFFhFFFFh 0000h0000h
2323 4545 1616 3434
Gt?Gt? Gt?Gt? gt/?gt/? Gt?Gt?
3131 77 1616 6767
0000h0000h FFFFhFFFFh 0000h0000h 0000h0000h
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3838 of 40 of 40
Use of FPU Registers for Use of FPU Registers for Storing MMX DataStoring MMX Data
The EMMS (empty MMX-state) instruction sets (11) all The EMMS (empty MMX-state) instruction sets (11) all the tags in the FPU, so the floating-point registers are the tags in the FPU, so the floating-point registers are listed as emptylisted as emptyEMMS must be executed before the return instruction at EMMS must be executed before the return instruction at the end of any MMX procedurethe end of any MMX procedure
otherwise any subsequent floating point operation will cause a otherwise any subsequent floating point operation will cause a floating point interrupt error, potentially crashing your applicationfloating point interrupt error, potentially crashing your application
if you use floating point within MMX procedure, you must use if you use floating point within MMX procedure, you must use EMMS instruction before executing the floating point instructionEMMS instruction before executing the floating point instruction
Any MMX instruction resets (00) all FPU tag bits, so the Any MMX instruction resets (00) all FPU tag bits, so the floating-point registers are listed as validfloating-point registers are listed as valid
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 3939 of 40 of 40
Streaming SIMD ExtensionsStreaming SIMD ExtensionsIntel Pentium IIIIntel Pentium III
Streaming SIMD defines a new architecture for Streaming SIMD defines a new architecture for floating point operationsfloating point operationsOperates on IEEE-754 Single-precision 32-bit Operates on IEEE-754 Single-precision 32-bit Real Numbers Real Numbers Uses eight new 128-bit wide general-purpose Uses eight new 128-bit wide general-purpose registers (XMM0 - XMM7)registers (XMM0 - XMM7)Introduced in Pentium III in March 1999 Introduced in Pentium III in March 1999 Pentium III includes floating point, MMX technology, Pentium III includes floating point, MMX technology,
and XMM registersand XMM registers
ECE 291 Lecture 13ECE 291 Lecture 13
Page Page 4040 of 40 of 40
Streaming SIMD ExtensionsStreaming SIMD ExtensionsIntel Pentium III (cont.)Intel Pentium III (cont.)
Supports packed and scalar operations on the Supports packed and scalar operations on the new packed single precision floating point data new packed single precision floating point data typestypesPackedPacked instructions operate vertically on four instructions operate vertically on four pairs of floating point data elements in parallelpairs of floating point data elements in parallel instructions have suffix instructions have suffix psps, e.g., addps, e.g., addps
ScalarScalar instructions operate on the least- instructions operate on the least-significant data elements of the two operands significant data elements of the two operands instructions have suffix instructions have suffix ssss, e.g., addss, e.g., addss