introduction to mmx, xmm, sse and sse2 technology
DESCRIPTION
Introduction to MMX, XMM, SSE and SSE2 Technology. M ulti m edia E x tension, S treaming S IMD E xtension 11/23/98, 5/6/99, 2/5/03, 5/10/04, 5/4/05. SISD - Single Instruction, Single Data. Traditional computers In general, one instruction processes one data item. Control Unit. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/1.jpg)
1
Introduction to MMX, XMM, SSE and SSE2
Technology
Multimedia Extension,Streaming SIMD Extension
11/23/98, 5/6/99, 2/5/03, 5/10/04, 5/4/05
![Page 2: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/2.jpg)
2SISD - Single Instruction, Single Data
Traditional computers
In general, one instruction processes one data item
Control Unit
ExecutionUnit
Memory
![Page 3: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/3.jpg)
3SIMD - Single Instruction, Multiple Data
One instruction can process multiple data items
Useful when large amounts of regularly organized data is processed
Example: Matrix and vector calculations
This is the basis of MMX and XMM
Control Unit
Memory
ExecutionUnits
![Page 4: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/4.jpg)
4
MISD
MemoryControl
UnitExecution
Units
MISD: Multiple instructions process one data item.
![Page 5: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/5.jpg)
5
MIMD
Control Unit
Memory
ExecutionUnit
Control Unit
ExecutionUnit
Control Unit
ExecutionUnit
Control Unit
ExecutionUnit
MIMD: Multiple instructions process multiple data items.
![Page 6: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/6.jpg)
6
Your Turn
How would you classify a traditional computer under this system?
How would you classify a Shemp which has multiple processors?
How would you classify a computer having a Intel Dual Core processor?
![Page 7: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/7.jpg)
7
Potential Applications MMX and SSE
graphics MEG video/image processing music synthesis speech compression/recognition video conferencing matrix and vector calculations Advanced 3D graphics (SSE2) Speech recognition (SSE2) Scientific and engineering applications (SSE2)
![Page 8: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/8.jpg)
8
MMX
4 new data types New instructions Uses 8 existing 64 bit floating point
registers
![Page 9: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/9.jpg)
9
The floating point registers Floating point is processed by eight 80 bit registers
ST(0), ST(1), …ST(7) in the floating point unit. When doing floating point arithmetic, these registers
are organized in a stack. Programming floating point is quite different that
programming integer arithmetic. Floating point calculations are done using 80 bits even
when the program specifies storing 32 or 64 bit data values.
![Page 10: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/10.jpg)
10Advantages of using the floating point registers in MMX. The registers already exist. Only logic had to be added to
the chip. The operating system already knows about the floating point
registers. When a computer is switches from one program to another,
the state (registers) of the current program must be saved so state can be restored when the program becomes the active program once again.
The floating point registers are automatically saved as part of the state of a program.
MMX worked under existing operating systems!
![Page 11: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/11.jpg)
11
New data types for MMX 64 bits long. One data item can store:
8 one byte integers:
4 two byte integers:
2 four byte integers
1 eight byte integer
![Page 12: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/12.jpg)
12
SSE and SSE2
SSE – Streaming SIMD Extensions SSE2 introduced eight 128 bit XMM registers These registers are disjoint from the floating
point/MMX registers SSE (Pentium III) can handle 4 single floating
point numbers SSE2 (Pentium 4) can also handle 2 double
floating point numbers
![Page 13: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/13.jpg)
13
New data types for XMM 128 bits: Can be used as:
16 one byte integers
8 two byte integers
4 doubleword integers or single precision floating
2 quadword integers or double precision floating
![Page 14: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/14.jpg)
14
Your turn Your program uses 3 arrays of 160,000
byte integers. We need to add the elements in the first two arrays to calculate the third array.
Using a standard Pentium, how many “operations” are needed? (One operation includes loading 2 values into CPU, adding, storing the result and the associate loop processing)
How many XMM operations would be needed?
![Page 15: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/15.jpg)
15
New instructions Process the new data types 16, 8,4, or 2 data items
(64 bits or 128 bits) at a time. Types of instructions:
Add / SubtractMultiply/Multiply and addShiftLogical (AND, NAND, OR, XOR)Pack and unpackMoveShuffle and unpack (SSE)
![Page 16: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/16.jpg)
16
Saturation
Handling overflow when adding 16, 8, 4, or 2 values at a time is a problem. Programmers can specify that when overflow occurs, the “sum” should be replaced by the maximum legal value.
Example: Unsigned byte addition 80h + A0h = 120h ===> overflow Instead the machine stores FFh.
Likewise when subtracting.
![Page 17: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/17.jpg)
17
Comparison operations
Consider <, >, <=, >=, =, and < > operations.
Consider comparing two 64 bits quantities each holding 8, 4, or 2 values.
Comparing multiple values at a time is a problem. So the MMX instructions store 0 for false and -1 for true for each of individual data items.
![Page 18: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/18.jpg)
18Example 1: Calculating Dot Products
7 Consider calculating S = AiBi
i = 0using MMX
Assume Ai and Bi are stored as signed 16 bit integers.
Assume that the products and sums should be calculated using 32 bits.
Assume that all values have two “binary” places.
![Page 19: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/19.jpg)
19Example 1: Calculating Dot Products Storing A and B (64 bit vectors)
0 2 4 6 8 10 12 14 bytes 0 1 2 3 4 5 6 7 subscriptsA
B
We store each Ai and Bi item as 16 bit integers, 4 per 64
bit data item. Assume each value has 2 binary places
![Page 20: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/20.jpg)
20Example 1: Calculating Dot Products
Multiply and add instruction
* * * * * * * *
+ + + +
2 20
3 40
806
4 30
5 50
1520
![Page 21: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/21.jpg)
21Example 1: MMX: Calculating Dot Products
Packed Multiply and add instruction
* * * * * * * *
+ + + +
Packed Add + +
(Normal) Add +
2 20
3 40
806
4 30
5 50
1520
2326
![Page 22: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/22.jpg)
22Example 1: Calculating Dot Products
Approximate algorithm – Load left half of A into a FP register. – Multiply and add by left half of B.– Shift products right 2 bits. (Products should have
only two binary places.)– Repeat with right halves of A and B using a
different register.– Add the second sum to the first.– Store the result.
4 w
ords
at a
tim
e
Two doublewords
at a time
![Page 23: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/23.jpg)
23Example 1: Calculating Dot Products
Approximate algorithm (Conclusion)– Add the two sums together in EAX to get the
final sum.
1 do
uble
wor
d at
a ti
me
![Page 24: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/24.jpg)
24Example 1: Calculating Dot Products
Intel claims that standard Pentiums would require 40 instructions to carry this out. Using MMX technology, only 13 instructions are needed. Speed improves by even a greater ratio.
![Page 25: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/25.jpg)
25Example 2:24-bit color video blending
Suppose we have are displaying 640 by 480 pixel video that uses 24 bit colors - 8 bits for red, 8 for green, and 8 for blue.
Suppose we are currently showing one picture which we want to fade out and replace by “fading” in a second picture.
Suppose that we want to do the fade out/in in 255 steps.
![Page 26: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/26.jpg)
26Example 2:24-bit color video blending
For each step, for each of 3 colors and for each of the 640 by 480 pixels we must calculate:Result_pixel = NewPicture_pixel * (i/255) +OldPicture_pixel * (1-(i/255))where “i” is the step counter.
This formula must be calculated640 * 480 * 3 * 255 = 235,008,000times on 8 bit data!
![Page 27: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/27.jpg)
27Example 2:24-bit color video blending
Intel calculates that this requires execution of 1.4 billion instructions on a standard PC even ignoring the calculation of i/255 and (1-i/255) and loop control.
With MMX, we can calculate 4 values in parallel. The number of MMX instructions would be 525 million. (Because the multiply instruction only applies to word data, the byte data must be unpacked into words and repacked after the calculation.)
![Page 28: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/28.jpg)
28
Also included in MMX Intel increased cache size when MMX was
introduced (necessary for SIMD machines) Programs run faster on MMX machines even if
the SIMD instructions are not used Excellent marketing:
– Programs run faster on MMX machine
– People want/buy MMX
– Software publishers are encouraged to rewrite programs to take advantage of the new instructions
![Page 29: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/29.jpg)
29
Information source http://www.intel.com/drg/mmx/manuals/overview/i
ndex.htm#intro
(no longer available) http://developer.intel.com/drg/mmx/manuals/
(no longer available) http://www.intel.com/design/Pentium4/manuals/24
547012.pdf (IA-32 Intel Architecture Software Developer’s Manual, vol. 1)
This slide show is MMX.PPT
![Page 30: Introduction to MMX, XMM, SSE and SSE2 Technology](https://reader035.vdocument.in/reader035/viewer/2022062217/568157c3550346895dc54a18/html5/thumbnails/30.jpg)
30
Your Turn
1. Characterize the kinds of problems where SIMD is helpful.
2. Give examples of problems where SIMD is useful.