amd 3dnow!tm technology: architecture and...
TRANSCRIPT
![Page 1: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/1.jpg)
AMD 3DNow!AMD 3DNow! TMTM Technology: Technology:Architecture and ImplementationsArchitecture and Implementations
Stuart F. Obermane-mail: [email protected]
California Microprocessor DivisionAdvanced Micro Devices
![Page 2: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/2.jpg)
OUTLINEOUTLINE
• Motivation for 3DNow! Technology• Features of the 3DNow! instruction set• K6-2 and K7 implementations• 3D graphics performance• Future
![Page 3: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/3.jpg)
Acceleration of Multimedia ApplicationsAcceleration of Multimedia Applications
• Multimedia applications have become an integralpart of the PC platform– Multimedia algorithms are computationally intensive
Application orGame Physics
floating pointintensive
Geometry transform,Clipping, & Lighting
TriangleSet up
PixelRendering
floating pointintensive
floating point &integer intensive
integerintensive
![Page 4: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/4.jpg)
Motivation for 3DNow! TechnologyMotivation for 3DNow! Technology
• Why a new technology?– Previous focus has been on integer intensive pixel
rendering tasks: MMX and 3D graphics hardware– 3D graphics performance now limited by floating-point
intensive front-end of graphics pipeline– New applications require realistic physical modeling
• What is it?– New set of instructions to accelerate FP computation– Defined in collaboration with leading ISV’s– Maximizes the performance of graphics accelerator
cards by improving the front-end of graphics pipeline
![Page 5: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/5.jpg)
3DNow! Technology3DNow! Technology
• Benefits– Accelerates most floating-point intensive multimedia
operations• Graphics pipeline (physics, geometry, setup)• Audio processing
Graphic Cards Accelerates
Application orGame Physics
Geometry transform,Clipping, & Lighting
TriangleSet up
PixelRendering
floating pointintensive
floating pointintensive
floating point &integer intensive
integerintensive
3DNow! Technology Accelerates
![Page 6: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/6.jpg)
3DNow! Technology3DNow! Technology
• 21 new instructions– “LEAN and MEAN” design philosophy– Includes only performance critical features
• SIMD floating-point instructions– Compatible with IEEE single precision data type– Two 32-bit FP values per 64-bit reg/mem operand– Uses MMX registers -> avoids x87 register stack– No exceptions– Limited rounding modes– No switching overhead between 3DNow! and MMX– Peak throughput of 4 FLOPS per cycle
• No core OS support required
![Page 7: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/7.jpg)
Classes of Instructions - FPClasses of Instructions - FP
• Basic arithmetic– PFADD, PFSUB, PFSUBR, PFACC, PFMUL
• Comparisons– PFCMPEQ, PFCMPGT, PFCMPGE
• Min/max– PFMIN, PFMAX
• Conversions– PF2ID, PI2FD
• Reciprocal and reciprocal square root– PFRCP, PFRSQRT– PFRCPIT1, PFRSQIT1, PFRCPIT2
![Page 8: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/8.jpg)
Classes of Instructions - Non FPClasses of Instructions - Non FP
• Integer– PAVGUSB, PMULHRW
• Data movement– PREFETCH/PREFETCHW
• Overhead reduction– FEMMS
![Page 9: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/9.jpg)
Reciprocal and Reciprocal Square RootReciprocal and Reciprocal Square Root
• Alternative to “classical” DIV and SQRT– Reciprocal and reciprocal square root frequently used
in graphics applications– Higher performance through reuse of common
divisors and radicands
• Choice of reduced (14-15b) or full precision– Reduced precision sufficient for many applications
and is higher performance– Avoid all long latency operations; full precision
synthesized from fully-pipelined Newton-Raphsoniterations ops
![Page 10: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/10.jpg)
Reciprocal Iteration InstructionsReciprocal Iteration Instructions
• Reciprocal Newton-Raphson iteration– To compute full-precision reciprocal of b using initial
approximation R0:• Rfull = R0 x (2 - b x R0)
– R0 is accurate to about 14 bits, b is a 24 bit number– PFRCPIT1 performs b x R0 rounded to 32 bits, inverts
the result (one’s complement), and compresses outleading 8 bits known to be identical, leaving 24 bits
– PFRCPIT2 expands the previous result to 32 bits,multiplies by R0, adds a fixed bias, and rounds to 24bits
![Page 11: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/11.jpg)
Reciprocal AccuracyReciprocal Accuracy
• Fast approximations– PFRCP accurate to 14.9 bits– PFRSQRT accurate to 15.8 bits
• Full precision– PFRCP, PFRCPIT1, PFRCPIT2 sequence provides
IEEE RN result for > 99% of all operands; remainingdiffer by 1 unit-in-the-last-place
– PFRSQRT, PFMUL, PFRSQIT1, PFRCPIT2sequence provides IEEE RN result for > 87% of alloperands; remaining differ by 1 unit-in-the-last-place
![Page 12: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/12.jpg)
AMD-K6-2 MicroprocessorAMD-K6-2 Microprocessor
• Worldwide launch May 28, 1998• Implemented in 0.25um CMOS process• 9.3M transistors on a die of 80 mm²• New features of AMD-K6-2
– Superscalar 3DNow! and MMX units• Dual decode and dual execution pipelines• No decode pairing restrictions• Only one cycle misalignment penalty on memory accesses
– 100 MHz Front Side bus• Increases local bus and L2 cache bandwidth by 50%• Redesigned I/O timing to allow for low cost 100 MHz
motherboard
![Page 13: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/13.jpg)
AMD-K6-2 Block DiagramAMD-K6-2 Block Diagram
StoreUnit
StoreQueue
InstructionControl Unit
SchedulerBuffer
(24 RISC86)Six RISC86Six RISC86Operation IssueOperation Issue
Four RISC86Four RISC86DecodeDecode
Out-of-OrderOut-of-OrderExecution EngineExecution Engine
Level-One Dual-Port Data Cache(32KByte) 128 Entry DTLB
32KByte Level-One Instruction Cache
20KByte Predecode Cache
Dual Instruction Decodersx86 to RISC86
16 Byte Fetch16 Byte Fetch
LoadUnit
Floating- PointUnit
BranchResolution Unit
PredecodeLogic
100 MhzSuper7
BusInterface
Register Unit X(Integer/MMX/3D)
Register Unit Y(Integer/MMX/3D)
Branch Logic(8192-Entry BHT)
(16-Entry BTC)(16-Entry RAS)
64 Entry ITLB
Level-One CacheController
StoreUnit
StoreQueue
InstructionControl Unit
SchedulerBuffer
(24 RISC86)Six RISC86Six RISC86Operation IssueOperation Issue
Four RISC86Four RISC86DecodeDecode
Out-of-OrderOut-of-OrderExecution EngineExecution Engine
Level-One Dual-Port Data Cache(32KByte) 128 Entry DTLB
32KByte Level-One Instruction Cache
20KByte Predecode Cache
Dual Instruction Decodersx86 to RISC86
16 Byte Fetch16 Byte Fetch
LoadUnit
Floating- PointUnit
BranchResolution Unit
PredecodeLogic
100 MhzSuper7
BusInterface
Register Unit X(Integer/MMX/3D)
Register Unit Y(Integer/MMX/3D)
Branch Logic(8192-Entry BHT)(16-Entry BTC)(16-Entry RAS)
64 Entry ITLB
Level-One CacheController
![Page 14: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/14.jpg)
K6 / K6-2 K6 / K6-2 MicroarchitecturalMicroarchitectural Features Features
• Out-of-order superscalar pipeline– Fetch and decode up to 4 internal ops per cycle– Issue up to 6 ops per cycle to 10 functional units– Loads, stores, integer ops, FPU, MMX/3DNow! Ops can be
performed out-of-order with respect to each other
• Pipeline is 6 to 8 stages deepIF DEC SC RF EX1 EX2* EX3* RET
• L1 Cache -- 32KByte I and 32KByte D write-back• Socket 7 processor bus at 66 MHz (K6) or 100 MHz (K6-2)
![Page 15: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/15.jpg)
AMD-K6-2 Multimedia UnitsAMD-K6-2 Multimedia Units
MMXALU
MMXALU
FP Add
Recip /RecipSqrt
MMX+FPMultiplier
MMXShifter
Register X ExecutionPipeline
Register Y ExecutionPipeline
Shared Resources
![Page 16: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/16.jpg)
AMD-K6-2 Multimedia PerformanceAMD-K6-2 Multimedia Performance
Instruction TypeLatency(cycles)
Throughput(cycles)
3DNow! FP 2 1
3DNow! / MMXinteger ALU 1 1
MMX multiply 2 1
![Page 17: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/17.jpg)
Recip Recip / / Recip Sqrt Recip Sqrt PerformancePerformance
K6-2Performance
PIIPerformance
14 bit reciprocal 2 cyclespipelined
-
15 bit reciprocalsquare root
2 cyclespipelined
-
24 bit reciprocal 6 cyclespipelined
~ 17 cyclesnon-pipelined
24 bit reciprocalsquare root
8 cyclespipelined
~ 28 cyclesnon-pipelined
![Page 18: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/18.jpg)
Shared 3DNow! and MMX MultiplierShared 3DNow! and MMX Multiplier
Data Format Selection
Booth 2 Encoders
Booth Muxes
4-2 CSA4-2 CSA4-2 CSA3,2 CSA
17 PP
4-2 CSA 3,2 CSA 3,2 CSA
MUX
64 bits
32 bits
64b CPA 64b CPA
Final Result Selection (32b)
Bin
ary
Tre
e
MMX Data Alignment
AL x BL
AH x BHZeroed out
SignExtended
EX1
EX2
16b16b
![Page 19: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/19.jpg)
3D3D Winbench Winbench 98 Performance 98 Performance
![Page 20: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/20.jpg)
Quake II Platform ConfigurationsQuake II Platform Configurations
K6-2 300/100 PII 300/66
OS: Win95 OS: Win95
Motherboard: Epox 51MVP3E-M Motherboard: Asus PB2 (BX)
BIOS Date: 6/1/98 BIOS Date: 4/2/98
Memory: 64MB Memory: 64MB
2D Video: Diamond Viper 330 (RIVA 128) 2D Video: Diamond Viper 330 (RIVA 128)
3D Video: Diamond VD2 (12MB) 3D Video: Diamond VD2 (12MB)
HDD: Maxtor DM 4.3GB HDD: Maxtor DM 4.3GB
![Page 21: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/21.jpg)
Quake II Performance - DEMO1Quake II Performance - DEMO1Highest Performance without Sound
Timedemo: DEMO1.DM2
80.079.1
68.2
77.1
60.2
72.573.7
64.0
57.1
72.1
56.0
59.0
62.0
65.0
68.0
71.0
74.0
77.0
80.0
83.0
86.0
640x480 800x600 1024x768
Display Resolution (horizontal x vertical )
Ave
rage
Fra
me
s P
er
Se
cond
(F
PS
)
AMD-K6-2, Dual Voodoo2
AMD-K6-2, Single Voodoo2
Intel Pentium-II, Dual Voodoo2
Intel Pentium-II, Single Voodoo2
![Page 22: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/22.jpg)
Quake II Performance - MASSIVE1Quake II Performance - MASSIVE1
Hig hest Perform ance without SoundTim edem o: MASSIVE1.DM2
53.3
57.357.5
48.4
56.9
48.7
51.4 51.3
53
45.5
44
46
48
50
52
54
56
58
60
62
640x480 800x600 1024x768
D is p la y R eso lution (horiz ontal x vertical )
Ave
rage
Fra
me
s P
er S
eco
nd (
FP
S)
AM D-K6-2 , Dua l V oodoo2
AM D-K6-2 , S in g le V oodoo2
In te l P e n tium -II , Dua l V oodoo2
In te l P e n tium -II , S in g le V oodoo2
![Page 23: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/23.jpg)
K6 Family and FutureK6 Family and Future
AMD-K6
AMD-K6
AMD-K6-2
AMD-K6-3
1H’97 2H’97 1H’98 1Q’99 2Q’99
Per
form
ance
MH
z
200
300
400
.25 µ process 100 MHz BusOn-chip, Processor Speed Backside L2100 MHz Frontside L3Superscalar MMX3DNow! Technology21.3M Transistors118 mm²
.35 µ processMMX Enhanced8.8M Transistors162 mm²
.25 µ processLower PowerHigher Speeds8.8M Transistors68 mm²
.25 µ process Bus 100 MHz 100 MHz Frontside L2Superscalar MMX TM
3DNow! TM Technology9.3M Transistors80 mm²
µP Forum ‘98October 13ISSCC ‘99
K7
![Page 24: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/24.jpg)
K7 K7 MicroarchitectureMicroarchitecture
![Page 25: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/25.jpg)
K7 Floating Point and Multimedia UnitK7 Floating Point and Multimedia Unit
![Page 26: AMD 3DNow!TM Technology: Architecture and Implementationsweb.stanford.edu/class/ee382/MISC/amd3dnow.pdf · – Uses MMX registers -> avoids x87 register stack – No exceptions –](https://reader033.vdocument.in/reader033/viewer/2022042216/5ebedfcd44f5df061714580c/html5/thumbnails/26.jpg)
3DNow! Floating Point 3DNow! Floating Point DatatypeDatatype
D1 D0
63 32 31 0
S
31 23 22 0
Biased Exponent Significand
(32 bits x 2) Two packed, single-precision, floating-point doublewords
D0 and D1 each hold an IEEE 32-bit single precision value: