434dds 45 w79
TRANSCRIPT
-
8/16/2019 434dds 45 w79
1/276
Advanced Micro Devices
AMD64 Technology
AMD64 Architecture
Programmer’s Manual
Volume 6:
128-Bit and 256-Bit
XOP and FMA4Instructions
Publication No. Revision Date
43479 3.04 November 2009
-
8/16/2019 434dds 45 w79
2/276
Trademarks
AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.
MMX is a trademark of Intel Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of their
respective companies.
©2009 Advanced Micro Devices, Inc. All rights reserved.
The contents of this document are provided in connection with Advanced Micro
Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with
respect to the accuracy or completeness of the contents of this publication and
reserves the right to make changes to specifications and product descriptions at
any time without notice. The information contained herein may be of a preliminary
or advance nature and is subject to change without notice. No license, whether
express, implied, arising by estoppel or otherwise, to any intellectual property rights
is granted by this publication. Except as set forth in AMD’s Standard Terms andConditions of Sale, AMD assumes no liability whatsoever, and disclaims any
express or implied warranty, relating to its products including, but not limited to, the
implied warranty of merchantability, fitness for a par ticular purpose, or infringement
of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use as
components in systems intended for surgical implant into the body, or in other appli-
cations intended to support or sustain life, or in any other application in which the
failure of AMD’s product could create a situation where personal injury, death, or
severe property or environmental damage may occur. AMD reserves the right to
discontinue or make changes to its products at any time without notice.
-
8/16/2019 434dds 45 w79
3/276
3
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11
1 New 128-Bit and 256-Bit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . .25
1.1 New Instruction Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
1.2 Opcode Byte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
1.3 Destination XMM registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
1.4 Four-Operand Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
1.5 Three-Operand Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
1.6 Two Operand Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
1.7 XOP Integer Multiply (Add) and Accumulate
Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31
1.8 Packed Integer Horizontal Add and Subtract . . . . . . . . . . . . . . . . . . . . .33
1.9 Vector Conditional Moves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .341.10 Packed Integer Rotates and Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
1.11 Packed Integer Comparison and Predicate Generation . . . . . . . . . . . . . .35
1.12 Fraction Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36
2 AMD XOP and FMA4 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39
2.2 Operand Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40
2.3 Instruction Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41
VFMADDPD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42
VFMADDPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
VFMADDSD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
VFMADDSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53VFMADDSUBPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
VFMADDSUBPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60
VFMSUBADDPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64
VFMSUBADDPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68
VFMSUBPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72
VFMSUBPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
VFMSUBSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80
VFMSUBSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84
VFNMADDPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88
VFNMADDPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91
VFNMADDSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94
VFNMADDSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97VFNMSUBPD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100
VFNMSUBPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104
VFNMSUBSD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108
VFNMSUBSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112
VFRCZPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116
VFRCZPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .119
VFRCZSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122
-
8/16/2019 434dds 45 w79
4/276
4
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
AMD Confidential-Advance Information
VFRCZSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .126
VPCMOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130
VPCOMB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133
VPCOMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136
VPCOMQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139VPCOMUB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142
VPCOMUD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .145
VPCOMUQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148
VPCOMUW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151
VPCOMW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154
VPERMIL2PD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .157
VPERMIL2PS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .163
VPHADDBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169
VPHADDBQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171
VPHADDBW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173
VPHADDDQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .175
VPHADDUBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177
VPHADDUBQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .179
VPHADDUBW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .181
VPHADDUDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183
VPHADDUWD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .185
VPHADDUWQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .187
VPHADDWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .189
VPHADDWQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .191
VPHSUBBW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .193
VPHSUBDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195
VPHSUBWD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .197
VPMACSDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .199VPMACSDQH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .202
VPMACSDQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .205
VPMACSSDD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .208
VPMACSSDQH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211
VPMACSSDQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .214
VPMACSSWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217
VPMACSSWW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .220
VPMACSWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223
VPMACSWW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .226
VPMADCSSWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .229
VPMADCSWD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .232
VPPERM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .235VPROTB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .239
VPROTD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .242
VPROTQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .245
VPROTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .248
VPSHAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251
VPSHAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .254
VPSHAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .257
-
8/16/2019 434dds 45 w79
5/276
5
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
VPSHAW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .260
VPSHLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .263
VPSHLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .266
VPSHLQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .269
VPSHLW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .272
-
8/16/2019 434dds 45 w79
6/276
6
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
AMD Confidential-Advance Information
-
8/16/2019 434dds 45 w79
7/276
Tables 7
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Tables
Table 1-1. VEX.pp Prefix Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
Table 1-2. Operand Element Size—OES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
Table 1-3. Operand Configurations for FMA4,V PCMOV and VPPERM Instructions29
Table 1-4. Operand Configurations for Three Operand Instructions . . . . . . . . . . . . . .30
Table 1-5. Immediate Operand Values for Unsigned Vector Comparison Operations35
Table 2-1. VPCOMB Comparison Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133
Table 2-2. VPCOMD Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136
Table 2-3. VPCOMQ Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139
Table 2-4. VPCOMUB Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . .142
Table 2-5. VPCOMUD Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . .145
Table 2-6. VPCOMUQ Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . .148
Table 2-7. VPCOMUW Comparison Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . .151
Table 2-8. VPCOMW Comparison Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154
Table 2-9. Selector and Source Selected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158
Table 2-10. Interaction of Selector Match Bit and Immediate Operand Match Field .159
Table 2-11. Selector and Source Selected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .164
Table 2-12. Interaction of Selector Match Bit and Immediate Operand Match Field .165
Table 2-13. VPPERM Control Byte. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .236
-
8/16/2019 434dds 45 w79
8/276
8 Tables
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
-
8/16/2019 434dds 45 w79
9/276
Revision History 9
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Revision History
Date Revision Description
November 2009 3.04
Removed #UD CR0.EM exception from all exception tables.
Added VPERMIL2PD and VPERMIL2PS instructions.
Corrected many small factual errors and typos.
-
8/16/2019 434dds 45 w79
10/276
10 Revision History
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
-
8/16/2019 434dds 45 w79
11/276
11
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Preface
About This Book
The instructions described in this book are part of a multivolume work entitled the AMD64
Architecture Programmer’s Manual . The following table lists each volume and its order number.
Audience
This document is intended for all programmers writing application or system software for a processorthat implements the AMD64 architecture.
Organization
Volumes 3 through 6 describe the AMD64 architecture’s instruction set in detail. Together, they cover
each instruction’s mnemonic syntax, opcodes, functions, affected flags, and possible exceptions.
The AMD64 instruction set is divided into seven subsets:
• General-purpose instructions
• System instructions• 128-bit media instructions
• 64-bit media instructions
• x87 floating-point instructions
• 128-bit and 256-bit XOP media instructions
Several instructions belong to—and are described identically in—multiple instruction subsets.
Title Order No.
Volume 1: Application Programming 24592
Volume 2: System Programming 24593
Volume 3: General-Purpose and System Instructions 24594
Volume 4: 128-Bit Media Instructions 26568
Volume 5: 64-Bit Media and x87 Floating-Point Instructions 26569
Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions 43479
http://../Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20cover.pdfhttp://../Volume%201%20-%20Application%20Programming/Vol%201%20-%20cover.pdfhttp://../Volume%202%20-%20System%20Programming/Vol%202%20-%20cover.pdfhttp://../Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20cover.pdfhttp://../Volume%204%20-%20128-Bit%20Media%20Instructions/Vol%204%20-%20cover.pdfhttp://../Volume%205%20-%2064-Bit%20Media%20and%20x87%20Floating-Point%20Instructions/Vol%205%20-%20cover.pdfhttp://../Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20cover.pdfhttp://../Volume%205%20-%2064-Bit%20Media%20and%20x87%20Floating-Point%20Instructions/Vol%205%20-%20cover.pdfhttp://../Volume%204%20-%20128-Bit%20Media%20Instructions/Vol%204%20-%20cover.pdfhttp://../Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20cover.pdfhttp://../Volume%202%20-%20System%20Programming/Vol%202%20-%20cover.pdfhttp://../Volume%201%20-%20Application%20Programming/Vol%201%20-%20cover.pdf
-
8/16/2019 434dds 45 w79
12/276
12
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
This volume describes the 128-bit and 256-bit XOP and FMA4 instruction extensions. The index at the
end cross-references topics within this volume. For other topics relating to the AMD64 architecture,
and for information on instructions in other subsets, see the tables of contents and indexes of the other
volumes.
Definitions
Many of the following definitions assume an in-depth knowledge of the legacy x86 architecture. See
“Related Documents” on page 22 for descriptions of the legacy x86 architecture.
Terms and Notation
In addition to the notation described below, “Opcode-Syntax Notation” in Volume 3 describes notation
relating specifically to opcodes.
1011b
A binary value—in this example, a 4-bit value.
F0EAh
A hexadecimal value—in this example a 2-byte value.
[1,2)
A range that includes the left-most value (in this case, 1) but excludes the right-most value (in this
case, 2).
7–4
A bit range, from bit 7 to 4, inclusive. The high-order bit is shown first.
128-bit media instructions
Instructions that use the 128-bit XMM registers. These are a combination of the SSE and SSE2
instruction sets.
64-bit media instructions
Instructions that use the 64-bit MMX registers. These are primarily a combination of MMX™ and
3DNow!™ instruction sets, with some additional instructions from the SSE and SSE2 instruction
sets.
16-bit mode
Legacy mode or compatibility mode in which a 16-bit address size is active. See legacy mode and
compatibility mode.
32-bit mode
Legacy mode or compatibility mode in which a 32-bit address size is active. See legacy mode and
compatibility mode.
http://../Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20AppA.pdfhttp://../Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20AppA.pdf
-
8/16/2019 434dds 45 w79
13/276
13
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
64-bit mode
A submode of long mode. In 64-bit mode, the default address size is 64 bits and new features, such
as register extensions, are supported for system and application software.
#GP(0)
Notation indicating a general-protection exception (#GP) with error code of 0.
absolute
Said of a displacement that references the base of a code segment rather than an instruction pointer.
Contrast with relative.
ASID
Address space identifier.
biased exponent
The sum of a floating-point value’s exponent and a constant bias for a particular floating-point datatype. The bias makes the range of the biased exponent always positive, which allows reciprocation
without overflow.
byte
Eight bits.
clear
To write a bit value of 0. Compare set .
compatibility mode
A submode of long mode. In compatibility mode, the default address size is 32 bits, and legacy 16- bit and 32-bit applications run without modification.
commit
To irreversibly write, in program order, an instruction’s result to software-visible storage, such as a
register (including flags), the data cache, an internal write buffer, or memory.
CPL
Current privilege level.
CR0–CR4
A register range, from register CR0 through CR4, inclusive, with the low-order register first.
CR0.PE = 1
Notation indicating that the PE bit of the CR0 register has a value of 1.
direct
Referencing a memory location whose address is included in the instruction’s syntax as an
immediate operand. The address may be an absolute or relative address. Compare indirect .
-
8/16/2019 434dds 45 w79
14/276
14
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
dirty data
Data held in the processor’s caches or internal buffers that is more recent than the copy held in
main memory.
displacement
A signed value that is added to the base of a segment (absolute addressing) or an instruction pointer
(relative addressing). Same as offset .
doubleword
Two words, or four bytes, or 32 bits.
double quadword
Eight words, or 16 bytes, or 128 bits. Also called octword .
DS:rSI
The contents of a memory location whose segment address is in the DS register and whose offsetrelative to that segment is in the rSI register.
EFER.LME = 0
Notation indicating that the LME bit of the EFER register has a value of 0.
effective address size
The address size for the current instruction after accounting for the default address size and any
address-size override prefix.
effective operand size
The operand size for the current instruction after accounting for the default operand size and anyoperand-size override prefix.
element
See vector .
exception
An abnormal condition that occurs as the result of executing an instruction. The processor’s
response to an exception depends on the type of the exception. For all exceptions except 128-bit
media SIMD floating-point exceptions and x87 floating-point exceptions, control is transferred to
the handler (or service routine) for that exception, as defined by the exception’s vector. For
floating-point exceptions defined by the IEEE 754 standard, there are both masked and unmaskedresponses. When unmasked, the exception handler is called, and when masked, a default response
is provided instead of calling the handler.
FF /0
Notation indicating that FF is the first byte of an opcode, and a subfield in the second byte has a
value of 0.
-
8/16/2019 434dds 45 w79
15/276
15
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
flush
An often ambiguous term meaning (1) writeback, if modified, and invalidate, as in “flush the cache
line,” or (2) invalidate, as in “flush the pipeline,” or (3) change a value, as in “flush to zero.”
GDT
Global descriptor table.
GIF
Global interrupt flag.
IDT
Interrupt descriptor table.
IGN
Ignore. Field is ignored.
indirect
Referencing a memory location whose address is in a register or other memory location. The
address may be an absolute or relative address. Compare direct .
IRB
The virtual-8086 mode interrupt-redirection bitmap.
IST
The long-mode interrupt-stack table.
IVT
The real-address mode interrupt-vector table.
LDT
Local descriptor table.
legacy x86
The legacy x86 architecture. See “Related Documents” on page 22 for descriptions of the legacy
x86 architecture.
legacy mode
An operating mode of the AMD64 architecture in which existing 16-bit and 32-bit applications and
operating systems run without modification. A processor implementation of the AMD64
architecture can run in either long mode or legacy mode. Legacy mode has three submodes, real
mode, protected mode, and virtual-8086 mode.
long mode
An operating mode unique to the AMD64 architecture. A processor implementation of the
AMD64 architecture can run in either long mode or legacy mode. Long mode has two submodes,
64-bit mode and compatibility mode.
-
8/16/2019 434dds 45 w79
16/276
16
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
lsb
Least-significant bit.
LSB
Least-significant byte.
main memory
Physical memory, such as RAM and ROM (but not cache memory) that is installed in a particular
computer system.
mask
(1) A control bit that prevents the occurrence of a floating-point exception from invoking an
exception-handling routine. (2) A field of bits used for a control purpose.
MBZ
Must be zero. If software attempts to set an MBZ bit to 1, a general-protection exception (#GP)occurs.
memory
Unless otherwise specified, main memory.
ModRM
A byte following an instruction opcode that specifies address calculation based on mode (Mod),
register (R), and memory (M) variables.
moffset
A 16, 32, or 64-bit offset that specifies a memory operand directly, without using a ModRM or SIB byte.
msb
Most-significant bit.
MSB
Most-significant byte.
multimedia instructions
A combination of 128-bit media instructions and 64-bit media instructions.
octword
Same as double quadword.
offset
Same as displacement .
-
8/16/2019 434dds 45 w79
17/276
17
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
overflow
The condition in which a floating-point number is larger in magnitude than the largest, finite,
positive or negative number that can be represented in the data-type format being used.
packed
See vector .
PAE
Physical-address extensions.
physical memory
Actual memory, consisting of main memory and cache.
probe
A check for an address in a processor’s caches or internal buffers. External probes originate
outside the processor, and internal probes originate within the processor.
protected mode
A submode of legacy mode.
quadword
Four words, or eight bytes, or 64 bits.
reserved
Fields marked as reserved may be used at some future time.
To preserve compatibility with future processors, reserved fields require special handling when
read or written by software.Reserved fields may be further qualified as MBZ, RAZ, SBZ or IGN (see definitions).
Software must not depend on the state of a reserved field, nor upon the ability of such fields to
return to a previously written state.
If a reserved field is not marked with one of the above qualifiers, software must not change the
state of that field; it must reload that field with the same values returned from a prior read.
RAZ
Read as zero (0), regardless of what is written.
real-address mode
A submode of legacy mode with 16-bit addressing and operand size and a simple form of
segmentation, lacking the segment and privilege protection mechanisms of protected mode. See
real mode.
real mode
A short name for real-address mode, a submode of legacy mode.
-
8/16/2019 434dds 45 w79
18/276
18
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
relative
Referencing with a displacement (also called offset) from an instruction pointer rather than the
base of a code segment. Contrast with absolute.
REX
An instruction prefix that specifies a 64-bit operand size and provides access to additional
registers.
RIP-relative addressing
Addressing relative to the 64-bit RIP instruction pointer.
set
To write a bit value of 1. Compare clear .
SIB
A byte following an instruction opcode that specifies address calculation based on scale (S), index(I), and base (B).
SIMD
Single instruction, multiple data. See vector .
SSE n and SSSE n
Various extensions to the SSE instruction set. See 128-bit media instructions and 64-bit media
instructions.
sticky bit
A bit that is set or cleared by hardware and that remains in that state until explicitly changed bysoftware.
TOP
The x87 top-of-stack pointer.
TSS
Task-state segment.
underflow
The condition in which a floating-point number is smaller in magnitude than the smallest nonzero,
positive or negative number that can be represented in the data-type format being used.
vector
(1) A set of integer or floating-point values, called elements, that are packed into a single operand.
Most of the 128-bit and 64-bit media instructions use vectors as operands. Vectors are also called
packed or SIMD (single-instruction multiple-data) operands.
(2) An index into an interrupt descriptor table (IDT), used to access exception handlers. Compare
exception.
-
8/16/2019 434dds 45 w79
19/276
19
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
virtual-8086 mode
A submode of legacy mode.
VMCB
Virtual machine control block.
VMM
Virtual machine monitor.
word
Two bytes, or 16 bits.
x86
See legacy x86 .
Registers
In the following list of registers, the names are used to refer either to a given register or to the contents
of that register:
AH–DH
The high 8-bit AH, BH, CH, and DH registers. Compare AL–DL.
AL–DL
The low 8-bit AL, BL, CL, and DL registers. Compare AH–DH.
AL–r15B
The low 8-bit AL, BL, CL, DL, SIL, DIL, BPL, SPL, and R8B–R15B registers, available in 64-bit
mode.
BP
Base pointer register.
CRn
Control register number n.
CS
Code segment register.
eAX–eSP
The 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers or the 32-bit EAX, EBX, ECX, EDX,
EDI, ESI, EBP, and ESP registers. Compare rAX–rSP.
EBP
Extended base pointer register.
-
8/16/2019 434dds 45 w79
20/276
20
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
EFER
Extended features enable register.
eFLAGS
16-bit or 32-bit flags register. Compare rFLAGS .
EFLAGS
32-bit (extended) flags register.
eIP
16-bit or 32-bit instruction-pointer register. Compare rIP .
EIP
32-bit (extended) instruction-pointer register.
FLAGS 16-bit flags register.
GDTR
Global descriptor table register.
GPRs
General-purpose registers. For the 16-bit data size, these are AX, BX, CX, DX, DI, SI, BP, and SP.
For the 32-bit data size, these are EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP. For the 64-bit
data size, these include RAX, RBX, RCX, RDX, RDI, RSI, RBP, RSP, and R8–R15.
IDTR
Interrupt descriptor table register.
IP
16-bit instruction-pointer register.
LDTR
Local descriptor table register.
MSR
Model-specific register.
r8–r15The 8-bit R8B–R15B registers, or the 16-bit R8W–R15W registers, or the 32-bit R8D–R15D
registers, or the 64-bit R8–R15 registers.
rAX–rSP
The 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers, or the 32-bit EAX, EBX, ECX, EDX,
EDI, ESI, EBP, and ESP registers, or the 64-bit RAX, RBX, RCX, RDX, RDI, RSI, RBP, and RSP
-
8/16/2019 434dds 45 w79
21/276
21
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
registers. Replace the placeholder r with nothing for 16-bit size, “E” for 32-bit size, or “R” for 64-
bit size.
RAX
64-bit version of the EAX register.
RBP
64-bit version of the EBP register.
RBX
64-bit version of the EBX register.
RCX
64-bit version of the ECX register.
RDI 64-bit version of the EDI register.
RDX
64-bit version of the EDX register.
rFLAGS
16-bit, 32-bit, or 64-bit flags register. Compare RFLAGS .
RFLAGS
64-bit flags register. Compare rFLAGS .
rIP
16-bit, 32-bit, or 64-bit instruction-pointer register. Compare RIP .
RIP
64-bit instruction-pointer register.
RSI
64-bit version of the ESI register.
RSP
64-bit version of the ESP register.
SP
Stack pointer register.
SS
Stack segment register.
-
8/16/2019 434dds 45 w79
22/276
22
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
TPR
Task priority register (CR8), a new register introduced in the AMD64 architecture to speed
interrupt management.
TR
Task register.
XMM0–XMM15
The 128-bit XMM registers; each is the lower half of a corresponding 256-bit YMM register.
YMM0–YMM15
The 256-bit YMM registers; the lower half of each of these is the corresponding 128-bit XMM
register.
Endian Order
The x86 and AMD64 architectures address memory using little-endian byte-ordering. Multibyte
values are stored with their least-significant byte at the lowest byte address, and they are illustrated
with their least significant byte at the right side. Strings are illustrated in reverse order, because the
addresses of their bytes increase from right to left.
Related Documents
• Peter Abel, IBM PC Assembly Language and Programming , Prentice-Hall, Englewood Cliffs, NJ,
1995.
• Rakesh Agarwal, 80x86 Architecture & Programming: Volume II , Prentice-Hall, Englewood
Cliffs, NJ, 1991.
• AMD, AMD-K6™ MMX™ Enhanced Processor Multimedia Technology, Sunnyvale, CA, 2000.
• AMD, 3DNow!™ Technology Manual , Sunnyvale, CA, 2000.
• AMD, AMD Extensions to the 3DNow!™ and MMX™ Instruction Sets, Sunnyvale, CA, 2000.
• Don Anderson and Tom Shanley, Pentium Processor System Architecture, Addison-Wesley, New
York, 1995.
• Nabajyoti Barkakati and Randall Hyde, Microsoft Macro Assembler Bible, Sams, Carmel, Indiana,
1992.
• Barry B. Brey, 8086/8088, 80286, 80386, and 80486 Assembly Language Programming ,
Macmillan Publishing Co., New York, 1994.
• Barry B. Brey, Programming the 80286, 80386, 80486, and Pentium Based Personal Computer ,
Prentice-Hall, Englewood Cliffs, NJ, 1995.
• Ralf Brown and Jim Kyle, PC Interrupts, Addison-Wesley, New York, 1994.
• Penn Brumm and Don Brumm, 80386/80486 Assembly Language Programming , Windcrest
McGraw-Hill, 1993.
• Geoff Chappell, DOS Internals, Addison-Wesley, New York, 1994.
-
8/16/2019 434dds 45 w79
23/276
23
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
• Chips and Technologies, Inc. Super386 DX Programmer’s Reference Manual , Chips and
Technologies, Inc., San Jose, 1992.
• John Crawford and Patrick Gelsinger, Programming the 80386 , Sybex, San Francisco, 1987.
• Cyrix Corporation, 5x86 Processor BIOS Writer's Guide, Cyrix Corporation, Richardson, TX,1995.
• Cyrix Corporation, M1 Processor Data Book , Cyrix Corporation, Richardson, TX, 1996.
• Cyrix Corporation, MX Processor MMX Extension Opcode Table, Cyrix Corporation, Richardson,
TX, 1996.
• Cyrix Corporation, MX Processor Data Book , Cyrix Corporation, Richardson, TX, 1997.
• Ray Duncan, Extending DOS: A Programmer's Guide to Protected-Mode DOS , Addison Wesley,
NY, 1991.
• William B. Giles, Assembly Language Programming for the Intel 80xxx Family, Macmillan, New
York, 1991.• Frank van Gilluwe, The Undocumented PC, Addison-Wesley, New York, 1994.
• John L. Hennessy and David A. Patterson, Computer Architecture, Morgan Kaufmann Publishers,
San Mateo, CA, 1996.
• Thom Hogan, The Programmer’s PC Sourcebook , Microsoft Press, Redmond, WA, 1991.
• Hal Katircioglu, Inside the 486, Pentium, and Pentium Pro, Peer-to-Peer Communications, Menlo
Park, CA, 1997.
• IBM Corporation, 486SLC Microprocessor Data Sheet , IBM Corporation, Essex Junction, VT,
1993.
• IBM Corporation, 486SLC2 Microprocessor Data Sheet , IBM Corporation, Essex Junction, VT,
1993.
• IBM Corporation, 80486DX2 Processor Floating Point Instructions, IBM Corporation, Essex
Junction, VT, 1995.
• IBM Corporation, 80486DX2 Processor BIOS Writer's Guide, IBM Corporation, Essex Junction,
VT, 1995.
• IBM Corporation, Blue Lightning 486DX2 Data Book , IBM Corporation, Essex Junction, VT,
1994.
• Institute of Electrical and Electronics Engineers, IEEE Standard for Binary Floating-Point
Arithmetic, ANSI/IEEE Std 754-1985.
• Institute of Electrical and Electronics Engineers, IEEE Standard for Radix-Independent Floating- Point Arithmetic, ANSI/IEEE Std 854-1987.
• Muhammad Ali Mazidi and Janice Gillispie Mazidi, 80X86 IBM PC and Compatible Computers,
Prentice-Hall, Englewood Cliffs, NJ, 1997.
• Hans-Peter Messmer, The Indispensable Pentium Book, Addison-Wesley, New York, 1995.
• Karen Miller, An Assembly Language Introduction to Computer Architecture: Using the Intel
Pentium, Oxford University Press, New York, 1999.
-
8/16/2019 434dds 45 w79
24/276
24
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
• Stephen Morse, Eric Isaacson, and Douglas Albert, The 80386/387 Architecture, John Wiley &
Sons, New York, 1987.
• NexGen Inc. , Nx586 Processor Data Book , NexGen Inc., Milpitas, CA, 1993.
• NexGen Inc. , Nx686 Processor Data Book , NexGen Inc., Milpitas, CA, 1994.
• Bipin Patwardhan, Introduction to the Streaming SIMD Extensions in the Pentium III ,
www.x86.org/articles/sse_pt1/ simd1.htm, June, 2000.
• Peter Norton, Peter Aitken, and Richard Wilton, PC Programmer’s Bible, Microsoft Press,
Redmond, WA, 1993.
• PharLap 386|ASM Reference Manual , Pharlap, Cambridge MA, 1993.
• PharLap TNT DOS-Extender Reference Manual , Pharlap, Cambridge MA, 1995.
• Sen-Cuo Ro and Sheau-Chuen Her, i386/i486 Advanced Programming , Van Nostrand Reinhold,
New York, 1993.
• Jeffrey P. Royer, Introduction to Protected Mode Programming , course materials for an onsiteclass, 1992.
• Tom Shanley, Protected Mode System Architecture, Addison Wesley, NY, 1996.
• SGS-Thomson Corporation, 80486DX Processor SMM Programming Manual , SGS-Thomson
Corporation, 1995.
• Walter A. Triebel, The 80386DX Microprocessor , Prentice-Hall, Englewood Cliffs, NJ, 1992.
• John Wharton, The Complete x86 , MicroDesign Resources, Sebastopol, California, 1994.
• Web sites and newsgroups:
- www.amd.com
- news.comp.arch- news.comp.lang.asm.x86
- news.intel.microprocessors
- news.microsoft
-
8/16/2019 434dds 45 w79
25/276
New 128-Bit and 256-Bit Instructions 25
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
1 New 128-Bit and 256-Bit Instructions
This release of the AMD64 architecture covers the XOP and FMA4 instruction set extensions. These128-bit and 256-bit instructions complement the AMD64 128-bit media instructions deescribed in
detail in the AMD64 Architecture Programmer’s Manual Volume 4: 128-Bit Media Instructions,
order# 26568. This document describes new instructions that are designed to:
Improve performance by increasing the work per instruction and
reduce the need to copy and move around register operands.
These instruction set extensions include:
Floating-point multiply accumulate instructions
Floating-point fraction extract
Integer horizontal add instructions
Integer multiply accumulate instructions
Byte permutation and bit granularity conditional move instructions
Packed integer compare and individual-partition shift/rotate instructions
These instructions all use the new XOP instruction format, which takes advantage of the three- and
four-operand non-destructive capability, 256-bit operand size, and instruction length efficiency
provided by this encoding. These instructions operate on either the lower 128- or full 256-bits of the
new YMM registers. Context handling of the YMM register set is supported by the new
XSAVE/XRSTOR instructions in conjunction with the XSETBV and XGETBV instructions. Support
for YMM context handling must be provided by the operating system and must be indicated by settingCR4.OSXSAVE to 1.
Support for the new instructions is indicated by use of the CPUID instruction:
XOP—ECX bit 11 as returned by CPUID function 8000_0001h.
FMA4—ECX bit 16 as returned by CPUID function 8000_0001h.
Attempting to execute these instructions causes a #UD exception either if they are not present in the
hardware or if operating system support for YMM context switching is not indicated by setting
CR4.OSXSAVE to 1.
1.1 New Instruction Format
The XOP instructions utilize a new three-byte XOP prefix preceding the opcode byte. This prefix
replaces the use of the 0F, 66, F2 and F3 prefix bytes and the REX prefix and encodes additional
information as well. The FMA4 instructions utilize the new AVX VEX prefix which provides similar
encoding capabilities.
Figure 1-1 shows the byte order of the instruction format.
-
8/16/2019 434dds 45 w79
26/276
26 New 128-Bit and 256-Bit Instructions
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
Figure 1-1. Instruction Byte-Order
1.1.1 Legacy Prefix
The optional legacy prefixes include operand-size override, address-size override, segment override,
Lock and REP prefixes. For additional information, see section 1.2, “Instruction Prefixes” in the
AMD64 Architecture Programmer’s Manual Volume 3: General Purpose and System Instructions,
order#24594.
1.1.2 Three-byte Prefix Format
The format of the three-byte form of the XOP and FMA4 instruction prefixes is shown in Figure 1-2.
XOP Prefix
( 3 byte) Opcode ModRM SIB
xxyyzz
Displacement
1, 2, or 4 Bytes
Immediate
1 Byte Legacy
[Prefix]
-
8/16/2019 434dds 45 w79
27/276
New 128-Bit and 256-Bit Instructions 27
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Figure 1-2. Three-byte XOP Format
Prefix Byte 0
Byte 0 of the XOP prefix is set to 8Fh. This signifies an XOP prefix only in conjunction with the
mmmmm field of the following byte being greater than or equal to 8; if the mmmmm field is less than
8 then these two bytes are a form of the POP instruction rather than an XOP prefix.
Prefix Byte 1
Byte 1 of the XOP prefix has four fields.
R Bit (Prefix Byte 1, Bit 7). This bit provides a one bit extension of the ModRM.reg field in 64-bitmode, permitting access to all 16 YMM/XMM and GPR registers. In 32-bit protected and
compatibility modes, this bit must be set to 1. This bit is the bit-inverted equivalent of the REX.R bit.
X Bit (Prefix Byte 1, Bit 6). This bit provides a one bit extension of the SIB.index field in 64-bit
mode, permitting access to 16 YMM/XMM and GPR registers. In 32-bit protected and compatibility
modes, this bit must be set to 1. This bit is the bit-inverted equivalent of the REX.X bit.
Byte 0 Byte 1 Byte 2
7 0 7 5 4 0 7 6 3 2 1 0
8F R X B mmmmm W vvvv L pp
Bit Mnemonic Description
B y t e 0
7–0 8Fh XOP Prefix Byte for 3-byte XOP Prefix
B y t e 1
7 R Inverted one bit extension to ModRM.reg field
6 X Inverted one bit extension of the SIB index field
5 BInverted one bit extension of the ModRM r/m
field or the SIB base field
4–0 mmmmmXOP opcode map select:
08h—instructions with immediate byte;
09h—instructions with no immediate;
B y t e
2
7 W
Default operand size override for a general pur-
pose register to 64-bit size in 64-bit mode; oper-
and configuration specifier for certain
XMM/YMM-based operations.
6–3 vvvvSource or destination register specifier in
inverted 1’s complement format.
2 L Vector length for XMM/YMM-based operations.
1–0 ppSpecifies whether there's an implied 66, F2, or
F3 opcode extension
-
8/16/2019 434dds 45 w79
28/276
28 New 128-Bit and 256-Bit Instructions
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
B Bit (Prefix Byte 1, Bit 5). This bit provides a one-bit extension of either the ModRM.r/m field to
specify a GPR or XMM register or to the SIB base field to specify a GPR. This permits access to 16
registers. In 32-bit protected and compatibility modes, this bit is ignored. This bit is the bit-inverted
equivalent of the REX.B bit and is available only in the 3-byte prefix format.
mmmmm (Prefix Byte 1, Bits 4–0. A five bit field encoding a one- or two-byte opcode prefix.
Prefix Byte 2
Byte 2 of the three-byte prefix has four fields.
W Bit (Prefix Byte 2, Bit 7). The meaning of the W bit is opcode specific. This bit toggles source
operand order or is ignored, depending upon the opcode.
vvvv (Prefix Byte 2, Bits 6–3). Encodes a source XMM or YMM register in inverted 1s complement
form.
L (Prefix Byte 2, Bit 2). If L is 0, encodes a vector length of 128-bits or indicates scalar operands; if
L is 1, the vector length is 256-bits. The register operands for a given instruction are either all 128-bit
XMM registers or all 256-bit YMM registers.
pp (Prefix Byte 2, Bits 1–0). Specifies an implied 66, F2, or F3 opcode extension as defined by the
following table.
1.2 Opcode Byte
The format of the opcode byte is shown in Figure 1-3. For most instructions, the operand element size
(OES) is specified by the two least-significant opcode bits, as shown in Table 1-2.
Figure 1-3. Opcode Byte Format
Table 1-1. VEX.pp Prefix Mapping
pp Implied Prefix
00b None
01b 66h
10b F3h
11b F2h
7 2 1 0
Opcode OES
-
8/16/2019 434dds 45 w79
29/276
New 128-Bit and 256-Bit Instructions 29
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
1.3 Destination XMM registers
The destination of XOP and FMA4 instructions may be a 128-bit XMM register or a 256-bit YMM
register. When a 128-bit result is written to a destination XMM register, the upper 128 bits of the
corresponding YMM register are cleared.
1.4 Four-Operand Instructions
Some new instructions require three input operands and one destination register. This is accomplished
by using the Prefix.vvvv field and Imm8[7:4] along with the MODRM.reg and MODRM.r/m fields.
VPCMOV is an example of a four operand instruction:
VPCMOV dest, src1, src2, src3; dest = (src1 & src3) | (src2 & ~src3)
The first operand is the destination operand and is an XMM or YMM register addressed by the
ModRM.reg field.
The second, third and fourth operands are sources. The first source operand is an XMM register
specified by the vvvv field. The second and third source operands are specified by the MODRM.r/m
and Imm8[7:4] fields, respectively, when VEX.W is set to 0. The FMA4, VPCMOV and VPPERM
instructions provide the option of swapping the second and third source operands by setting W to 1, as
shown in Table 1-3. This allows either the second data operand or the control operand to be memory
based.
1.5 Three-Operand Instructions
Some instructions have two source operands and a destination operand.
Table 1-2. Operand Element Size—OES
Opcode.OES Integer OperationFloating-Point
Operation
00 Byte PS
01 Word PD
10 Doubleword SS
11 Quadword SD
Table 1-3. Operand Configurations for FMA4,V PCMOV and VPPERMInstructions
XOP.W dest src1 src2 src3
0 ModRM.reg VEX/XOP.vvvv modrm.r/m imm8[7:4]
1 ModRM.reg VEX/XOP.vvvv imm8[7:4] ModRM.r/m
-
8/16/2019 434dds 45 w79
30/276
30 New 128-Bit and 256-Bit Instructions
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
VPROTB is an example of a three operand instruction:
VPROTB dest, src, count dest = src > count
The first operand is the destination operand, and is an XMM register addressed by the ModRM.regfield. The second and third operands are source operands. One source operand is an XMM register
addressed by the XOP.vvvv field, the other source operand is an XMM register or memory operand
addressed by the ModRM.r/m field.
For certain instructions, in the three-operand format the XOP.W bit determines which source operand
is specified by which operand field, as shown in Table 1-4.
Table 1-4. Operand Configurations for Three Operand Instructions
1.6 Two Operand Instructions
Two-operand instructions use the normal ModRM-based operand assignment. For most instructions,
the first operand is the destination, addressed by the ModRM.reg field and the second operand is either
an XMM or YMM register or a memory operand, as determined by the ModRM.mod field. For all of
these instructions, the XOP.vvvv field is not applicable and must be set to 1111b.
The VFRCZPD instruction is an example of a two operand instruction.
VFRCZPD xmm1, xmm2/mem128
VEX.W dest src count
0 ModRM.reg ModRM.r/m VEX.vvvv
1 ModRM.reg VEX.vvvv ModRM.r/m
-
8/16/2019 434dds 45 w79
31/276
New 128-Bit and 256-Bit Instructions 31
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
1.7 XOP Integer Multiply (Add) and Accumulate
Instructions
The multiply and accumulate and multiply, add and accumulate instructions operate on and produce packed signed integer values. These instructions allow the accumulation of results from (possibly)
many iterations of similar operations without a separate intermediate addition operation to update the
accumulator register.
1.7.1 Saturation
Some instructions limit the result of an operation to the maximum or minimum value representable by
the data type of the destination—an operation known as saturation. Many of the integer multiply and
accumulate instructions saturate the cumulative results of the multiplication and addition
(accumulation) operations before writing the final results to the destination (accumulator) register.
Note, however, that not all multiply and accumulate instructions saturate results. (For further discussion of saturation, see the AMD64 Architecture Programmer’s Manual Volume 1: Application
Programming , order# 24592.)
1.7.2 Multiply and Accumulate Instructions
The operation of a typical XOP integer multiply and accumulate instruction is shown in Figure 1-4 on
page 32.
The multiply and accumulate instructions operate on and produce packed signed integer values. These
instructions first multiply the value in the first source operand by the corresponding value in the
second source operand. Each signed integer product is then added to the corresponding value in the
third source operand, which is the accumulator and is identical to the destination operand. The resultsmay or may not be saturated prior to being written to the destination register, depending on the
instruction.
-
8/16/2019 434dds 45 w79
32/276
32 New 128-Bit and 256-Bit Instructions
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
Figure 1-4. Operation of Multiply and Accumulate Instructions
The XOP instruction extensions provide the following integer multiply and accumulate instructions.
VPMACSSWW—Packed Multiply Accumulate Signed Word to Signed Word with Saturation VPMACSWW—Packed Multiply Accumulate Signed Word to Signed Word
VPMACSSWD—Packed Multiply Accumulate Signed Word to Signed Doubleword with
Saturation
VPMACSWD—Packed Multiply Accumulate Signed Word to Signed Doubleword
VPMACSSDD—Packed Multiply Accumulate Signed Doubleword to Signed Doubleword with
Saturation
VPMACSDD—Packed Multiply Accumulate Signed Doubleword to Signed Doubleword
VPMACSSDQL—Packed Multiply Accumulate Signed Low Doubleword to Signed Quadword
with Saturation
VPMACSSDQH—Packed Multiply Accumulate Signed High Doubleword to Signed Quadword
with Saturation
VPMACSDQL—Packed Multiply Accumulate Signed Low Doubleword to Signed Quadword
VPMACSDQH—Packed Multiply Accumulate Signed High Doubleword to Signed Quadword
src1
127 96 95 64 63 32 31 0
src2
src3
127 96 95 64 63 32 31 0
(saturate)
dest
127 96 95 64 63 32 31 0
multiply
add
multiply
add
(saturate)
multiplymultiply
add
add(accumulate)(accumulate)
(accumulate)
(accumulate)
(saturate) (saturate)
127 96 95 64 63 32 31 0
-
8/16/2019 434dds 45 w79
33/276
New 128-Bit and 256-Bit Instructions 33
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
1.7.3 Integer Multiply, Add and Accumulate Instructions
The operation of the multiply, add and accumulate instructions is illustrated in Figure 1-5.
The multiply, add and accumulate instructions first multiply each packed signed integer value in thefirst source operand by the corresponding packed signed integer value in the second source operand.
The odd and even adjacent resulting products are then added. Each resulting sum is then added to the
corresponding packed signed integer value in the third source operand.
Figure 1-5. Operation of Multiply, Add and Accumulate Instructions
The XOP instruction set provides the following integer multiply, add and accumulate instructions.
VPMADCSSWD—Packed Multiply Add and Accumulate Signed Word to Signed Doubleword
with Saturation
VPMADCSWD—Packed Multiply Add and Accumulate Signed Word to Signed Doubleword
1.8 Packed Integer Horizontal Add and Subtract
The packed horizontal add and subtract signed byte instructions successively add adjacent pairs of
signed integer values from the source XMM register or 128-bit memory operand and pack the (sign-
extended) integer result of each addition in the destination.
VPHADDBW—Packed Horizontal Add Signed Byte to Signed Word
127 112 111 96 95 80 79 64 63 48 47 32 31 16 15 0
src127 112 111 96 95 80 79 64 63 48 47 32 31 16 15 0
src3
127 96 95 64 63 32 31 0
multiplymultiply
multiplymultiply
multiply
multiply
multiplymultiply
add
add
dest
127 96 95 64 63 32 31 0
(saturate)
add add
add
(saturate)
add
(saturate)
add
(saturate)
add
[accumulate][accumulate]
[accumulate][accumulate]
-
8/16/2019 434dds 45 w79
34/276
34 New 128-Bit and 256-Bit Instructions
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
VPHADDBD—Packed Horizontal Add Signed Byte to Signed Doubleword
VPHADDBQ—Packed Horizontal Add Signed Byte to Signed Quadword
VPHADDDQ—Packed Horizontal Add Signed Doubleword to Signed Quadword
VPHADDUBW—Packed Horizontal Add Unsigned Byte to Word
VPHADDUBD—Packed Horizontal Add Unsigned Byte to Doubleword
VPHADDUBQ—Packed Horizontal Add Unsigned Byte to Quadword
VPHADDUWD—Packed Horizontal Add Unsigned Word to Doubleword
VPHADDUWQ—Packed Horizontal Add Unsigned Word to Quadword
VPHADDUDQ—Packed Horizontal Add Unsigned Doubleword to Quadword
VPHADDWD—Packed Horizontal Add Signed Word to Signed Doubleword
VPHADDWQ—Packed Horizontal Add Signed Word to Signed Quadword
VPHSUBBW—Packed Horizontal Subtract Signed Byte to Signed Word VPHSUBWD—Packed Horizontal Subtract Signed Word to Signed Doubleword
VPHSUBDQ—Packed Horizontal Subtract Signed Doubleword to Signed Quadword
1.9 Vector Conditional Moves
XOP instructions include vector conditional move instructions:
VPCMOV—Vector Conditional Moves
VPPERM—Packed Permute Bytes
The VPCMOV instruction implements the C/C++ language ternary ‘?’ operator a bit level. Thisinstruction operates on individual bits and requires a bitwise predicate in one XMM or YMM register
and the two source operands in two more XMM or YMM registers.
The VPPERM instruction performs vector permutation on a packed array of 32 bytes composed of two
16-byte input operands. The VPPERM instruction replaces each destination byte with 00h, FFh, or one
of the 32 bytes of the packed array. A byte selected from the array may have an additional operation
such as NOT or bit reversal applied to it, before it is written to the destination. The action for each
destination byte is determined by a corresponding control byte. The VPPERM instruction allows
either the second 16-byte input array or the control array to be memory based, per the XOP.W bit.
1.10 Packed Integer Rotates and Shifts
These instructions rotate/shift the elements of the vector in the first source YMM or 128-bit memory
operand by the amount specified by a control byte. The rotates and shifts differ in the way they handle
the control byte.
-
8/16/2019 434dds 45 w79
35/276
New 128-Bit and 256-Bit Instructions 35
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
1.10.1 Packed Integer Shifts
The packed integer shift instructions shift each element of the vector in the first source XMM or 128-
bit memory operand by the amount specified by a control byte contained in the least significant byte of
the corresponding element of the second source operand. The result of each shift operation is returnedin the destination XMM register. This allows load-and-shift from memory operations, with either the
source operand or the shift-count operand being memory-based, as indicated by the XOP.W bit. The
XOP instruction set provides the following packed integer shift instructions:
VPSHLB—Packed Shift Logical Bytes
VPSHLW—Packed Shift Logical Words
VPSHLD—Packed Shift Logical Doublewords
VPSHLQ—Packed Shift Logical Quadwords
VPSHAB—Packed Shift Arithmetic Bytes
VPSHAW—Packed Shift Arithmetic Words
VPSHAD—Packed Shift Arithmetic Doublewords
VPSHAQ—Packed Shift Arithmetic Quadwords
1.10.2 Packed Integer Rotate
There are two variants of the packed integer rotate instructions. The first is identical to that described
above (see “Packed Integer Shifts”). In the second variant, the control byte is supplied as an 8-bit
immediate operand that specifies a single rotate amount for every element in the first source operand.
The XOP instruction set provides the following packed integer rotate instructions:
VPROTB—Packed Rotate Bytes VPROTW—Packed Rotate Words
VPROTD—Packed Rotate Doublewords
VPROTQ—Packed Rotate Quadwords
1.11 Packed Integer Comparison and Predicate Generation
The XOP comparison instructions compare packed integer values in the first source XMM register
with corresponding packed integer values in the second source XMM register or 128-bit memory. The
type of comparison is specified by the immediate-byte operand. The resulting predicate is placed in the
destination XMM register. If the condition is true, all bits in the corresponding field in the destinationregister are set to 1s; otherwise all bits in the field are set to 0s.
Table 1-5. Immediate Operand Values for Unsigned Vector Comparison Operations
Immediate Operand
Byte Comparison Operation
Bits 7:3 Bits 2:0
-
8/16/2019 434dds 45 w79
36/276
36 New 128-Bit and 256-Bit Instructions
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
The integer comparison and predicate generation instructions compare corresponding packed signed
or unsigned bytes in the first and second source operands and write the result of each comparison in the
corresponding element of the destination. The result of each comparison is a value of all 1s (TRUE) or
all 0s (FALSE). The type of comparison is specified by the three low-order bits of the immediate-byteoperand. The XOP instruction set provides the following integer comparison instructions.
VPCOMUB—Compare Vector Unsigned Bytes
VPCOMUW—Compare Vector Unsigned Words
VPCOMUD—Compare Vector Unsigned Doublewords
VPCOMUQ—Compare Vector Unsigned Quadwords
VPCOMB—Compare Vector Signed Bytes
VPCOMW—Compare Vector Signed Words
VPCOMD—Compare Vector Signed Doublewords
VPCOMQ—Compare Vector Signed Quadwords
1.12 Fraction Extract
The fraction extract instructions isolate the fractional portions of vector or scalar floating point
operands. The result of _PD and _PS instructions is a vector of integer numbers. The result of _SD and
_SS instructions is always a scalar integer number. XOP provides the following fraction extract
instructions:
VFRCZPD—Extract Fraction Packed Double-Precision Floating-Point
VFRCZPS—Extract Fraction Packed Single-Precision Floating-Point VFRCZSD— Extract Fraction Scalar Double-Precision Floating-Point
VFRCZSS— Extract Fraction Scalar Single-Precision Floating Point
The VFRCZPD and VFRCZPS instructions extract the fractional portions of a vector of double-
/single-precision floating-point values in an XMM or YMM register or a 128- or 256-bit memory
location and write the results in the corresponding field in the destination register.
00000b
000b Less Than
001b Less Than or Equal
010b Greater Than
011b Greater Than or Equal
100b Equal
101b Not Equal
110b False
111b True
-
8/16/2019 434dds 45 w79
37/276
New 128-Bit and 256-Bit Instructions 37
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
The VFRCZSS and VFRCZSD instructions extract the fractional portion of the single-/double-
precision scalar floating-point value in an XMM register or 32- or 64-bit memory location and writes
the result in the lower element of the destination register. The upper elements of the destination XMM
register are unaffected by the operation, while the upper 128 bits of the corresponding YMM register
are cleared to zeros.
-
8/16/2019 434dds 45 w79
38/276
38 New 128-Bit and 256-Bit Instructions
AMD64 Technology Documentation Updates 43479—Rev. 3.04—November 2009
-
8/16/2019 434dds 45 w79
39/276
Instruction Reference 39
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
2 AMD XOP and FMA4 Instructions
The following section describes the complete set of XOP 128-media instructions. Instructions arelisted alphabetically by mnemonic.
2.1 Notation
The notation used to denote the size and type of source and destination operands in both mnemonics
and opcodes is discussed in detail in Section 2.5, “Notation,” on page 37 in the AMD64 Architecture
Programmer’s Manual Volume 3: General Purpose and System Instructions. Mnemonic conventions
that are idiosyncratic to the XOP instruction set have been included in Chapter 1, “New 128-Bit and
256-Bit Instructions”, in this document.
2.1.1 Opcode Syntax
Opcode specification for the XOP and FMA4 instruction sets, with their two, three and four operand
syntax, requires a slightly different approach from that used to specify the opcodes for previous
generation 64- and 128-bit instructions (documented in the AMD64 Architecture Programmer’s
Manual Volume 4: 128-Bit Media Instructions, order# 26568, and AMD64 Architecture Programmer’s
Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions, order# 26569). In the following
pages, opcodes are specified using the order of fields and bits as they occur in a complete opcode
specification as outlined in Section 1.1, “New Instruction Format,” on page 25. The following opcode
specification is typical:
Most of the terms and symbols used in the following pages are defined in Section 1.1, “New
Instruction Format,” on page 25. The following notations and convention are used in this volume, in
addition to the opcode notational conventions specified in Section 2.5.2, “Opcode Syntax,” on page 39
Mnemonic Encoding
XOP RXB.mmmmm W.vvvv.L.pp Opcode
VPCMOV ymm1, ymm2 , ymm3 / mem256 , ymm4 8F RXB.08 0.src1.1.00 A2 /r /imm[7:4]
assembly language representation XOPprefix
3-bit field representing R, X, B bit values
W bit
vvvv field
L bitpp field
opcode
register/memory type specifierimmediate operand
5-bit encoding for opcode prefix
http://h/AMD%20x86-64%20Architecture%20Programmer's%20Manual/Working%20Spring%202007%20Edition/Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20Ch2.pdfhttp://h/AMD%20x86-64%20Architecture%20Programmer's%20Manual/Working%20Spring%202007%20Edition/Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20Ch2.pdfhttp://h/AMD%20x86-64%20Architecture%20Programmer's%20Manual/Working%20Spring%202007%20Edition/Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20Ch2.pdfhttp://h/AMD%20x86-64%20Architecture%20Programmer's%20Manual/Working%20Spring%202007%20Edition/Volume%203%20-%20General-Purpose%20and%20System%20Instructions/Vol%203%20-%20Ch2.pdf
-
8/16/2019 434dds 45 w79
40/276
40 Instruction Reference
AMD64 Technology Documentation Updates 43479—3.04—November 2009
in the AMD64 Archi tecture Programmer ’s Manual Volume 3: General Purpose and System
Instructions:
cntr
Control bits (for comparison instructions); immediate byte bits 3–0.
is4
Destination register specifier; immediate byte bits 7:4.
RXB
Bit field specifying the R, X and B bit values. Specified in one’s complement form.
VEX.W
The meaning of the W bit is opcode specific. This bit toggles source operand order or is ignored,
depending upon the opcode.
VEX.L
Vector length specifier
VEX.vvvv
Additional operand register specifier.
XOP
Indicates the XOP prefix byte (8Fh).
2.2 Operand Specification
The packed values in a operand are numbered starting with 0, which is considered to be even-
numbered.
-
8/16/2019 434dds 45 w79
41/276
Instruction Reference 41
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
2.3 Instruction Reference
-
8/16/2019 434dds 45 w79
42/276
42 VFMADDPD Instruction Reference
AMD64 Technology Documentation Updates 43479—3.04—November 2009
Multiplies each packed double-precision floating-point value in the first source by the corresponding packed double-precision floating-point value in the second source, then adds each product to the
corresponding packed double-precision floating-point value in the third source and writes the rounded
results to the destination register.
The VFMADDPD instruction requires four operands:
VFMADDPD dest, src1, src2, src3 dest = (src1* src2) + src3
The 128-bit version multiplies each of the two double-precision values in the first source XMM
register by its corresponding double-precision value in the second source. It then adds each
intermediate product to the corresponding double-precision value in the third source and places the
result in the destination XMM register.
The 256-bit version multiplies each of the four double-precision values in the first source YMM
register by its corresponding double-precision value in the second source. It then adds each product to
the corresponding double-precision value in the third source and places the results in the destination
YMM register.
If VEX.W is 0, the second source is either a register or memory and the third source is a register. If
VEX.W is 1, the second source is a register and the third source is a register or memory location.
The destination is always either an XMM register or a YMM register, depending on the vector size, as
determined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bits
of the corresponding YMM register are cleared to zeros.
The intermediate products are not rounded; the infinitely precise products are used in the addition. The
results of the addition are rounded, as specified by the rounding mode in MXCSR.
The VFMADDPD instruction is an FMA4 instruction. The presence of this instruction set is indicated
by a CPUID feature bit. (See the CPUID Specification, order# 25481.)
VFMADDPD Multiply and Add Packed Double-Precision
Floating-Point
Mnemonic Encoding
VEX RXB.mmmmm W.vvvv.L.pp Opcode
VFMADDPD xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 69 /r /is4
VFMADDPD ymm1, ymm2 , ymm3 / mem256 , ymm4 C4 RXB.03 0.ysrc1.1.01 69 /r /is4
VFMADDPD xmm1, xmm2, xmm3, xmm4 / mem128 C4 RXB.03 1.xsrc1.0.01 69 /r /is4
VFMADDPD ymm1, ymm2, ymm3, ymm4 / mem256 C4 RXB.03 1.ysrc1.1.01 69 /r /is4
-
8/16/2019 434dds 45 w79
43/276
Instruction Reference VFMADDPD 43
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Related Instructions
VFMADDPS, VFMADDSD, VFMADDSS
rFLAGS Affected
None
127 64 63 0
255 192 191 128
127 64 63 0
255 192 191 128
127 64 63 0
255 128
255
192191
128
VEX.L=0
127 64 63 0
255 192 191 128
VEX.L=1
0s
VEX.L=1
mul mul
mul mul
VEX.L=1
addadd
add
add
rnd rndrnd
rnd
VEX.L=1
dest = xmm1 | ymm1
src1 = xmm2 | ymm2 src2 = xmm3/mem128 | ymm/mem256
VFMADDPD
src3 = xmm4/mem128 | ymm4/mem256
-
8/16/2019 434dds 45 w79
44/276
44 VFMADDPD Instruction Reference
AMD64 Technology Documentation Updates 43479—3.04—November 2009
MXCSR Flags Affected
Exceptions
MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE
M M M M M
17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.
Exception RealVirtual
8086Protected Cause of Exception
Invalid opcode, #UD
X X FMA4 instructions are only recognized in protected
mode.
X The FMA4 instructions are not supported, as indicated
by ECX bit 16 of CPUID function 8000_0001h.
X The operating-system XSAVE support bit (OSXSAVE)
of CR4 was cleared to 0, as indicated by ECX bit 27 of
CPUID function 0000_0001h.
X The operating-system YMM support bits (YMM and
XMM) of XFEATURE_ENABLED_MASK were not
both set.
X There was an unmasked SIMD floating-point excep-
tion while CR4.OSXMMEXCPT = 0.
See SIMD Floating-Point Exceptions, below, for
details.
Device not available,#NM
X The task-switch bit (TS) of CR0 was set to 1.
Stack, #SS X A memory address exceeded the stack segment limit
or was non-canonical.
General protection, #GP X A memory address exceeded a data segment limit or
was non-canonical.
X A null data segment was used to reference memory.
Page fault, #PF X A page fault resulted from the execution of the instruc-
tion.
Alignment Check, #AC X An unaligned memory reference was performed while
alignment checking was enabled.
SIMD Floating-Point
Exception, #XF
X There was an unmasked SIMD floating-point excep-
tion while CR4.OSXMMEXCPT=1.
See SIMD Floating-Point Exceptions, below, for
details.
-
8/16/2019 434dds 45 w79
45/276
Instruction Reference VFMADDPD 45
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
SIMD Floating-Point Exceptions
Invalid-operation excep-tion (IE)
X A source operand was an SNaN value.X +/-zero was multiplied by +/- infinity
X +infinity was added to -infinity
Denormalized-operand
exception (DE)
X A source operand was a denormal value.
Overflow exception (OE) X A rounded result was too large to fit into the format of
the destination operand.
Underflow exception
(UE)
X A rounded result was too small to fit into the format of
the destination operand.
Precision exception (PE) X A result could not be represented exactly in the desti-
nation format.
Exception RealVirtual
8086Protected Cause of Exception
-
8/16/2019 434dds 45 w79
46/276
46 VFMADDPS Instruction Reference
AMD64 Technology Documentation Updates 43479—3.04—November 2009
Multiplies each packed single-precision floating-point value in the first source by the correspondingsingle-precision floating-point value in the second source, then adds each product to the corresponding
packed single-precision floating-point value in the third source and writes the rounded results to the
destination register.
The VFMADDPS instruction requires four operands:
VFMADDPS dest, src1, src2, src3 dest = src1* src2 + src3
The 128-bit version multiplies each of the four single-precision values in the first source XMM
register by its corresponding single-precision value in the second source. It then adds each product to
the corresponding single-precision value in the third source and places the results in the destination
XMM register.
The 256-bit version multiplies each of the eight single-precision values in the first source YMM
register by its corresponding double-precision value in the second source. It then adds each product to
the corresponding double-precision value in the third source and places the results in the destination
YMM register.
If VEX.W is 0, the second source is either a register or memory location and the third source is a
register. If VEX.W is 1, the second source is a register and the third source is a register or memory
location.
The destination is always either an XMM register or a YMM register, depending on the vector size, as
determined by the value of VEX.L. When the destination is a 128-bit XMM register, the upper 128 bitsof the corresponding YMM register are cleared to zeros.
The intermediate products are not rounded; the infinitely precise products are used in the addition. The
results of the addition are rounded, as specified by the rounding mode in MXCSR.
The VFMADDPS instruction is an FMA4 instruction. The presence of this instruction set is indicated
by a CPUID feature bit. (See the CPUID Specification, order# 25481.)
VFMADDPS Multiply and Add Packed Single-Precision
Floating-Point
Mnemonic Encoding
VEX RXB.mmmmm W.vvvv.L.pp Opcode
VFMADDPS xmm1, xmm2, xmm3/mem128, xmm4 C4 RXB.03 0.xsrc1.0.01 68 /r /is4
VFMADDPS ymm1, ymm2 , ymm3 / mem256 , ymm4 C4 RXB.03 0.ysrc1.1.01 68 /r /is4
VFMADDPS xmm1, xmm2, xmm3, xmm4 / mem128 C4 RXB.03 1.xsrc1.0.01 68 /r /is4
VFMADDPS ymm1, ymm2, ymm3, ymm4 / mem256 C4 RXB.03 1.ysrc1.1.01 68 /r /is4
-
8/16/2019 434dds 45 w79
47/276
Instruction Reference VFMADDPS 47
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Related Instructions
VFMADDPD, VFMADDSD, VFMADDSS
rFLAGS Affected
None
127 6463 0
255 192191 128
255 128
VEX.L=0
VEX.L=1
0s
VEX.L=1
mul
VEX.L=1
add
rnd rndrnd
VEX.L=1
dest = xmm1 | ymm1
src1 = xmm2 | ymm2 src2 = xmm3/mem128 | ymm4/mem256
VFMADDPS
31329596
159160223224
add
rnd
mul
rnd
mul
add
mul
add
rnd
add
rnd
mul
add
rnd255 192191 128159160223224
255 192191 128159160223224
255 192191 128159160223224
127 6463 031329596
127 6463 031329596
255 192191 128159160223224
mul mul
add
mul
add
src3 = xmm4/mem128 | ymm4/mem256
-
8/16/2019 434dds 45 w79
48/276
48 VFMADDPS Instruction Reference
AMD64 Technology Documentation Updates 43479—3.04—November 2009
MXCSR Flags Affected
Exceptions
MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE
M M M M M
17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.
Exception RealVirtual
8086Protected Cause of Exception
Invalid opcode, #UD
X X FMA4 instructions are only recognized in protected
mode.
X The FMA4 instructions are not supported, as indicated
by ECX bit 16 of CPUID function 8000_0001h.
X The operating-system XSAVE support bit (OSXSAVE)
of CR4 was cleared to 0, as indicated by ECX bit 27 of
CPUID function 0000_0001h.
X The operating-system YMM support bits (YMM and
XMM) of XFEATURE_ENABLED_MASK were not
both set.
X There was an unmasked SIMD floating-point excep-
tion while CR4.OSXMMEXCPT = 0.
See SIMD Floating-Point Exceptions, below, for
details.
Device not available,#NM
X The task-switch bit (TS) of CR0 was set to 1.
Stack, #SS X A memory address exceeded the stack segment limit
or was non-canonical.
General protection, #GP X A memory address exceeded a data segment limit or
was non-canonical.
X A null data segment was used to reference memory.
Page fault, #PF X A page fault resulted from the execution of the instruc-
tion.
Alignment Check, #AC X An unaligned memory reference was performed while
alignment checking was enabled.
SIMD Floating-Point
Exception, #XF
X There was an unmasked SIMD floating-point excep-
tion while CR4.OSXMMEXCPT=1.
See SIMD Floating-Point Exceptions, below, for
details.
SIMD Floating-Point Exceptions
-
8/16/2019 434dds 45 w79
49/276
Instruction Reference VFMADDPS 49
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Invalid-operation excep-
tion (IE)
X A source operand was an SNaN value.
X +/-zero was multiplied by +/- infinityX +infinity was added to -infinity
Denormalized-operand
exception (DE)
X A source operand was a denormal value.
Overflow exception (OE) X A rounded result was too large to fit into the format of
the destination operand.
Underflow exception
(UE)
X A rounded result was too small to fit into the format of
the destination operand.
Precision exception (PE) X A result could not be represented exactly in the desti-
nation format.
Exception RealVirtual
8086Protected Cause of Exception
-
8/16/2019 434dds 45 w79
50/276
50 VFMADDSD Instruction Reference
AMD64 Technology Documentation Updates 43479—3.04—November 2009
Multiplies the double-precision floating-point value in the low-order quadword of the first source bythe double-precision floating-point value in the low-order quadword of the second source, then adds
the product to the double-precision floating-point value in the low-order quadword of the third source.
The low-order quadword result is written to the destination.
The VFMADDSD instruction requires four operands:
VFMADDSD dest, src1, src2, src3 dest = src1* src2 + src3
If VEX.W is 0, the second source is either a register or 64-bit memory location and the third source is
a register. If VEX.W is 1, the second source is a a register and the third source is a register or 64-bit
memory location.
The destination is an XMM register. When the result is written to the destination XMM register, the
upper quadword of the destination register (bits 64–127) and the upper 128-bits of the corresponding
YMM register are cleared to zeros.
The intermediate product is not rounded; the infinitely precise product is used in the addition. The
result of the addition is rounded, as specified by the rounding mode in MXCSR.
The VFMADDSD instruction is an FMA4 instruction. The presence of this instruction set is indicated
by a CPUID feature bit. (See the CPUID Specification, order# 25481.)
VFMADDSD Multiply and Add Scalar
Double-Precision Floating-Point
Mnemonic Encoding
VEX RXB.mmmmm W.vvvv.L.pp Opcode
VFMADDSD xmm1, xmm2, xmm3/mem64, xmm4 C4 RXB.03 0.xsrc1.0.01 6B /r /is4
VFMADDSD xmm1, xmm2, xmm3, xmm4 / mem64 C4 RXB.03 1.xsrc1.0.01 6B /r /is4
-
8/16/2019 434dds 45 w79
51/276
Instruction Reference VFMADDSD 51
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Related Instructions
VFMADDPD, VFMADDPS, VFMADDSS
rFLAGS Affected
None
MXCSR Flags Affected
MM FZ RC PM UM OM ZM DM IM DAZ PE UE OE ZE DE IE
M M M M M
17 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Note: A flag that may be set to one or cleared to zero is M (modified). Unaffected flags are blank.
127 64 63 0
127 64 63 0
127 64 63 0
255 128
127 64 63 0
0s
mul
add
rnd
dest = xmm1
src1 = xmm2 src2 = xmm3/mem64
src3 = xmm4/mem64
VFMADDSD
0s
-
8/16/2019 434dds 45 w79
52/276
52 VFMADDSD Instruction Reference
AMD64 Technology Documentation Updates 43479—3.04—November 2009
Exceptions
Exception RealVirtual
8086Protected Cause of Exception
Invalid opcode, #UD
X X FMA4 instructions are only recognized in protectedmode.
X The FMA4 instructions are not supported, as indicated
by ECX bit 16 of CPUID function 8000_0001h.
X The operating-system XSAVE support bit (OSXSAVE)
of CR4 was cleared to 0, as indicated by ECX bit 27 of
CPUID function 0000_0001h.
X The operating-system YMM support bits (YMM and
XMM) of XFEATURE_ENABLED_MASK were not
both set.
X There was an unmasked SIMD floating-point excep-
tion while CR4.OSXMMEXCPT = 0.See SIMD Floating-Point Exceptions, below, for
details.
Device not available,
#NM
X The task-switch bit (TS) of CR0 was set to 1.
Stack, #SS X A memory address exceeded the stack segment limit
or was non-canonical.
General protection, #GP X A memory address exceeded a data segment limit or
was non-canonical.
X A null data segment was used to reference memory.
Page fault, #PF X A page fault resulted from the execution of the instruc-
tion. Alignment Check, #AC X An unaligned memory reference was performed while
alignment checking was enabled.
SIMD Floating-Point
Exception, #XF
X There was an unmasked SIMD floating-point excep-
tion while CR4.OSXMMEXCPT=1.
See SIMD Floating-Point Exceptions, below, for
details.
SIMD Floating-Point Exceptions
Invalid-operation excep-
tion (IE)
X A source operand was an SNaN value.
X +/-zero was multiplied by +/- infinity
X +infinity was added to -infinity
Denormalized-operandexception (DE)
X A source operand was a denormal value.
Overflow exception (OE) X A rounded result was too large to fit into the format of
the destination operand.
Underflow exception
(UE)
X A rounded result was too small to fit into the format of
the destination operand.
Precision exception (PE) X A result could not be represented exactly in the desti-
nation format.
-
8/16/2019 434dds 45 w79
53/276
Instruction Reference VFMADDSS 53
43479—Rev. 3.04—November 2009 AMD64 Technology Documentation Updates
Multiplies the single-precision floating-point value in the low-order doubleword of the first source bythe low-order single-precision floating-point value in the second source, then adds the product to the
low-order single-precision floating-point value in the third source. The low-order doubleword result is
written to the destination.
The VFMADDSS instruction requires four operands:
VFMADDSS dest, src1, src2, s