software utilities for x86 processors visual studio
TRANSCRIPT
585© Daniel Kusswurm 2018D. Kusswurm, Modern X86 Assembly Language Programming, https://doi.org/10.1007/978-1-4842-4063-2
Appendix A
Appendix A includes supplemental material about the following items:
• Software utilities for x86 processors
• Visual Studio
• References
Software Utilities for x86 ProcessorsThe following utilities can be used to determine which x86 instruction set extensions are supported by the processor in your computer:
CPUID CPU-Z (https://www.cpuid.com)
HWiNFO Diagnostic Software (https://www.hwinfo.com)
Piriform SPECCY (https://www.ccleaner.com/speccy)
Visual StudioIn this section, you’ll learn how to use Microsoft’s Visual Studio development tool to run the source code examples that are described in the main text. You’ll also learn how to create a simple Visual Studio C++ project. Before proceeding, you may want to refer to the Introduction for additional information regarding Visual Studio and the recommended hardware platforms for running the source code examples. The Introduction also contains important details about downloading the source code ZIP files for each chapter.
Visual Studio uses logical entities called solutions and projects to help simplify application development. A solution is a collection of one or more projects that are used to build an application. A project is container object that organizes an application’s files. A Visual Studio project is usually created for each buildable component of an application (e.g., executable file, dynamic-linked library, static library, and so on).
A standard Visual Studio C++ project includes two solution configurations named Debug and Release. As implied by their names, these configurations support separate executable builds for initial development and final release. A standard Visual Studio C++ project also incorporates solution platforms. The default solution platforms are named Win32 and x64, which contain the necessary settings to build 32-bit and 64-bit executables, respectively. The Visual Studio solution and project files for this book’s source code examples include only the x64 platform.
Appendix A
586
Running a Source Code ExampleYou can use the following steps to run any of the book’s source code examples:
1. Using File Explorer, double-click on the chapter’s Visual Studio solution (.sln) file. The solution file is included in the chapter source code ZIP file.
2. From the menu bar, select Build | Configuration Manager. In the Configuration Manager dialog box, set Active Solution Configuration to Release. Then set Active Solution Platform to x64. Note that these options may already be selected.
3. If necessary, select View | Solution Explorer to open the Solution Explorer window.
4. In the Solution Explorer window, right-click on a project to run and choose Set as StartUp Project.
5. Select Debug | Start Without Debugging to run the program.
Some of the source code examples reference data files in different folders using fixed path names. To run the corresponding executables using a different folder structure than the one used for Visual Studio development, you may need to change the path name strings in the C++ source code.
Creating a Visual Studio C++ ProjectIn this section, you’ll learn how to create a simple Visual Studio project that includes both C++ and assembly language source code files. The ensuing paragraphs describe the same basic procedure that was used to create the source code examples in the main text and includes the following phases:
• Create a C++ project
• Enable MASM support
• Add an assembly language file
• Set project properties
• Edit the source code
• Build and run the project
Create a C++ ProjectUse the following steps to create a Visual Studio C++ project:
1. Start Visual Studio.
2. Select File | New Project.
3. In the New Project dialog box control tree, select Installed | Visual C++ | Windows Desktop.
4. Select Windows Console Application for the project type.
5. In the Name text box, enter Example1.
Appendix A
587
6. In the Location text box, enter a folder name for the project location. You can also use the Browse button to choose a folder or leave the text unchanged to use the default location.
7. In the Solution text box, enter TestSolution.
8. Verify that the New Project dialog box settings are the same as the ones shown in Figure A-1 (the Location can be different). Click OK.
9. If necessary, select View | Solution Explorer to open the Solution Explorer window.
10. In the Solution Explorer tree control, right-click on the top-level text that’s labeled Solution ‘Example 1’ (1 Project) and select Rename. Change the solution name to TestSolution.
11. Select Build | Configuration Manager. In the Configuration Manager dialog box, choose <Edit…> under Active Solution Platforms (see Figure A-2).
12. In the Edit Solution Platforms dialog box, select x86 and click Remove (see Figure A-3). Click Close to close the Edit Solutions Platforms dialog box; click Close to close the Configuration Manager dialog box.
Figure A-1. New Project dialog box
Appendix A
588
Figure A-2. Configuration Manager dialog box
Figure A-3. Edit Solution Platforms dialog box
Appendix A
589
Enable MASM SupportUse the following steps to enable support for Microsoft Macro Assembler:
1. In the Solution Explorer tree control, right-click on Example1 and select Build Dependencies | Build Customizations.
2. In the Visual C++ Build Customizations dialog box, check masm(.targets, .props).
3. Click OK.
Add an Assembly Language FileUse the following steps to add an assembly language source code file (.asm) to a Visual Studio C++ project:
1. In the Solution Explorer tree control, right-click on Example1 and select Add | New Item.
2. Select C++ File (.cpp) for the file type.
3. In the Name text box, change the name to Example1_.asm, as shown in Figure A-4. Note that the trailing underscore is required since all C++ and assembly language source code files in a project must have a unique base name.
4. Click Add.
Figure A-4. Add New Item dialog box
Appendix A
590
Set Project PropertiesUse the following steps to set the project’s properties. The properties that control listing file generation (Steps 5 - 8) are optional.
1. In the Solution Explorer tree control, right-click on Example1 and select Properties.
2. In the Property Pages dialog box, change the Configuration setting to All Configurations and the Platform setting to All Platforms. Note that one or both options may already be set.
3. In the tree control, select Configuration Properties | General. Change the setting Whole Program Optimization to No Whole Program Optimization (see Figure A-5).
4. Select Configuration Properties | C/C++ | Code Generation. Change the setting Enable Enhanced Instruction Set to Advanced Vector Extensions (/arch:AVX) (see Figure A-6)
5. Select Configuration Properties | C/C++ | Output Files. Change the setting Assembler Output to Assembly Machine and Source Code (/FAcs) (see Figure A-7).
6. Select Configuration Properties | Microsoft Macro Assembler | Listing File. Change the setting Enable Assembly Generated Code Listing to Yes (/Sg) (see Figure A-8).
7. Change the Assembled Code Listing File text field to $(IntDir)\%(filename).lst (see Figure A-8). This macro text specifies the project’s intermediate directory, which is a subfolder of the main project folder.
8. Click OK.
Figure A-5. Property Pages dialog box (Whole Program Optimization)
Appendix A
591
Figure A-6. Property Pages dialog box (Enable Enhanced Instruction Set)
Figure A-7. Property Pages dialog box (Assembler Output)
Appendix A
592
Edit the Source CodeUse the following steps to edit the project source code:
1. In the Editor window, click on the tab named Example1.cpp.
2. Edit the C++ source code to match the code that’s shown in Listing A-1.
3. Click on the tab named Example1_.asm.
4. Edit the assembly language source code to match the code that’s shown in Listing A-2.
5. Select File | Save All.
Listing A-1. Example1.cpp
// Example1.cpp : Defines the entry point for the console application.//
#include "stdafx.h"#include <iostream>
using namespace std;
extern "C" int CalcResult1_(int val1, int val2, int* quo, int* rem);
int main(){ int val1 = 42; int val2 = 9;
Figure A-8. Property Pages dialog box (Microsoft Macro Assembler Listing File)
Appendix A
593
int quo; int rem; int prod = CalcResult1_(val1, val2, &quo, &rem);
cout << "Results for Example1\n"; cout << "val1 = " << val1 << '\n'; cout << "val2 = " << val2 << '\n'; cout << "quo = " << quo << '\n'; cout << "rem = " << rem << '\n'; cout << "prod = " << prod << '\n'; return 0;}
Listing A-2. Example1_.asm
; extern "C" int CalcResult1_(int val1, int val2, int* quo, int* rem);
.codeCalcResult1_ proc mov r10d,ecx ;r10d = val1 mov r11d,edx ;r11d = val2
mov eax,ecx ;eax = val1 cdq ;edx:eax = val1
idiv r11d ;calc val1 / val2 mov dword ptr [r8],eax ;save quotient mov dword ptr [r9],edx ;save remainder
imul r10d,r11d ;r10d = val1 * val2 mov eax,r10d ;eax = val1 * val2 retCalcResult1_ endp end
Build and Run the ProjectUse the following steps to build and run the project:
1. Select Build | Build Solution.
2. If necessary, fix any reported C++ compiler or MASM errors and repeat Step 1.
3. Select Debug | Start Without Debugging.
4. Verify that the output matches the console window shown in Figure A-9.
5. Press Enter to close the console window.
Appendix A
594
ReferencesThis section contains a list of references that were consulted during preparation of the main text. It also includes additional references and resources that provide worthwhile information. The references have been grouped into the following categories:
X86 programming reference manuals
X86 programming and microarchitecture references
Ancillary resources
Algorithm references
C++ references
X86 Programming Reference ManualsThe following is a list of x86 programming reference manuals published by AMD and Intel:
AMD64 Architecture Programmer’s Manual Volume 1: Application Programming https://support.amd.com/TechDocs/24592.pdf
AMD64 Architecture Programmer’s Manual Volume 3: General Purpose and System Instructions, https://support.amd.com/TechDocs/24594.pdf
AMD64 Architecture Programmer’s Manual Volume 4: 128-bit and 256-bit Media Instructions, https://support.amd.com/TechDocs/26568.pdf
Figure A-9. Console window output
Appendix A
595
Software Optimization Guide for AMD Family 17h Processors, Publication Number 55723, June 2017, https://developer.amd.com/resources/developer-guides-manuals
Intel 64 and IA-32 Architectures Software Developer’s Manual, Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4, https://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
Intel 64 and IA-32 Architectures Optimization Reference Manual, https://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
Intel Architecture Instruction Set Extensions and Future Features Programming Reference, https://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
X86 Programming and Microarchitecture ReferencesThe follow resources contain informative information about x86 assembly language programming, processors, and microarchitectures.
Guy Ben-Haim, Itai Neoran, and Ishay Tubi, Practical Intel AVX Optimization on 2nd Generation Intel Core Processors, https://software.intel.com/sites/default/files/m/d/4/1/d/8/Practical_Optimization_with_AVX.pdf
Ian Cutress, The Intel Skylake Mobile and Desktop Launch, with Architecture Analysis, September 2015, https://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis
Ian Cutress, The Intel Skylake-X Review: Core i9-7900X, i7-7820X and i7-7800X Tested, June 2017, https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested
Anger Fog, The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers, August 2018, https://agner.org/optimize/#manuals
Agner Fog, Optimizing subroutines in assembly language: An optimization guide for x86 platforms, April 2018, https://agner.org/optimize/#manuals
Chris Kirkpatrick, Intel AVX State Transitions: Migrating SSE Code to AVX, https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx
Patrick Konsor, Avoiding AVX-SSE Transition Penalties, https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties
Patrick Konsor, Performance Benefits of Half-Precision Floats, https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats
Daniel Kusswurm, Modern x86 Assembly Language Programming, Apress, ISBN 978-1-4842-0065-0, 2014.
Appendix A
596
Max Locktyukhin, How to Detect New Instruction Support in the 4th Generation Intel Core Processor Family, August 2013, https://software.intel.com/en-us/node/405250
John Morgan, Microsoft Visual Studio 2017 Supports Intel AVX-512, https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512
Erdinc Ozturk, James Guilford, Vinodh Gopal, and Wajdi Feghal, New Instructions Supporting Large Integer Arithmetic on Intel Architecture Processors, August 2012, https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
James Reinders, AVX-512 May Be a Hidden Gem in Intel Xeon Scalable Processors, June 2017, https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors
Anand Lal Shimpi, Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel, October 2012, http://www.anandtech.com/show/6355/intels-haswell-architecture
Ancillary ResourcesThe following resources contain useful information about x86 processors and microarchitectures:
Processors for Desktops, AMD, https://www.amd.com/en/products/processors-desktop
List of AMD Accelerated Processing Unit Microprocessors, Wikipedia, https://en.wikipedia.org/wiki/List_of_AMD_Accelerated_Processing_Unit_microprocessors
List of AMD CPU Microarchitectures, Wikipedia, https://en.wikipedia.org/wiki/List_of_AMD_CPU_microarchitectures
List of AMD Microprocessors, Wikipedia, https://en.wikipedia.org/wiki/List_of_AMD_processors
Product Information Website, Intel, https://ark.intel.com
List of Intel CPU Microarchitectures, Wikipedia, https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures
List of Intel Microprocessors, Wikipedia, https://en.wikipedia.org/wiki/Intel_processor
List of Intel Xeon Microprocessors, Wikipedia, https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors
Register Renaming, Wikipedia, https://en.wikipedia.org/wiki/Register_renaming
Appendix A
597
Algorithm ReferencesThe following resources were consulted to develop the algorithms used in the source code examples:
Forman S. Acton, REAL Computing Made REAL – Preventing Errors in Scientific and Engineering Calculations, ISBN 978-0486442211, Dover Publications, 2005
Tony Chan, Gene Golub, Randall LeVeque, Algorithms for Computing the Sample Variance: Analysis and Recommendations, The American Statistician, Volume 37 Number 3 (1983), p. 242-247
James F. Epperson, An Introduction to Numerical Methods and Analysis, Second Edition, ISBN 978-1-118-36759-9, Wiley, 2013
David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic, ACM Computing Surveys, Volume 23 Issue 1 (March 1991), p. 5 – 48
Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing, Fourth Edition, ISBN 978-0-133-35672-4, 2018
James E. Miller, David G. Moursund, Charles S. Duris, Elementary Theory & Application of Numerical Analysis, Revised Edition, ISBN 978-0486479064, Dover Publications, 2011
Anthony Pettofrezzo, Matrices and Transformations, ISBN 0-486-63634-8, Dover Publications, 1978
Hans Schneider and George Barker, Matrices and Linear Algebra, ISBN 0-486-66014-1, Dover Publications, 1989
Eric W. Weisstein, Convolution, MathWorld, http://mathworld.wolfram.com/Convolution.html
Eric W. Weisstein, Correlation Coefficient, MathWorld, http://mathworld.wolfram.com/CorrelationCoefficient.html
Eric W. Weisstein, Cross Product, MathWorld, http://mathworld.wolfram.com/CrossProduct.html
Eric W. Weisstein, Least Squares Fitting, MathWorld, http://mathworld.wolfram.com/LeastSquaresFitting.html
Eric W. Weisstein, Matrix Multiplication, MathWorld, http://mathworld.wolfram.com/MatrixMultiplication.html
David M. Young and Robert Todd Gregory, A Survey of Numerical Mathematics, Volume 1, ISBN 0-486-65691-8, Dover Publications, 1988
Algorithms for calculating variance, Wikipedia, https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Body Surface Area Calculator, http://www.globalrph.com/bsa2.htm
Grayscale, Wikipedia, https://en.wikipedia.org/wiki/Grayscale
Linked List, Wikipedia, https://en.wikipedia.org/wiki/Linked_list
Appendix A
598
C++ ReferencesThe following resources contain valuable information about C++ programming, the C++ Standard Template Libraries, and C++ programming using multiple threads.
Ivor Horton, Using the C++ Standard Template Libraries, Apress, ISBN 978-1-4842-0005-6, 2015
Nicolai M. Josuttis, The C++ Standard Library – A Tutorial and Reference, Second Edition, Addison Wesley, ISBN 978-0-321-62321-8, 2012
Bjarne Stroustrup, The C++ Programming Language, Fourth Edition, Addison Wesley, ISBN 978-0-321-56384-2, 2013
Anthony Williams, C++ Concurrency in Action – Practical Multithreading, ISBN 978-1-933-98877-1, Manning Publications, 2012
cplusplus.com, http://www.cplusplus.com
599© Daniel Kusswurm 2018 D. Kusswurm, Modern X86 Assembly Language Programming, https://doi.org/10.1007/978-1-4842-4063-2
��������� AAdvanced Vector Extensions (AVX)
data typespacked floating-point, 100–102packed integer, 103–105scalar floating-point, 94–99
differences between x86-SSEexecution lanes, 105intermixing x86-AVX and x86-SSE
code, 107operand alignment, 106vzeroupper, 107YMM register high-order bit zeroing, 106
instruction syntaxnon-destructive source operand, 93
registersMXCSR, 92, 94, 97–98XMM registers, 91–94, 97, 104, 106–107YMM registers, 91–93, 99, 105–107
AVX2MXCSR, 277, 281non-destructive source
operand, 277operand alignment, 277packed floating-point, 278–279packed integer
variable bit shift, 282XMM registers, 277–279, 281YMM registers, 277–279, 281
AVX-512conditional execution and merging
merge masking, 424–425zero masking, 424–425
data types, 423–424embedded broadcast, 426instruction-level rounding
round down, 427, 443round to nearest, 427, 443round to zero, 427, 443
round up, 427, 443suppress all exceptions, 427, 443
instruction set extensionsAV512CD, 430, 433, 491AVX512BW, 430–431, 433, 491, 495AVX512DQ, 421, 422, 427, 431, 433, 491AVX512F, 427–429, 431, 433, 436, 464, 491AVX512VL, 421–422, 427, 433, 491
instruction syntaxconditional execution and
merging, 424–426embedded broadcast, 426instruction-level rounding, 427
merge masking, 433–436, 444, 495, 517predicate mask, 422, 424–425register sets
MXCSR, 427opmask registers, 422XMM registers, 422YMM registers, 422ZMM registers, 422
zero masking, 433, 437–440, 495, 502, 517
Array of structures (AOS), 473, 476, 526–527Array operations
column meansrow-major ordering, 302
least squares, 193–198min-max, 188–192simple calculations, 292–298square roots, 184–187
Arraysaccessing elements, 51–53, 58–61comparing, 79–82one-dimensional, 51reversal, 82–86row-major ordering, 58two-dimensional
row and column indices, 58, 66
Index
■ INDEX
600
��������� BBenchmark timing measurements
csv file, 207TRIMMEAN, 207
��������� CC++
classesAlignedArray, 238, 261, 368, 375, 404,
473, 502AlignedMem, 238array, 343BmThreadTimer, 206default_random_engine, 238ImageMatrix, 261matrix, 203mutex, 581OS, 206thread, 580–581uniform_int_distribution, 238unique_ptr, 302, 397, 473
lvalue, 355rvalue, 355size_t, 77specifiers
alignas, 171–172, 177, 198, 220, 290, 296, 351, 355–356, 450, 455, 495
Cachecache line, 531, 539L1 data (D-Cache), 530–531L1 instruction (I-Cache), 530–532L2, 530–532L3, 530–532non-temporal data, 557, 562pollution, 557, 562slice, 531temporal data, 557, 562
Conditional jump, 21, 44, 47–48Conditional move, 21, 44, 48Condition codes, 17–18, 44–48Convolution
discrete equation, 386–387, 397–398, 404input signal
padding, 396–397, 404kernel
fixed size, 388variable size, 388
output signal, 385–387, 397, 404–405response signal, 385–386SIMD equations, 404–405theory, 386YMM registers, 489ZMM registers, 489
Correlation coefficient, 305–312CPU Identification (CPUID)
AVX-512 feature flags, 556feature flag, 555–556host operating system
OSXSAVE, 555leaf value, 555memory caches, 554–557return results, 554serializing instruction, 554sub-leaf value, 554xgetbv, 554–555
��������� DData blend, 285, 333Data gather
indicesdoubleword, 343quadword, 343
merge control mask, 339, 343–344vector scale-index-base, 278
Data permuteindices, 337
Data prefetchhint, 562linked list, 563, 569–570
Differences between x86-32 and x86-64 programming
byte register restrictions, 15deprecated instructions, 15immediate operands
32-bit, 13–14invalid instructions, 15operand sizes, 13
��������� EEnhanced bit manipulation
leading zero bits, 414trailing zero bits, 414
��������� F, GFeature set identification. See CPUIDFlagless operations
multiplication, 406–411shift, 406–411
FMA. See Fused-Multiply-Add (FMA)FMA3. See Fused-Multiply-Add (FMA)FMA4. See Fused-Multiply-Add (FMA)Fundamental data types
byte, 3–4double quadword, 3doubleword, 3–4
■ INDEX
601
little endian ordering, 3proper alignment, 4quadword, 3–4word, 3
Fused-Multiply-Add (FMA)arithmetic, 281convolution functions
packed, 398–406scalar, 388–398
data dependenciesmultiple registers, 397
operand ordering scheme, 282packed, 281rounding
MXCSR.RC, 281scalar, 281value discrepancies, 406
��������� HHalf-precision floating-point
encodingexponent, 280sign bit, 280significand, 280
F16C, 280Half-precision floating-point conversions
rounding mode, 418
��������� IIEEE 754
binary encodingexponent, 95–96sign bit, 95significand, 95–96
special valuesdenormal, 96floating-point zero, 96infinity, 96NaN, 96QNaN, 96SNaN, 96
Image processingimage histogram, 255–262image statistics
mean, 510, 517, 519standard deviation, 510, 517, 519
image thresholdingmask image, 262–263, 271–272
pixel clipping, 363–369pixel conversions, 246–254
instruction-level rounding, 503size reduction, 503
pixel mean, 240–246
pixel minimum-maximum, 232–240RGB pixel min-max values
macro text string, 375RGB to grayscale conversion
color conversion coefficients, 382size reduction, 527weighted sum, 381–382
thresholdingmask image, 504, 508
Instruction operandsimmediate, 11memory, 11register, 11
Instruction pipelineallocate rename block, 532branch prediction unit, 532, 536–537decoded instruction cache, 532execution engine
execution unit, 532–533instruction decoder, 529, 532instruction fetch and pre-decode, 532instruction queue, 532loop stream detector, 532micro-op instruction queue, 532retire unit, 532–533scheduler, 529, 532
Instruction set extensionsADX, 280, 282–283BMI1, 280, 282–283, 406, 412, 415BMI2, 280, 282–283, 406, 412, 415F16C, 280, 385FMA, 280–282LZCNT, 280, 282–283, 406POPCNT, 280, 282–283
Integer arithmeticaddition, 22–24division, 31–34logical operations, 24–27mixed sizes, 35–40multiplication, 31–34shift operations, 27–30subtraction, 22–24
��������� J, KJump table, 182
��������� LLinked list
nodedata, 569end-of-list terminator, 570link, 569
Loop unrolling, 239
■ INDEX
602
��������� MMASM. See Microsoft Macro Assembler (MASM)Matrix operations
inverseCayley-Hamilton theorem, 329
multiplication, 207–213, 312–320transposition, 199–204, 206–207, 312–320
Matrix-vector multiplicationequations, 476, 482permutation of vector components, 482
Memory addressing modesbase register, 12, 41base register + disp, 12base register + index register, 12, 42base register + index register + disp, 12base register + index register * scale factor, 12, 42base register + index register * scale factor +
disp, 12, 42effective address calculation, 11index * scale factor + disp, 12RIP + disp (RIP relative), 12RIP relative, 42
MicroarchitectureCoffee Lake, 529Haswell, 529, 534Kaby Lake, 529Skylake, 529–531, 533Skylake Server, 421, 433, 529, 534
Micro-opmacro-fusion, 532micro-fusion, 532
Microsoft Macro Assembler (MASM)comment line, 23custom segment, 290, 330directive
=, 147align, 171, 262, 290, 330, 450 .allocstack, 146, 151bcst, 463byte ptr, 38catstr, 375.code, 23.const, 42.data, 43dup, 290dword, 26, 42, 290dword ptr, 38endp, 23.endprolog, 56, 146–147, 151ends, 330equ, 127, 147.erridni, 375macro, 203proc, 23
proc frame, 56, 164.pushreg, 56, 151qword, 262, 290qword ptr, 38readonly, 330real4, 111real8, 114.savexmm128, 157segment, 330.setframe, 146substr, 375word ptr, 38xmmword ptr, 172ymmword ptr, 290, 351zmmword ptr, 450
label, 47location counter ($), 42macro text string, 375
Miscellaneous data typesbit field, 6bit string, 6string, 6
Multithreadingdata arrays, 570
MXCSRcontrol flags, 97, 98, 128, 133–134rounding control, 98, 100rounding mode, 128, 133–134status flags, 92, 97, 100, 122
��������� NNon-temporal memory store
arrays, 557, 561hint, 562
Numeric data typesfloating-point
double-precision, 5single-precision, 5
signed integers, 5unsigned integers, 5
��������� OOptimization
basic techniques, 534–536data alignment
multi-byte values, 538packed floating-point, 538packed integer, 538
floating-point arithmeticdenormals, 536loop unrolling, 536precision, 536register dependencies, 536
■ INDEX
603
program branchesbackward conditional, 537branch prediction, 537forward conditional, 537loop unrolling, 537
SIMD techniquesregister spills, 539
��������� P, QPacked floating-point arithmetic
common operationsaddition, 170compares, 173–179conversions, 179–183division, 170–171multiplication, 170–171subtraction, 170
compares, 452–457conversions
unsigned integer, 440, 443logical decisions, 295, 346operations
absolute value, 290–291addition, 288–289, 448–449division, 289, 448–449multiplication, 289, 448–449square root, 289–290, 448–449subtraction, 288–289, 448–449
Packed integer arithmeticbasic arithmetic
doubleword, 502–503, 517–519, 526–527
word, 495common operations
addition, 215–221multiplication, 226–232shifts, 221–226subtraction, 215–221
operationsaddition, 350shifts, 350subtraction, 350
pack and unpack, 352–357size promotions, 244, 253
sign extended, 358, 362zero extended, 358, 362
��������� RRegisters
general purpose8-bit, 816-bit, 832-bit, 8
64-bit, 8–9MXCSR, 6–7RFLAGS
carry, 9–10direction, 9–10overflow, 9–10parity, 9–10sign, 9–10zero, 9–10
RIP (instruction pointer), 10RSP (stack pointer), 9XMM, 6–7, 9YMM, 6–7ZMM, 6
RFLAGS. See RegistersRing interconnect, 531
��������� S, T, UScalar floating-point arithmetic
arrays, 135double-precision, 109, 112matrices, 138–143operations
addition, 117compares, 118conversions, 128division, 109multiplication, 117square root, 117subtraction, 117
single-precision, 110–112SIMD. See Single Instruction
Multiple Data (SIMD)Single Instruction Multiple Data (SIMD)
arithmetichorizontal addition, 101horizontal subtraction, 101packed floating-point, 100–102packed integer, 103–105saturated, 90–91wrapround, 90–91
data typesxmmword, 6ymmword, 6zmmword, 6
programming concepts, 88–89Smoothing operator
Gaussian filtercoefficients, 386
Stringsconcatenation, 74–79counting characters, 71–73direction flag, 82, 85end-of-string character, 73
■ INDEX
604
Structuremember alignment, 68padding, 68
Structure of arrays (SOA), 473, 476, 526–527System agent, 531
��������� V, WVector cross product
component equation, 473gather, 466, 473opmask register, 473–474scatter, 466, 473
Vector scale-index-base (VSIB). See AVX2Visual C++
calling conventionepilog macros, 144, 159–166floating-point argument, 141floating-point return value, 112, 153, 165function epilog, 56, 152–153function prolog, 56, 159general-purpose register,
143, 148–153integer argument, 146leaf function, 143, 147local storage, 147, 159, 262non-leaf function, 143–144, 146, 159,
163–164
non-volatile register, 56, 143, 146–147, 151, 157, 163, 165
prolog macros, 144, 159–166register arguments, 23, 38–39returning structures by value, 356return value, 33, 70stack alignment, 146stack arguments, 23, 38–39stack frame, 143–148, 151, 157, 163–165stack layout, 146–147, 152, 158–159, 163–164volatile register, 56, 143XMM register, 143, 151, 153, 164–165ZMM registers, 436, 449
decorated name, 26extern “C” modifier, 26
��������� XXmmVal, 171–172, 177, 182, 219–220, 225,
230, 232, 290
��������� YYmmVal, 290, 337, 351, 355
��������� ZZmmVal, 449–450, 455–456, 495