software utilities for x86 processors visual studio

585© Daniel Kusswurm 2018D. Kusswurm, Modern X86 Assembly Language Programming, https://doi.org/10.1007/978-1-4842-4063-2

Appendix A

Appendix A includes supplemental material about the following items:

• Software utilities for x86 processors

• Visual Studio

• References

Software Utilities for x86 ProcessorsThe following utilities can be used to determine which x86 instruction set extensions are supported by the processor in your computer:

CPUID CPU-Z (https://www.cpuid.com)

HWiNFO Diagnostic Software (https://www.hwinfo.com)

Piriform SPECCY (https://www.ccleaner.com/speccy)

Visual StudioIn this section, you’ll learn how to use Microsoft’s Visual Studio development tool to run the source code examples that are described in the main text. You’ll also learn how to create a simple Visual Studio C++ project. Before proceeding, you may want to refer to the Introduction for additional information regarding Visual Studio and the recommended hardware platforms for running the source code examples. The Introduction also contains important details about downloading the source code ZIP files for each chapter.

Visual Studio uses logical entities called solutions and projects to help simplify application development. A solution is a collection of one or more projects that are used to build an application. A project is container object that organizes an application’s files. A Visual Studio project is usually created for each buildable component of an application (e.g., executable file, dynamic-linked library, static library, and so on).

A standard Visual Studio C++ project includes two solution configurations named Debug and Release. As implied by their names, these configurations support separate executable builds for initial development and final release. A standard Visual Studio C++ project also incorporates solution platforms. The default solution platforms are named Win32 and x64, which contain the necessary settings to build 32-bit and 64-bit executables, respectively. The Visual Studio solution and project files for this book’s source code examples include only the x64 platform.

https://doi.org/10.1007/978-1-4842-4063-2

https://www.cpuid.com

https://www.hwinfo.com

https://www.ccleaner.com/speccy

Appendix A

586

Running a Source Code ExampleYou can use the following steps to run any of the book’s source code examples:

1. Using File Explorer, double-click on the chapter’s Visual Studio solution (.sln) file. The solution file is included in the chapter source code ZIP file.

2. From the menu bar, select Build | Configuration Manager. In the Configuration Manager dialog box, set Active Solution Configuration to Release. Then set Active Solution Platform to x64. Note that these options may already be selected.

3. If necessary, select View | Solution Explorer to open the Solution Explorer window.

4. In the Solution Explorer window, right-click on a project to run and choose Set as StartUp Project.

5. Select Debug | Start Without Debugging to run the program.

Some of the source code examples reference data files in different folders using fixed path names. To run the corresponding executables using a different folder structure than the one used for Visual Studio development, you may need to change the path name strings in the C++ source code.

Creating a Visual Studio C++ ProjectIn this section, you’ll learn how to create a simple Visual Studio project that includes both C++ and assembly language source code files. The ensuing paragraphs describe the same basic procedure that was used to create the source code examples in the main text and includes the following phases:

• Create a C++ project

• Enable MASM support

• Add an assembly language file

• Set project properties

• Edit the source code

• Build and run the project

Create a C++ ProjectUse the following steps to create a Visual Studio C++ project:

1. Start Visual Studio.

2. Select File | New Project.

3. In the New Project dialog box control tree, select Installed | Visual C++ | Windows Desktop.

4. Select Windows Console Application for the project type.

5. In the Name text box, enter Example1.

Appendix A

587

6. In the Location text box, enter a folder name for the project location. You can also use the Browse button to choose a folder or leave the text unchanged to use the default location.

7. In the Solution text box, enter TestSolution.

8. Verify that the New Project dialog box settings are the same as the ones shown in Figure A-1 (the Location can be different). Click OK.

9. If necessary, select View | Solution Explorer to open the Solution Explorer window.

10. In the Solution Explorer tree control, right-click on the top-level text that’s labeled Solution ‘Example 1’ (1 Project) and select Rename. Change the solution name to TestSolution.

11. Select Build | Configuration Manager. In the Configuration Manager dialog box, choose <Edit…> under Active Solution Platforms (see Figure A-2).

12. In the Edit Solution Platforms dialog box, select x86 and click Remove (see Figure A-3). Click Close to close the Edit Solutions Platforms dialog box; click Close to close the Configuration Manager dialog box.

Figure A-1. New Project dialog box

Appendix A

588

Figure A-2. Configuration Manager dialog box

Figure A-3. Edit Solution Platforms dialog box

Appendix A

589

Enable MASM SupportUse the following steps to enable support for Microsoft Macro Assembler:

1. In the Solution Explorer tree control, right-click on Example1 and select Build Dependencies | Build Customizations.

2. In the Visual C++ Build Customizations dialog box, check masm(.targets, .props).

3. Click OK.

Add an Assembly Language FileUse the following steps to add an assembly language source code file (.asm) to a Visual Studio C++ project:

1. In the Solution Explorer tree control, right-click on Example1 and select Add | New Item.

2. Select C++ File (.cpp) for the file type.

3. In the Name text box, change the name to Example1_.asm, as shown in Figure A-4. Note that the trailing underscore is required since all C++ and assembly language source code files in a project must have a unique base name.

4. Click Add.

Figure A-4. Add New Item dialog box

Appendix A

590

Set Project PropertiesUse the following steps to set the project’s properties. The properties that control listing file generation (Steps 5 - 8) are optional.

1. In the Solution Explorer tree control, right-click on Example1 and select Properties.

2. In the Property Pages dialog box, change the Configuration setting to All Configurations and the Platform setting to All Platforms. Note that one or both options may already be set.

3. In the tree control, select Configuration Properties | General. Change the setting Whole Program Optimization to No Whole Program Optimization (see Figure A-5).

4. Select Configuration Properties | C/C++ | Code Generation. Change the setting Enable Enhanced Instruction Set to Advanced Vector Extensions (/arch:AVX) (see Figure A-6)

5. Select Configuration Properties | C/C++ | Output Files. Change the setting Assembler Output to Assembly Machine and Source Code (/FAcs) (see Figure A-7).

6. Select Configuration Properties | Microsoft Macro Assembler | Listing File. Change the setting Enable Assembly Generated Code Listing to Yes (/Sg) (see Figure A-8).

7. Change the Assembled Code Listing File text field to $(IntDir)\%(filename).lst (see Figure A-8). This macro text specifies the project’s intermediate directory, which is a subfolder of the main project folder.

8. Click OK.

Figure A-5. Property Pages dialog box (Whole Program Optimization)

Appendix A

591

Figure A-6. Property Pages dialog box (Enable Enhanced Instruction Set)

Figure A-7. Property Pages dialog box (Assembler Output)

Appendix A

592

Edit the Source CodeUse the following steps to edit the project source code:

1. In the Editor window, click on the tab named Example1.cpp.

2. Edit the C++ source code to match the code that’s shown in Listing A-1.

3. Click on the tab named Example1_.asm.

4. Edit the assembly language source code to match the code that’s shown in Listing A-2.

5. Select File | Save All.

Listing A-1. Example1.cpp

// Example1.cpp : Defines the entry point for the console application.//

#include "stdafx.h"#include <iostream>

using namespace std;

extern "C" int CalcResult1_(int val1, int val2, int* quo, int* rem);

int main(){ int val1 = 42; int val2 = 9;

Figure A-8. Property Pages dialog box (Microsoft Macro Assembler Listing File)

Appendix A

593

int quo; int rem; int prod = CalcResult1_(val1, val2, &quo, &rem);

cout << "Results for Example1\n"; cout << "val1 = " << val1 << '\n'; cout << "val2 = " << val2 << '\n'; cout << "quo = " << quo << '\n'; cout << "rem = " << rem << '\n'; cout << "prod = " << prod << '\n'; return 0;}

Listing A-2. Example1_.asm

; extern "C" int CalcResult1_(int val1, int val2, int* quo, int* rem);

.codeCalcResult1_ proc mov r10d,ecx ;r10d = val1 mov r11d,edx ;r11d = val2

mov eax,ecx ;eax = val1 cdq ;edx:eax = val1

idiv r11d ;calc val1 / val2 mov dword ptr [r8],eax ;save quotient mov dword ptr [r9],edx ;save remainder

imul r10d,r11d ;r10d = val1 * val2 mov eax,r10d ;eax = val1 * val2 retCalcResult1_ endp end

Build and Run the ProjectUse the following steps to build and run the project:

1. Select Build | Build Solution.

2. If necessary, fix any reported C++ compiler or MASM errors and repeat Step 1.

3. Select Debug | Start Without Debugging.

4. Verify that the output matches the console window shown in Figure A-9.

5. Press Enter to close the console window.

Appendix A

594

ReferencesThis section contains a list of references that were consulted during preparation of the main text. It also includes additional references and resources that provide worthwhile information. The references have been grouped into the following categories:

X86 programming reference manuals

X86 programming and microarchitecture references

Ancillary resources

Algorithm references

C++ references

X86 Programming Reference ManualsThe following is a list of x86 programming reference manuals published by AMD and Intel:

AMD64 Architecture Programmer’s Manual Volume 1: Application Programming https://support.amd.com/TechDocs/24592.pdf

AMD64 Architecture Programmer’s Manual Volume 3: General Purpose and System Instructions, https://support.amd.com/TechDocs/24594.pdf

AMD64 Architecture Programmer’s Manual Volume 4: 128-bit and 256-bit Media Instructions, https://support.amd.com/TechDocs/26568.pdf

Figure A-9. Console window output

https://support.amd.com/TechDocs/24592.pdf



Appendix A

595

Software Optimization Guide for AMD Family 17h Processors, Publication Number 55723, June 2017, https://developer.amd.com/resources/developer-guides-manuals

Intel 64 and IA-32 Architectures Software Developer’s Manual, Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4, https://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Intel 64 and IA-32 Architectures Optimization Reference Manual, https://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Intel Architecture Instruction Set Extensions and Future Features Programming Reference, https://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

X86 Programming and Microarchitecture ReferencesThe follow resources contain informative information about x86 assembly language programming, processors, and microarchitectures.

Guy Ben-Haim, Itai Neoran, and Ishay Tubi, Practical Intel AVX Optimization on 2nd Generation Intel Core Processors, https://software.intel.com/sites/default/files/m/d/4/1/d/8/Practical_Optimization_with_AVX.pdf

Ian Cutress, The Intel Skylake Mobile and Desktop Launch, with Architecture Analysis, September 2015, https://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis

Ian Cutress, The Intel Skylake-X Review: Core i9-7900X, i7-7820X and i7-7800X Tested, June 2017, https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested

Anger Fog, The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers, August 2018, https://agner.org/optimize/#manuals

Agner Fog, Optimizing subroutines in assembly language: An optimization guide for x86 platforms, April 2018, https://agner.org/optimize/#manuals

Chris Kirkpatrick, Intel AVX State Transitions: Migrating SSE Code to AVX, https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx

Patrick Konsor, Avoiding AVX-SSE Transition Penalties, https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties

Patrick Konsor, Performance Benefits of Half-Precision Floats, https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats

Daniel Kusswurm, Modern x86 Assembly Language Programming, Apress, ISBN 978-1-4842-0065-0, 2014.

https://developer.amd.com/resources/developer-guides-manuals

https://developer.amd.com/resources/developer-guides-manuals

https://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html








https://software.intel.com/sites/default/files/m/d/4/1/d/8/Practical_Optimization_with_AVX.pdf

https://software.intel.com/sites/default/files/m/d/4/1/d/8/Practical_Optimization_with_AVX.pdf

https://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis

https://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis

https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested

https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested

https://agner.org/optimize/#manuals



https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx

https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx

https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties

https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties

https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats



Appendix A

596

Max Locktyukhin, How to Detect New Instruction Support in the 4th Generation Intel Core Processor Family, August 2013, https://software.intel.com/en-us/node/405250

John Morgan, Microsoft Visual Studio 2017 Supports Intel AVX-512, https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512

Erdinc Ozturk, James Guilford, Vinodh Gopal, and Wajdi Feghal, New Instructions Supporting Large Integer Arithmetic on Intel Architecture Processors, August 2012, https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

James Reinders, AVX-512 May Be a Hidden Gem in Intel Xeon Scalable Processors, June 2017, https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors

Anand Lal Shimpi, Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel, October 2012, http://www.anandtech.com/show/6355/intels-haswell-architecture

Ancillary ResourcesThe following resources contain useful information about x86 processors and microarchitectures:

Processors for Desktops, AMD, https://www.amd.com/en/products/processors-desktop

List of AMD Accelerated Processing Unit Microprocessors, Wikipedia, https://en.wikipedia.org/wiki/List_of_AMD_Accelerated_Processing_Unit_microprocessors

List of AMD CPU Microarchitectures, Wikipedia, https://en.wikipedia.org/wiki/List_of_AMD_CPU_microarchitectures

List of AMD Microprocessors, Wikipedia, https://en.wikipedia.org/wiki/List_of_AMD_processors

Product Information Website, Intel, https://ark.intel.com

List of Intel CPU Microarchitectures, Wikipedia, https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures

List of Intel Microprocessors, Wikipedia, https://en.wikipedia.org/wiki/Intel_processor

List of Intel Xeon Microprocessors, Wikipedia, https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors

Register Renaming, Wikipedia, https://en.wikipedia.org/wiki/Register_renaming

https://software.intel.com/en-us/node/405250

https://software.intel.com/en-us/node/405250

https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512

https://blogs.msdn.microsoft.com/vcblog/2017/07/11/microsoft-visual-studio-2017-supports-intel-avx-512

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors

https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors

http://www.anandtech.com/show/6355/intels-haswell-architecture

http://www.anandtech.com/show/6355/intels-haswell-architecture

https://www.amd.com/en/products/processors-desktop

https://www.amd.com/en/products/processors-desktop

https://en.wikipedia.org/wiki/List_of_AMD_Accelerated_Processing_Unit_microprocessors

https://en.wikipedia.org/wiki/List_of_AMD_Accelerated_Processing_Unit_microprocessors

https://en.wikipedia.org/wiki/List_of_AMD_CPU_microarchitectures

https://en.wikipedia.org/wiki/List_of_AMD_CPU_microarchitectures

https://en.wikipedia.org/wiki/List_of_AMD_processors

https://en.wikipedia.org/wiki/List_of_AMD_processors

https://ark.intel.com

https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures

https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures

https://en.wikipedia.org/wiki/Intel_processor

https://en.wikipedia.org/wiki/Intel_processor

https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors

https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors

https://en.wikipedia.org/wiki/Register_renaming

https://en.wikipedia.org/wiki/Register_renaming

Appendix A

597

Algorithm ReferencesThe following resources were consulted to develop the algorithms used in the source code examples:

Forman S. Acton, REAL Computing Made REAL – Preventing Errors in Scientific and Engineering Calculations, ISBN 978-0486442211, Dover Publications, 2005

Tony Chan, Gene Golub, Randall LeVeque, Algorithms for Computing the Sample Variance: Analysis and Recommendations, The American Statistician, Volume 37 Number 3 (1983), p. 242-247

James F. Epperson, An Introduction to Numerical Methods and Analysis, Second Edition, ISBN 978-1-118-36759-9, Wiley, 2013

David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic, ACM Computing Surveys, Volume 23 Issue 1 (March 1991), p. 5 – 48

Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing, Fourth Edition, ISBN 978-0-133-35672-4, 2018

James E. Miller, David G. Moursund, Charles S. Duris, Elementary Theory & Application of Numerical Analysis, Revised Edition, ISBN 978-0486479064, Dover Publications, 2011

Anthony Pettofrezzo, Matrices and Transformations, ISBN 0-486-63634-8, Dover Publications, 1978

Hans Schneider and George Barker, Matrices and Linear Algebra, ISBN 0-486-66014-1, Dover Publications, 1989

Eric W. Weisstein, Convolution, MathWorld, http://mathworld.wolfram.com/Convolution.html

Eric W. Weisstein, Correlation Coefficient, MathWorld, http://mathworld.wolfram.com/CorrelationCoefficient.html

Eric W. Weisstein, Cross Product, MathWorld, http://mathworld.wolfram.com/CrossProduct.html

Eric W. Weisstein, Least Squares Fitting, MathWorld, http://mathworld.wolfram.com/LeastSquaresFitting.html

Eric W. Weisstein, Matrix Multiplication, MathWorld, http://mathworld.wolfram.com/MatrixMultiplication.html

David M. Young and Robert Todd Gregory, A Survey of Numerical Mathematics, Volume 1, ISBN 0-486-65691-8, Dover Publications, 1988

Algorithms for calculating variance, Wikipedia, https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Body Surface Area Calculator, http://www.globalrph.com/bsa2.htm

Grayscale, Wikipedia, https://en.wikipedia.org/wiki/Grayscale

Linked List, Wikipedia, https://en.wikipedia.org/wiki/Linked_list

http://mathworld.wolfram.com/Convolution.html

http://mathworld.wolfram.com/Convolution.html

http://mathworld.wolfram.com/CorrelationCoefficient.html

http://mathworld.wolfram.com/CorrelationCoefficient.html

http://mathworld.wolfram.com/CrossProduct.html

http://mathworld.wolfram.com/CrossProduct.html

http://mathworld.wolfram.com/LeastSquaresFitting.html

http://mathworld.wolfram.com/LeastSquaresFitting.html

http://mathworld.wolfram.com/MatrixMultiplication.html

http://mathworld.wolfram.com/MatrixMultiplication.html

https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

http://www.globalrph.com/bsa2.htm

https://en.wikipedia.org/wiki/Grayscale

https://en.wikipedia.org/wiki/Linked_list

Appendix A

598

C++ ReferencesThe following resources contain valuable information about C++ programming, the C++ Standard Template Libraries, and C++ programming using multiple threads.

Ivor Horton, Using the C++ Standard Template Libraries, Apress, ISBN 978-1-4842-0005-6, 2015

Nicolai M. Josuttis, The C++ Standard Library – A Tutorial and Reference, Second Edition, Addison Wesley, ISBN 978-0-321-62321-8, 2012

Bjarne Stroustrup, The C++ Programming Language, Fourth Edition, Addison Wesley, ISBN 978-0-321-56384-2, 2013

Anthony Williams, C++ Concurrency in Action – Practical Multithreading, ISBN 978-1-933-98877-1, Manning Publications, 2012

cplusplus.com, http://www.cplusplus.com

http://www.cplusplus.com

599© Daniel Kusswurm 2018 D. Kusswurm, Modern X86 Assembly Language Programming, https://doi.org/10.1007/978-1-4842-4063-2

�� AAdvanced Vector Extensions (AVX)

data typespacked floating-point, 100–102packed integer, 103–105scalar floating-point, 94–99

differences between x86-SSEexecution lanes, 105intermixing x86-AVX and x86-SSE

code, 107operand alignment, 106vzeroupper, 107YMM register high-order bit zeroing, 106

instruction syntaxnon-destructive source operand, 93

registersMXCSR, 92, 94, 97–98XMM registers, 91–94, 97, 104, 106–107YMM registers, 91–93, 99, 105–107

AVX2MXCSR, 277, 281non-destructive source

operand, 277operand alignment, 277packed floating-point, 278–279packed integer

variable bit shift, 282XMM registers, 277–279, 281YMM registers, 277–279, 281

AVX-512conditional execution and merging

merge masking, 424–425zero masking, 424–425

data types, 423–424embedded broadcast, 426instruction-level rounding

round down, 427, 443round to nearest, 427, 443round to zero, 427, 443

round up, 427, 443suppress all exceptions, 427, 443

instruction set extensionsAV512CD, 430, 433, 491AVX512BW, 430–431, 433, 491, 495AVX512DQ, 421, 422, 427, 431, 433, 491AVX512F, 427–429, 431, 433, 436, 464, 491AVX512VL, 421–422, 427, 433, 491

instruction syntaxconditional execution and

merging, 424–426embedded broadcast, 426instruction-level rounding, 427

merge masking, 433–436, 444, 495, 517predicate mask, 422, 424–425register sets

MXCSR, 427opmask registers, 422XMM registers, 422YMM registers, 422ZMM registers, 422

zero masking, 433, 437–440, 495, 502, 517

Array of structures (AOS), 473, 476, 526–527Array operations

column meansrow-major ordering, 302

least squares, 193–198min-max, 188–192simple calculations, 292–298square roots, 184–187

Arraysaccessing elements, 51–53, 58–61comparing, 79–82one-dimensional, 51reversal, 82–86row-major ordering, 58two-dimensional

row and column indices, 58, 66

Index

https://doi.org/10.1007/978-1-4842-4063-2

■ INDEX

600

�� BBenchmark timing measurements

csv file, 207TRIMMEAN, 207

�� CC++

classesAlignedArray, 238, 261, 368, 375, 404,

473, 502AlignedMem, 238array, 343BmThreadTimer, 206default_random_engine, 238ImageMatrix, 261matrix, 203mutex, 581OS, 206thread, 580–581uniform_int_distribution, 238unique_ptr, 302, 397, 473

lvalue, 355rvalue, 355size_t, 77specifiers

alignas, 171–172, 177, 198, 220, 290, 296, 351, 355–356, 450, 455, 495

Cachecache line, 531, 539L1 data (D-Cache), 530–531L1 instruction (I-Cache), 530–532L2, 530–532L3, 530–532non-temporal data, 557, 562pollution, 557, 562slice, 531temporal data, 557, 562

Conditional jump, 21, 44, 47–48Conditional move, 21, 44, 48Condition codes, 17–18, 44–48Convolution

discrete equation, 386–387, 397–398, 404input signal

padding, 396–397, 404kernel

fixed size, 388variable size, 388

output signal, 385–387, 397, 404–405response signal, 385–386SIMD equations, 404–405theory, 386YMM registers, 489ZMM registers, 489

Correlation coefficient, 305–312CPU Identification (CPUID)

AVX-512 feature flags, 556feature flag, 555–556host operating system

OSXSAVE, 555leaf value, 555memory caches, 554–557return results, 554serializing instruction, 554sub-leaf value, 554xgetbv, 554–555

�� DData blend, 285, 333Data gather

indicesdoubleword, 343quadword, 343

merge control mask, 339, 343–344vector scale-index-base, 278

Data permuteindices, 337

Data prefetchhint, 562linked list, 563, 569–570

Differences between x86-32 and x86-64 programming

byte register restrictions, 15deprecated instructions, 15immediate operands

32-bit, 13–14invalid instructions, 15operand sizes, 13

�� EEnhanced bit manipulation

leading zero bits, 414trailing zero bits, 414

�� F, GFeature set identification. See CPUIDFlagless operations

multiplication, 406–411shift, 406–411

FMA. See Fused-Multiply-Add (FMA)FMA3. See Fused-Multiply-Add (FMA)FMA4. See Fused-Multiply-Add (FMA)Fundamental data types

byte, 3–4double quadword, 3doubleword, 3–4

■ INDEX

601

little endian ordering, 3proper alignment, 4quadword, 3–4word, 3

Fused-Multiply-Add (FMA)arithmetic, 281convolution functions

packed, 398–406scalar, 388–398

data dependenciesmultiple registers, 397

operand ordering scheme, 282packed, 281rounding

MXCSR.RC, 281scalar, 281value discrepancies, 406

�� HHalf-precision floating-point

encodingexponent, 280sign bit, 280significand, 280

F16C, 280Half-precision floating-point conversions

rounding mode, 418

�� IIEEE 754

binary encodingexponent, 95–96sign bit, 95significand, 95–96

special valuesdenormal, 96floating-point zero, 96infinity, 96NaN, 96QNaN, 96SNaN, 96

Image processingimage histogram, 255–262image statistics

mean, 510, 517, 519standard deviation, 510, 517, 519

image thresholdingmask image, 262–263, 271–272

pixel clipping, 363–369pixel conversions, 246–254

instruction-level rounding, 503size reduction, 503

pixel mean, 240–246

pixel minimum-maximum, 232–240RGB pixel min-max values

macro text string, 375RGB to grayscale conversion

color conversion coefficients, 382size reduction, 527weighted sum, 381–382

thresholdingmask image, 504, 508

Instruction operandsimmediate, 11memory, 11register, 11

Instruction pipelineallocate rename block, 532branch prediction unit, 532, 536–537decoded instruction cache, 532execution engine

execution unit, 532–533instruction decoder, 529, 532instruction fetch and pre-decode, 532instruction queue, 532loop stream detector, 532micro-op instruction queue, 532retire unit, 532–533scheduler, 529, 532

Instruction set extensionsADX, 280, 282–283BMI1, 280, 282–283, 406, 412, 415BMI2, 280, 282–283, 406, 412, 415F16C, 280, 385FMA, 280–282LZCNT, 280, 282–283, 406POPCNT, 280, 282–283

Integer arithmeticaddition, 22–24division, 31–34logical operations, 24–27mixed sizes, 35–40multiplication, 31–34shift operations, 27–30subtraction, 22–24

�� J, KJump table, 182

�� LLinked list

nodedata, 569end-of-list terminator, 570link, 569

Loop unrolling, 239

■ INDEX

602

�� MMASM. See Microsoft Macro Assembler (MASM)Matrix operations

inverseCayley-Hamilton theorem, 329

multiplication, 207–213, 312–320transposition, 199–204, 206–207, 312–320

Matrix-vector multiplicationequations, 476, 482permutation of vector components, 482

Memory addressing modesbase register, 12, 41base register + disp, 12base register + index register, 12, 42base register + index register + disp, 12base register + index register * scale factor, 12, 42base register + index register * scale factor +

disp, 12, 42effective address calculation, 11index * scale factor + disp, 12RIP + disp (RIP relative), 12RIP relative, 42

MicroarchitectureCoffee Lake, 529Haswell, 529, 534Kaby Lake, 529Skylake, 529–531, 533Skylake Server, 421, 433, 529, 534

Micro-opmacro-fusion, 532micro-fusion, 532

Microsoft Macro Assembler (MASM)comment line, 23custom segment, 290, 330directive

=, 147align, 171, 262, 290, 330, 450 .allocstack, 146, 151bcst, 463byte ptr, 38catstr, 375.code, 23.const, 42.data, 43dup, 290dword, 26, 42, 290dword ptr, 38endp, 23.endprolog, 56, 146–147, 151ends, 330equ, 127, 147.erridni, 375macro, 203proc, 23

proc frame, 56, 164.pushreg, 56, 151qword, 262, 290qword ptr, 38readonly, 330real4, 111real8, 114.savexmm128, 157segment, 330.setframe, 146substr, 375word ptr, 38xmmword ptr, 172ymmword ptr, 290, 351zmmword ptr, 450

label, 47location counter ($), 42macro text string, 375

Miscellaneous data typesbit field, 6bit string, 6string, 6

Multithreadingdata arrays, 570

MXCSRcontrol flags, 97, 98, 128, 133–134rounding control, 98, 100rounding mode, 128, 133–134status flags, 92, 97, 100, 122

�� NNon-temporal memory store

arrays, 557, 561hint, 562

Numeric data typesfloating-point

double-precision, 5single-precision, 5

signed integers, 5unsigned integers, 5

�� OOptimization

basic techniques, 534–536data alignment

multi-byte values, 538packed floating-point, 538packed integer, 538

floating-point arithmeticdenormals, 536loop unrolling, 536precision, 536register dependencies, 536

■ INDEX

603

program branchesbackward conditional, 537branch prediction, 537forward conditional, 537loop unrolling, 537

SIMD techniquesregister spills, 539

�� P, QPacked floating-point arithmetic

common operationsaddition, 170compares, 173–179conversions, 179–183division, 170–171multiplication, 170–171subtraction, 170

compares, 452–457conversions

unsigned integer, 440, 443logical decisions, 295, 346operations

absolute value, 290–291addition, 288–289, 448–449division, 289, 448–449multiplication, 289, 448–449square root, 289–290, 448–449subtraction, 288–289, 448–449

Packed integer arithmeticbasic arithmetic

doubleword, 502–503, 517–519, 526–527

word, 495common operations

addition, 215–221multiplication, 226–232shifts, 221–226subtraction, 215–221

operationsaddition, 350shifts, 350subtraction, 350

pack and unpack, 352–357size promotions, 244, 253

sign extended, 358, 362zero extended, 358, 362

�� RRegisters

general purpose8-bit, 816-bit, 832-bit, 8

64-bit, 8–9MXCSR, 6–7RFLAGS

carry, 9–10direction, 9–10overflow, 9–10parity, 9–10sign, 9–10zero, 9–10

RIP (instruction pointer), 10RSP (stack pointer), 9XMM, 6–7, 9YMM, 6–7ZMM, 6

RFLAGS. See RegistersRing interconnect, 531

�� S, T, UScalar floating-point arithmetic

arrays, 135double-precision, 109, 112matrices, 138–143operations

addition, 117compares, 118conversions, 128division, 109multiplication, 117square root, 117subtraction, 117

single-precision, 110–112SIMD. See Single Instruction

Multiple Data (SIMD)Single Instruction Multiple Data (SIMD)

arithmetichorizontal addition, 101horizontal subtraction, 101packed floating-point, 100–102packed integer, 103–105saturated, 90–91wrapround, 90–91

data typesxmmword, 6ymmword, 6zmmword, 6

programming concepts, 88–89Smoothing operator

Gaussian filtercoefficients, 386

Stringsconcatenation, 74–79counting characters, 71–73direction flag, 82, 85end-of-string character, 73

■ INDEX

604

Structuremember alignment, 68padding, 68

Structure of arrays (SOA), 473, 476, 526–527System agent, 531

�� V, WVector cross product

component equation, 473gather, 466, 473opmask register, 473–474scatter, 466, 473

Vector scale-index-base (VSIB). See AVX2Visual C++

calling conventionepilog macros, 144, 159–166floating-point argument, 141floating-point return value, 112, 153, 165function epilog, 56, 152–153function prolog, 56, 159general-purpose register,

143, 148–153integer argument, 146leaf function, 143, 147local storage, 147, 159, 262non-leaf function, 143–144, 146, 159,

163–164

non-volatile register, 56, 143, 146–147, 151, 157, 163, 165

prolog macros, 144, 159–166register arguments, 23, 38–39returning structures by value, 356return value, 33, 70stack alignment, 146stack arguments, 23, 38–39stack frame, 143–148, 151, 157, 163–165stack layout, 146–147, 152, 158–159, 163–164volatile register, 56, 143XMM register, 143, 151, 153, 164–165ZMM registers, 436, 449

decorated name, 26extern “C” modifier, 26

�� XXmmVal, 171–172, 177, 182, 219–220, 225,

230, 232, 290

�� YYmmVal, 290, 337, 351, 355

�� ZZmmVal, 449–450, 455–456, 495

software utilities for x86 processors visual studio

Documents