ia-64 application instruction set architecture guide ... · ia-64 application isa guide 1.0 this...

IA-64 Application Instruction Set Architecture Guide

Revision 1.0

HP/Intel

IA-64 Application ISA Guide 1.0

THIS DOCUMENT IS PROVIDED “AS IS” WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WAR-RANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, ORANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE.

Information in this document is provided in connection with Intel/Hewlett-Packard products. No license, express orimplied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided inIntel/Hewlett-Packard's Terms and Conditions of Sale for such products, Intel/Hewlett-Packard assumes no liability what-soever, and Intel/Hewlett-Packard disclaims any express or implied warranty, relating to sale and/or use of Intel/Hewlett-Packard products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringe-ment of any patent, copyright or other intellectual property right. Intel/Hewlett-Packard products are not intended for usein medical, life saving, or life sustaining applications.

Intel/Hewlett-Packard may make changes to specifications and product descriptions at any time, without notice.

Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined."Intel/Hewlett-Packard reserves these for future definition and shall have no responsibility whatsoever for conflicts orincompatibilities arising from future changes to them.

IA-64 processors may contain design defects or errors known as errata which may cause the product to deviate from pub-lished specifications. Current characterized errata are available on request.

Copyright © Intel Corporation / Hewlett-Packard Company, 1999

*Third-party brands and names are the property of their respective owners.

HP/Intel Table of Contents iii


Table of Contents

1 About the IA-64 Application ISA Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11.1 Overview of IA-64 Application Instruction Set Architecture (ISA) Guide. . . . . . . . . . . . . . . . . . . 1-11.2 Terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

2 Introduction to the IA-64 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12.1 IA-64 Operating Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12.2 Instruction Set Transition Model Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22.3 PA-RISC Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22.4 IA-64 Instruction Set Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22.5 Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32.6 Compiler to Processor Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32.7 Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3

2.7.1 Control Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32.7.2 Data Speculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4

2.8 Predication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-42.9 Register Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-52.10 Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-52.11 Register Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-52.12 Floating-point Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-52.13 Multimedia Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6

3 IA-64 Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13.1 Application Register State. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1

3.1.1 Reserved and Ignored Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13.1.2 General Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33.1.3 Floating-Point Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33.1.4 Predicate Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33.1.5 Branch Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33.1.6 Instruction Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33.1.7 Current Frame Marker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43.1.8 Application Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43.1.9 Performance Monitor Data Registers (PMD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-83.1.10 User Mask (UM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-83.1.11 Processor Identification Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8

3.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-103.2.1 Application Memory Addressing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-103.2.2 Addressable Units and Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-103.2.3 Byte Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10

3.3 Instruction Encoding Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-113.4 Instruction Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12

4 IA-64 Application Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14.1 Register Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1

4.1.1 Register Stack Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14.1.2 Register Stack Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3

4.2 Integer Computation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34.2.1 Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34.2.2 Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-44.2.3 32-Bit Addresses and Integers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-44.2.4 Bit Field and Shift Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-44.2.5 Large Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5

4.3 Compare Instructions and Predication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-54.3.1 Predication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-64.3.2 Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-64.3.3 Compare Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-64.3.4 Predicate Register Transfers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8

4.4 Memory Access Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8

iv Table of Contents HP/Intel


4.4.1 Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-94.4.2 Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-94.4.3 Semaphore Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-104.4.4 Control Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-104.4.5 Data Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-124.4.6 Memory Hierarchy Control and Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-164.4.7 Memory Access Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18

4.5 Branch Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-194.5.1 Modulo-Scheduled Loop Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-204.5.2 Branch Prediction Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22

4.6 Multimedia Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-234.6.1 Parallel Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-234.6.2 Parallel Shifts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-244.6.3 Data Arrangement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24

4.7 Register File Transfers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-244.8 Character Strings and Population Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25

4.8.1 Character Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-254.8.2 Population Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26

5 IA-64 Floating-point Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15.1 Data Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1

5.1.1 Real Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15.1.2 Floating-point Register Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15.1.3 Representation of Values in Floating-point Registers . . . . . . . . . . . . . . . . . . . . . . . . 5-2

5.2 Floating-point Status Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-45.3 Floating-point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6

5.3.1 Memory Access Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-65.3.2 Floating-Point Register to/from General Register Transfer Instructions . . . . . . . . . . . . . . 5-115.3.3 Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-125.3.4 Non-Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-135.3.5 Floating-point Status Register (FPSR) Status Field Instructions . . . . . . . . . . . . . . . . . . 5-145.3.6 Integer Multiply and Add Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14

5.4 Additional IEEE Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-155.4.1 Definition of SNaNs, QNaNs, and Propagation of NaNs . . . . . . . . . . . . . . . . . . . . . . 5-155.4.2 IEEE Standard Mandated Operations Deferred to Software . . . . . . . . . . . . . . . . . . . . . 5-155.4.3 Additions beyond the IEEE Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15

6 IA-64 Instruction Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16.1 Instruction Page Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16.2 Instruction Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2

A Instruction Sequencing Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1A.1 RAW Ordering Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2A.2 WAW Ordering Exceptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3A.3 WAR Ordering Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3

B IA-64 Pseudo-Code Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1C IA-64 Instruction Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1

C.1 Format Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2C.2 A-Unit Instruction Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-8

C.2.1 Integer ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-8C.2.2 Integer Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-10C.2.3 Multimedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-13

C.3 I-Unit Instruction Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-16C.3.1 Multimedia and Variable Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-16C.3.2 Integer Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-20C.3.3 Test Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-22C.3.4 Miscellaneous I-Unit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-23C.3.5 GR/BR Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-24C.3.6 GR/Predicate/IP Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-24C.3.7 GR/AR Moves (I-Unit). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-25

HP/Intel Table of Contents v


C.3.8 Sign/Zero Extend/Compute Zero Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-25C.4 M-Unit Instruction Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-26

C.4.1 Loads and Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-26C.4.2 Line Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-38C.4.3 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-39C.4.4 Set/Get FR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-39C.4.5 Speculation and Advanced Load Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-40C.4.6 Cache/Synchronization/RSE/ALAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-41C.4.7 GR/AR Moves (M-Unit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-42C.4.8 Miscellaneous M-Unit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-42C.4.9 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-43

C.5 B-Unit Instruction Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-45C.5.1 Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-45C.5.2 Nop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-48C.5.3 Miscellaneous B-Unit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-49

C.6 F-Unit Instruction Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-49C.6.1 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-51C.6.2 Parallel Floating-point Select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-52C.6.3 Compare and Classify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-52C.6.4 Approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-53C.6.5 Minimum/Maximum and Parallel Compare. . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-54C.6.6 Merge and Logical. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-54C.6.7 Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-55C.6.8 Status Field Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-55C.6.9 Miscellaneous F-Unit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-56

C.7 X-Unit Instruction Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-56C.7.1 Miscellaneous X-Unit Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-56C.7.2 Move Long Immediate64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-57

C.8 Immediate Formation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-57

vi List of Figures HP/Intel


List of Figures

Figure 3-1. Application Register Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2Figure 3-2. Frame Marker Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4Figure 3-3. RSC Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6Figure 3-4. BSP Register Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6Figure 3-5. BSPSTORE Register Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6Figure 3-6. RNAT Register Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6Figure 3-7. PFS Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7Figure 3-8. Epilog Count Register Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8Figure 3-9. User Mask Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8Figure 3-10. CPUID Registers 0 and 1 – Vendor Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9Figure 3-11. CPUID Register 2 – Processor Serial Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9Figure 3-12. CPUID Register 3 – Version Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9Figure 3-13. CPUID Register 4 – General Features/Capability Bits . . . . . . . . . . . . . . . . . . . . . . . . . 3-9Figure 3-14. Little-endian Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10Figure 3-15. Big-endian Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11Figure 3-16. Bundle Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11Figure 4-1. Register Stack Behavior on Procedure Call and Return . . . . . . . . . . . . . . . . . . . . . . . . . 4-2Figure 4-2. Data Speculation Recovery Using ld.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13Figure 4-3. Data Speculation Recovery Using chk.a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14Figure 4-4. Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16Figure 4-5. Allocation Paths Supported in the Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 4-17Figure 5-1. Floating-point Register Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2Figure 5-2. Floating-point Status Register Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4Figure 5-3. Floating-point Status Field Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4Figure 5-4. Memory to Floating-point Register Data Translation – Single Precision . . . . . . . . . . . . . . . . 5-7Figure 5-5. Memory to Floating-point Register Data Translation – Double Precision . . . . . . . . . . . . . . . 5-8Figure 5-6. Memory to Floating-point Register Data Translation – Double Extended, Integer and Fill . . . . . . 5-9Figure 5-7. Floating-point Register to Memory Data Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10Figure 5-8. Spill/Fill and Double-Extended (80-bit) Floating-point Memory Formats . . . . . . . . . . . . . . . 5-11Figure 6-1. Add Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4Figure 6-2. Stack Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5Figure 6-3. Operation of br.ctop and br.cexit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10Figure 6-4. Operation of br.wtop and br.wexit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11Figure 6-5. Deposit Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-27Figure 6-6. Extract Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-28Figure 6-7. Floating-point Merge Negative Sign Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-49Figure 6-8. Floating-point Merge Sign Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-49Figure 6-9. Floating-point Merge Sign and Exponent Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 6-49Figure 6-10. Floating-point Mix Left . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-52Figure 6-11. Floating-point Mix Right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-52Figure 6-12. Floating-point Mix Left-Right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-52Figure 6-13. Floating-point Pack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-63Figure 6-14. Floating-point Merge Negative Sign Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-73Figure 6-15. Floating-point Merge Sign Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-73Figure 6-16. Floating-point Merge Sign and Exponent Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 6-73Figure 6-17. Floating-point Swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-95Figure 6-18. Floating-point Swap Negate Left or Right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-95Figure 6-19. Floating-point Sign Extend Left . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-97Figure 6-20. Floating-point Sign Extend Right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-97Figure 6-21. Function of getf.exp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-99Figure 6-22. Function of getf.sig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-99Figure 6-23. Mix Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-114Figure 6-24. Mux1 Operation (8-bit elements) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-127Figure 6-25. Mux2 Examples (16-bit elements). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-128

HP/Intel List of Figures vii


Figure 6-26. Pack Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-132Figure 6-27. Parallel Add Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-134Figure 6-28. Parallel Average Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-137Figure 6-29. Parallel Average with Round Away from Zero Example. . . . . . . . . . . . . . . . . . . . . . . 6-138Figure 6-30. Parallel Average Subtract Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-140Figure 6-31. Parallel Compare Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-142Figure 6-32. Parallel Maximum Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-144Figure 6-33. Parallel Minimum Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-145Figure 6-34. Parallel Multiply Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-146Figure 6-35. Parallel Multiply and Shift Right Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-147Figure 6-36. Parallel Sum of Absolute Difference Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-149Figure 6-37. Parallel Shift Left Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-150Figure 6-38. Parallel Subtract Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-154Figure 6-39. Function of setf.exp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-158Figure 6-40. Function of setf.sig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-158Figure 6-41. Shift Left and Add Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-162Figure 6-42. Shift Right Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-164Figure 6-43. Unpack Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-178Figure C-1. Bundle Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-1

viii List of Tables HP/Intel


List of Tables

Table 2-1. IA-64 Processor Operating Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1Table 3-1. Reserved and Ignored Registers and Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2Table 3-2. Frame Marker Field Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4Table 3-3. Application Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5Table 3-4. RSC Field Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6Table 3-5. PFS Field Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7Table 3-6. User Mask Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8Table 3-7. CPUID Register 3 Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9Table 3-8. Relationship Between Instruction Type and Execution Unit Type . . . . . . . . . . . . . . . . . . . 3-11Table 3-9. Template Field Encoding and Instruction Slot Mapping . . . . . . . . . . . . . . . . . . . . . . . . 3-12Table 4-1. Architectural Visible State Related to the Register Stack . . . . . . . . . . . . . . . . . . . . . . . . 4-3Table 4-2. Register Stack Management Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3Table 4-3. Integer Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4Table 4-4. Integer Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4Table 4-5. 32-bit Pointer and 32-bit Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4Table 4-6. Bit Field and Shift Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5Table 4-7. Instructions to Generate Large Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5Table 4-8. Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6Table 4-9. Compare Type Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7Table 4-10. Compare Outcome with NaT Source Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7Table 4-11. Instructions and Compare Types Provided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7Table 4-12. Memory Access Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9Table 4-13. State Relating to Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10Table 4-14. State Related to Control Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12Table 4-15. Instructions Related to Control Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12Table 4-16. State Relating to Data Speculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16Table 4-17. Instructions Relating to Data Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16Table 4-18. Locality Hints Specified by Each Instruction Class . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17Table 4-19. Memory Hierarchy Control Instructions and Hint Mechanisms. . . . . . . . . . . . . . . . . . . . . 4-18Table 4-20. Memory Ordering Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19Table 4-21. Memory Ordering Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19Table 4-22. Branch Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20Table 4-23. State Relating to Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20Table 4-24. Instructions Relating to Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20Table 4-25. Instructions that Modify RRBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21Table 4-26. Whether Prediction Hint on Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22Table 4-27. Sequential Prefetch Hint on Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22Table 4-28. Predictor Deallocation Hint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22Table 4-29. Parallel Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23Table 4-30. Parallel Shift Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24Table 4-31. Parallel Data Arrangement Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24Table 4-32. Register File Transfer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25Table 4-33. String Support Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26Table 5-1. IEEE Real-Type Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1Table 5-2. Floating-point Register Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2Table 5-3. Floating-point Status Register Field Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4Table 5-4. Floating-point Status Register’s Status Field Description . . . . . . . . . . . . . . . . . . . . . . . 5-4Table 5-5. Floating-point Rounding Control Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5Table 5-6. Floating-point Computation Model Control Definitions . . . . . . . . . . . . . . . . . . . . . . . . 5-5Table 5-7. Floating-point Memory Access Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6Table 5-8. Floating-point Register Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11Table 5-9. General Register (Integer) to Floating-point Register Data Translation. . . . . . . . . . . . . . . . . 5-12Table 5-10. Floating-point Register to General Register (Integer) Data Translation. . . . . . . . . . . . . . . . . 5-12Table 5-11. Floating-point Instruction Status Field Specifier Definition. . . . . . . . . . . . . . . . . . . . . . . 5-12

HP/Intel List of Tables ix


Table 5-12. Floating-point Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12Table 5-13. Floating-point Pseudo-Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13Table 5-14. Non-Arithmetic Floating-point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13Table 5-15. FPSR Status Field Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14Table 5-16. Integer Multiply and Add Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14Table 6-1. Instruction Page Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1Table 6-2. Instruction Page Font Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1Table 6-3. Register File Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1Table 6-4. C Syntax Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2Table 6-5. Branch Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8Table 6-6. Branch Whether Hint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11Table 6-7. Sequential Prefetch Hint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11Table 6-8. Branch Cache Deallocation Hint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12Table 6-9. ALAT Clear Completer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16Table 6-10. Comparison Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19Table 6-11. 64-bit Comparison Relations for Normal and unc Compares. . . . . . . . . . . . . . . . . . . . . . 6-20Table 6-12. 64-bit Comparison Relations for Parallel Compares . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20Table 6-13. Immediate Range for 32-bit Compares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-22Table 6-14. Memory Compare and Exchange Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24Table 6-15. Compare and Exchange Semaphore Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24Table 6-16. Result Ranges for czx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26Table 6-17. Specified pc Mnemonic Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-30Table 6-18. sf Mnemonic Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-30Table 6-19. Floating-point Class Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-37Table 6-20. Floating-point Classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-37Table 6-21. Floating-point Comparison Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-40Table 6-22. Floating-point Comparison Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-40Table 6-23. Fetch and Add Semaphore Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-45Table 6-24. Floating-point Parallel Comparison Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-66Table 6-25. Floating-point Parallel Comparison Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-66Table 6-26. sz Completers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-101Table 6-27. Load Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-101Table 6-28. Load Hints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-102Table 6-29. fsz Completers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-105Table 6-30. FP Load Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-105Table 6-31. lftype Mnemonic Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-110Table 6-32. lfhint Mnemonic Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-110Table 6-33. Indirect Register File Mnemonics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-122Table 6-34. Mux Permutations for 8-bit Elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-127Table 6-35. Pack Saturation Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-132Table 6-36. Parallel Add Saturation Completers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-134Table 6-37. Parallel Add Saturation Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-134Table 6-38. Pcmp Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-142Table 6-39. PMPYSHR Shift Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-147Table 6-40. Parallel Subtract Saturation Completers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-154Table 6-41. Parallel Subtract Saturation Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-154Table 6-42. Store Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-166Table 6-43. Store Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-166Table 6-44. xsz Mnemonic Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-171Table 6-45. Test Bit Relations for Normal and unc tbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-173Table 6-46. Test Bit Relations for Parallel tbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-173Table 6-47. Test NaT Relations for Normal and unc tnats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-175Table 6-48. Test NaT Relations for Parallel tnats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-175Table 6-49. Memory Exchange Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-180Table B-1. Pseudo-Code Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-1Table C-1. Relationship Between Instruction Type and Execution Unit Type . . . . . . . . . . . . . . . . . . . .C-1Table C-2. Template Field Encoding and Instruction Slot Mapping . . . . . . . . . . . . . . . . . . . . . . . . .C-2

x List of Tables HP/Intel


Table C-3. Major Opcode Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3Table C-4. Instruction Format Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-4Table C-5. Instruction Field Color Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-6Table C-6. Instruction Field Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7Table C-7. Special Instruction Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7Table C-8. Integer ALU 2-bit+1-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-8Table C-9. Integer ALU 4-bit+2-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-8Table C-10. Integer Compare Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-10Table C-11. Integer Compare Immediate Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-10Table C-12. Multimedia ALU 2-bit+1-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-13Table C-13. Multimedia ALU Size 1 4-bit+2-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . .C-14Table C-14. Multimedia ALU Size 2 4-bit+2-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . .C-14Table C-15. Multimedia ALU Size 4 4-bit+2-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . .C-15Table C-16. Multimedia and Variable Shift 1-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . .C-16Table C-17. Multimedia Max/Min/Mix/Pack/Unpack Size 1 2-bit Opcode Extensions . . . . . . . . . . . . . . .C-16Table C-18. Multimedia Multiply/Shift/Max/Min/Mix/Pack/Unpack Size 2 2-bit Opcode Extensions . . . . . . .C-17Table C-19. Multimedia Shift/Mix/Pack/Unpack Size 4 2-bit Opcode Extensions . . . . . . . . . . . . . . . . .C-17Table C-20. Variable Shift 2-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-18Table C-21. Integer Shift/Test Bit/Test NaT 2-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . .C-20Table C-22. Deposit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-20Table C-23. Test Bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-22Table C-24. Misc I-Unit 3-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-23Table C-25. Misc I-Unit 6-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-23Table C-26. Integer Load/Store/Semaphore/Get FR 1-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . .C-26Table C-27. Floating-point Load/Store/Load Pair/Set FR 1-bit Opcode Extensions . . . . . . . . . . . . . . . . .C-26Table C-28. Integer Load/Store Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-26Table C-29. Integer Load +Reg Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-27Table C-30. Integer Load/Store +Imm Opcode Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-27Table C-31. Semaphore/Get FR Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-28Table C-32. Floating-point Load/Store/Lfetch Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . .C-28Table C-33. Floating-point Load/Lfetch +Reg Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . .C-29Table C-34. Floating-point Load/Store/Lfetch +Imm Opcode Extensions . . . . . . . . . . . . . . . . . . . . . .C-29Table C-35. Floating-point Load Pair/Set FR Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . .C-30Table C-36. Floating-point Load Pair +Imm Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . .C-30Table C-37. Load Hint Completer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-30Table C-38. Store Hint Completer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-30Table C-39. Line Prefetch Hint Completer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-38Table C-40. Opcode 0 Memory Management 3-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . .C-43Table C-41. Opcode 0 Memory Management 4-bit+2-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . .C-44Table C-42. Opcode 1 Memory Management 3-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . .C-44Table C-43. Opcode 1 Memory Management 6-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . .C-44Table C-44. IP-Relative Branch Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-45Table C-45. Indirect/Miscellaneous Branch Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . .C-46Table C-46. Indirect Branch Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-46Table C-47. Indirect Return Branch Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-46Table C-48. Sequential Prefetch Hint Completer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-46Table C-49. Branch Whether Hint Completer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-47Table C-50. Indirect Call Whether Hint Completer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-47Table C-51. Branch Cache Deallocation Hint Completer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-47Table C-52. Indirect Predict/Nop Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-48Table C-53. Miscellaneous Floating-point 1-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . .C-49Table C-54. Opcode 0 Miscellaneous Floating-point 6-bit Opcode Extensions . . . . . . . . . . . . . . . . . . .C-50Table C-55. Opcode 1 Miscellaneous Floating-point 6-bit Opcode Extensions . . . . . . . . . . . . . . . . . . .C-50Table C-56. Reciprocal Approximation 1-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . .C-50Table C-57. Floating-point Status Field Completer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-51Table C-58. Floating-point Arithmetic 1-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . .C-51Table C-59. Fixed-point Multiply Add and Select Opcode Extensions. . . . . . . . . . . . . . . . . . . . . . . .C-51

HP/Intel List of Tables xi


Table C-60. Floating-point Compare Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-52Table C-61. Floating-point Class 1-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-52Table C-62. Misc X-Unit 3-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-56Table C-63. Misc X-Unit 6-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-56Table C-64. Move Long 1-bit Opcode Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-57Table C-65. Immediate Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-57

xii List of Tables HP/Intel


HP/Intel About the IA-64 Application ISA Guide 1-1


1 About the IA-64 Application ISA Guide

The Intel Architecture – 64-bit (IA-64) is a unique combination of innovative features, such as explicit parallelism, predi-cation, speculation and much more. The architecture is designed to be highly scalable to fill the ever increasing perfor-mance requirements of various server and workstation market segments. The IA-64 architecture features a revolutionary64-bit instruction set architecture (ISA) which applies a new processor architecture technology called EPIC, or ExplicitlyParallel Instruction Computing. A key feature of the IA-64 architecture is IA-32 instruction set compatibility.

This document provides a comprehensive description of IA-64 architecture exposed to application software. This includesinformation on application level resources (registers, etc), application environment, detailed application (non-privileged)instruction description, format and encoding.

1.1 Overview of IA-64 Application Instruction Set Architecture (ISA) Guide

Chapter 1, “About the IA-64 Application ISA Guide”. Gives an overview of this guide.

Chapter 2, “Introduction to the IA-64 Processor Architecture”. Provides an overview of key features of IA-64 architec-ture.

Chapter 3, “IA-64 Execution Environment”. Describes the IA-64 application architectural state (registers, memory, etc).

Chapter 4, “IA-64 Application Programming Model”. Describes the IA-64 architecture from the perspective of the appli-cation programmer. IA-64 instructions are grouped into related functions and an overview of their behavior is given.

Chapter 5, “IA-64 Floating-point Programming Model”. This chapter provides a description of IA-64 floating-point regis-ters, data types and formats and floating-point instructions.

Chapter 6, “IA-64 Instruction Reference”. Provides detailed description of IA-64 application instructions, operation, andinstruction format.

Appendix A, "Instruction Sequencing Considerations". Describes the details of instruction sequencing in IA-64 architec-ture.

Appendix B, "IA-64 Pseudo-Code Functions". Describes pseudo-code functions used in Chapter 6, “IA-64 InstructionReference”.

Appendix C, "IA-64 Instruction Formats". Describes the encoding and instruction format of instructions covered in Chap-ter 6, “IA-64 Instruction Reference”.

1.2 TerminologyThe following definitions are for terms related to the IA-64 architecture and will be used in the rest of this document:

• Instruction Set Architecture (ISA) – Defines application and system level resources. These resources includeinstructions and registers.

• IA-64 Architecture – IA-64 defines the architectural extensions defined in the new ISA including 64-bit instructioncapabilities, new performance-enhancing features and support for IA-32 instruction set.

• IA-32 Architecture – The 32-bit and 16-bit Intel Architecture as described in the Intel Architecture Software Devel-oper’s Manual.

• IA-64 Processor – An Intel 64-bit processor that implements both the IA-64 and the IA-32 instruction sets.

1-2 About the IA-64 Application ISA Guide HP/Intel


• IA-64 System Environment – IA-64 operating system privileged environment that supports the execution of bothIA-64 and IA-32 code.

• IA-32 System Environment – Operating system privileged environment and resources as defined by the Intel Archi-tecture Software Developer’s Manual. Resources include virtual paging, control registers, debugging, performancemonitoring, machine checks, and the set of privileged instructions.

HP/Intel Introduction to the IA-64 Processor Architecture 2-1


2 Introduction to the IA-64 Processor Architecture

The IA-64 architecture was designed to overcome the performance limitations of traditional architectures and providemaximum headroom for the future. To achieve this, IA-64 was designed with an array of innovative features to extractgreater instruction level parallelism including: speculation, predication, large register files, a register stack, advancedbranch architecture, and many others. 64-bit memory addressability was added to meet the increasing large memory foot-print requirements of data warehousing, E-business, and other high performance server and workstation applications.

The IA-64 architecture also provides binary compatibility with the IA-32 instruction set. IA-64 processors can run IA-32applications on an IA-64 operating system that supports execution of IA-32 applications. IA-64 processors can run IA-32application binaries on IA-32 legacy operating systems assuming the platform and firmware support exists in the system.The IA-64 architecture also provides the capability to support mixed IA-32 and IA-64 code execution.

The IA-64 architecture was designed with the understanding that compatibility with IA-32 and PA-RISC is a key require-ment. Significant effort has been applied in the architectural definition to maximize IA-64 scalability, performance andarchitectural longevity. As a result, IA-64 has been designed with an array of performance optimization techniques thatextend the architecture to 64-bits and enable higher performance. These features include speculation, predication, largeregister files, a register stack, and an advanced branch architecture and are designed to extract greater instruction levelparallelism.

2.1 IA-64 Operating EnvironmentsThe IA-64 architecture supports two operating system environments:

• IA-32 System Environment: supports IA-32 32-bit operating systems, and

• IA-64 System Environment: supports IA-64 operating systems.

The architectural model also supports a mixture of IA-32 and IA-64 applications within a single IA-64 operating system.Table 2-1 defines the major operating environments supported on IA-64 processors.

Table 2-1. IA-64 Processor Operating Environments

System Environment

Application Environment

Usage

IA-32 IA-32 Instruction Set

IA-32 Protected Mode, Real Mode and Virtual 8086 Mode application and operating system environment. Compatible with IA-32 Pentium®, Pentium Pro and Pen-tium II processors.

IA-64 IA-32 Protected Mode

IA-32 Protected Mode applications in the IA-64 system environment, if supported by OS.

IA-32 Real Mode IA-32 Real Mode applications in the IA-64 system envi-ronment, if supported by OS.

IA-32 Virtual Mode

IA-32 Virtual 86 Mode applications in the IA-64 system environment, if supported by OS.

IA-64 Instruction Set

IA-64 Applications on IA-64 operating systems.

2-2 Introduction to the IA-64 Processor Architecture HP/Intel


2.2 Instruction Set Transition Model OverviewWithin the IA-64 System Environment, the processor can execute either IA-32 or IA-64 instructions at any time. Threespecial instructions and interruptions are defined to transition the processor between the IA-32 and the IA-64 instructionset.

• jmpe (IA-32 instruction) Jump to an IA-64 target instruction, and change the instruction set to IA-64.

• br.ia (IA-64 instruction) IA-64 branch to an IA-32 target instruction, and change the instruction set to IA-32.

• Interruptions transition the processor to the IA-64 instruction set for handling all interruption conditions.

• rfi (IA-64 instruction) “return from interruption” is defined to return to an IA-32 or IA-64 instruction.

The jmpe and br.ia instructions provide a low overhead mechanism to transfer control between the instruction sets.These instructions are typically incorporated into “thunks” or “stubs” that implement the required call linkage and callingconventions to call dynamic or statically linked libraries.

2.3 PA-RISC CompatibilityBinary compatibility between PA-RISC and IA-64 is handled through dynamic object code translation. This process isvery efficient because there is such a high degree of correspondence between PA-RISC and IA-64 instructions. HP’s per-formance studies show that on average the dynamic translator only spends 1-2% of its time in translation with 98-99% ofthe time spent executing native code. The dynamic translator actually performs optimizations on the translated code totake advantage of IA-64’s wider instructions, and performance features such as predication, speculation and large registersets. In addition, if an application has been aggressively optimized for PA-RISC, some of the benefit of the optimizationswill carry over to IA-64. In fact, an aggressively optimized PA-RISC application may actually perform faster on IA-64using the dynamic translator than the same application recompiled at a low level of optimization on an IA-64 compiler. Ofcourse, the best performance will result from a high level of optimization using a good native compiler.

The dynamic translator is designed to run all non-kernel intrusive code, handling both 64-bit and 32-bit instructions. Thismeans operating systems and device drivers typically would not be supported, but all other applications will run. HP’sdynamic translator will be bundled with all versions of HP-UX sold for IA-64 systems. When HP-UX encounters codecompiled for PA-RISC, it will automatically and transparently invoke the dynamic translator which will allow the code torun on IA-64 without any intervention. Correctness of the dynamic translator has been verified with the same testing reg-imen used to validate PA-RISC processors.

2.4 IA-64 Instruction Set FeaturesIA-64 incorporates architectural features which enable high sustained performance and remove barriers to further perfor-mance increases. The IA-64 architecture is based on the following principles:

• Explicit parallelism

— Mechanisms for synergy between the compiler and the processor

— Massive resources to take advantage of instruction level parallelism

— 128 Integer and Floating point registers, 64 1-bit predicate registers, 8 branch registers

— Support for many execution units and memory ports

• Features that enhance instruction level parallelism

— Speculation (which minimizes memory latency impact).

— Predication (which removes branches).

— Software pipelining of loops with low overhead

— Branch prediction to minimize the cost of branches



• Focussed enhancements for improved software performance

— Special support for software modularity

— High performance floating-point architecture

— Specific multimedia instructions

The following sections highlight these important features of IA-64.

2.5 Instruction Level ParallelismInstruction Level Parallelism (ILP) is the ability to execute multiple instructions at the same time. The IA-64 architectureallows issuing of independent instructions in bundles (three instructions per bundle) for parallel execution and can issuemultiple bundles per clock. Supported by a large number of parallel resources such as large register files and multiple exe-cution units, the IA-64 architecture enables the compiler to manage work in progress and schedule simultaneous threadsof computation.

The IA-64 architecture incorporates mechanisms to take advantage of ILP. Compilers for traditional architectures areoften limited in their ability to utilize speculative information because it cannot always be guaranteed to be correct. TheIA-64 architecture enables the compiler to exploit speculative information without sacrificing the correct execution of anapplication (see “Speculation” on page 2-3). In traditional architectures, procedure calls limit performance since registersneed be spilled and filled. IA-64 enables procedures to communicate register usage to the processor. This allows the pro-cessor to schedule procedure register operations even when there is a low degree of ILP. See “Register Stack” on page 2-5.

2.6 Compiler to Processor CommunicationThe IA-64 architecture provides mechanisms, such as instruction templates, branch hints, and cache hints to enable thecompiler to communicate compile-time information to the processor. In addition, IA-64 allows compiled code to managethe processor hardware using run-time information. These communication mechanisms are vital in minimizing the perfor-mance penalties associated with branches and cache misses.

Every memory load and store in IA-64 has a 2-bit cache hint field in which the compiler encodes its prediction of the spa-tial and/or temporal locality of the memory area being accessed. An IA-64 processor can use this information to determinethe placement of cache lines in the cache hierarchy. This leads to better utilization of the hierarchy since the relative costof cache misses continues to grow.

2.7 SpeculationThere are two types of speculation: control and data. In both control and data speculation, the compiler exposes ILP byissuing an operation early and removing the latency of this operation from the critical path. The compiler will issue anoperation speculatively if it is reasonably sure that the speculation will be beneficial. To be beneficial two conditionsshould hold: it must be statistically frequent enough that the probability it will require recovery is small, and issuing theoperation early should expose further ILP-enhancing optimization. Speculation is one of the primary mechanisms for thecompiler to exploit statistical ILP by overlapping, and therefore tolerating, the latencies of operations.

2.7.1 Control Speculation

Control speculation is the execution of an operation before the branch which guards it. Consider the code sequence below:

if (a>b) load(ld_addr1,target1)else load(ld_addr2, target2)

If the operation load(ld_addr1,target1)were to be performed prior to the determination of (a>b), then the operationwould be control speculative with respect to the controlling condition (a>b). Under normal execution, the operationload(ld_addr1,target1) may or may not execute. If the new control speculative load causes an exception then theexception should only be serviced if (a>b) is true. When the compiler uses control speculation it leaves a check operationat the original location. The check verifies whether an exception has occurred and if so it branches to recovery code. Thecode sequence above now translates into:



/* off critical path */sload(ld_addr1,target1)sload(ld_addr2,target2)

/* other operations including uses of target1/target2 */if (a>b) scheck(target1,recovery_addr1)else scheck(target2, recovery_addr2)

2.7.2 Data Speculation

Data speculation is the execution of a memory load prior to a store that preceded it and that may potentially alias with it.Data speculative loads are also referred to as “advanced loads”. Consider the code sequence below:

store(st_addr,data)load(ld_addr,target)use(target)

The process of determining at compile time the relationship between memory addresses is called disambiguation. In theexample above, if ld_addr and st_addr cannot be disambiguated, and if the load were to be performed prior to the store,then the load would be data speculative with respect to the store. If memory addresses overlap during execution, a data-speculative load issued before the store might return a different value than a regular load issued after the store. Thereforeanalogous to control speculation, when the compiler data speculates a load, it leaves a check instruction at the originallocation of the load. The check verifies whether an overlap has occurred and if so it branches to recovery code. The codesequence above now translates into:

/* off critical path */aload(ld_addr,target)

/* other operations including uses of target */store(st_addr,data)acheck(target,recovery_addr)use(target)

2.8 PredicationPredication is the conditional execution of instructions. Conditional execution is implemented through branches in tradi-tional architectures. IA-64 implements this function through the use of predicated instructions. Predication removesbranches used for conditional execution resulting in larger basic blocks and the elimination of associated mispredict pen-alties.

To illustrate, an unpredicated instruction

r1 = r2 + r3

when predicated, would be of the form

if (p5) r1 = r2 + r3

In this example p5 is the controlling predicate that decides whether or not the instruction executes and updates state. If thepredicate value is true, then the instruction updates state. Otherwise it generally behaves like a nop. Predicates areassigned values by compare instructions.

Predicated execution avoids branches, and simplifies compiler optimizations by converting a control dependence to a datadependence. Consider the original code:

if (a>b) c = c + 1else d = d * e + f

The branch at (a>b) can be avoided by converting the code above to the predicated code:

pT, pF = compare(a>b)if (pT) c = c + 1if (pF) d = d * e + f



The predicate pT is set to 1 if the condition evaluates to true, and to 0 if the condition evaluates to false. The predicate pFis the complement of pT. The control dependence of the instructions c = c + 1 and d = d * e + f on the branch withthe condition (a>b) is now converted into a data dependence on compare(a>b) through predicates pT and pF (the branchis eliminated). An added benefit is that the compiler can schedule the instructions under pT and pF to execute in parallel.It is also worth noting that there are several different types of compare instructions that write predicates in different man-ners including unconditional compares and parallel compares.

2.9 Register StackIA-64 avoids the unnecessary spilling and filling of registers at procedure call and return interfaces through compiler-con-trolled renaming. At a call site, a new frame of registers is available to the called procedure without the need for registerspill and fill (either by the caller or by the callee). Register access occurs by renaming the virtual register identifiers in theinstructions through a base register into the physical registers. The callee can freely use available registers without havingto spill and eventually restore the caller’s registers. The callee executes an alloc instruction specifying the number ofregisters it expects to use in order to ensure that enough registers are available. If sufficient registers are not available(stack overflow), the alloc stalls the processor and spills the caller’s registers until the requested number of registers areavailable.

At the return site, the base register is restored to the value that the caller was using to access registers prior to the call.Some of the caller’s registers may have been spilled by the hardware and not yet restored. In this case (stack underflow),the return stalls the processor until the processor has restored an appropriate number of the caller’s registers. The hard-ware can exploit the explicit register stack frame information to spill and fill registers from the register stack to memory atthe best opportunity (independent of the calling and called procedures).

2.10 BranchingIn addition to removing branches through the use of predication, several mechanisms are provided to decrease the branchmisprediction rate and the cost of the remaining mispredicted branches. These mechanisms provide ways for the compilerto communicate information about branch conditions to the processor.

For indirect branches, a branch register is used to hold the target address.

Special loop-closing branches are provided to accelerate counted loops and modulo-scheduled loops. These branches pro-vide information that allows for perfect prediction of loop termination, thereby eliminating costly mispredict penalties anda reduction of the loop overhead.

2.11 Register RotationModulo scheduling of a loop is analogous to hardware pipelining of a functional unit since the next iteration of the loopstarts before the previous iteration has finished. The iteration is split into stages similar to the stages of an execution pipe-line. Modulo scheduling allows the compiler to execute loop iterations in parallel rather than sequentially. The concurrentexecution of multiple iterations traditionally requires unrolling of the loop and software renaming of registers. IA-64allows the renaming of registers which provide every iteration with its own set of registers, avoiding the need for unroll-ing. This kind of register renaming is called register rotation. The result is that software pipelining can be applied to amuch wider variety of loops - both small as well as large with significantly reduced overhead.

2.12 Floating-point ArchitectureIA-64 defines a floating-point architecture with full IEEE support for the single, double, and double-extended (80-bit)data types. Some extensions, such as a fused multiply and add operation, minimum and maximum functions, and a regis-ter file format with a larger range than the double-extended memory format, are also included. 128 floating-point registersare defined. Of these, 96 registers are rotating (not stacked) and can be used to modulo schedule loops compactly. Multi-ple floating-point status registers are provided for speculation.

IA-64 has parallel FP instructions which operate on two 32-bit single precision numbers, resident in a single floating-pointregister, in parallel and independently. These instructions significantly increase the single precision floating-point compu-tation throughput and enhance the performance of 3D intensive applications and games.



2.13 Multimedia SupportIA-64 has multimedia instructions which treat the general registers as concatenations of eight 8-bit, four 16-bit, or two 32-bit elements. These instructions operate on each element in parallel, independent of the others. IA-64 multimedia instruc-tions are semantically compatible with HP’s MAX-2 multimedia technology and Intel’s MMX technology instructionsand Streaming SIMD Extensions instruction technology.

HP/Intel IA-64 Execution Environment 3-1


3 IA-64 Execution Environment

The architectural state consists of registers and memory. The results of instruction execution become architecturally visi-ble according to a set of execution sequencing rules. This chapter describes the IA-64 application architectural state andthe rules for execution sequencing.

3.1 Application Register StateThe following is a list of the registers available to application programs (see Figure 3-1):

• General Registers (GRs) – General purpose 64-bit register file, GR0 – GR127. IA-32 integer and segment registersare contained in GR8 - GR31 when executing IA-32 instructions.

• Floating-Point Registers (FRs) – Floating-point register file, FR0 – FR127. IA-32 floating-point and multi-mediaregisters are contained in FR8 - FR31 when executing IA-32 instructions.

• Predicate Registers (PRs) – Single-bit registers, used in IA-64 predication and branching, PR0 – PR63.

• Branch Registers (BRs) – Registers used in IA-64 branching, BR0 – BR7.

• Instruction Pointer (IP) – Register which holds the bundle address of the currently executing IA-64 instruction, orbyte address of the currently executing IA-32 instruction.

• Current Frame Marker (CFM) – State that describes the current general register stack frame, and FR/PR rotation.

• Application Registers (ARs) – A collection of special-purpose IA-64 and IA-32 application registers.

• Performance Monitor Data Registers (PMD) – Data registers for performance monitor hardware.

• User Mask (UM) – A set of single-bit values used for alignment traps, performance monitors, and to monitor float-ing-point register usage.

• Processor Identifiers (CPUID) – Registers that describe processor implementation-dependent IA-64 features.

IA-32 application register state is entirely contained within the larger IA-64 application register set and is accessible byIA-64 instructions. IA-32 instructions cannot access the IA-64 specific register set.

3.1.1 Reserved and Ignored Registers

Registers which are not defined are either reserved or ignored. An access to a reserved register raises an Illegal Opera-tion fault. A read of an ignored register returns zero. Software may write any value to an ignored register and the hard-ware will ignore the value written. In variable-sized register sets, registers which are unimplemented in a particularprocessor are also reserved registers. An access to one of these unimplemented registers causes a Reserved Register/Fieldfault.

Within defined registers, fields which are not defined are either reserved or ignored. For reserved fields, hardware willalways return a zero on a read. Software must always write zeros to these fields. Any attempt to write a non-zero valueinto a reserved field will raise a Reserved register/field fault. Reserved fields may have a possible future use.

For ignored fields, hardware will return a 0 on a read, unless noted otherwise. Software may write any value to thesefields since the hardware will ignore any value written. Except where noted otherwise some IA-32 ignored fields mayhave a possible future use.

Table 3-1 summarizes how the processor treats reserved and ignored registers and fields.

3-2 IA-64 Execution Environment HP/Intel


For defined fields in registers, values which are not defined are reserved. Software must always write defined values tothese fields. Any attempt to write a reserved value will raise a Reserved Register/Field fault. Certain registers are read-only registers. A write to a read-only register raises an Illegal Operation fault.

When fields are marked as reserved, it is essential for compatibility with future processors that software treat these fieldsas having a future, though unknown effect. Software should follow these guidelines when dealing with reserved fields:

• Do not depend on the state of any reserved fields. Mask all reserved fields before testing.

• Do not depend on the states of any reserved fields when storing to memory or a register.

• Do not depend on the ability to retain information written into reserved or ignored fields.

• Where possible reload reserved or ignored fields with values previously returned from the same register, otherwiseload zeros.

Table 3-1. Reserved and Ignored Registers and Fields

Type Read WriteReserved register Illegal operation fault Illegal operation fault

Ignored register 0 value written is discarded

Reserved field 0 write of non-zero causes Reserved Reg/Field fault

Ignored field 0 (unless noted otherwise) value written is discarded

Figure 3-1. Application Register Model

APPLICATION REGISTER SET

pr0

IP

PredicatesFloating-point registers

Instruction Pointer

fr0pr1pr2

fr1fr2

181 0

63 0

Branch registers

br0 br1 br2

63 0

br7

gr0gr1gr2

63 0

gr127fr127

gr16

gr31

gr32fr32

fr31

0 +0.0+1.0

General registers

0

nats

CFM

Current Frame Marker

Performance Monitor

63 0

pr63

pr15pr16

37 0

pmd0pmd1

pmdm

Processor Identifiers63 0

cpuid0cpuid1

cpuidn

Data registers

User Mask5 0

63 0

ar64

Application registers

KR0

KR7

RSCBSPar17

ar16

BSPSTORERNAT

ar18ar19

CCV

UNATar36

ar32

FPSR

ITC

ar40

ar44

ECLCar65

ar66

PFS

ar127

ar0

ar7

EFLAGCSDar25

ar24

SSDCFLG

ar26ar27

FSRFIRar29

ar28

FDRar31

FCRar21



3.1.2 General Registers

A set of 128 (64-bit) general registers provide the central resource for all integer and integer multimedia computation.They are numbered GR0 through GR127, and are available to all programs at all privilege levels. Each general registerhas 64 bits of normal data storage plus an additional bit, the NaT bit (Not a Thing), which is used to track deferred specu-lative exceptions.

The general registers are partitioned into two subsets. General registers 0 through 31 are termed the static general regis-ters. Of these, GR0 is special in that it always reads as zero when sourced as an operand and attempting to write to GR 0causes an Illegal Operation fault. General registers 32 through 127 are termed the stacked general registers. The stackedregisters are made available to a program by allocating a register stack frame consisting of a programmable number oflocal and output registers. See Chapter 4.1, “Register Stack” for a description. A portion of the stacked registers can beprogrammatically renamed to accelerate loops. See “Modulo-Scheduled Loop Support” on page 4-20.

General registers 8 through 31 contain the IA-32 integer, segment selector and segment descriptor registers when execut-ing IA-32 instructions.

3.1.3 Floating-Point Registers

A set of 128 (82-bit) floating-point registers are used for all floating-point computation. They are numbered FR0 throughFR127, and are available to all programs at all privilege levels. The floating-point registers are partitioned into two sub-sets. Floating-point registers 0 through 31 are termed the static floating-point registers. Of these, FR0 and FR1 are spe-cial. FR0 always reads as +0.0 when sourced as an operand, and FR 1 always reads as +1.0. When either of these is usedas a destination, a fault is raised. Deferred speculative exceptions are recorded with a special register value called NaTVal(Not a Thing Value).

Floating-point registers 32 through 127 are termed the rotating floating-point registers. These registers can be program-matically renamed to accelerate loops. See “Modulo-Scheduled Loop Support” on page 4-20.

Floating-point registers 8 through 31 contain the IA-32 floating point and multi-media registers when executing IA-32instructions.

3.1.4 Predicate Registers

A set of 64 (1-bit) predicate registers are used to hold the results of IA-64 compare instructions. These registers are num-bered PR0 through PR63, and are available to all programs at all privilege levels. These registers are used for conditionalexecution of instructions.

The predicate registers are partitioned into two subsets. Predicate registers 0 through 15 are termed the static predicateregisters. Of these, PR0 always reads as ‘1’ when sourced as an operand, and when used as a destination, the result is dis-carded. The static predicate registers are also used in conditional branching. See “Predication” on page 4-6.

Predicate registers 16 through 63 are termed the rotating predicate registers. These registers can be programmaticallyrenamed to accelerate loops. See “Modulo-Scheduled Loop Support” on page 4-20.

3.1.5 Branch Registers

A set of 8 (64-bit) branch registers are used to hold IA-64 branching information. They are numbered BR 0 through BR7, and are available to all programs at all privilege levels. The branch registers are used to specify the branch targetaddresses for indirect branches. For more information see “Branch Instructions” on page 4-19.

3.1.6 Instruction Pointer

The Instruction Pointer (IP) holds the address of the bundle which contains the current executing IA-64 instruction. The IPcan be read directly with a mov ip instruction. The IP cannot be directly written, but is incremented as instructions areexecuted, and can be set to a new value with a branch. Because IA-64 instruction bundles are 16 bytes, and are 16-bytealigned, the least significant 4 bits of IP are always zero. See “Instruction Encoding Overview” on page 3-11. For IA-32instruction set execution, IP holds the zero extended 32-bit virtual linear address of the currently executing IA-32 instruc-tion. IA-32 instructions are byte-aligned, therefore the least significant 4 bits of IP are preserved for IA-32 instruction setexecution.



3.1.7 Current Frame Marker

Each general register stack frame is associated with a frame marker. The frame marker describes the state of the IA-64general register stack. The Current Frame Marker (CFM) holds the state of the current stack frame. The CFM cannot bedirectly read or written (see “Register Stack” on page 4-1).

The frame markers contain the sizes of the various portions of the stack frame, plus three Register Rename Base values(used in register rotation). The layout of the frame markers is shown in Figure 3-2 and the fields are described inTable 3-2.

On a call, the CFM is copied to the Previous Frame Marker field in the Previous Function State register (see Section3.1.8.10). A new value is written to the CFM, creating a new stack frame with no locals or rotating registers, but with a setof output registers which are the caller’s output registers. Additionally, all Register Rename Base registers (RRBs) are setto 0. See “Modulo-Scheduled Loop Support” on page 4-20.

3.1.8 Application Registers

The application register file includes special-purpose data registers and control registers for application-visible processorfunctions for both the IA-32 and IA-64 instruction sets. These registers can be accessed by IA-64 application software(except where noted). Table 3-3 contains a list of the application registers.

37 32 31 25 24 18 17 14 13 7 6 0

rrb.pr rrb.fr rrb.gr sor sol sof6 7 7 4 7 7

Figure 3-2. Frame Marker Format

Table 3-2. Frame Marker Field Description

Field Bit Range Descriptionsof 6:0 Size of stack frame

sol 13:7 Size of locals portion of stack frame

sor 17:14 Size of rotating portion of stack frame(the number of rotating registers is 8 * sor)

rrb.gr 24:18 Register Rename Base for general registers

rrb.fr 31:25 Register Rename Base for floating-point registers

rrb.pr 37:32 Register Rename Base for predicate registers



Application registers can only be accessed by either a M or I execution unit. This is specified in the last column of thetable. The ignored registers are for future backward-compatible extensions.

3.1.8.1 Kernel Registers (KR 0-7 – AR 0-7)

Eight user-visible IA-64 64-bit data kernel registers are provided to convey information from the operating system to theapplication. These registers can be read at any privilege level but are writable only at the most privileged level. KR0 -KR2 are also used to hold additional IA-32 register state when the IA-32 instruction set is executing.

3.1.8.2 Register Stack Configuration Register (RSC – AR 16)

The Register Stack Configuration (RSC) Register is a 64-bit register used to control the operation of the IA-64 RegisterStack Engine (RSE). The RSC format is shown in Figure 3-3 and the field description is contained in Table 3-4. Instruc-

Table 3-3. Application Registers

Register Name DescriptionExecution Unit

TypeAR 0-7 KR 0-7a

a. Writes to these registers when the privilege level is not zero result in a Privileged Register fault. Reads are always allowed.

Kernel Registers 0-7

M

AR 8-15 Reserved

AR 16 RSC Register Stack Configuration Register

AR 17 BSP Backing Store Pointer (read-only)

AR 18 BSPSTORE Backing Store Pointer for Memory Stores

AR 19 RNAT RSE NAT Collection Register

AR 20 Reserved

AR 21 FCR IA-32 Floating-point Control Register

AR 22 – AR 23 Reserved

AR 24 EFLAGb

b. Some IA-32 EFLAG field writes are silently ignored if the privilege level is not zero.

IA-32 EFLAG register

AR 25 CSD IA-32 Code Segment Descriptor

AR 26 SSD IA-32 Stack Segment Descriptor

AR 27 CFLGa IA-32 Combined CR0 and CR4 register

AR 28 FSR IA-32 Floating-point Status Register

AR 29 FIR IA-32 Floating-point Instruction Register

AR 30 FDR IA-32 Floating-point Data Register

AR 31 Reserved

AR 32 CCV Compare and Exchange Compare Value Register


AR 36 UNAT User NAT Collection Register


AR 40 FPSR Floating-Point Status Register


AR 44 ITC Interval Time Counter


AR 48 – AR 63 Ignored M or I

AR 64 PFS Previous Function State

IAR 65 LC Loop Count Register

AR 66 EC Epilog Count Register


AR 112 – AR 127 Ignored M or I



tions that modify the RSC can never set the privilege level field to a more privileged level than the currently executingprocess.

3.1.8.3 RSE Backing Store Pointer (BSP – AR 17)

The RSE Backing Store Pointer is a 64-bit read-only register (Figure 3-4). It holds the address of the location in memorywhich is the save location for GR 32 in the current stack frame.

3.1.8.4 RSE Backing Store Pointer for Memory Stores (BSPSTORE – AR 18)

The RSE Backing Store Pointer for memory stores is a 64-bit register (Figure 3-5). It holds the address of the location inmemory to which the RSE will spill the next value.

3.1.8.5 RSE NAT Collection Register (RNAT – AR 19)

The RSE NaT Collection Register is a 64-bit register (Figure 3-6) used by the RSE to temporarily hold NaT bits when it isspilling general registers. Bit 63 always reads as zero and ignores all writes.

3.1.8.6 Compare and Exchange Value Register (CCV – AR 32)

The Compare and Exchange Value Register is a 64-bit register that contains the compare value used as the third sourceoperand in the IA-64 cmpxchg instruction.

63 30 29 16 15 5 4 3 2 1 0

rv loadrs rv be pl mode34 14 11 1 2 2

Figure 3-3. RSC Format

Table 3-4. RSC Field Description

Field Bit Range Descriptionmode 1:0 RSE mode – controls how aggressively the RSE saves and restores register frames.

Eager and intensive settings are hints and can be implemented as lazy.

Bit Pattern00100111

RSE Modeenforced lazyload intensivestore intensiveeager

Bit 1: eager loadsdisabledenableddisabledenabled

Bit 0: eager storesdisableddisabledenabledenabled

pl 3:2 RSE privilege level – loads and stores issued by the RSE are at this privilege level

be 4 RSE endian mode – loads and stores issued by the RSE use this byte ordering(0: little endian; 1: big endian)

loadrs 29:16 RSE load distance to tear point – value used in the loadrs instruction for synchro-nizing the RSE to a tear point

rv 15:5, 63:30 Reserved

63 3 2 1 0

pointer ig61 3

Figure 3-4. BSP Register Format

63 3 2 1 0

pointer ig61 3

Figure 3-5. BSPSTORE Register Format

63 0

ig RSE NaT Collection1 63

Figure 3-6. RNAT Register Format



3.1.8.7 User NAT Collection Register (UNAT – AR 36)

The User NaT Collection Register is a 64-bit register used to temporarily hold NaT bits when saving and restoring generalregisters with the IA-64 ld8.fill and st8.spill instructions.

3.1.8.8 Floating-Point Status Register (FPSR – AR 40)

The floating-point status register (FPSR) controls traps, rounding mode, precision control, flags, and other control bits forIA-64 floating point instructions. FPSR does not control or reflect the status of IA-32 floating point instructions. For moredetails on the FPSR, see Section 5.2.

3.1.8.9 Interval Time Counter (ITC – AR 44)

The Interval Time Counter (ITC) is a 64-bit register which counts up at a fixed relationship to the processor clock fre-quency. Applications can directly sample the ITC for time-based calculations and performance measurements. Systemsoftware can secure the interval time counter from non-privileged IA-64 access. When secured, a read of the ITC at anyprivilege level other than the most privileged causes a Privileged Register fault. The ITC can be written only at the mostprivileged level. The IA-32 Time Stamp Counter (TSC) is equivalent to ITC. ITC can directly be read by the IA-32 rdtsc

(read time stamp counter) instruction. System software can secure the ITC from non-privileged IA-32 access. Whensecured, an IA-32 read of the ITC at any privilege level other than the most privileged raises an IA-32_Exception(GPfault).

3.1.8.10 Previous Function State (PFS – AR 64)

The IA-64 Previous Function State register (PFS) contains multiple fields: Previous Frame Marker (pfm), Previous EpilogCount (pec), and Previous Privilege Level (ppl). Figure 3-7 diagrams the PFS format and Table 3-5 describes the PFSfields. These values are copied automatically on a call from the CFM register, Epilog Count Register (EC) and PSR.cpl(Current Privilege Level in the Processor Status Register) to accelerate procedure calling.

When an IA-64 br.call is executed, the CFM, EC, and PSR.cpl are copied to the PFS and the old contents of the PFS arediscarded. When an IA-64 br.ret is executed, the PFS is copied to the CFM and EC. PFS.ppl is copied to PSR.cpl,unless this action would increase the privilege level.

The PFS.pfm has the same layout as the CFM (see Section 3.1.7), and the PFS.pec has the same layout as the EC (see Sec-tion 3.1.8.12).

3.1.8.11 Loop Count Register (LC – AR 65)

The Loop Count register (LC) is a 64-bit register used in IA-64 counted loops. LC is decremented by counted-loop-typebranches.

3.1.8.12 Epilog Count Register (EC – AR 66)

The Epilog Count register (EC) is a 6-bit register used for counting the final (epilog) stages in IA-64 modulo-scheduledloops. See “Modulo-Scheduled Loop Support” on page 4-20. A diagram of the EC register is shown in Figure 3-8.

63 62 61 58 57 52 51 38 37 0

ppl rv pec rv pfm2 4 6 14 38

Figure 3-7. PFS Format

Table 3-5. PFS Field Description

Field Bit Range Descriptionpfm 37:0 Previous Frame Marker

pec 57:52 Previous Epilog Count

ppl 63:62 Previous Privilege Level

rv 51:38, 61:58 Reserved



3.1.9 Performance Monitor Data Registers (PMD)

A set of performance monitoring registers can be configured by privileged software to be accessible at all privilege levels.Performance monitor data can be directly sampled from within the application. The operating system is allowed to secureuser-configured performance monitors. Secured performance counters return zeros when read, regardless of the currentprivilege level. The performance monitors can only be written at the most privileged level. Performance monitors can beused to gather performance information for both IA-32 and IA-64 instruction set execution.

3.1.10 User Mask (UM)

The user mask is a subset of the Processor Status Register and is accessible to IA-64 application programs. The user maskcontrols memory access alignment, byte-ordering and user-configured performance monitors. It also records the modifica-tion state of IA-64 floating-point registers. Figure 3-9 show the user mask format and Table 3-6 describes the user maskfields.

3.1.11 Processor Identification Registers

Application level processor identification information is available in an IA-64 register file termed: CPUID. This registerfile is divided into a fixed region, registers 0 to 4, and a variable region, register 5 and above. The CPUID[3].number fieldindicates the maximum number of 8-byte registers containing processor specific information.

The CPUID registers are unprivileged and accessed using the indirect mov (from) instruction. All registers beyond registerCPUID[3].number are reserved and raise a Reserved Register/Field fault if they are accessed. Writes are not permitted andno instruction exists for such an operation.

63 6 5 0

ig epilog count58 6

Figure 3-8. Epilog Count Register Format

5 4 3 2 1 0

mfh mfl ac up be rv1 1 1 1 1 1

Figure 3-9. User Mask Format

Table 3-6. User Mask Field Descriptions

Field Bit Range Descriptionrv 0 Reserved

be 1 IA-64 Big-endian memory access enable(controls loads and stores but not RSE memory accesses)0: accesses are done little-endian 1: accesses are done big-endianThis bit is ignored for IA-32 data memory accesses. IA-32 data references are always performed little-endian.

up 2 User performance monitor enable for IA-32 and IA-64 instruction set execution0: user performance monitors are disabled 1: user performance monitors are enabled

ac 3 Alignment check for IA-32 and IA-64 data memory references0: unaligned data memory references may cause an Unaligned Data Reference fault.1: all unaligned data memory references cause an Unaligned Data Reference fault.

mfl 4 Lower (f2 .. f31) floating-point registers written – This bit is set to one when an IA-64 instruction that uses register f2..f31 as a target register, completes. This bit is sticky and is only cleared by an explicit write of the user mask.

mfh 5 Upper (f32 .. f127) floating-point registers written – This bit is set to one when an IA-64 instruction that uses register f32..f127 as a target register, completes. This bit is sticky and only cleared by an explicit write of the user mask.



Vendor information is located in CPUID registers 0 and 1 and specify a vendor name, in ASCII, for the processor imple-mentation (Figure 3-10). All bytes after the end of the string up to the 16th byte are zero. Earlier ASCII characters areplaced in lower number register and lower numbered byte positions.

A Processor Serial Number is located in CPUID register 2. If Processor Serial Numbers are supported by the processormodel and are not disabled, this register returns a 64-bit number Processor Serial Number (Figure 3-11), otherwise zero isreturned. The Processor Serial Number (64-bits) must be combined with the 32-bit version information (CPUID register3; processor archrev, family, model, and revision numbers) to form a 96-bit Processor Identifier.

The 96-bit Processor Identifier is designed to be unique.

CPUID register 3 contains several fields indicating version information related to the processor implementation.Figure 3-12 and Table 3-7 specify the definitions of each field.

CPUID register 4 provides general application level information about IA-64 features. As shown in Figure 3-13, it is a setof flag bits used to indicate if a given IA-64 feature is supported in the processor model. When a bit is one the feature issupported; when 0 the feature is not supported. This register does not contain IA-32 instruction set features. IA-32 instruc-tion set features can be acquired by the IA-32 cpuid instruction. There are no defined feature bits in the current architec-ture. As new features are added (or removed) from future processor models the presence (or removal) of new features willbe indicated by new feature bits. A value of zero in this register indicates all features defined in the first IA-64 architec-tural revision are implemented.

63 0

CPUID[0] byte 0

CPUID[1] byte 1564

Figure 3-10. CPUID Registers 0 and 1 – Vendor Information

63 0

Processor Serial Number64

Figure 3-11. CPUID Register 2 – Processor Serial Number

63 40 39 32 31 24 23 16 15 8 7 0

rv archrev family model revision number24 8 8 8 8 8

Figure 3-12. CPUID Register 3 – Version Information

Table 3-7. CPUID Register 3 Fields

Field Bits Descriptionnumber 7:0 The index of the largest implemented CPUID register (one less than the number of implemented CPUID

registers). This value will be at least 4.

revision 15:8 Processor revision number. An 8-bit value that represents the revision or stepping of this processor implementation within the processor model.

model 23:16 Processor model number. A unique 8-bit value representing the processor model within the processor family.

family 31:24 Processor family number. A unique 8-bit value representing the processor family.

archrev 39:32 Architecture revision. An 8-bit value that represents the architecture revision number that the processor implements.

rv 63:40 Reserved.

63 0

rv64

Figure 3-13. CPUID Register 4 – General Features/Capability Bits



3.2 MemoryThis section describes an IA-64 application program’s view of memory. This includes a description of how memory isaccessed, for both 32-bit and 64-bit applications. The size and alignment of addressable units in memory is also given,along with a description of how byte ordering is handled.

3.2.1 Application Memory Addressing Model

Memory is byte addressable and is accessed with 64-bit pointers. A 32-bit pointer model without a hardware mode is sup-ported architecturally. Pointers which are 32 bits in memory are loaded and manipulated in 64-bit registers. Software mustexplicitly convert 32-bit pointers into 64-bit pointers before use.

3.2.2 Addressable Units and Alignment

Memory can be addressed in units of 1, 2, 4, 8, 10 and 16 bytes.

It is recommended that all addressable units be stored on their naturally aligned boundaries. Hardware and/or operatingsystem software may have support for unaligned accesses, possibly with some performance cost. 10-byte floating-pointvalues should be stored on 16-byte aligned boundaries.

Bits within larger units are always numbered from 0 starting with the least-significant bit. Quantities loaded from memoryto general registers are always placed in the least-significant portion of the register (loaded values are placed right justi-fied in the target general register).

Instruction bundles (3 IA-64 instructions per bundle) are 16-byte units that are always aligned on 16-byte boundaries.

3.2.3 Byte Ordering

The UM.be bit in the User Mask controls whether loads and stores use little-endian or big-endian byte ordering for IA-64references. When the UM.be bit is 0, larger-than-byte loads and stores are little endian (lower-addressed bytes in memorycorrespond to the lower-order bytes in the register). When the UM.be bit is 1, larger-than-byte loads and stores are bigendian (lower-addressed bytes in memory correspond to the higher-order bytes in the register). Load byte and store byteare not affected by the UM.be bit. The UM.be bit does not affect instruction fetch, IA-32 references, or the RSE. IA-64instructions are always accessed by the processor as little-endian units. When instructions are referenced as big-endiandata, the instruction will appear reversed in a register.

Figure 3-14 shows various loads in little-endian format. Figure 3-15 shows various loads in big endian format. Stores arenot shown but behave similarly.

Figure 3-14. Little-endian Loads

a

b

c

d

e

f

g

hd ac bh eg f

63 0

0

1

2

3

4

5

6

7LD8 [0] =>

7 0Memory Registers

0 b0 00 00 0

63 0

LD1 [1] =>

0 c0 d0 00 0

63 0

LD2 [2] =>

h eg f0 00 0

63 0

LD4 [4] =>

Address



3.3 Instruction Encoding OverviewEach IA-64 instruction is categorized into one of six types; each instruction type may be executed on one or more execu-tion unit types. Table 3-8 lists the instruction types and the execution unit type on which they are executed:

Three instructions are grouped together into 128-bit sized and aligned containers called bundles. Each bundle containsthree 41-bit instruction slots and a 5-bit template field. The format of a bundle is depicted in Figure 3-16.

During execution, architectural stops in the program indicate to the hardware that one or more instructions before the stopmay have certain kinds of resource dependencies with one or more instructions after the stop. A stop is present after eachslot having a double line to the right of it in Table 3-9. For example, template 00 has no stops, while template 03 has a stopafter slot 1 and another after slot 2.

In addition to the location of stops, the template field specifies the mapping of instruction slots to execution unit types.Not all possible mappings of instructions to units are available. Table 3-9 indicates the defined combinations. The threerightmost columns correspond to the three instruction slots in a bundle. Listed within each column is the execution unittype controlled by that instruction slot.

Figure 3-15. Big-endian Loads

Table 3-8. Relationship Between Instruction Type and Execution Unit Type

InstructionType

DescriptionExecution Unit

TypeA Integer ALU I-unit or M-unit

I Non-ALU integer I-unit

M Memory M-unit

F Floating-point F-unit

B Branch B-unit

L+X Extended I-unit

127 87 86 46 45 5 4 0

instruction slot 2 instruction slot 1 instruction slot 0 template41 41 41 5

Figure 3-16. Bundle Format

a

b

c

d

e

f

g

h

e hf ga db c

63 0

0

1

2

3

4

5

6

7

LD8 [0] =>

7 0Memory Registers

0 b0 00 00 0

63 0

LD1 [1] =>

0 d0 c0 00 0

63 0

LD2 [2] =>

e hf g0 00 0

63 0

LD4 [4] =>

Address



Extended instructions, used for long immediate integer, occupy two instruction slots.

3.4 Instruction SequencingAn IA-64 program consists of a sequence of instructions and stops packed in bundles. Instruction execution is ordered asfollows:

• Bundles are ordered from lowest to highest memory address. Instructions in bundles with lower memory addressesare considered to precede instructions in bundles with higher memory addresses. The byte order of each bundle inmemory is little-endian (the template field is contained in byte 0 of a bundle).

• Within a bundle, instructions are ordered from instruction slot 0 to instruction slot 2 as specified in Figure 3-16 onpage 3-11.

For additional details on Instruction sequencing, refer to Appendix A, “Instruction Sequencing Considerations”.

Table 3-9. Template Field Encoding and Instruction Slot Mapping

Template Slot 0 Slot 1 Slot 200 M-unit I-unit I-unit

01 M-unit I-unit I-unit



04 M-unit L-unit X-unit


06

07

08 M-unit M-unit I-unit


0A M-unit M-unit I-unit

0B M-unit M-unit I-unit

0C M-unit F-unit I-unit

0D M-unit F-unit I-unit

0E M-unit M-unit F-unit

0F M-unit M-unit F-unit

10 M-unit I-unit B-unit


12 M-unit B-unit B-unit


14

15

16 B-unit B-unit B-unit


18 M-unit M-unit B-unit


1A

1B

1C M-unit F-unit B-unit

1D M-unit F-unit B-unit

1E

1F

HP/Intel IA-64 Application Programming Model 4-1


4 IA-64 Application Programming Model

This section describes the IA-64 architectural functionality from the perspective of the application programmer. IA-64instructions are grouped into related functions and an overview of their behavior is given. Unless otherwise noted, allimmediates are sign extended to 64 bits before use. The floating-point programming model is described separately inChapter 5, “IA-64 Floating-point Programming Model”.

The main features of the IA-64 programming model covered here are:

• General Register Stack

• Integer Computation Instructions

• Compare Instructions and Predication

• Memory Access Instructions and Speculation

• Branch Instructions and Branch Prediction

• Multimedia Instructions

• Register File Transfer Instructions

• Character Strings and Population Count

4.1 Register StackAs described in “General Registers” on page 3-3, the general register file is divided into static and stacked subsets. Thestatic subset is visible to all procedures and consists of the 32 registers from GR 0 through GR 31. The stacked subset islocal to each procedure and may vary in size from zero to 96 registers beginning at GR 32. The register stack mechanismis implemented by renaming register addresses as a side-effect of procedure calls and returns. The implementation of thisrename mechanism is not otherwise visible to application programs. The register stack is disabled during IA-32 instruc-tion set execution.

The static subset must be saved and restored at procedure boundaries according to software convention. The stacked sub-set is automatically saved and restored by the Register Stack Engine (RSE) without explicit software intervention. Allother register files are visible to all procedures and must be saved/restored by software according to software convention.

4.1.1 Register Stack Operation

The registers in the stacked subset visible to a given procedure are called a register stack frame. The frame is further par-titioned into two variable-size areas: the local area and the output area. Immediately after a call, the size of the local areaof the newly activated frame is zero and the size of the output area is equal to the size of the caller’s output area and over-lays the caller’s output area.

The local and output areas of a frame can be re-sized using the alloc instruction which specifies immediates that deter-mine the size of frame (sof) and size of locals (sol). (Note that in the assembly language, alloc specifies three operands:the size of inputs immediate; the size of locals immediate; and the size of outputs immediate. The value of sol is deter-mined by adding the size of inputs immediate and the size of locals immediate; the value of sof is determined by addingall three immediates.) The value of sof specifies the size of the entire stacked subset visible to the current procedure; thevalue of sol specifies the size of the local area. The size of the output area is determined by the difference between sof andsol. The values of these parameters for the currently active procedure are maintained in the Current Frame Marker (CFM).

Reading a stacked register outside the current frame will return an undefined result. Writing a stacked register outside thecurrent frame will cause an Illegal Operation fault.

4-2 IA-64 Application Programming Model HP/Intel


When a call-type branch is executed, the CFM is copied to the Previous Frame Marker (PFM) field in the Previous Func-tion State application register (PFS), and the callee’s frame is created as follows:

• The stacked registers are renamed such that the first register in the caller’s output area becomes GR 32 for the callee

• The size of the local area is set to zero

• The size of the callee’s frame (sofb1) is set to the size of the caller’s output area (sofa – sola)

Values in the output area of the caller’s register stack frame are visible to the callee. This overlap permits parameter andreturn value passing between procedures to take place entirely in registers.

Procedure frames may be dynamically re-sized by issuing an alloc instruction. An alloc instruction causes no renam-ing, but only changes the size of the register stack frame and the partitioning between local and output areas. Typically,when a procedure is called, it will allocate some number of local registers for its use (which will include the parameterspassed to it in the caller’s output registers), plus an output area (for passing parameters to procedures it will call). Newlyallocated registers (including their NaT bits) have undefined values.

When a return-type branch is executed, CFM is restored from PFM and the register renaming is restored to the caller’sconfiguration. The PFM is procedure local state and must be saved and restored by non-leaf procedures. The CFM is notdirectly accessible in application programs and is updated only through the execution of calls, returns, alloc, and clr-rrb.

Figure 4-1 depicts the behavior of the register stack on a procedure call from procA (caller) to procB (callee). The state ofthe register stack is shown at four points: prior to the call, immediately following the call, after procB has executed analloc, and after procB returns to procA.

The majority of application programs need only issue alloc instructions and save/restore PFM in order to effectively uti-lize the register stack. A detailed knowledge of the RSE (Register Stack Engine) is required only by certain specializedapplication software such as user-level thread packages, debuggers, etc.

Figure 4-1. Register Stack Behavior on Procedure Call and Return

Caller’s frame (procA)

Callee’s frame (procB)

Local A

Output B2

32 46

32 48

52

Callee’s frame (procB) Output B1

32 38

50

CFM PFM

14 21

14 2116 19

14 210 7

x x

Frame markersStacked GR’s

Caller’s frame (procA) 14 21 14 21Local A Output A

32 46 52

after return

sofa=21sola=14

sofb1=7

sofb2=19solb2=16

call

alloc

return

sol sof sofsol

after call

after alloc

Output A

Local B

Instruction Execution



4.1.2 Register Stack Instructions

The alloc instruction is used to change the size of the current register stack frame. An alloc instruction must be the firstinstruction in an instruction group otherwise the results are undefined. An alloc instruction affects the register stackframe seen by all instructions in an instruction group, including the alloc itself. An alloc cannot be predicated. Analloc does not affect the values or NaT bits of the allocated registers. When a register stack frame is expanded, newlyallocated registers may have their NaT bit set.

In addition, there are three instructions which provide explicit control over the state of the register stack. These instruc-tions are used in thread and context switching which necessitate a corresponding switch of the backing store for the regis-ter stack.

The flushrs instruction is used to force all previous stack frames out to backing store memory. It stalls instruction execu-tion until all active frames in the physical register stack up to, but not including the current frame are spilled to the back-ing store by the RSE. A flushrs instruction must be the first instruction in an instruction group; otherwise, the results areundefined. A flushrs cannot be predicated.

Table 4-1 lists the architectural visible state relating to the register stack. Table 4-2 summarizes the register stack manage-ment instructions. Call- and return-type branches, which affect the stack, are described in “Branch Instructions” onpage 4-19.

4.2 Integer Computation InstructionsThe integer execution units provide a set of arithmetic, logical, shift and bit-field-manipulation instructions. Additionally,they provide a set of instructions to accelerate operations on 32-bit data and pointers.

Arithmetic, logical and 32-bit acceleration instructions can be executed on both I- and M-units

4.2.1 Arithmetic Instructions

Addition and subtraction (add, sub) are supported with regular two input forms and special three input forms. The threeinput addition form adds one to the sum of two input registers. The three input subtraction form subtracts one from the dif-ference of two input registers. The three input forms share the same mnemonics as the two input forms and are specifiedby appending a “1” as a third source operand.

Immediate forms of addition and subtraction use a register and a 15-bit immediate. The immediate form is obtained sim-ply by specifying an immediate rather than a register as the first operand. Also, addition can be performed between a reg-ister and a 22-bit immediate; however, the source register must be GR 0, 1, 2 or 3.

A shift left and add instruction (shladd) shifts one register operand to the left by 1 to 4 bits and adds the result to a secondregister operand. Table 4-3 summarizes the integer arithmetic instructions.

Table 4-1. Architectural Visible State Related to the Register Stack

Register DescriptionAR[PFS].pfm Previous Frame Marker field

AR[RSC] Register Stack Configuration application register

AR[BSP] Backing store pointer application register

AR[BSPSTORE] Backing store pointer application register for memory stores

AR[RNAT] RSE NaT collection application register

Table 4-2. Register Stack Management Instructions

Mnemonic Operationalloc Allocate register stack frameflushrs Flush register stack to backing store



Note that an integer multiply instruction is defined which uses the floating-point registers. See “Integer Multiply and AddInstructions” on page 5-14 for details. Integer divide is performed in software similarly to floating-point divide.

4.2.2 Logical Instructions

Instructions to perform logical AND (and), OR (or), and exclusive OR (xor) between two registers or between a registerand an immediate are defined. The andcm instruction performs a logical AND of a register or an immediate with the com-plement of another register. Table 4-4 summarizes the integer logical instructions.

4.2.3 32-Bit Addresses and Integers

Support for IA-64 32-bit addresses is provided in the form of add instructions that perform region bit copying. This sup-ports the virtual address translation model. The add 32-bit pointer instruction (addp) adds two registers or a register andan immediate, zeroes the most significant 32-bits of the result, and copies bits 31:30 of the second source to bits 62:61 ofthe result. The shladdp instruction operates similarly but shifts the first source to the left by 1 to 4 bits before performingthe add, and is provided only in the two-register form.

In addition, support for 32-bit integers is provided through 32-bit compare instructions and instructions to perform signand zero extension. Compare instructions are described in “Compare Instructions and Predication” on page 4-5. The signand zero extend (sxt, zxt) instructions take an 8-bit, 16-bit, or 32-bit value in a register, and produce a properly extended64-bit result.

Table 4-5 summarizes 32-bit pointer and 32-bit integer instructions.

4.2.4 Bit Field and Shift Instructions

Four classes of instructions are defined for shifting and operating on bit fields within a general register: variable shifts,fixed shift-and-mask instructions, a 128-bit-input funnel shift, and special compare operations to test an individual bitwithin a general register. The compare instructions for testing a single bit (tbit), or for testing the NaT bit (tnat) aredescribed in “Compare Instructions and Predication” on page 4-5.

The variable shift instructions shift the contents of a general register by an amount specified by another general register.The shift right signed (shr) and shift right unsigned (shr.u) instructions shift the contents of a register to the right with

Table 4-3. Integer Arithmetic Instructions

Mnemonic Operationadd Addition

add ...,1 Three input additionsub Subtraction

sub ...,1 Three input subtractionshladd Shift left and add

Table 4-4. Integer Logical Instructions

Mnemonic Operationand Logical andor Logical orandcm Logical and complementxor Logical exclusive or

Table 4-5. 32-bit Pointer and 32-bit Integer Instructions

Mnemonic Operationaddp 32-bit pointer additionshladdp Shift left and add 32-bit pointersxt Sign extendzxt Zero extend



the vacated bit positions filled with the sign bit or zeroes respectively. The shift left (shl) instruction shifts the contents ofa register to the left.

The fixed shift-and-mask instructions (extr, dep) are generalized forms of fixed shifts. The extract instruction (extr)copies an arbitrary bit field from a general register to the least-significant bits of the target register. The remaining bits ofthe target are written with either the sign of the bit field (extr) or with zero (extr.u). The length and starting position ofthe field are specified by two immediates. This is essentially a shift-right-and-mask operation. A simple right shift by afixed amount can be specified by using shr with an immediate value for the shift amount. This is just an assemblypseudo-op for an extract instruction where the field to be extracted extends all the way to the left-most register bit.

The deposit instruction (dep) takes a field from either the least-significant bits of a general register, or from an immediatevalue of all zeroes or all ones, places it at an arbitrary position, and fills the result to the left and right of the field witheither bits from a second general register (dep) or with zeroes (dep.z). The length and starting position of the field arespecified by two immediates. This is essentially a shift-left-mask-merge operation. A simple left shift by a fixed amountcan be specified by using shl with an immediate value for the shift amount. This is just an assembly pseudo-op for dep.zwhere the deposited field extends all the way to the left-most register bit.

The shift right pair (shrp) instruction performs a 128-bit-input funnel shift. It extracts an arbitrary 64-bit field from a 128-bit field formed by concatenating two source general registers. The starting position is specified by an immediate. Thiscan be used to accelerate the adjustment of unaligned data. A bit rotate operation can be performed by using shrp andspecifying the same register for both operands.

Table 4-6 summarizes the bit field and shift instructions.

4.2.5 Large Constants

A special instruction is defined for generating large constants (see Table 4-7). For constants up to 22 bits in size, the addinstruction can be used, or the mov pseudo-op (pseudo-op of add with GR0, which always reads 0). For larger constants,the move long immediate instruction (movl) is defined to write a 64-bit immediate into a general register. This instructionoccupies two instruction slots within the same bundle, and is the only such instruction.

4.3 Compare Instructions and PredicationA set of compare instructions provides the ability to test for various conditions and affect the dynamic execution ofinstructions. A compare instruction tests for a single specified condition and generates a boolean result. These results arewritten to predicate registers. The predicate registers can then be used to affect dynamic execution in two ways: as condi-tions for conditional branches, or as qualifying predicates for predication.

Table 4-6. Bit Field and Shift Instructions

Mnemonic Operationshr Shift right signedshr.u Shift right unsignedshl Shift leftextr Extract signed (shift right and mask)extr.u Extract unsigned (shift right and mask)dep Deposit (shift left, mask and merge)dep.z Deposit in zeroes (shift left and mask)shrp Shift right pair

Table 4-7. Instructions to Generate Large Constants

Mnemonic Operationmov Move 22-bit immediatemovl Move 64-bit immediate

4.3.1 Predication

Predication is the conditional execution of instructions. The execution of most IA-64 instructions is gated by a qualifyingpredicate. If the predicate is true, the instruction executes normally; if the predicate is false, the instruction does not mod-ify architectural state (except for the unconditional type of compare instructions, floating-point approximation instructionsand while-loop branches). Predicates are one-bit values and are stored in the predicate register file. A zero predicate isinterpreted as false and a one predicate is interpreted as true (predicate register PR0 is hardwired to one).

A few IA-64 instructions cannot be predicated. These instructions are: allocate stack frame (alloc), clear rrb (clrrrb),flush register stack (flushrs), and counted branches (cloop, ctop, cexit).

4.3.2 Compare Instructions

Predicate registers are written by the following instructions: general register compare (cmp, cmp4), floating-point registercompare (fcmp), test bit and test NaT (tbit, tnat), floating-point class (fclass), and floating-point reciprocal approxi-mation and reciprocal square root approximation (frcpa, frsqrta). Most of these compare instructions (all but frcpaand frsqrta) set two predicate registers based on the outcome of the comparison. The setting of the two target registersis described below in “Compare Types” on page 4-6. Compare instructions are summarized in Table 4-8.

The 64-bit (cmp) and 32-bit (cmp4) compare instructions compare two registers, or a register and an immediate, for one often relations (e.g., >, <=). The compare instructions set two predicate targets according to the result. The cmp4 instructioncompares the least-significant 32-bits of both sources (the most significant 32-bits are ignored).

The test bit (tbit) instruction sets two predicate registers according to the state of a single bit in a general register (theposition of the bit is specified by an immediate). The test NaT (tnat) instruction sets two predicate registers according tothe state of the NaT bit corresponding to a general register.

The fcmp instruction compares two floating-point registers and sets two predicate targets according to one of eight rela-tions. The fclass instruction sets two predicate targets according to the classification of the number contained in thefloating-point register source.

The frcpa and frsqrta instructions set a single predicate target if their floating-point register sources are such that avalid approximation can be produced, otherwise the predicate target is cleared.

4.3.3 Compare Types

Compare instructions can have as many as five compare types: Normal, Unconditional, AND, OR, or DeMorgan. Thetype defines how the instruction writes its target predicate registers based on the outcome of the comparison and on thequalifying predicate. The description of these types is contained in Table 4-9. In the table, “qp” refers to the value of thequalifying predicate of the compare and “result” refers to the outcome of the compare relation (one if the compare relationis true and zero if the compare relation is false).

Table 4-8. Compare Instructions

Mnemonic Operationcmp, cmp4 GR comparetbit Test bit in a GRtnat Test GR NaT bitfcmp FR comparefclass FR classfrcpa, fprcpa Floating-point reciprocal approximationfrsqrta, fprsqrta Floating-point reciprocal square root approximation

The Normal compare type simply writes the compare result to the first predicate target and the complement of the result tothe second predicate target.

The Unconditional compare type behaves the same as the Normal type, except that if the qualifying predicate is 0, bothpredicate targets are written with 0. This can be thought of as an initialization of the predicate targets, combined with aNormal compare. Note that compare instructions with the Unconditional type modify architectural state when their quali-fying predicate is false.

The AND, OR and DeMorgan types are termed “parallel” compare types because they allow multiple simultaneous com-pares (of the same type) to target a single predicate register. This provides the ability to compute a logical equation such asp5 = (r4 == 0) || (r5 == r6) in a single cycle (assuming p5 was initialized to 0 in an earlier cycle). The DeMor-gan compare type is just a combination of an OR type to one predicate target and an AND type to the other predicate tar-get. Multiple OR-type compares (including the OR part of the DeMorgan type) may specify the same predicate target inthe same instruction group. Multiple AND-type compares (including the AND part of the DeMorgan type) may also spec-ify the same predicate target in the same instruction group.

For all compare instructions (except for tnat and fclass), if one or both of the source registers contains a deferredexception token (NaT or NaTVal – see “Control Speculation” on page 4-10), the result of the compare is different. Bothpredicate targets are treated the same, and are either written to 0 or left unchanged. In combination with speculation, thisallows predicated code to be turned off in the presence of a deferred exception. (fclass behaves this way as well ifNaTVal is not one of the classes being tested for.) Table 4-10 describes the behavior.

Only a subset of the compare types are provided for some of the compare instructions. Table 4-11 lists the compare typeswhich are available for each of the instructions.

Table 4-9. Compare Type Function

Compare Type CompleterOperation

First Predicate Target Second Predicate TargetNormal none if (qp) {target = result} if (qp) {target = !result}

Unconditional uncif (qp) {target = result}else {target = 0}

if (qp) {target = !result}else {target = 0}

ANDand if (qp && !result) {target = 0} if (qp && !result) {target = 0}

andcm if (qp && result) {target = 0} if (qp && result) {target = 0}

ORor if (qp && result) {target = 1} if (qp && result) {target = 1}

orcm if (qp && !result) {target = 1} if (qp && !result) {target = 1}

DeMorganor.andcm if (qp && result) {target = 1} if (qp && result) {target = 0}

and.orcm if (qp && !result) {target = 0} if (qp && !result) {target = 1}

Table 4-10. Compare Outcome with NaT Source Input

Compare Type OperationNormal if (qp) {target = 0}

Unconditional target = 0

AND if (qp) {target = 0}

OR (not written)

DeMorgan (not written)

Table 4-11. Instructions and Compare Types Provided

Instruction Relation Types Providedcmp, cmp4 a == b, a != b,

a > 0, a >= 0, a < 0, a <= 0,0 > a, 0 >= a, 0 < a, 0 <= a

Normal, Unconditional, AND, OR, DeMorgan

All other relations Normal, Unconditional

tbit, tnat All Normal, Unconditional, AND, OR, DeMorgan



4.3.4 Predicate Register Transfers

Instructions are provided to transfer between the predicate register file and a general register. These instructions operate ina “broadside” manner whereby multiple predicate registers are transferred in parallel, such that predicate register N istransferred to/from bit N of a general register.

The move to predicates instruction (mov pr=) loads multiple predicate registers from a general register according to amask specified by an immediate. The mask contains one bit for each of PR 1 through PR 15 (PR 0 is hardwired to 1) andone bit for all of PR 16 through PR63 (the rotating predicates). A predicate register is written from the corresponding bitin a general register if the corresponding mask bit is 1; if the mask bit is 0 the predicate register is not modified.

The move to rotating predicates instruction (mov pr.rot=) copies 48 bits from an immediate value into the 48 rotatingpredicates (PR 16 through PR 63). The immediate value includes 28 bits, and is sign-extended. Thus PR 16 through PR 42can be independently set to new values, and PR 43 through PR 63 are all set to either 0 or 1.

The move from predicates instruction (mov =pr) transfers the entire predicate register file into a general register target.

For all of these predicate register transfers, the predicate registers are accessed as though the register rename base(CFM.rrb.pr) were 0. Typically, therefore, software should clear CFM.rrb.pr before initializing rotating predicates.

4.4 Memory Access InstructionsMemory is accessed by simple load, store and semaphore instructions, which transfer data to and from general registers orfloating-point registers. The memory address is specified by the contents of a general register.

Most load and store instructions can also specify base-address-register update. Base update adds either an immediatevalue or the contents of a general register to the address register, and places the result back in the address register. Theupdate is done after the load or store operation, i.e., it is performed as an address post-increment.

For highest performance, data should be aligned on natural boundaries. Within a 4K-byte boundary, accesses misalignedwith respect to their natural boundaries will always fault if UM.ac (alignment check bit in the User Mask register) is 1. IfUM.ac is 0, then an unaligned access will succeed if it is supported by the implementation; otherwise it will cause anUnaligned Data Reference fault. All memory accesses that cross a 4K-byte boundary will cause an Unaligned Data Ref-erence fault independent of UM.ac. Additionally, all semaphore instructions will cause an Unaligned Data Reference faultif the access is not aligned to its natural boundary, independent of UM.ac.

Accesses to memory quantities larger than a byte may be done in a big-endian or little-endian fashion. The byte orderingfor all memory access instructions is determined by UM.be in the User Mask register for IA-64 memory references. AllIA-32 memory references are performed little-endian.

Load, store and semaphore instructions are summarized in Table 4-12.

fcmp, fclass All Normal, Unconditional

frcpa, frsqrta,fprcpa, fprsqrta

Not Applicable Unconditional

Table 4-11. Instructions and Compare Types Provided (Continued)

Instruction Relation Types Provided



4.4.1 Load Instructions

Load instructions transfer data from memory to a general register, a floating-point register or a pair of floating-point reg-isters.

For general register loads, access sizes of 1, 2, 4, and 8 bytes are defined. For sizes less than eight bytes, the loaded valueis zero extended to 64-bits.

For floating-point loads, five access sizes are defined: single precision (4 bytes), double precision (8 bytes), double-extended precision (10 bytes), single precision pair (8 bytes), and double precision pair (16 bytes). The value(s) loadedfrom memory are converted into floating-point register format (see “Memory Access Instructions” on page 5-6 fordetails). The floating-point load pair instructions load two adjacent single or double precision numbers into two indepen-dent floating-point registers (see the ldfp[s/d] instruction description for restrictions on target register specifiers). Thefloating-point load pair instructions cannot specify base register update.

Variants of both general and floating-point register loads are defined for supporting compiler-directed control and dataspeculation. These use the general register NaT bits and the ALAT. See “Control Speculation” on page 4-10 and “DataSpeculation” on page 4-12.

Variants are also provided for controlling the memory/cache subsystem. An ordered load can be used to force ordering inmemory accesses. See “Memory Access Ordering” on page 4-18. A biased load provides a hint to acquire exclusive own-ership of the accessed line. See “Memory Hierarchy Control and Consistency” on page 4-16.

Special-purpose loads are defined for restoring register values that were spilled to memory. The ld8.fill instructionloads a general register and the corresponding NaT bit (defined for an 8-byte access only). The ldf.fill instructionloads a value in floating-point register format from memory without conversion (defined for 16-byte access only).See“Register Spill and Fill” on page 4-12.

4.4.2 Store Instructions

Store instructions transfer data from a general or floating-point register to memory. Store instructions are always non-speculative. Store instructions can specify base-address-register update, but only by an immediate value. A variant is alsoprovided for controlling the memory/cache subsystem. An ordered store can be used to force ordering in memoryaccesses.

Both general and floating-point register stores are defined with the same access sizes as their load counterparts. The onlyexception is that there are no floating-point store pair instructions.

Table 4-12. Memory Access Instructions

MnemonicOperation

GeneralFloating-point

Normal Load Pairld ldf ldfp Loadld.s ldf.s ldfp.s Speculative loadld.a ldf.a ldfp.a Advanced loadld.sa ldf.sa ldfp.sa Speculative advanced loadld.c.nc, ld.c.clr ldf.c.nc, ldf.c.clr ldfp.c.nc, ldfp.c.clr Check loadld.c.clr.acq Ordered check loadld.acq Ordered loadld.bias Biased loadld8.fill ldf.fill Fillst stf Storest.rel Ordered storest.spill stf.spill Spillcmpxchg Compare and exchangexchg Exchange memory and GRfetchadd Fetch and add



Special purpose stores are defined for spilling register values to memory. The st8.spill instruction stores a general reg-ister and the corresponding NaT bit (defined for 8-byte access only). This allows the result of a speculative calculation tobe spilled to memory and restored. The stf.spill instruction stores a floating-point register in memory in the floating-point register format without conversion. This allows register spill and restore code to be written to be compatible withpossible future extensions to the floating-point register format. The stf.spill instruction also does not fault if the regis-ter contains a NaTVal, and is defined for 16-byte access only. See“Register Spill and Fill” on page 4-12.

4.4.3 Semaphore Instructions

Semaphore instructions atomically load a general register from memory, perform an operation and then store a result tothe same memory location. Semaphore instructions are always non-speculative. No base register update is provided.

Three types of atomic semaphore operations are defined: exchange (xchg); compare and exchange (cmpxchg); and fetchand add (fetchadd).

The xchg target is loaded with the zero-extended contents of the memory location addressed by the first source and thenthe second source is stored into the same memory location.

The cmpxchg target is loaded with the zero-extended contents of the memory location addressed by the first source; if thezero-extended value is equal to the contents of the Compare and Exchange Compare Value application register (CCV),then the second source is stored into the same memory location.

The fetchadd instruction specifies one general register source, one general register target, and an immediate. Thefetchadd target is loaded with the zero-extended contents of the memory location addressed by the source and then theimmediate is added to the loaded value and the result is stored into the same memory location.

4.4.4 Control Speculation

Special mechanisms are provided to allow for compiler-directed speculation. This speculation takes two forms, controlspeculation and data speculation, with a separate mechanism to support each. See also “Data Speculation” on page 4-12.

4.4.4.1 Control Speculation Concepts

Control speculation describes the compiler optimization where an instruction or a sequence of instructions is executedbefore it is known that the dynamic control flow of the program will actually reach the point in the program where thesequence of instructions is needed. This is done with instruction sequences that have long execution latencies. Starting theexecution early allows the compiler to overlap the execution with other work, increasing the parallelism and decreasingoverall execution time. The compiler performs this optimization when it determines that it is very likely that the dynamiccontrol flow of the program will eventually require this calculation. In cases where the control flow is such that the calcu-lation turns out not to be needed, its results are simply discarded (the results in processor registers are simply not used).

Since the speculative instruction sequence may not be required by the program, no exceptions encountered that would bevisible to the program can be signalled until it is determined that the program’s control flow does require the execution ofthis instruction sequence. For this reason, a mechanism is provided for recording the occurrence of an exception so that itcan be signalled later if and when it is necessary. In such a situation, the exception is said to be deferred. When an excep-tion is deferred by an instruction, a special token is written into the target register to indicate the existence of a deferredexception in the program.

Deferred exception tokens are represented differently in the general and floating-point register files. In general registers,an additional bit is defined for each register called the NaT bit (Not a Thing). Thus general registers are 65 bits wide. ANaT bit equal to 1 indicates that the register contains a deferred exception token, and that its 64-bit data portion containsan implementation specific value that software cannot rely upon. In floating-point registers, a deferred exception is indi-

Table 4-13. State Relating to Memory Access

Register FunctionUM.be User mask byte ordering

UM.ac User mask Unaligned Data Reference fault enable

UNAT GR NaT collection

CCV Compare and Exchange Compare Value application register



cated by a specific pseudo-zero encoding called the NaTVal (see “Representation of Values in Floating-point Registers”on page 5-2 for details).

4.4.4.2 Control Speculation and Instructions

Instructions are divided into two categories: speculative (instructions which can be used speculatively) and non-specula-tive (instructions which cannot). Non-speculative instructions will raise exceptions if they occur and are therefore unsafeto schedule before they are known to be executed. Speculative instructions defer exceptions (they do not raise them) andare therefore safe to schedule before they are know to be executed.

Loads to general and floating-point registers have both non-speculative (ld, ldf, ldfp) and speculative (ld.s, ldf.s,ldfp.s) variants. Generally, all computation instructions which write their results to general or floating-point registersare speculative. Any instruction that modifies state other than a general or floating-point register is non-speculative, sincethere would be no way to represent the deferred exception (there are a few exceptions).

Deferred exception tokens propagate through the program in a dataflow manner. A speculative instruction that reads a reg-ister containing a deferred exception token will propagate a deferred exception token into its target. Thus a chain ofinstructions can be executed speculatively, and only the result register need be checked for a deferred exception token todetermine whether any exceptions occurred.

At the point in the program when it is known that the result of a speculative calculation is needed, a speculation check(chk.s) instruction is used. This instruction tests for a deferred exception token. If none is found, then the speculative cal-culation was successful, and execution continues normally. If a deferred exception token is found, then the speculativecalculation was unsuccessful and must be re-done. In this case, the chk.s instruction branches to a new address (specifiedby an immediate offset in the chk.s instruction). Software can use this mechanism to invoke code that contains a copy ofthe speculative calculation (but with non-speculative loads). Since it is now known that the calculation is required, anyexceptions which now occur can be signalled and handled normally.

Since computational instructions do not generally cause exceptions, the only instructions which generate deferred excep-tion tokens are speculative loads. (IEEE floating-point exceptions are handled specially through a set of alternate statusfields. See “Floating-point Status Register” on page 5-4.) Other speculative instructions propagate deferred exceptiontokens, but do not generate them.

4.4.4.3 Control Speculation and Compares

As stated earlier, most instructions that write a register file other than the general registers or the floating-point registersare non-speculative. The compare (cmp, cmp4, fcmp), test bit (tbit), floating-point class (fclass), and floating-pointapproximation (frcpa, frsqrta) instructions are special cases. These instructions read general or floating-point registersand write one or two predicate registers.

For these instructions, if any source contains a deferred exception token, all predicate targets are either cleared or leftunchanged, depending on the compare type (see Table 4-10 on page 4-7). Software can use this behavior to ensure thatany dependent conditional branches are not taken and any dependent predicated instructions are nullified. See “Predica-tion” on page 4-6.

Deferred exception tokens can also be tested for with certain compare instructions. The test NaT (tnat) instruction teststhe NaT bit corresponding to the specified general register and writes two predicate results. The floating-point class(fclass) instruction can be used to test for a NaTVal in a floating-point register and write the result to two predicate reg-isters. (fclass does not clear both predicate targets in the presence of a NaTVal input if NaTVal is one of the classesbeing tested for.)

4.4.4.4 Control Speculation without Recovery

A non-speculative instruction that reads a register containing a deferred exception token will raise a Register NaT Con-sumption fault. Such instructions can be thought of as performing a non-recoverable speculation check operation. In somecompilation environments, it may be true that the only exceptions that are deferred are fatal errors. In such a program, ifthe result of a speculative calculation is checked and a deferred exception token is found, execution of the program is ter-minated. For such a program, the results of speculative calculations can be checked simply by using non-speculativeinstructions.



4.4.4.5 Register Spill and Fill

Special store and load instructions are provided for spilling a register to memory and preserving any deferred exceptiontoken, and for restoring a spilled register.

The spill and fill general register instructions (st8.spill, ld8.fill) are defined to save/restore a general register alongwith the corresponding NaT bit.

The st8.spill instruction writes a general register’s NaT bit into the User NaT Collection application register (UNAT),and, if the NaT bit was 0, writes the register’s 64-bit data portion to memory. If the register’s NaT bit was 1, the UNAT isupdated, but the memory update is implementation specific, and must consistently follow one of three spill behaviors:

• The st8.spill may not update memory with the register’s 64-bit data portion, or

• The st8.spill may write a zero to the specified memory location, or

• The st8.spill may write the register’s 64-bit data portion to memory, only if that implementation returns a zerointo the target register of all NaTed speculative loads, and that implementation also guarantees that all NaT propagat-ing instructions perform all computations as specified by the instruction pages.

Bits 8:3 of the memory address determine which bit in the UNAT register is written.

The ld8.fill instruction loads a general register from memory taking the corresponding NaT bit from the bit in theUNAT register addressed by bits 8:3 of the memory address. The UNAT register must be saved and restored by software.It is the responsibility of software to ensure that the contents of the UNAT register are correct while executing st8.spill

and ld8.fill instructions.

The floating-point spill and fill instructions (stf.spill, ldf.fill) are defined to save/restore a floating-point register(saved as 16 bytes) without surfacing an exception if the FR contains a NaTVal (these instructions do not affect the UNATregister).

The general and floating-point spill/fill instructions allow spilling/filling of registers that are targets of a speculativeinstruction and may therefore contain a deferred exception token. Note also that transfers between the general and float-ing-point register files cause a conversion between the two deferred exception token formats.

Table 4-14 lists the state relating to control speculation. Table 4-15 summarizes the instructions related to control specula-tion.

4.4.5 Data Speculation

Just as control speculative loads and checks allow the compiler to schedule instructions across control dependences, dataspeculative loads and checks allow the compiler to schedule instructions across some types of ambiguous data depen-dences. This section details the usage model and semantics of data speculation and related instructions.

Table 4-14. State Related to Control Speculation

Register DescriptionNaT bits 65th bit associated with each GR indicating a deferred exception

NaTVal Pseudo-Zero encoding for FR indicating a deferred exception

UNAT User NaT collection application register

Table 4-15. Instructions Related to Control Speculation

Mnemonic Operationld.s, ldf.s, ldfp.s GR and FR speculative loadsld8.fill, ldf.fill Fill GR with NaT collection, fill FRst8.spill, stf.spill Spill GR with NaT collection, spill FRchk.s Test GR or FR for deferred exception tokentnat Test GR NaT bit and set predicate



4.4.5.1 Data Speculation Concepts

An ambiguous memory dependence is said to exist between a store (or any operation that may update memory state) anda load when it cannot be statically determined whether the load and store might access overlapping regions of memory.For convenience, a store that cannot be statically disambiguated relative to a particular load is said to be ambiguous rela-tive to that load. In such cases, the compiler cannot change the order in which the load and store instructions were origi-nally specified in the program. To overcome this scheduling limitation, a special kind of load instruction called anadvanced load can be scheduled to execute earlier than one or more stores that are ambiguous relative to that load.

As with control speculation, the compiler can also speculate operations that are dependent upon the advanced load andlater insert a check instruction that will determine whether the speculation was successful or not. For data speculation, thecheck can be placed anywhere the original non-data speculative load could have been scheduled.

Thus, a data-speculative sequence of instructions consists of an advanced load, zero or more instructions dependent on thevalue of that load, and a check instruction. This means that any sequence of stores followed by a load can be transformedinto an advanced load followed by a sequence of stores followed by a check. The decision to perform such a transforma-tion is highly dependent upon the likelihood and cost of recovering from an unsuccessful data speculation.

4.4.5.2 Data Speculation and Instructions

Advanced loads are available in integer (ld.a), floating-point (ldf.a), and floating-point pair (ldfp.a) forms. When anadvanced load is executed, it allocates an entry in a structure called the Advanced Load Address Table (ALAT). Later,when a corresponding check instruction is executed, the presence of an entry indicates that the data speculation suc-ceeded; otherwise, the speculation failed and one of two kinds of compiler-generated recovery is performed:

1. The check load instruction (ld.c, ldf.c, or ldfp.c) is used for recovery when the only instruction scheduledbefore a store that is ambiguous relative to the advanced load is the advanced load itself. The check load searchesthe ALAT for a matching entry. If found, the speculation was successful; if a matching entry was not found, thespeculation was unsuccessful and the check load reloads the correct value from memory. Figure 4-2 shows thistransformation.

2. The advanced load check (chk.a) is used when an advanced load and several instructions that depend on theloaded value are scheduled before a store that is ambiguous relative to the advanced load. The advanced loadcheck works like the speculation check (chk.s) in that, if the speculation was successful, execution continuesinline and no recovery is necessary; if speculation was unsuccessful, the chk.a branches to compiler-generatedrecovery code. The recovery code contains instructions that will re-execute all the work that was dependent onthe failed data speculative load up to the point of the check instruction. As with the check load, the success of adata speculation using an advanced load check is determined by searching the ALAT for a matching entry. Thistransformation is shown in Figure 4-3.

Before Data Speculation After Data Speculation

// other instructionsst8 [r4] = r12ld8 r6 = [r8];;add r5 = r6, r7;;st8 [r18] = r5

ld8.a r6 = [r8];; // advanced load// other instructionsst8 [r4] = r12ld8.c.clr r6 = [r8] // check loadadd r5 = r6, r7;;st8 [r18] = r5

Figure 4-2. Data Speculation Recovery Using ld.c



Recovery code may use either a normal or advanced load to obtain the correct value for the failed advanced load. Anadvanced load is used only when it is advantageous to have an ALAT entry reallocated after a failed speculation. The lastinstruction in the recovery code should branch to the instruction following the chk.a.

4.4.5.3 Detailed Functionality of the ALAT and Related Instructions

The ALAT is the structure that holds the state necessary for advanced loads and checks to operate correctly. The ALAT issearched in two different ways: by physical addresses and by ALAT register tags. An ALAT register tag is a unique num-ber derived from the physical target register number and type in conjunction with other implementation-specific state.Implementation-specific state might include register stack wrap-around information to distinguish one instance of a phys-ical register that may have been spilled by the RSE from the current instance of that register, thus avoiding the need topurge the ALAT on all register stack wrap-arounds.

IA-32 instruction set execution leaves the contents of the ALAT undefined. Software can not rely on ALAT values beingpreserved across an instruction set transition. On entry to IA-32 instruction set, existing entries in the ALAT are ignored.

Allocating and Checking ALAT Entries

Advanced loads perform the following actions:

1. The ALAT register tag for the advanced load is computed. (For ldfp.a, a tag is computed only for the first targetregister.)

2. If an entry with a matching ALAT register tag exists, it is removed.

3. A new entry is allocated in the ALAT which contains the new ALAT register tag, the load access size, and a tagderived from the physical memory address.

4. The value at the address specified in the advanced load is loaded into the target register and, if specified, the baseregister is updated and an implicit prefetch is performed.

Since the success of a check is determined by finding a matching register tag in the ALAT, both the chk.a and the targetregister of a ld.c must specify the same register as their corresponding advanced load. Additionally, the check load mustuse the same address and operand size as the corresponding advanced load; otherwise, the value written into the targetregister by the check load is undefined.

An advanced load check performs the following actions:

1. It looks for a matching ALAT entry and if found, falls through to the next instruction.

2. If no matching entry is found, the chk.a branches to the specified address.

An implementation may choose to implement a failing advanced load check directly as a branch or as a fault where thefault-handler emulates the branch. Although the expected mode of operation is for an implementation to detect matchingentries in the ALAT during checks, an implementation may fail a check instruction even when an entry with a matchingALAT register tag exists. This will be a rare occurrence but software must not assume that the ALAT does not contain theentry.

Before Data Speculation After Data Speculation

// other instructionsst8 [r4] = r12ld8 r6 = [r8];;add r5 = r6, r7;;st8 [r18] = r5

ld8.a r6 = [r8];;// other instructionsadd r5 = r6, r7;;// other instructionsst8 [r4] = r12chk.a.clr r6, recover

back:st8 [r18] = r5

// somewhere else in programrecover:

ld8 r6 = [r8];;add r5 = r6, r7br back

Figure 4-3. Data Speculation Recovery Using chk.a



A check load checks for a matching entry in the ALAT. If no matching entry is found, it reloads the value from memoryand any faults that occur during the memory reference are raised. When a matching entry is found, the target register isleft unchanged.

If the check load was an ordered check load (ld.c.clr.acq), then it is performed with the semantics of an ordered load(ld.acq). ALAT register tag lookups by advanced load checks and check loads are subject to memory ordering con-straints as outlined in “Memory Access Ordering” on page 4-18.

In addition to the flexibility described above, the size, organization, matching algorithm, and replacement algorithm of theALAT are implementation dependent. Thus, the success or failure of specific advanced loads and checks in a programmay change: when the program is run on different processor implementations, within the execution of a single program onthe same implementation, or between different runs on the same implementation.

Invalidating ALAT Entries

In addition to entries removed by advanced loads, ALAT entry invalidations can occur implicitly by events that altermemory state or explicitly by any of the following instructions: ld.c.clr, ld.c.clr.acq, chk.a.clr, invala, inv-ala.e. Events that may implicitly invalidate ALAT entries include those that change memory state or memory translationstate such as:

1. The execution of stores or semaphores on other processors in the coherence domain.

2. The execution of store or semaphore instructions issued on the local processor.

When one of these events occurs, hardware checks each memory region represented by an entry in the ALAT to see if itoverlaps with the locations affected by the invalidation event. ALAT entries whose memory regions overlap with theinvalidation event locations are removed.

4.4.5.4 Combining Control and Data Speculation

Control speculation and data speculation are not mutually exclusive; a given load may be both control and data specula-tive. Both control speculative (ld.sa, ldf.sa, ldfp.sa) and non-control speculative (ld.a, ldf.a, ldfp.a) variants ofadvanced loads are defined for general and floating-point registers. If a speculative advanced load generates a deferredexception token then:

1. Any existing ALAT entry with the same ALAT register tag is invalidated.

2. No new ALAT entry is allocated.

3. If the target of the load was a general-purpose register, its NaT bit is set.

4. If the target of the load was a floating-point register, then NaTVal is written to the target register.

If a speculative advanced load does not generate a deferred exception, then its behavior is the same as the correspondingnon-control speculative advanced load.

Since there can be no matching entry in the ALAT after a deferred fault, a single advanced load check or check load is suf-ficient to check both for data speculation failures and to detect deferred exceptions.

4.4.5.5 Instruction Completers for ALAT Management

To help the compiler manage the allocation and deallocation of ALAT entries, two variants of advanced load checks andcheck loads are provided: variants with clear (chk.a.clr, ld.c.clr, ld.c.clr.acq, ldf.c.clr, ldfp.c.clr) andvariants with no clear (chk.a.nc, ld.c.nc, ldf.c.nc, ldfp.c.nc).

The clear variants are used when the compiler knows that the ALAT entry will not be used again and wants the entryexplicitly removed. This allows software to indicate when entries are unneeded, making it less likely that a useful entrywill be unnecessarily forced out because all entries are currently allocated.

For the clear variants of check load, any ALAT entry with the same ALAT register tag is invalidated independently ofwhether the address or size fields of the check load and the corresponding advanced load match. For chk.a.clr, the entryis guaranteed to be invalidated only when the instruction falls through (the recovery code is not executed). Thus, a failingchk.a.clr may or may not clear any matching ALAT entries. In such cases, the recovery code must explicitly invalidatethe entry in question if program correctness depends on the entry being absent after a failed chk.a.clr.



Non-clear variants of both kinds of data speculation checks act as a hint to the processor that an existing entry should bemaintained in the ALAT or that a new entry should be allocated when a matching ALAT entry doesn’t exist. Such variantscan be used within loops to check advanced loads which were presumed loop-invariant and moved out of the loop by thecompiler. This behavior ensures that if the check load fails on one iteration, then the check load will not necessarily fail onall subsequent iterations. Whenever a new entry is inserted into the ALAT or when the contents of an entry are updated,the information written into the ALAT only uses information from the check load and does not use any residual informa-tion from a prior entry. The non-clear variant of chk.a, chk.a.nc, does not allocate entries and the ‘nc’ completer acts asa hint to the processor that the entry should not be cleared.

Table 4-16 and Table 4-17 summarize state and instructions relating to data speculation.

4.4.6 Memory Hierarchy Control and Consistency

4.4.6.1 Hierarchy Control and Hints

IA-64 memory access instructions are defined to specify whether the data being accessed possesses temporal locality. Inaddition, memory access instructions can specify which levels of the memory hierarchy are affected by the access. Thisleads to an architectural view of the memory hierarchy depicted in Figure 4-4 composed of zero or more levels of cachebetween the register files and memory where each level may consist of two parallel structures: a temporal structure and anon-temporal structure. Note that this view applies to data accesses and not instruction accesses.

Structure FunctionALAT Advanced load address table

Table 4-16. State Relating to Data Speculation

Mnemonic Operationld.a, ldf.a, ldfp.a GR and FR advanced loadst, st.rel, st8.spill, stf, stf.spill GR and FR storecmpxchg, fetchadd, xchg GR semaphoreld.c.clr, ld.c.clr.acq, ldf.c.clr, ldfp.c.clr GR and FR check load, clear on ALAT hitld.c.nc, ldf.c.nc, ldfp.c.nc GR and FR check load, re-allocate on ALAT missld.sa, ldf.sa, ldfp.sa GR and FR speculative advanced loadchk.a.clr, chk.a.nc GR and FR advanced load checkinvala Invalidate all ALAT entriesinvala.e Invalidate individual ALAT entry for GR or FR

Table 4-17. Instructions Relating to Data Speculation

Figure 4-4. Memory Hierarchy

StructureTemporal

Non-temporalStructure

MemoryRegister

Files

StructureTemporal


StructureTemporal


Level 1 Level 2 Level N

Cache



The temporal structures cache memory accessed with temporal locality; the non-temporal structures cache memoryaccessed without temporal locality. Both structures assume that memory accesses possess spatial locality. The existence ofseparate temporal and non-temporal structures, as well as the number of levels of cache, is implementation dependent.

Three mechanisms are defined for allocation control: locality hints; explicit prefetch; and implicit prefetch. Locality hintsare specified by load, store, and explicit prefetch (lfetch) instructions. A locality hint specifies a hierarchy level (e.g., 1,2, all). An access that is temporal with respect to a given hierarchy level is treated as temporal with respect to all lower(higher numbered) levels. An access that is non-temporal with respect to a given hierarchy level is treated as temporalwith respect to all lower levels. Finding a cache line closer in the hierarchy than specified in the hint does not demote theline. This enables the precise management of lines using lfetch and then subsequent uses by .nta loads and stores toretain that level in the hierarchy. For example, specifying the .nt2 hint by a prefetch indicates that the data should becached at level 3. Subsequent loads and stores can specify .nta and have the data remain at level 3.

Locality hints do not affect the functional behavior of the program and may be ignored by the implementation. The local-ity hints available to loads, stores, and explicit prefetch instructions are given in Table 4-18. Instruction accesses are con-sidered to possess both temporal and spatial locality with respect to level 1.

Each locality hint implies a particular allocation path in the memory hierarchy. The allocation paths corresponding to thelocality hints are depicted in Figure 4-5. The allocation path specifies the structures in which the line containing the databeing referenced would best be allocated. If the line is already at the same or higher level in the hierarchy no movementoccurs. Hinting that a datum should be cached in a temporal structure indicates that it is likely to be read in the near future.

Explicit prefetch is defined in the form of the line prefetch instruction (lfetch, lfetch.fault). The lfetch instructionsmoves the line containing the addressed byte to a location in the memory hierarchy specified by the locality hint. If theline is already at the same or higher level in the hierarchy, no movement occurs. Both immediate and register post-incre-ment are defined for lfetch and lfetch.fault. The lfetch instruction does not cause any exceptions, does not affectprogram behavior, and may be ignored by the implementation. The lfetch.fault instruction affects the memory hierar-chy in exactly the same way as lfetch but takes exceptions as if it were a 1-byte load instruction.

Table 4-18. Locality Hints Specified by Each Instruction Class

Mnemonic Locality HintInstruction Type

Load Storelfetch,

lfetch.faultnone Temporal, level 1 x x xnt1 Non-temporal, level 1 x xnt2 Non-temporal, level 2 xnta Non-temporal, all levels x x x

Figure 4-5. Allocation Paths Supported in the Memory Hierarchy

Level 1 Level 2

Temporal

Non-temporal

Temporal

Structure StructureNon-temporal

Memory

temporal, level 1

non-temporal, level 1

non-temporal, all levels

Level 3

Non-temporal

Temporal

Structure

non-temporal, level 2

Cache

Structure Structure Structure



Implicit prefetch is based on the address post-increment of loads, stores, lfetch and lfetch.fault. The line containingthe post-incremented address is moved in the memory hierarchy based on the locality hint of the originating load, store,lfetch or lfetch.fault. If the line is already at the same or higher level in the hierarchy then no movement occurs.Implicit prefetch does not cause any exceptions, does not affect program behavior, and may be ignored by the implemen-tation.

Another form of hint that can be provided on loads is the ld.bias load type. This is a hint to the implementation toacquire exclusive ownership of the line containing the addressed data. The bias hint does not affect program functionalityand may be ignored by the implementation.

The fc instruction invalidates the cache line in all levels of the memory hierarchy above memory. If the cache line is notconsistent with memory, then it is copied into memory before invalidation.

Table 4-19 summarizes the memory hierarchy control instructions and hint mechanisms.

4.4.6.2 Memory Consistency

IA-64 instruction accesses made by a processor are not coherent with respect to instruction and/or data accesses made byany other processor, nor are instruction accesses made by a processor coherent with respect to data accesses made by thatsame processor. Therefore, hardware is not required to keep a processor’s instruction caches consistent with respect to anyprocessor’s data caches, including that processor’s own data caches; nor is hardware required to keep a processor’sinstruction caches consistent with respect to any other processor’s instruction caches. Data accesses from different proces-sors in the same coherence domain are coherent with respect to each other; this consistency is provided by the hardware.Data accesses from the same processor are subject to data dependency rules; see Section 4.4.7, “Memory Access Order-ing” below.

The mechanism(s) by which coherence is maintained is implementation dependent. Separate or unified structures forcaching data and instructions are not architecturally visible. Within this context there are two categories of data memoryhierarchy control: allocation and flush. Allocation refers to movement towards the processor in the hierarchy (lower num-bered levels) and flush refers to movement away from the processor in the hierarchy (higher numbered levels). Allocationand flush occur in line-sized units; the minimum architecturally visible line size is 32-bytes (aligned on a 32-byte bound-ary). The line size in an implementation may be smaller in which case the implementation will need to move multiplelines for each allocation and flush event. An implementation may allocate and flush in units larger than 32-bytes.

In order to guarantee that a write from a given processor becomes visible to the instruction stream of that same, and other,processors, the affected line(s) must be flushed to memory. Software may use the fc instruction for this purpose. Memoryupdates by DMA devices are coherent with respect to instruction and data accesses of processors. The consistencybetween instruction and data caches of processors with respect to memory updates by DMA devices is provided by thehardware. In case a program modifies its own instructions, the sync.i and srlz.i instructions are used to ensure thatprior coherency actions are observed by a given point in the program. Refer to the description sync.i on page 6-172 foran example of self-modifying code.

4.4.7 Memory Access Ordering

Memory data access ordering must satisfy read-after-write (RAW), write-after-write (WAW), and write-after-read (WAR)data dependencies to the same memory location. In addition, memory writes and flushes must observe control dependen-cies. Except for these restrictions, reads, writes, and flushes may occur in an order different from the specified programorder. Note that no ordering exists between instruction accesses and data accesses or between any two instructionaccesses. The mechanisms described below are defined to enforce a particular memory access order. In the following dis-cussion, the terms “previous” and “subsequent” are used to refer to the program specified order. The term “visible” is used

Table 4-19. Memory Hierarchy Control Instructions and Hint Mechanisms

Mnemonic Operation.nt1 and .nta completer on loads Load usage hints

.nta completer on stores Store usage hints

prefetch line at post-increment address on loads and stores Prefetch hint

lfetch, lfetch.fault with .nt1, .nt2, and .nta hints Prefetch linefc Flush cache



to refer to all architecturally visible effects of performing a memory access (at a minimum this involves reading or writingmemory).

Memory accesses follow one of four memory ordering semantics: unordered, release, acquire or fence. Unordered dataaccesses may become visible in any order. Release data accesses guarantee that all previous data accesses are made visibleprior to being made visible themselves. Acquire data accesses guarantee that they are made visible prior to all subsequentdata accesses. Fence operations combine the release and acquire semantics into a bi-directional fence, i.e., they guaranteethat all previous data accesses are made visible prior to any subsequent data accesses being made visible.

Explicit memory ordering takes the form of a set of instructions: ordered load and ordered check load (ld.acq,ld.c.clr.acq), ordered store (st.rel), semaphores (cmpxchg, xchg, fetchadd), and memory fence (mf). The ld.acqand ld.c.clr.acq instructions follow acquire semantics. The st.rel follows release semantics. The mf instruction is afence operation. The xchg, fetchadd.acq, and cmpxchg.acq instructions have acquire semantics. The cmpxchg.rel,and fetchadd.rel instructions have release semantics. The semaphore instructions also have implicit ordering. If thereis a write, it will always follow the read. In addition, the read and write will be performed atomically with no interveningaccesses to the same memory region.

Table 4-20 illustrates the ordering interactions between memory accesses with different ordering semantics. “O” indicatesthat the first and second reference are performed in order with respect to each other. A “-” indicates that no ordering isimplied other than data dependencies (and control dependencies for writes and flushes).

Table 4-21 summarizes memory ordering instructions related to cacheable memory.

4.5 Branch InstructionsBranch instructions effect a transfer of control flow to a new address. Branch targets are bundle-aligned, which meanscontrol is always passed to the first instruction slot of the target bundle (slot 0). Branch instructions are not required to bethe last instruction in an instruction group. In fact, an instruction group can contain arbitrarily many branches (providedthat the normal RAW and WAW dependency requirements are met). If a branch is taken, only instructions up to the takenbranch will be executed. After a taken branch, the next instruction executed will be at the target of the branch.

There are two categories of branches: IP-relative branches, and indirect branches. IP-relative branches specify their targetwith a signed 21-bit displacement, which is added to the IP of the bundle containing the branch to give the address of thetarget bundle. The displacement allows a branch reach of ±16MBytes and is bundle-aligned. Indirect branches use thebranch registers to specify the target address.

There are several branch types, as shown in Table 4-22. The conditional branch br is a branch which is taken if the speci-fied predicate is 1, and not-taken otherwise. The conditional call branch br.call does the same thing, and in addition,writes a link address to a specified branch register and adjusts the general register stack (see “Register Stack” on

Table 4-20. Memory Ordering Rules

First ReferenceSecond Reference

Fence Acquire Release Unordered

fence O O O O

acquire O O O O

release O – O –

unordered O – O –

Table 4-21. Memory Ordering Instructions

Mnemonic Operationld.acq, ld.c.clr.acq Ordered load and ordered check loadst.rel Ordered storexchg Exchange memory and general registercmpxchg.acq, cmpxchg.rel Conditional exchange of memory and general registerfetchadd.acq,fetchadd.rel Add immediate to memorymf Memory ordering fence



page 4-1). The conditional return br.ret does the same thing as an indirect conditional branch, plus it adjusts the generalregister stack. Unconditional branches, calls and returns are executed by specifying PR 0 (which is always 1) as the pred-icate for the branch instruction.

The counted loop type (CLOOP) uses the Loop Count (LC) application register. If LC is non-zero then it is decrementedand the branch is taken. If LC is zero, the branch falls through. The modulo-scheduled loop type branches (CTOP, CEXIT,WTOP, WEXIT) are described in “Modulo-Scheduled Loop Support” on page 4-20. The loop type branches (CLOOP,CTOP, CEXIT, WTOP, WEXIT) are allowed only in slot 2 of a bundle. A loop type branch executed in slot 0 or 1 willcause an Illegal Operation fault.

Instructions are provided to move data between branch registers and general registers (mov =br, mov br=). Table 4-23summarizes state and instructions relating to branching.

4.5.1 Modulo-Scheduled Loop Support

Support for software-pipelined loops is provided through rotating registers and loop branch types. Software pipelining ofa loop is analogous to hardware pipelining of a functional unit. The loop body is partitioned into multiple “stages” withzero or more instructions in each stage. Modulo-scheduled loops have 3 phases: prolog, kernel, and epilog. During theprolog phase, new loop iterations are started each time around (filling the software pipeline). During the kernel phase, thepipeline is full. A new loop iteration is started, and another is finished each time around. During the epilog phase, no newiterations are started, but previous iterations are completed (draining the software pipeline).

A predicate is assigned to each stage to control the activation of the instructions in that stage (this predicate is called the“stage predicate”). To support the pipelining effect of stage predicates and registers in a software-pipelined loop, a fixedsized area of the predicate and floating-point register files (PR16-PR63 and FR32-FR127), and a programmable sized areaof the general register file, are defined to “rotate.” The size of the rotating area in the general register file is determined byan immediate in the alloc instruction. This immediate must be either zero or a multiple of 8. The general register rotatingarea is defined to start at GR32 and overlay the local and output areas, depending on their relative sizes. The stage predi-

Table 4-22. Branch Types

Mnemonic Function Branch Condition Target Addressbr.cond or br Conditional branch Qualifying predicate IP-rel or Indirectbr.call Conditional procedure call Qualifying predicate IP-rel or Indirectbr.ret Conditional procedure return Qualifying predicate Indirectbr.ia Invoke the IA-32 instruction set Unconditional Indirectbr.cloop Counted loop branch Loop count IP-rel

br.ctop, br.cexit Modulo-scheduled counted loop Loop count and Epilog count IP-rel

br.wtop, br.wexit Modulo-scheduled while loop Qualifying predicate and Epilog count

IP-rel

Table 4-23. State Relating to Branching

Register FunctionBRs Branch registers

PRs Predicate registers

CFM Current Frame Marker

PFS Previous Function State application register

LC Loop Count application register

EC Epilog Count application register

Table 4-24. Instructions Relating to Branching

Mnemonic Operationbr Branchmov =br Move from BR to GRmov br= Move from GR to BR



cates are allocated in the rotating area of the predicate register file. For counted loops, PR16 is architecturally defined tobe the first stage predicate with subsequent stage predicates extending to higher predicate register numbers. For whileloops, the first stage predicate may be any rotating predicate with subsequent stage predicates extending to higher predi-cate register numbers. Software is required to initialize the stage (rotating) predicates prior to entering the loop. An allocinstruction may not change the size of the rotating portion of the register stack frame unless all rotating register bases(rrb’s) in the CFM are zero. All rrb’s can be set to zero with the clrrrb instruction. The clrrrb.pr form can be used toclear just the rrb for the predicate registers. The clrrrb instruction must be the last instruction in an instruction group.

Rotation by one register position occurs when a software-pipelined loop type branch is executed. Registers are rotatedtowards larger register numbers in a wrap-around fashion. For example, the value in register X will be located in registerX+1 after one rotation. If X is the highest addressed rotating register its value will wrap to the lowest addressed rotatingregister. Rotation is implemented by renaming register numbers based upon the value of a rotating register base (rrb) con-tained in CFM. A rrb is defined for each of the three rotating register files: CFM.rrb.gr for the general registers;CFM.rrb.fr for the floating-point registers; CFM.rrb.pr for the predicate registers. General registers only rotate when thesize of the rotating region is not equal to zero. Floating-point and predicate registers always rotate. When rotation occurs,two or all three rrb’s are decremented in unison. Each rrb is decremented modulo the size of their respective rotatingregions (e.g., 96 for rrb.fr). The operation of the rotating register rename mechanism is not otherwise visible to software.The instructions that modify the rrb’s are listed in Table 4-25.

There are two categories of software-pipelined loop branch types: counted and while. Both categories have two forms: topand exit. The “top” variant is used when the loop decision is located at the bottom of the loop body. A taken branch willcontinue the loop while a not-taken branch will exit the loop. The “exit” variant is used when the loop decision is locatedsomewhere other than the bottom of the loop. A not-taken branch will continue the loop and a taken branch will exit theloop. The “exit” variant is also used at intermediate points in an unrolled pipelined loop.

The branch condition of a counted loop branch is determined by the specific counted loop type (ctop or cexit), the value ofthe loop count application register (LC), and the value of the epilog count application register (EC). Note that the countedloop branches do not use a qualifying predicate. LC is initialized to one less than the number of iterations for the countedloop and EC is initialized to the number of stages into which the loop body has been partitioned. While LC is greater thanzero, the branch direction will continue the loop, LC will be decremented, registers will be rotated (rrb’s are decre-mented), and PR 16 will be set to 1 after rotation. (For each of the loop-type branches, PR 63 is written by the branch, andafter rotation this value will be in PR 16.)

Execution of a counted loop branch with LC equal to zero signals the start of the epilog. While in the epilog and while ECis greater than one, the branch direction will continue the loop, EC will be decremented, registers will be rotated, and PR16 will be set to 0 after rotation. Execution of a counted loop branch with LC equal to zero and EC equal to one signals theend of the loop; the branch direction will exit the loop, EC will be decremented, registers will be rotated, and PR 16 willbe set to 0 after rotation. A counted loop type branch executed with both LC and EC equal to zero will have a branchdirection to exit the loop. LC, EC, and the rrb’s will not be modified (no rotation) and PR 63 will be set to 0. LC and ECequal to zero can occur in some types of optimized, unrolled software-pipelined loops if the target of a cexit branch is setto the next sequential bundle and the loop trip count is not evenly divisible by the unroll amount.

The direction of a while loop branch is determined by the specific while loop type (wtop or wexit), the value of the quali-fying predicate, and the value of EC. The while loop branches do not use LC. While the qualifying predicate is one, thebranch direction will continue the loop, registers will be rotated, and PR 16 will be set to 0 after rotation. While the quali-fying predicate is zero and EC is greater than one, the branch direction will continue the loop, EC will be decremented,registers will be rotated, and PR 16 will be set to 0 after rotation. The qualifying predicate is one during the kernel andzero during the epilog. During the prolog, the qualifying predicate may be zero or one depending upon the scheme used toprogram the pipelined while loop. Execution of a while loop branch with qualifying predicate equal to zero and EC equalto one signals the end of the loop; the branch direction will exit the loop, EC will be decremented, registers will be rotated,

Table 4-25. Instructions that Modify RRBs

Mnemonic Operationclrrrb Clears all rrb’sclrrrb.pr Clears rrb.prbr.call Clears all rrb’sbr.ret Restores CFM.rrb’s from PFM.rrb’sbr.ctop, br.cexit, br.wtop, and br.wexit Decrements all rrb’s



and PR 16 will be set to 0 after rotation. A while loop branch executed with a zero qualifying predicate and with EC equalto zero has a branch direction to exit the loop. EC and the rrb’s will not be modified (no rotation) and PR 63 will be set to0.

For while loops, the initialization of EC depends upon the scheme used to program the pipelined while loop. Often, thefirst valid condition for the while loop branch is not computed until several stages into the prolog. Therefore, softwarepipelines for while loops often have several speculative prolog stages. During these stages, the qualifying predicate can beset to zero or one depending upon the scheme used to program the loop. If the qualifying predicate is one throughout theprolog, EC will be decremented only during the epilog phase and is initialized to one more than the number of epilogstages. If the qualifying predicate is zero during the speculative stages of the prolog, EC will be decremented during thispart of the prolog, and the initialization value for EC is increased accordingly.

4.5.2 Branch Prediction Hints

Information about branch behavior can be provided to the processor to improve branch prediction. This information canbe encoded with branch hints as part of a branch instruction (referred to as hints). Hints do not affect the functional behav-ior of the program and may be ignored by the processor.

Branch instructions can provide three types of hints:

• Whether prediction strategy: This describes (for COND, CALL and RET type branches) how the processor shouldpredict the branch condition. (For the loop type branches, prediction is based on LC and EC.) The suggested strate-gies that can be hinted are shown in Table 4-26.

• Sequential prefetch: This indicates how much code the processor should prefetch at the branch target (shown inTable 4-27).

• Predictor deallocation: This provides re-use information to allow the hardware to better manage branch predictionresources. Normally, prediction resources keep track of the most-recently executed branches. However, sometimesthe most-recently executed branch is not useful to remember, either because it will not be re-visited any time soon orbecause a hint instruction will re-supply the information prior to re-visiting the branch. In such cases, this hint can beused to free up the prediction resources.

Table 4-26. Whether Prediction Hint on Branches

Completer Strategy Operationspnt Static Not-Taken Ignore this branch, do not allocate prediction resources for this branch.sptk Static Taken Always predict taken, do not allocate prediction resources for this branch.dpnt Dynamic Not-Taken Use dynamic prediction hardware. If no dynamic history information exists

for this branch, predict not-taken.dptk Dynamic Taken Use dynamic prediction hardware. If no dynamic history information exists

for this branch, predict taken.

Table 4-27. Sequential Prefetch Hint on Branches

Completer Sequential Prefetch Hint Operationfew Prefetch few lines When prefetching code at the branch target, stop prefetching after a

few (implementation-dependent number of) lines.many Prefetch many lines When prefetching code at the branch target, prefetch more lines (also

an implementation-dependent number).

Table 4-28. Predictor Deallocation Hint

Completer Operationnone Don’t deallocateclr Deallocate branch information



4.6 Multimedia InstructionsMultimedia instructions (see Table 4-29) treat the general registers as concatenations of eight 8-bit, four 16-bit, or two 32-bit elements. They operate on each element independently and in parallel. The elements are always aligned on their natu-ral boundaries within a general register. Most multimedia instructions are defined to operate on multiple element sizes.Three classes of multimedia instructions are defined: arithmetic, shift and data arrangement.

4.6.1 Parallel Arithmetic

There are three forms of parallel addition and subtraction: modulo (padd, psub), signed saturation (padd.sss,psub.sss), and unsigned saturation (padd.uuu, padd.uus, psub.uuu, psub.uus). The modulo forms have the resultwrap around the largest or smallest representable value in the range of the result element. In the saturating forms, resultslarger than the largest representable value of the range of the result element, or smaller than the smallest representablevalue of the range, are clamped to the largest or smallest value in the range of the result element respectively. The signedsaturation form treats both sources as signed and clamps the result to the limits of a signed range. The unsigned saturationform treats one source as unsigned and clamps the result to the limits of an unsigned range. Two variants are defined thattreat the second source as either signed (.uus) or unsigned (.uuu).

The parallel average instruction (pavg, pavg.raz) adds corresponding elements from each source and right shifts eachresult by one bit. In the simple form of the instruction, the carry out of the most-significant bit of each sum is written intothe most significant bit of the result element. In the round-away-from-zero form, a 1 is added to each sum before shifting.The parallel average subtract instruction (pavgsub) performs a similar operation on the difference of the sources.

The parallel shift left and add instruction (pshladd) performs a left shift on the elements of the first source and then addsthem to the corresponding elements from the second source. Signed saturation is performed on both the shift and the addoperations. The parallel shift right and add instruction (pshradd) is similar to pshladd. Both of these instructions aredefined for 2-byte elements only.

The parallel compare instruction (pcmp) compares the corresponding elements of both sources and writes all ones (if true)or all zeroes (if false) into the corresponding elements of the target according to one of two relations (== or >).

The parallel multiply right instruction (pmpy.r) multiplies the corresponding two even-numbered signed 2-byte elementsof both sources and writes the results into two 4-byte elements in the target. The pmpy.l instruction performs a similaroperation on odd-numbered 2-byte elements. The parallel multiply and shift right instruction (pmpyshr, pmpyshr.u) mul-tiplies the corresponding 2-byte elements of both sources producing four 4-byte results. The 4-byte results are shifted rightby 0, 7, 15, or 16 bits as specified by the instruction. The least-significant 2 bytes of the 4-byte shifted results are thenstored in the target register.

The parallel sum of absolute difference instruction (psad) accumulates the absolute difference of corresponding 1-byteelements and writes the result in the target.

The parallel minimum (pmin.u, pmin) and the parallel maximum (pmax.u, pmax) instructions deliver the minimum ormaximum, respectively, of the corresponding 1-byte or 2-byte elements in the target. The 1-byte elements are treated asunsigned values and the 2-byte elements are treated as signed values.

Table 4-29. Parallel Arithmetic Instructions

Mnemonic Operation 1-byte 2-byte 4-bytepadd Parallel modulo addition x x xpadd.sss Parallel addition with signed saturation x xpadd.uuu, padd.uus Parallel addition with unsigned saturation x xpsub Parallel modulo subtraction x x xpsub.sss Parallel subtraction with signed saturation x xpsub.uuu, psub.uus Parallel subtraction with unsigned saturation x xpavg Parallel arithmetic average x xpavg.raz Parallel arithmetic average with round away from zero x xpavgsub Parallel average of a difference x xpshladd Parallel shift left and add with signed saturation xpshradd Parallel shift right and add with signed saturation x



4.6.2 Parallel Shifts

The parallel shift left instruction (pshl) individually shifts each element of the first source by a count contained in eithera general register or an immediate. The parallel shift right instruction (pshr) performs an individual arithmetic right shiftof each element of one source by a count contained in either a general register or an immediate. The pshr.u instructionperforms an unsigned right shift. Table 4-30 summarizes the types of parallel shift instructions.

4.6.3 Data Arrangement

The mix right instruction (mix.r) interleaves the even-numbered elements from both sources into the target. The mix leftinstruction (mix.l) interleaves the odd-numbered elements. The unpack low instruction (unpack.l) interleaves the ele-ments in the least-significant 4 bytes of each source into the target register. The unpack high instruction (unpack.h) inter-leaves elements from the most significant 4 bytes. The pack instructions (pack.sss, pack.uss) convert from 32-bit or16-bit elements to 16-bit or 8-bit elements respectively. The least-significant half of larger elements in both sources areextracted and written into smaller elements in the target register. The pack.sss instruction treats the extracted elementsas signed values and performs signed saturation on them. The pack.uss instruction performs unsigned saturation. Themux instruction (mux) copies individual 2-byte or 1-byte elements in the source to arbitrary positions in the target accord-ing to a specified function. For 2-byte elements, an 8-bit immediate allows all possible permutations to be specified. For1-byte elements the copy function is selected from one of five possibilities (reverse, mix, shuffle, alternate, broadcast).Table 4-31 describes the various types of parallel data arrangement instructions.

4.7 Register File TransfersTable 4-32 shows the instructions defined to move values between the general register file and the floating-point, branch,predicate, performance monitor, processor identification, and application register files. Several of the transfer instructionsshare the same mnemonic (mov). The value of the operand identifies which register file is accessed.

pcmp Parallel compare x x xpmpy.l Parallel signed multiply of odd elements xpmpy.r Parallel signed multiply of even elements xpmpyshr Parallel signed multiply and shift right xpmpyshr.u Parallel unsigned multiply and shift right xpsad Parallel sum of absolute difference xpmin Parallel minimum x xpmax Parallel maximum x x

Table 4-30. Parallel Shift Instructions

Mnemonic Operation 1-byte 2-byte 4-bytepshl Parallel shift left x xpshr Parallel signed shift right x xpshr.u Parallel unsigned shift right x x

Table 4-31. Parallel Data Arrangement Instructions

Mnemonic Operation 1-byte 2-byte 4-bytemix.l Interleave odd elements from both sources x x xmix.r Interleave even elements from both sources x x xmux Arbitrary copy of individual source elements x xpack.sss Convert from larger to smaller elements with signed saturation x xpack.uss Convert from larger to smaller elements with unsigned saturation xunpack.l Interleave least-significant elements from both sources x x xunpack.h Interleave most significant elements from both sources x x x

Table 4-29. Parallel Arithmetic Instructions (Continued)

Mnemonic Operation 1-byte 2-byte 4-byte



Memory access instructions only target or source the general and floating-point register files. It is necessary to use thegeneral register file as an intermediary for transfers between memory and all other register files except the floating-pointregister file.

Two classes of move are defined between the general registers and the floating-point registers. The first type moves thesignificand or the sign/exponent (getf.sig, setf.sig, getf.exp, setf.exp). The second type moves entire single ordouble precision numbers (getf.s, setf.s, getf.d, setf.d). These instructions also perform a conversion between thedeferred exception token formats.

Instructions are provided to transfer between the branch registers and the general registers.

Instructions are defined to transfer between the predicate register file and a general register. These instructions operate ina “broadside” manner whereby multiple predicate registers are transferred in parallel (predicate register N is transferred toand from bit N of a general register). The move to predicate instruction (mov pr=) transfers a general register to multiplepredicate registers according to a mask specified by an immediate. The mask contains one bit for each of the static predi-cate registers (PR 1 through PR 15 – PR 0 is hardwired to 1) and one bit for all of the rotating predicates (PR 16 throughPR63). A predicate register is written from the corresponding bit in a general register if the corresponding mask bit is set.If the mask bit is clear then the predicate register is not modified. The rotating predicates are transferred as if CFM.rrb.prwere zero. The actual value in CFM.rrb.pr is ignored and remains unchanged. The move from predicate instruction (mov

=pr) transfers the entire predicate register file into a general register target.

The mov =pmd[] instruction is defined to move from a performance monitor data (PMD) register to a general register. Ifthe operating system has not enabled reading of performance monitor data registers in user level then all zeroes arereturned. The mov =cpuid[] instruction is defined to move from a processor identification register to a general register.

The mov =ip instruction is provided for copying the current value of the instruction pointer (IP) into a general register.

4.8 Character Strings and Population CountA small set of special instructions accelerate operations on character and bit-field data.

4.8.1 Character Strings

The compute zero index instructions (czx.l, czx.r) treat the general register source as either eight 1-byte or four 2-byteelements and write the general register target with the index of the first zero element found. If there are no zero elementsin the source, the target is written with a constant one higher than the largest possible index (8 for the 1-byte form, 4 forthe 2-byte form). The czx.l instruction scans the source from left to right with the left-most element having an index ofzero. The czx.r instruction scans from right to left with the right-most element having an index of zero. Table 4-33 sum-marizes the compute zero index instructions.

Table 4-32. Register File Transfer Instructions

Mnemonic Operationgetf.exp, getf.sig Move FR exponent or significand to GRgetf.s, getf.d Move single/double precision memory format from FR to GRsetf.s, setf.d Move single/double precision memory format from GR to FRsetf.exp, setf.sig Move from GR to FR exponent or significandmov =br Move from BR to GRmov br= Move from GR to BRmov =pr Move from predicates to GRmov pr=, mov pr.rot= Move from GR to predicatesmov ar= Move from GR to ARmov =ar Move from AR to GRsum, rum Set and reset user maskmov =pmd[...] Move from performance monitor data register to GRmov =cpuid[...] Move from processor identification register to GRmov =ip Move from Instruction Pointer



4.8.2 Population Count

The population count instruction (popcnt) writes the number of bits which have a value of 1 in the source register into thetarget register.

Table 4-33. String Support Instructions

Mnemonic Operation 1-byte 2-byteczx.l Locate first zero element, left to right x xczx.r Locate first zero element, right to left x x

HP/Intel IA-64 Floating-point Programming Model 5-1


5 IA-64 Floating-point Programming Model

The IA-64 floating-point architecture is fully compliant with the ANSI/IEEE Standard for Binary Floating-Point Arith-metic (Std. 754-1985). There is full IEEE support for single, double, and double-extended real formats. The two IEEEmethods for controlling rounding precision are supported. The first method converts results to the double-extended expo-nent range. The second method converts results to the destination precision. Some IEEE extensions such as fused multiplyand add, minimum and maximum operations, and a register file format with a larger range than the minimum double-extended format are also included.

5.1 Data Types and FormatsSix data types are supported directly: single, double, double-extended real (IEEE real types); 64-bit signed integer, 64-bitunsigned integer, and the 82-bit floating-point register format. A “Parallel FP” format where a pair of IEEE single preci-sion values occupy a floating-point register’s significand is also supported. A seventh data type, IEEE-style quad-preci-sion, is supported by software routines. A future architecture extension may include additional support for the quad-precision real type.

5.1.1 Real Types

The parameters for the supported IEEE real types are summarized in Table 5-1.

5.1.2 Floating-point Register Format

Data contained in the floating-point registers can be either integer or real type. The format of data in the floating-pointregisters is designed to accommodate both of these types with no loss of information.

Real numbers reside in 82-bit floating-point registers in a three-field binary format (see Figure 5-1). The three fields are:

• The 64-bit significand field, b63. b62b61 .. b1b0, contains the number's significant digits. This field is composed of anexplicit integer bit (significand{63}), and 63 bits of fraction (significand{62:0}). For Parallel FP data, the significandfield holds a pair of 32-bit IEEE single real numbers.

• The 17-bit exponent field locates the binary point within or beyond the significant digits (i.e., it determines the num-ber's magnitude). The exponent field is biased by 65535 (0xFFFF). An exponent field of all ones is used to encode the

Table 5-1. IEEE Real-Type Properties

Single DoubleDouble-

ExtendedQuad-

Precision

IEEE Real-Type ParametersSign + or − + or − + or − + or −Emax +127 +1023 +16383 +16383

Emin −126 −1022 −16382 −16382

Exponent bias +127 +1023 +16383 +16383

Precision (bits) 24 53 64 113

IEEE Memory FormatsTotal memory format width (bits) 32 64 80 128

Sign field width (bits) 1 1 1 1

Exponent field width (bits) 8 11 15 15

Significand field width (bits) 23 52 64 112

5-2 IA-64 Floating-point Programming Model HP/Intel


special values for IEEE signed infinity and NaNs. An exponent field of all zeros and a significand field of all zeros isused to encode the special values for IEEE signed zeros. An exponent field of all zeros and a non-zero significandfield encodes the double-extended real denormals and double-extended real pseudo-denormals.

• The 1-bit sign field indicates whether the number is positive (sign=0) or negative (sign=1). For Parallel FP data, thisbit is always 0.

The value of a finite floating-point number, encoded with non-zero exponent field, can be calculated using the expression:

(-1)(sign) * 2(exponent – 65535) * (significand{63}.significand{62:0}2)

The value of a finite floating-point number, encoded with zero exponent field, can be calculated using the expression:

(-1)(sign) * 2(-16382) * (significand{63}.significand{62:0}2)

Integers (64-bit signed/unsigned) and Parallel FP numbers reside in the 64-bit significand field. In their canonical form,the exponent field is set to 0x1003E (biased 63) and the sign field is set to 0.

5.1.3 Representation of Values in Floating-point Registers

The floating-point register encodings are grouped into classes and subclasses and listed below in Table 5-2 (shaded encod-ings are unsupported). The last two table entries contain the values of the constant floating-point registers, FR 0 and FR 1.The constant value in FR 1 does not change for the parallel single precision instructions or for the integer multiply accu-mulate instruction and would not generally be useful.

81 80 64 63 0

sign exponent significand (with explicit integer bit)1 17 64

Figure 5-1. Floating-point Register Format

Table 5-2. Floating-point Register Encodings

Class or SubclassSign

(1 bit)

Biased Exponent(17-bits)

Significandi.bb...bb (explicit integer bit is shown)

(64-bits)

NaNs 0/1 0x1FFFF 1.000...01 through 1.111...11

Quiet NaNs 0/1 0x1FFFF 1.100...00 through 1.111...11

Quiet NaN Indefinitea 1 0x1FFFF 1.100...00

Signaling NaNs 0/1 0x1FFFF 1.000...01 through 1.011...11

Infinity 0/1 0x1FFFF 1.000...00

Pseudo-NaNs 0/1 0x1FFFF 0.000...01 through 0.111...11

Pseudo-Infinity 0/1 0x1FFFF 0.000...00

Normalized Numbers(Floating-point Register Format Normals)

0/1 0x00001through0x1FFFE

1.000...00 through 1.111...11

Integers or Parallel FP(large unsigned or negative signed integers)

0 0x1003E 1.000...00 through 1.111...11

Integer Indefiniteb 0 0x1003E 1.000...00

IEEE Single Real Normals 0/1 0x0FF81through0x1007E

1.000...00...(40)0sthrough 1.111...11...(40)0s

IEEE Double Real Normals 0/1 0x0FC01through0x103FE

1.000...00...(11)0sthrough1.111...11...(11)0s

IEEE Double-Extended Real Normals 0/1 0x0C001through0x13FFE

1.000...00 through 1.111...11



All register file encodings are allowed as inputs to arithmetic operations. The result of an arithmetic operation is alwaysthe most normalized register file representation of the computed value, with the exponent range limited from Emin toEmax of the destination type, and the significand precision limited to the number of precision bits of the destination type.

Normal numbers with the same value as Dou-ble-Extended Real Pseudo-Denormals

0/1 0x0C001 1.000...00 through 1.111...11

IA-32 Stack Single Real Normals(produced when the computation model is IA-32 Stack Single)

0/1 0x0C001through0x13FFE

1.000...00...(40)0sthrough1.111...11...(40)0s

IA-32 Stack Double Real Normals(produced when the computation model is IA-32 Stack Double)

0/1 0x0C001through0x13FFE

1.000...00...(11)0sthrough1.111...11...(11)0s

Unnormalized Numbers(Floating-point Register Format unnormalized num-bers)

0/1 0x00000 0.000...01 through 1.111...11

0x00001through0x1FFFE

0.000...01 through 0.111...11

0x00001 through 0x1FFFD

0.000...00

1 0x1FFFE 0.000...00

Integers or Parallel FP(positive signed/unsigned integers)

0 0x1003E 0.000...00 through 0.111...11

Single Real Denormals 0/1 0x0FF81 0.000...01...(40)0sthrough 0.111...11...(40)0s

Double Real Denormals 0/1 0x0FC01 0.000...01...(11)0sthrough 0.111...11...(11)0s

Register Format Denormals 0/1 0x00001 0.000...01 through 0.111...11

Double-Extended Real Denormals 0/1 0x00000 0.000...01 through 0.111...11

Unnormal numbers with the same value as Double-Extended Real Denormals

0/1 0x0C001 0.000...01 through 0.111...11

Double-Extended Real Pseudo-Denormals(IA-32 stack and memory format)

0/1 0x00000 1.000...00 through 1.111...11

IA-32 Stack Single Real Denormals(produced when computation model is IA-32 Stack Single)

0/1 0x00000 0.000...01...(40)0sthrough0.111...11...(40)0s

IA-32 Stack Double Real Denormals(produced when computation model is IA-32 Stack Double)

0/1 0x00000 0.000...01...(11)0sthrough0.111...11...(11)0s

Pseudo-Zeros 0/1 0x00001through0x1FFFD

0.000...00

1 0x1FFFE 0.000...00

NaTValc 0 0x1FFFE 0.000...00

Zero 0/1 0x00000 0.000...00

FR 0 (positive zero) 0 0x00000 0.000...00

FR 1 (positive one) 0 0x0FFFF 1.000...00

a. Default response on a masked real invalid operation.b. Default response on a masked integer invalid operation.c. Created by unsuccessful speculative memory operation.

Table 5-2. Floating-point Register Encodings (Continued)



Computed values, such as zeros, infinities, and NaNs that are outside these bounds are represented by the correspondingunique register file encoding. Double-extended real denormal results are mapped to the register file exponent of 0x00000(instead of 0x0C001). Unsupported encodings (Pseudo-NaNs and Pseudo-Infinities), Pseudo-zeros and Double-extendedReal Pseudo-denormals are never produced as a result of an arithmetic operation.

Arithmetic on pseudo-zeros operates exactly as an equivalently signed zero, with one exception. Pseudo-zero multipliedby infinity returns the correctly signed infinity instead of an Invalid Operation Floating-Point Exception fault (andQNaN). Also, pseudo-zeros are classified as unnormalized numbers, not zeros.

5.2 Floating-point Status RegisterThe Floating-Point Status Register (FPSR) contains the dynamic control and status information for floating-point opera-tions. There is one main set of control and status information (FPSR.sf0), and three alternate sets (FPSR.sf1, FPSR.sf2,FPSR.sf3). The FPSR layout is shown in Figure 5-2 and its fields are defined in Table 5-3. Table 5-4 gives the FPSR’s sta-tus field description and Figure 5-3 shows their layout.

63 58 57 45 44 32 31 19 18 6 5 0

rv sf3 sf2 sf1 sf0 traps6 13 13 13 13 6

Figure 5-2. Floating-point Status Register Format

Table 5-3. Floating-point Status Register Field Description

Field Bits Description

traps.vd 0 Invalid Operation Floating-Point Exception fault (IEEE Trap) disabled when this bit is set

traps.dd 1 Denormal/Unnormal Operand Floating-Point Exception fault disabled when this bit is set

traps.zd 2 Zero Divide Floating-Point Exception fault (IEEE Trap) disabled when this bit is set

traps.od 3 Overflow Floating-Point Exception trap (IEEE Trap) disabled when this bit is set

traps.ud 4 Underflow Floating-Point Exception trap (IEEE Trap) disabled when this bit is set

traps.id 5 Inexact Floating-Point Exception trap (IEEE Trap) disabled when this bit is set

sf0 18:6 Main status field

sf1 31:19 Alternate status field 1



rv 63:58 Reserved

12 11 10 9 8 7 6 5 4 3 2 1 0

FPSR.sfx

flags controls

i u o z d v td rc pc wre ftz6 7

Figure 5-3. Floating-point Status Field Format

Table 5-4. Floating-point Status Register’s Status Field Description


ftz 0 Flush-to-Zero mode

wre 1 Widest range exponent (see Table 5-6)

pc 3:2 Precision control (see Table 5-6)

rc 5:4 Rounding control (see Table 5-5)

td 6 Traps disableda

v 7 Invalid Operation (IEEE Flag)



The Denormal/Unnormal Operand status flag is an IEEE-style sticky flag that is set if the value is used in an arithmeticinstruction and in an arithmetic calculation; e.g. unorm*NaN doesn’t set the d flag. Canonical single/double/double-extended denormal/double-extended pseudo-denormal/register format denormal encodings are a subset of the floating-point register format unnormalized numbers.

Note that the Floating-Point Exception fault/trap occurs only if an enabled floating-point exception occurs during the pro-cessing of the instruction. Hence, setting a flag bit of a status field to 1 in software will not cause an interruption. The sta-tus fields flags are merely indications of the occurrence of floating-point exceptions.

Flush-to-Zero (FTZ) mode causes results which encounter “tininess” to be truncated to the correctly signed zero. Flush-to-Zero mode can be enabled only if Underflow is disabled. This can be accomplished by disabling all traps (FPSR.sfx.tdbeing set to 1), or by disabling it individually (FPSR.traps.ud set to 1). If Underflow is enabled then it takes priority andFlush-to-Zero mode is ignored. Note that the software exception handler could examine the Flush-to Zero mode bit andchoose to emulate the Flush-to-Zero operation when an enabled Underflow exception arises.

The FPSR.sfx.u and FPSR.sfx.i bits will be set to 1 when a result is flushed to the correctly signed zero because of Flush-to-Zero mode. If enabled, an inexact result exception is signaled.

A floating-point result is rounded based on the instruction’s .pc completer and the status field’s wre, pc, and rc controlfields. The result’s significand precision and exponent range are determined as described in Table 5-6 "Floating-pointComputation Model Control Definitions". If the result isn’t exact, FPSR.sfx.rc specifies the rounding direction (seeTable 5-5).

d 8 Denormal/Unnormal Operand

z 9 Zero Divide (IEEE Flag)

o 10 Overflow (IEEE Flag)

u 11 Underflow (IEEE Flag)

i 12 Inexact (IEEE Flag)

a. td is a reserved bit in the main status field, FPSR.sf0.

Table 5-5. Floating-point Rounding Control Definitions

Nearest (or even) – Infinity (down) + Infinity (up) Zero (truncate/chop)FPSR.sfx.rc 00 01 10 11

Table 5-6. Floating-point Computation Model Control Definitions

Computation Model Control Fields Computation Model Selected

Instruction’s .pc Completer

FPSR.sfx’s Dynamic pc Field

FPSR.sfx’s Dynamic wre Field

SignificandPrecision

ExponentRange

Computational Style

.s ignored 0 24 bits 8 bits IEEE real single

.d ignored 0 53 bits 11 bits IEEE real double

.s ignored 1 24 bits 17 bits Register file range, single precision

.d ignored 1 53 bits 17 bits Register file range, double precision

none 00 0 24 bits 15 bits IA-32 stack single

none 01 0 N.A. N.A. Reserved

none 10 0 53 bits 15 bits IA-32 stack double

none 11 0 64 bits 15 bits IA-32 double-extended

none 00 1 24 bits 17 bits Register file range, single precision

none 01 1 N.A. N.A. Reserved

none 10 1 53 bits 17 bits Register file range, double precision

Table 5-4. Floating-point Status Register’s Status Field Description (Continued)




The trap disable (sfx.td) control bit allows one to easily set up a local IEEE exception trap default environment. IfFPSR.sfx.td is clear (enabled), the FPSR.traps bits are used. If FPSR.sfx.td is set, the FPSR.traps bits are treated as if theyare all set (disabled). Note that FPSR.sf0.td is a reserved field which returns 0 when read.

5.3 Floating-point InstructionsThis section describes the IA-64 floating-point instructions.

5.3.1 Memory Access Instructions

There are floating-point load and store instructions for the single, double, double-extended floating-point real data types,and the Parallel FP or signed/unsigned integer data type. The addressing modes for floating-point load and store instruc-tions are the same as for integer load and store instructions, except for floating-point load pair instructions which can havean implicit base-register post increment. The memory hint options for floating-point load and store instructions are thesame as those for integer load and store instructions. (See “Memory Hierarchy Control and Consistency” on page 4-16.)Table 5-7 lists the types of floating-point load and store instructions. The floating-point load pair instructions require thetwo target registers to be odd/even or even/odd. The floating-point store instructions (stfs, stfd, stfe) require the valuein the floating-point register to have the same type as the store for the format conversion to be correct.

Unsuccessful speculative loads write a NaTVal into the destination register or registers (see Section 4.4.4). Storing aNaTVal to memory will cause a Register NaT Consumption fault, except for the spill instruction (stf.spill).

Saving and restoring floating-point registers is accomplished by the spill and fill instructions (stf.spill, ldf.fill)using a 16-byte memory container. These are the only instructions that can be used for saving and restoring the actual reg-ister contents since they do not fault on NaTVal. They save and restore all types (single, double, double-extended, registerformat and integer or Parallel FP) and will ensure compatibility with possible future architecture extensions.

Figure 5-4, Figure 5-5, Figure 5-6 and Figure 5-7 describe how single precision, double precision, double-extended preci-sion, and spill/fill data is translated during transfers between floating-point registers and memory.

none 11 1 64 bits 17 bits Register file range, double-extended precision

not applicablea ignored ignored 24 bits 8 bits A pair of IEEE real singles

not applicableb ignored ignored 64 bits 17 bits Register file range, double-extended precision

a. For parallel FP instructions which have no .pc completer (e.g., fpma).b. For non-parallel FP instructions which have no .pc completer (e.g., fmerge).

Table 5-7. Floating-point Memory Access Instructions

Operations Load to FR Load Pair to FR Store from FR

Single ldfs ldfps stfs

Integer/Parallel FP ldf8 ldfp8 stf8

Double ldfd ldfpd stfd

Double-extended ldfe stfe

Spill/fill ldf.fill stf.spill

Table 5-6. Floating-point Computation Model Control Definitions (Continued)

Computation Model Control Fields Computation Model Selected

Instruction’s .pc Completer

FPSR.sfx’s Dynamic pc Field

FPSR.sfx’s Dynamic wre Field

SignificandPrecision

ExponentRange

Computational Style



Figure 5-4. Memory to Floating-point Register Data Translation – Single Precision

sign exponentinteger

significand

FR:

Memory:

Single-precision Load – normal numbers

bit

01


significand

FR:

Memory:

Single-precision Load – infinities and NaNs

bit

01


significand

FR:

Memory:

Single-precision Load – zeros

bit

00

0x1FFFF

1111111 1

0

0000000 0 0


significand

FR:

Memory:

Single-precision Load – denormal numbers

bit

00x0FF81

0000000 0

0

00



Figure 5-5. Memory to Floating-point Register Data Translation – Double Precision


significand

FR:

Memory:

Double-precision Load – normal numbers

bit

01


significand

FR:

Memory:

Double-precision Load – infinities and NaNs

bit

01


significand

FR:

Memory:

Double-precision Load – zeros

bit

00

0x1FFFF

1111111 1

0

0000000 0 0


significand

FR:

Memory:

Double-precision Load – denormal numbers

bit

00x0FC01

0000000 0

0

111

000

000

0 0 0 0 00



Figure 5-6. Memory to Floating-point Register Data Translation – Double Extended, Integer and Fill


significand

FR:

Memory:

Double-extended-precision Load – normal/unnormal numbers

bit


significand

FR:

Memory:

Double-extended-precision Load – infinities and NaNs

bit


significand

FR:

Memory:

Double-extended-precision Load – denormal/pseudo-denormals and zeros

bit

0x1FFFF

0

1111111 11111111

0000000 00000000


significand

FR:

Memory:

Integer Load

bit

0 0x1003E

sign exponent significand

FR:

Memory:

Register Fill

integerbit



Figure 5-7. Floating-point Register to Memory Data Translation


significand

FR:

Memory:

Single-precision Store

bit


FR:

Memory:

Double-precision Store

integerbit


FR:

Memory:

Double-extended-precision Store

integerbit


FR:

Memory:

Register Spill

integerbit

0 00 0 0 0

= AND


significand

FR:

Memory:

Integer Store

bit



Both little-endian and big-endian byte ordering is supported on floating-point loads and stores. For both single and doublememory formats, the byte ordering is identical to the 32-bit and 64-bit integer data types (see Section 3.2.3). The byte-ordering for the spill/fill memory and double-extended formats is shown in Figure 5-8.

5.3.2 Floating-Point Register to/from General Register Transfer Instructions

The setf and getf instructions (see Table 5-8) transfer data between floating-point registers (FR) and general registers(GR). These instructions will translate a general register NaT to/from a floating-point register NaTVal. For all other oper-ands, the .s and .d variants of the setf and getf instructions translate to/from FR as per Figure 5-4, Figure 5-5 andFigure 5-7. The memory representation is read from or written to the GR. The .exp and .sig variants of the setf andgetf instructions operate on the sign/exponent and significand portions of a floating-point register, respectively, and theirtranslation formats are described in Table 5-9 and Table 5-10.

Figure 5-8. Spill/Fill and Double-Extended (80-bit) Floating-point Memory Formats

Table 5-8. Floating-point Register Transfer Instructions

Operations GR to FR FR to GR

Single setf.s getf.s

Double setf.d getf.d

Sign and Exponent setf.exp getf.exp

Significand/Integer setf.sig getf.sig

s0

s1

s2

s3

s4

s5

s6

s7

0

1

2

3

4

5

6

7

7 0

Memory Formats Floating-Point Register Format (82-bit)

e0

e1

se2

0

0

0

0

0

8

9

10

11

12

13

14

15

0

0

0

0

0

se2

e1

e0

0

1

2

3

4

5

6

7

7 0

s7

s6

s5

s4

s3

s2

s1

s0

8

9

10

11

12

13

14

15

s0

s1

s2

s3

s4

s5

s6

s7

0

1

2

3

4

5

6

7

7 0

e0’

se1’

8

9

s3 s0s2 s1s7 s4s6 s5

63 0

se2 e1 e0

s3 s0s2 s1s7 s4s6 s5se1’e0’

81significandexp.s

Double-Extended (80-bit) interpretation

se1’

e0’

s7

s6

s5

s4

s3

s2

0

1

2

3

4

5

6

7

7 0

s1

s0

8

9

Spill/Fill (128-bit) Double-Extended (80-bit)

LE BE LE BE

5.3.3 Arithmetic Instructions

All of the arithmetic floating-point instructions (except fcvt.xf which is always exact) have a .sf specifier. This indicateswhich of the four FPSR’s status fields will both control and record the status of the execution of the instruction (seeTable 5-11). The status field specifies: enabled exceptions, rounding mode, exponent width, precision control, and whichstatus field’s flags to update. See “Floating-point Status Register” on page 5-4.

Most arithmetic floating-point instructions can specify the precision of the result statically by using a .pc completer, ordynamically using the .pc field of the FPSR status field. (see Table 5-6). Arithmetic instructions that do not have a .pccompleter use the floating-point register file range and precision.

Table 5-12 lists the floating-point arithmetic instructions and Table 5-13 lists the pseudo-operation definitions.

Table 5-9. General Register (Integer) to Floating-point Register Data Translation

General Register Floating-Point Register (.sig) Floating-Point Register (.exp)

Class NaT Integer Sign Exponent Significand Sign Exponent Significand

NaT 1 ignore NaTVal NaTVal

integers 0 000...00through111...11

0 0x1003E integer integer{17} integer{16:0} 0x8000000000000000

Table 5-10. Floating-point Register to General Register (Integer) Data Translation

Floating-Point Register General Register (.sig) General Register (.exp)

Class Sign Exponent Significand NaT Integer NaT Integer

NaTVal 0 0x1FFFE 0.000...00 1 0x0000000000000000 1 0x1FFFE

integers or paral-lel FP

0 0x1003E 0.000...00through

1.111...11

0 significand 0 0x1003E

other ignore ignore ignore 0 significand 0 ((sign<<17) | exponent)

Table 5-11. Floating-point Instruction Status Field Specifier Definition

.sf Specifier .s0 .s1 .s2 .s3Status field FPSR.sf0 FPSR.sf1 FPSR.sf2 FPSR.sf3

Table 5-12. Floating-point Arithmetic Instructions

Operation Normal FP Mnemonic(s) Parallel FP Mnemonic(s)Floating-point multiply and add fma.pc.sf fpma.sf

Floating-point multiply and subtract fms.pc.sf fpms.sf

Floating-point negate multiply and add fnma.pc.sf fpnma.sf

Floating-point reciprocal approximation frcpa.sf fprcpa.sf

Floating-point reciprocal square root approximation frsqrta.sf fprsqrta.sf

Floating-point compare fcmp.frel.fctype.sf fpcmp.frel.sf

Floating-point minimum fmin.sf fpmin.sf

Floating-point maximum fmax.sf fpmax.sf

Floating-point absolute minimum famin.sf fpamin.sf

Floating-point absolute maximum famax.sf fpamax.sf

Convert floating-point to signed integer fcvt.fx.sf fcvt.fx.trunc.sf

fpcvt.fx.sf fpcvt.fx.trunc.sf

Convert floating-point to unsigned integer fcvt.fxu.sf fcvt.fxu.trunc.sf

fpcvt.fxu.sf fpcvt.fxu.trunc.sf

Convert signed integer to floating-point fcvt.xf N.A.



There are no pseudo-operations for Parallel FP addition, subtraction, negation or normalization since FR 1 does not con-tain a packed pair of single precision 1.0 values. A parallel FP addition can be performed by first forming a pair of 1.0 val-ues in a register (using the fpack instruction) and then using the fpma instruction. Similarly, an integer add operation canbe generated by first forming an integer 1 in a floating-point register and then using the xma instruction.

5.3.4 Non-Arithmetic Instructions

Table 5-14 lists the non-arithmetic floating-point instructions. The fclass instruction is used to classify the contents of afloating-point register. The fmerge instruction is used to merge data from two floating-point registers into one floating-point register. The fmix, fsxt, fpack, and fswap instructions are used to manipulate the Parallel FP data in the floating-point significand. The fand, fandcm, for, and fxor instructions are used to perform logical operations on the floating-point significand. The fselect instruction is used for conditional selects.

The non-arithmetic floating-point instructions always use the floating-point register (82-bit) precision since they do nothave a .pc completer nor a .sf specifier.

Table 5-13. Floating-point Pseudo-Operations

Operation Mnemonic Operation Used

Floating-point multiplication (IEEE)Parallel FP multiplication

fmpy.pc.sffpmpy.sf

fma, using FR 0 for addendfpma, using FR 0 for addend

Floating-point negate multiplication (IEEE)Parallel FP negate multiplication

fnmpy.pc.sffpnmpy.sf

fnma, using FR 0 for addendfpnma, using FR 0 for addend

Floating-point addition (IEEE) fadd.pc.sf fma, using FR 1 for multiplicand

Floating-point subtraction (IEEE) fsub.pc.sf fms, using FR 1 for multiplicand

Floating-point negation (IEEE) fnma.pc.sf fnma, using FR 1 for multiplicand and FR 0 for addend

Floating-point absolute valueParallel FP absolute value

fabsfpabs

fmerge.s, with sign from FR 0fpmerge.s, with sign from FR 0

Floating-point negateParallel FP negate

fnegfpneg

fmerge.nsfpmerge.ns

Floating-point negate absolute valueParallel FP negate absolute value

fnegabsfpnegabs

fmerge.ns, with sign from FR 0fpmerge.ns, with sign from FR 0

Floating-point normalization fnorm.pc.sf fma, using FR 1 for multiplicand and FR 0 for addend

Convert unsigned integer to floating-point fcvt.xuf.pc.sf fma, using FR 1 for multiplicand and FR 0 for addend

Table 5-14. Non-Arithmetic Floating-point Instructions

Operation Mnemonic(s)

Floating-point classify fclass.fcrel.fctype

Floating-point merge signParallel FP merge sign

fmerge.sfpmerge.s

Floating-point merge negative signParallel FP merge negative sign

fmerge.nsfpmerge.ns

Floating-point merge sign and exponentParallel FP merge sign and exponent

fmerge.sefpmerge.se

Floating-point mix left fmix.l

Floating-point mix right fmix.r

Floating-point mix left-right fmix.lr

Floating-point sign-extend left fsxt.l

Floating-point sign-extend right fsxt.r

Floating-point pack fpack

Floating-point swap fswap



5.3.5 Floating-point Status Register (FPSR) Status Field Instructions

Speculation of floating-point operations requires that the status flags be stored temporarily in one of the alternate statusfields (not FPSR.sf0). After a speculative execution chain has been committed, a fchkf instruction can be used to updatethe normal flags (FPSR.sf0.flags). This operation will preserve the correctness of the IEEE flags. The fchkf instructiondoes this by comparing the flags of the status field with the FPSR.sf0.flags and FPSR.traps. If the flags of the alternate sta-tus field indicate the occurrence of an event that corresponds to an enabled floating-point exception in FPSR.traps, or anevent that is not already registered in the FPSR.sf0.flags (i.e., the flag for that event in FPSR.sf0.flags is clear), then thefchkf instruction causes a Speculative Operation fault. If neither of these cases arise then the fchkf instruction doesnothing.

The fsetc instruction allows bit-wise modification of a status field’s control bits. The FPSR.sf0.controls are ANDed witha 7-bit immediate and-mask and ORed with a 7-bit immediate or-mask to produce the control bits for the status field. Thefclrf instruction clears all of the status field’s flags to zero.

5.3.6 Integer Multiply and Add Instructions

Integer (fixed-point) multiply is executed in the floating-point unit using the three-operand xma instructions. The operandsand result of these instructions are floating-point registers. The xma instructions ignore the sign and exponent fields of thefloating-point register, except for a NaTVal check. The product of two 64-bit source significands is added to the third 64-bit significand (zero extended) to produce a 128-bit result. The low and high versions of the instruction select the appro-priate low/high 64-bits of the 128-bit result, respectively, and write it into the destination register as a canonical integer.The signed and unsigned versions of the instructions treat the input registers as signed and unsigned 64-bit integersrespectively.

Floating-point swap and negate left fswap.nl

Floating-point swap and negate right fswap.nr

Floating-point And fand

Floating-point And Complement fandcm

Floating-point Or for

Floating-point Xor fxor

Floating-point Select fselect

Table 5-15. FPSR Status Field Instructions


Floating-point check flags fchkf.sf

Floating-point clear flags fclrf.sf

Floating-point set controls fsetc.sf

Table 5-16. Integer Multiply and Add Instructions

Integer Multiply and Add Low High

Signed xma.l xma.h

Unsigned xma.lu (pseudo-op) xma.hu

Table 5-14. Non-Arithmetic Floating-point Instructions (Continued)


5.4 Additional IEEE Considerations

5.4.1 Definition of SNaNs, QNaNs, and Propagation of NaNs

Signaling NaNs have a zero in the most significant fractional bit of the significand. Quiet NaNs have a one in the mostsignificant fractional bit of the significand. This definition of signaling and quiet NaNs easily preserves “NaNness” whenconverting between different precisions. When propagating NaNs in operations that have more than one NaN operand, theresult NaN is chosen from one of the operand NaNs in the following priority based on register encoding fields: first f4,then f2, and lastly f3.

5.4.2 IEEE Standard Mandated Operations Deferred to Software

The following IEEE mandated operations will be implemented in software:

• String to floating-point conversion.

• Floating-point to string conversion.

• Divide (with help from frcpa or fprcpa instruction).

• Square root (with help from frsqrta or fprsqrta instruction).

• Remainder (with help from frcpa or fprcpa instruction).

• Floating-point to integer valued floating-point conversion.

• Correctly wrapping the exponent for single, double, and double-extended overflow and underflow values, as recom-mended by the IEEE standard.

5.4.3 Additions beyond the IEEE Standard

• The fused multiply and add (fma, fms, fnma, fpma, fpms, fpnma) operations enable efficient software divide,square root, and remainder algorithms.

• The extended range of the 17-bit exponent in the register file format allows simplified implementation of many basicnumeric algorithms by the careful numeric programmer.

• The NaTVal is a natural extension of the IEEE concept of NaNs. It is used to support speculative execution.

• Flush-to-Zero mode is an industry standard addition.

• The minimum and maximum instructions allow the efficient execution of the common Fortran Intrinsic Functions:MIN(), MAX(), AMIN(), AMAX(); and C language idioms such as a<b?a:b.

• All mixed precision operations are allowed. The IEEE standard suggests that implementations allow lower precisionoperands to produce higher precision results; this is supported. The IEEE standard also suggests that implementationsnot allow higher precision operands to produce lower precision results; this suggestion is not followed.

• An IEEE style quad-precision real type that is supported in software.

HP/Intel IA-64 Instruction Reference 6-1


6 IA-64 Instruction Reference

This chapter describes the function of IA-64 instruction. The pages of this chapter are sorted alphabetically by assemblylanguage mnemonic.

6.1 Instruction Page ConventionsThe instruction pages are divided into multiple sections as listed in Table 6-1. The first four sections are present on allinstruction pages. The last three sections are present only when necessary. Table 6-2 lists the font conventions which areused by the instruction pages.

In the Format section, register addresses are specified using the assembly mnemonic field names given in the third columnof Table 6-3. For instructions that are predicated, the Description section assumes that the qualifying predicate is true(except for instructions that modify architectural state when their qualifying predicate is false). The test of the qualifyingpredicate is included in the Operation section (when applicable).

In the Operation section, registers are addressed using the notation reg[addr].field. The register file being accessed isspecified by reg, and has a value chosen from the second column of Table 6-3. The addr field specifies a register addressas an assembly language field name or a register mnemonic. For the general, floating-point, and predicate register fileswhich undergo register renaming, addr is the register address prior to renaming and the renaming is not shown. Thefield option specifies a named bit field within the register. If field is absent, then all fields of the register are accessed.The only exception is when referencing the data field of the general registers (64-bits not including the NaT bit) where thenotation GR[addr] is used. The syntactical differences between the code found in the Operation section and standard C islisted in Table 6-4.

Table 6-1. Instruction Page Description

Section Name ContentsFormat Assembly language syntax, instruction type and encoding format

Description Instruction function in English

Operation Instruction function in C code

FP Exceptions IEEE floating-point traps

Table 6-2. Instruction Page Font Conventions

Font Interpretationregular (Format section) Required characters in an assembly language mnemonic

italic (Format section) Assembly language field name that must be filled with one of a range of legal values listed in the Description section

code (Operation section) C code specifying instruction behaviorcode_italic (Operation section) Assembly language field name corresponding to a italic field listed

in the Format section

Table 6-3. Register File Notation

Register File C NotationAssembly Mnemonic

Indirect Access

Application registers AR ar

Branch registers BR b

CPU identification registers CPUID cpuid Y

6-2 IA-64 Instruction Reference HP/Intel


6.2 Instruction DescriptionsThe remainder of this chapter provides a description of IA-64 instruction.

Floating-point registers FR f

General registers GR r

Performance monitor data registers PMD pmd Y

Predicate registers PR p

Table 6-4. C Syntax Differences

Syntax Function{msb:lsb}, {bit} Bit field specifier. When appended to a variable, denotes a bit field extending from the most sig-

nificant bit specified by “msb” to the least significant bit specified by “lsb” including bits “msb” and “lsb”. If “msb” and “lsb” are equal then a single bit is accessed. The second form denotes a single bit.

u>, u>=, u<, u<= Unsigned inequality relations. Variables on either side of the operator are treated as unsigned.

u>>, u>>= Unsigned right shift. Zeroes are shifted into the most significant bit position.

u+ Unsigned addition. Operands are treated as unsigned, and zero-extended.

u* Unsigned multiplication. Operands are treated as unsigned.

Table 6-3. Register File Notation (Continued)

Register File C NotationAssembly Mnemonic

Indirect Access


IA-64 Application ISA Guide 1.0 add

Add

Format: (qp) add r1 = r2, r3 register_form A1(qp) add r1 = r2, r3, 1 plus1_form, register_form A1(qp) add r1 = imm, r3 pseudo-op(qp) adds r1 = imm14, r3 imm14_form A4(qp) addl r1 = imm22, r3 imm22_form A5

Description: The two source operands (and an optional constant 1) are added and the result placed in GR r1. In the reg-ister form the first operand is GR r2; in the imm_14 form the first operand is taken from the sign extendedimm14 encoding field; in the imm22_form the first operand is taken from the sign extended imm22 encod-ing field. In the imm22_form, GR r3 can specify only GRs 0, 1, 2 and 3.

The plus1_form is available only in the register_form (although the equivalent effect in the immediateforms can be achieved by adjusting the immediate).

The immediate-form pseudo-op chooses the imm14_form or imm22_form based upon the size of theimmediate operand and the value in GR r3.

Operation: if (PR[qp]) {check_target_register(r1);

if (register_form) // register formtmp_src = GR[r2];

else if (imm14_form) // 14-bit immediate formtmp_src = sign_ext(imm14, 14);

else // 22-bit immediate formtmp_src = sign_ext(imm22, 22);

tmp_nat = (register_form ? GR[r2].nat : 0);

if (plus1_form)GR[r1] = tmp_src + GR[r3] + 1;

elseGR[r1] = tmp_src + GR[r3];

GR[r1].nat = tmp_nat || GR[r3].nat;}


addp4 IA-64 Application ISA Guide 1.0

Add Pointer

Format: (qp) addp4 r1 = r2, r3 register_form A1(qp) addp4 r1 = imm14, r3 imm14_form A4

Description: The two source operands are added. The upper 32 bits of the result are forced to zero, and then bits{31:30} of GR r3 are copied to bits {62:61} of the result. This result is placed in GR r1. In theregister_form the first operand is GR r2; in the imm14_form the first operand is taken from the signextended imm14 encoding field.


tmp_src = (register_form ? GR[r2] : sign_ext(imm14, 14));tmp_nat = (register_form ? GR[r2].nat : 0);

tmp_res = tmp_src + GR[r3];tmp_res = zero_ext(tmp_res{31:0}, 32);tmp_res{62:61} = GR[r3]{31:30};GR[r1] = tmp_res;GR[r1].nat = tmp_nat || GR[r3].nat;

}

Figure 6-1. Add Pointer

GR r3:

GR r1:

GR r2:

+

00

032 30

63

032

03261


IA-64 Application ISA Guide 1.0 alloc

Allocate Stack Frame

Format: alloc r1 = ar.pfs, i, l, o, r M34

Description: A new stack frame is allocated on the general register stack, and the Previous Function State register(PFS) is copied to GR r1. The change of frame size is immediate. The write of GR r1 and subsequentinstructions in the same instruction group use the new frame. This instruction cannot be predicated.

The four parameters, i (size of inputs), l (size of locals), o (size of outputs), and r (size of rotating) specifythe sizes of the regions of the stack frame.

The size of the frame (sof) is determined by i + l + o. Note that this instruction may grow or shrink the sizeof the current register stack frame. The size of the local region (sol) is given by i + l. There is no real dis-tinction between inputs and locals. They are given as separate operands in the instruction only as a hint tothe assembler about how the local registers are to be used.

The rotating registers must fit within the stack frame and be a multiple of 8 in number. If this instructionattempts to change the size of CFM.sor, and the register rename base registers (CFM.rrb.gr, CFM.rrb.fr,CFM.rrb.pr) are not all zero, then the instruction will cause a Reserved Register/Field fault.

Although the assembler does not allow illegal combinations of operands for alloc, illegal combinationscan be encoded in the instruction. Attempting to allocate a stack frame larger than 96 registers, or with therotating region larger than the stack frame, or with the size of locals larger than the stack frame, will causean Illegal Operation fault. An alloc instruction must be the first instruction in an instruction group. Oth-erwise, the results are undefined.

If insufficient registers are available to allocate the desired frame alloc will stall the processor untilenough dirty registers are written to the backing store. Such mandatory RSE stores may cause the datarelated faults listed below.

Operation: tmp_sof = i + l + o;tmp_sol = i + l;tmp_sor = r u>> 3;check_target_register_sof(r1, tmp_sof);if (tmp_sof u> 96 || r u> tmp_sof || tmp_sol u> tmp_sof)

illegal_operation_fault();if (tmp_sor != CFM.sor &&

(CFM.rrb.gr != 0 || CFM.rrb.fr != 0 || CFM.rrb.pr != 0))reserved_register_field_fault();

alat_frame_update(0, tmp_sof - CFM.sof);rse_new_frame(CFM.sof, tmp_sof);// Make room for new registers; Mandatory RSE

// stores can raise faults listed below.CFM.sof = tmp_sof;CFM.sol = tmp_sol;CFM.sor = tmp_sor;

GR[r1] = AR[PFS];GR[r1].nat = 0;

Figure 6-2. Stack Frame

Local

GR32

sofsol

Output


and IA-64 Application ISA Guide 1.0

Logical And

Format: (qp) and r1 = r2, r3 register_form A1(qp) and r1 = imm8, r3 imm8_form A3

Description: The two source operands are logically ANDed and the result placed in GR r1. In the register_form the firstoperand is GR r2; in the imm8_form the first operand is taken from the imm8 encoding field.



GR[r1] = tmp_src & GR[r3];GR[r1].nat = tmp_nat || GR[r3].nat;

}


IA-64 Application ISA Guide 1.0 andcm

And Complement

Format: (qp) andcm r1 = r2, r3 register_form A1(qp) andcm r1 = imm8, r3 imm8_form A3

Description: The first source operand is logically ANDed with the 1’s complement of the second source operand andthe result placed in GR r1. In the register_form the first operand is GR r2; in the imm8_form the first oper-and is taken from the imm8 encoding field.



GR[r1] = tmp_src & ~GR[r3];GR[r1].nat = tmp_nat || GR[r3].nat;

}


br IA-64 Application ISA Guide 1.0

Branch

Format: (qp) br.btype.bwh.ph.dh target25 ip_relative_form B1(qp) br.btype.bwh.ph.dh b1 = target25 call_form, ip_relative_form B3

br.btype.bwh.ph.dh target25 counted_form, ip_relative_form B2br.ph.dh target25 pseudo-op

(qp) br.btype.bwh.ph.dh b2 indirect_form B4(qp) br.btype.bwh.ph.dh b1 = b2 call_form, indirect_form B5

br.ph.dh b2 pseudo-op

Description: A branch calculation is evaluated, and either a branch is taken, or execution continues with the nextsequential instruction. The execution of a branch logically follows the execution of all previous non-branch instructions in the same instruction group. On a taken branch, execution begins at slot 0.

Branches can be either IP-relative, or indirect. For IP-relative branches, the target25 operand, in assembly,specifies a label to branch to. This is encoded in the branch instruction as a signed immediate displace-ment (imm21) between the target bundle and the bundle containing this instruction (imm21 = target25 – IP>> 4). For indirect branches, the target address is taken from BR b2.

There are two pseudo-ops for unconditional branches. These are encoded like a conditional branch (btype= cond), with the qp field specifying PR 0, and with the bwh hint of sptk.

The branch type determines how the branch condition is calculated and whether the branch has othereffects (such as writing a link register). For the basic branch types, the branch condition is simply thevalue of the specified predicate register. These basic branch types are:

• cond: If the qualifying predicate is 1, the branch is taken. Otherwise it is not taken.

• call: If the qualifying predicate is 1, the branch is taken and several other actions occur:

• The current values of the Current Frame Marker (CFM), the EC application register and the cur-rent privilege level are saved in the Previous Function State application register.

• The caller’s stack frame is effectively saved and the callee is provided with a frame containingonly the caller’s output region.

• The rotation rename base registers in the CFM are reset to 0.

• A return link value is placed in BR b1.

• return: If the qualifying predicate is 1, the branch is taken and the following occurs:

• CFM, EC, and the current privilege level are restored from PFS. (The privilege level is restoredonly if this does not increase privilege.)

• The caller’s stack frame is restored.

• If the return lowers the privilege, and PSR.lp is 1, then a Lower-privilege Transfer trap is taken.

• ia: The branch is taken unconditionally, if it is not intercepted by the OS. The effect of the branch is toinvoke the IA-32 instruction set (by setting PSR.is to 1) and begin processing IA-32 instructions atthe virtual linear target address contained in BR b2{31:0}. If the qualifying predicate is not PR 0, anIllegal Operation fault is raised.

Table 6-5. Branch Types

btype Function Branch Condition Target Addresscond or none Conditional branch Qualifying predicate IP-rel or Indirect

call Conditional procedure call Qualifying predicate IP-rel or Indirect

ret Conditional procedure return Qualifying predicate Indirect

ia Invoke IA-32 instruction set Unconditional Indirect

cloop Counted loop branch Loop count IP-rel

ctop, cexit Mod-scheduled counted loop Loop count and epilog count IP-rel

wtop, wexit Mod-scheduled while loop Qualifying predicate and epilog count IP-rel


IA-64 Application ISA Guide 1.0 br

The IA-32 target effective address is calculated relative to the current code segment, i.e. EIP{31:0} =BR b2{31:0} – CSD.base. The IA-32 instruction set can be entered at any privilege level, providedinstruction set transitions are not disabled.

Software must ensure the code segment descriptor (CSD) and selector (CS) are loaded before issuingthe branch. If the target EIP value exceeds the code segment limit or has a code segment privilegeviolation, an IA-32_Exception(GPFault) is raised on the target IA-32 instruction. For entry into 16-bitIA-32 code, if BR b2 is not within 64K-bytes of CSD.base a GPFault is raised on the target instruc-tion. EFLAG.rf is unmodified until the successful completion of the first IA-32 instruction.EFLAG.rf is not cleared until the target IA-32 instruction successfully completes.

Software must issue a mf instruction before the branch if memory ordering is required between IA-32processor consistent and IA-64 unordered memory references. The processor does not ensure IA-64-instruction-set-generated writes into the instruction stream are seen by subsequent IA-32 instructionfetches. br.ia does not perform an instruction serialization operation. The processor does ensure thatprior writes (even in the same instruction group) to GRs and FRs are observed by the first IA-32instruction. Writes to ARs within the same instruction group as br.ia are not allowed, since br.iamay implicitly reads all ARs. If an illegal RAW dependency is present between an AR write andbr.ia, the first IA-32 instruction fetch and execution may or may not see the updated AR value.

IA-32 instruction set execution leaves the contents of the ALAT undefined. Software can not rely onALAT values being preserved across an instruction set transition. On entry to IA-32 code, existingentries in the ALAT are ignored. If the register stack contains any dirty registers, an Illegal Operationfault is raised on the br.ia instruction. All registers left in the current register stack frame are leftundefined during IA-32 instruction set execution. The current register stack frame is forced to zero.To flush the register file of dirty registers, the flushrs instruction must be issued in an instructiongroup proceeding the br.ia instruction. To enhance the performance of the instruction set transition,software can start the IA-64 register stack flush in parallel with starting the IA-32 instruction set by 1)ensuring flushrs is exactly one instruction group before the br.ia, and 2) br.ia is in the first B-slot. br.ia should always be executed in the first B-slot with a hint of “static-taken” (default), other-wise processor performance will be degraded.

Another branch type is provided for simple counted loops. This branch type uses the Loop Count applica-tion register (LC) to determine the branch condition, and does not use a qualifying predicate:

• cloop: If the LC register is not equal to zero, it is decremented and the branch is taken.

In addition to these simple branch types, there are four types which are used for accelerating modulo-scheduled loops. Two of these are for counted loops (which use the LC register), and two for while loops(which use the qualifying predicate). These loop types use register rotation to provide register renaming,and they use predication to turn off instructions that correspond to empty pipeline stages.

The Epilog Count application register (EC) is used to count epilog stages and, for some while loops, a por-tion of the prolog stages. In the epilog phase, EC is decremented each time around and, for most loops,when EC is one, the pipeline has been drained, and the loop is exited. For certain types of optimized,unrolled software-pipelined loops, the target of a br.cexit or br.wexit is set to the next sequential bun-dle. In this case, the pipeline may not be fully drained when EC is one, and continues to drain while EC iszero.

For these modulo-scheduled loop types, the calculation of whether the branch is taken or not depends onthe kernel branch condition (LC for counted types, and the qualifying predicate for while types) and on theepilog condition (whether EC is greater than one or not).

These branch types are of two categories: top and exit. The top types (ctop and wtop) are used when theloop decision is located at the bottom of the loop body and therefore a taken branch will continue the loopwhile a fall through branch will exit the loop. The exit types (cexit and wexit) are used when the loop deci-sion is located somewhere other than the bottom of the loop and therefore a fall though branch will con-tinue the loop and a taken branch will exit the loop. The exit types are also used at intermediate points inan unrolled pipelined loop.



The modulo-scheduled loop types are:

• ctop and cexit: These branch types behave identically, except in the determination of whether tobranch or not. For br.ctop, the branch is taken if either LC is non-zero or EC is greater than one. Forbr.cexit, the opposite is true. It is not taken if either LC is non-zero or EC is greater than one and istaken otherwise.

These branch types also use LC and EC to control register rotation and predicate initialization. Duringthe prolog and kernel phase, when LC is non-zero, LC counts down. When br.ctop or br.cexit isexecuted with LC equal to zero, the epilog phase is entered, and EC counts down. When br.ctop orbr.cexit is executed with LC equal to zero and EC equal to one, a final decrement of EC and a finalregister rotation are done. If LC and EC are equal to zero, register rotation stops. These other effectsare the same for the two branch types, and are described in Figure 6-3.

wtop and wexit: These branch types behave identically, except in the determination of whether tobranch or not. For br.wtop, the branch is taken if either the qualifying predicate is one or EC isgreater than one. For br.wexit, the opposite is true. It is not taken if either the qualifying predicate isone or EC is greater than one, and is taken otherwise.

These branch types also use the qualifying predicate and EC to control register rotation and predicateinitialization. During the prolog phase, the qualifying predicate is either zero or one, depending uponthe scheme used to program the loop. During the kernel phase, the qualifying predicate is one. Duringthe epilog phase, the qualifying predicate is zero, and EC counts down. When br.wtop or br.wexitis executed with the qualifying predicate equal to zero and EC equal to one, a final decrement of ECand a final register rotation are done. If the qualifying predicate and EC are zero, register rotationstops. These other effects are the same for the two branch types, and are described in Figure 6-4.

Figure 6-3. Operation of br.ctop and br.cexit

LC?== 0 (epilog)

ctop, cexit

ctop: branchcexit: fall-thru

ctop: fall-thrucexit: branch

EC?

EC--

PR[63] = 0

RRB--

EC = EC

PR[63] = 1

RRB--

EC--

PR[63] = 0

RRB--

> 1

== 1

== 0

EC = EC

PR[63] = 0

RRB = RRB

LC = LCLC-- LC = LC LC = LC

kernel)

!= 0(prolog /

(specialunrolledloops)



The loop-type branches (br.cloop, br.ctop, br.cexit, br.wtop, and br.wexit) are only allowed ininstruction slot 2 within a bundle. Executing such an instruction in either slot 0 or 1 will cause an IllegalOperation fault, whether the branch would have been taken or not.

Read after Write (RAW) and Write after Read (WAR) dependency requirements are slightly different forbranch instructions. Changes to BRs, PRs, and PFS by non-branch instructions are visible to a subsequentbranch instruction in the same instruction group (i.e., a limited RAW is allowed for these resources). Thisallows for a low-latency compare-branch sequence, for example. The normal RAW requirements apply tothe LC and EC application registers, and the RRBs.

Within an instruction group, a WAR dependency on PR 63 is not allowed if both the reading and writinginstructions are branches. For example, a br.wtop or br.wexit may not use PR[63] as its qualifyingpredicate and PR[63] cannot be the qualifying predicate for any branch preceding a br.wtop orbr.wexit in the same instruction group.

For dependency purposes, the loop-type branches effectively always write their associated resources,whether they are taken or not. The cloop type effectively always writes LC. When LC is 0, a cloop branchleaves it unchanged, but hardware may implement this as a re-write of LC with the same value. Similarly,br.ctop and br.cexit effectively always write LC, EC, the RRBs, and PR[63]. br.wtop and br.wexiteffectively always write EC, the RRBs, and PR[63].

Values for various branch hint completers are shown in the following tables. Whether Prediction Strategyhints are shown in Table 6-6. Sequential Prefetch hints are shown in Table 6-7. Branch Cache Dealloca-tion hints are shown in Table 6-8.

Figure 6-4. Operation of br.wtop and br.wexit

Table 6-6. Branch Whether Hint

bwh Completer Branch Whether Hintspnt Static Not-Taken

sptk Static Taken

dpnt Dynamic Not-Taken

dptk Dynamic Taken

Table 6-7. Sequential Prefetch Hint

ph Completer Sequential Prefetch Hintfew or none Few lines

many Many lines

PR[qp]?

wtop, wexit

wtop: branchwexit: fall-thru

wtop: fall-thruwexit: branch

EC?

EC--

PR[63] = 0

RRB--

EC--

PR[63] = 0

RRB--

> 1

== 1

== 0

EC = EC

PR[63] = 0

RRB--

EC = EC

PR[63] = 0

RRB = RRB

(prolog /epilog) (epilog)

==0 (prolog / epilog)(specialunrolledloops)

== 1kernel)

(prolog /

Operation: if (ip_relative_form) // determine branch targettmp_IP = IP + sign_ext((imm21 << 4), 25);

else // indirect_formtmp_IP = BR[b2];

if (btype != ‘ia’) // for IA-64 branches,tmp_IP = tmp_IP & ~0xf; // ignore bottom 4 bits of target

lower_priv_transition = 0;

switch ( btype) {case ‘cond’: // simple conditional branch

tmp_taken = PR[ qp];break;

case ‘call’: // call saves a return linktmp_taken = PR[ qp];if (tmp_taken) {

BR[b1] = IP + 16;

AR[PFS].pfm = CFM; // ... and saves the stack frameAR[PFS].pec = AR[EC];AR[PFS].ppl = PSR.cpl;

alat_frame_update(CFM.sol, 0);rse_preserve_frame(CFM.sol);CFM.sof -= CFM.sol; // new frame size is size of outsCFM.sol = 0;CFM.sor = 0;CFM.rrb.gr = 0;CFM.rrb.fr = 0;CFM.rrb.pr = 0;

}break;

case ‘ret’: // return restores stack frametmp_taken = PR[ qp];if (tmp_taken) {

// tmp_growth indicates the amount to move logical TOP *up*:// tmp_growth = sizeof(previous out) - sizeof(current frame)// a negative amount indicates a shrinking stacktmp_growth = (AR[PFS].pfm.sof - AR[PFS].pfm.sol) - CFM.sof;alat_frame_update(-AR[PFS].pfm.sol, 0);rse_fatal = rse_restore_frame(AR[PFS].pfm.sol, tmp_growth, CFM.sof);if (rse_fatal) {

CFM.sof = 0;CFM.sol = 0;CFM.sor = 0;CFM.rrb.gr = 0;CFM.rrb.fr = 0;CFM.rrb.pr = 0;

} else // normal branch returnCFM = AR[PFS].pfm;

rse_enable_current_frame_load();AR[EC] = AR[PFS].pec;if (PSR.cpl u< AR[PFS].ppl) { // ... and restores privilege

PSR.cpl = AR[PFS].ppl;lower_priv_transition = 1;

}}

Table 6-8. Branch Cache Deallocation Hint

dh Completer Branch Cache Deallocation Hintnone Don’t deallocate

clr Deallocate branch information



break;

case ‘ia’: // switch to IA modetmp_taken = 1;if (qp != 0)

illegal_operation_fault();if (AR[BSPSTORE] != AR[BSP])

illegal_operation_fault();if (PSR.di)

disabled_instruction_set_transition_fault();PSR.is = 1; // set IA-32 Instruction Set ModeCFM.sof = 0; //force current stack frameCFM.sol = 0; //to zeroCFM.sor = 0;CFM.rrb.gr = 0;CFM.rrb.fr = 0;CFM.rrb.pr = 0;rse_invalidate_non_current_regs();

// Note the register stack is disabled during IA-32 instruction set executionbreak;

case ‘cloop’: // simple counted loopif (slot != 2)

illegal_operation_fault();tmp_taken = (AR[LC] != 0);if (AR[LC] != 0)

AR[LC]--;break;

case ‘ctop’: case ‘cexit’: // SW pipelined counted loop

if (slot != 2)illegal_operation_fault();

if ( btype == ‘ctop’) tmp_taken = ((AR[LC] != 0) || (AR[EC] u> 1));if ( btype == ‘cexit’)tmp_taken = !((AR[LC] != 0) || (AR[EC] u> 1));if (AR[LC] != 0) {

AR[LC]--;AR[EC] = AR[EC];PR[63] = 1;rotate_regs();

} else if (AR[EC] != 0) {AR[LC] = AR[LC];AR[EC]--;PR[63] = 0;rotate_regs();

} else {AR[LC] = AR[LC];AR[EC] = AR[EC];PR[63] = 0;CFM.rrb.gr = CFM.rrb.gr;CFM.rrb.fr = CFM.rrb.fr;CFM.rrb.pr = CFM.rrb.pr;

}break;

case ‘wtop’:case ‘wexit’: // SW pipelined while loop

if (slot != 2)illegal_operation_fault();

if ( btype == ‘wtop’) tmp_taken = (PR[ qp] || (AR[EC] u> 1));if ( btype == ‘wexit’)tmp_taken = !(PR[ qp] || (AR[EC] u> 1));if (PR[ qp]) {

AR[EC] = AR[EC];PR[63] = 0;rotate_regs();

} else if (AR[EC] != 0) {AR[EC]--;



PR[63] = 0;rotate_regs();

} else {AR[EC] = AR[EC];PR[63] = 0;CFM.rrb.gr = CFM.rrb.gr;CFM.rrb.fr = CFM.rrb.fr;CFM.rrb.pr = CFM.rrb.pr;

}break;

}if (tmp_taken) {

taken_branch = 1;IP = tmp_IP; // set the new value for IPif ((PSR.it && unimplemented_virtual_address(tmp_IP))

|| (!PSR.it && unimplemented_physical_address(tmp_IP)))unimplemented_instruction_address_trap(lower_priv_transition,tmp_IP);

if (lower_priv_transition && PSR.lp)lower_privilege_transfer_trap();

if (PSR.tb)taken_branch_trap();

}


IA-64 Application ISA Guide 1.0 break

Break

Format: (qp) break imm21 pseudo-op(qp) break.i imm21 i_unit_form I19(qp) break.b imm21 b_unit_form B9(qp) break.m imm21 m_unit_form M37(qp) break.f imm21 f_unit_form F15(qp) break.x imm62 x_unit_form X1

Description: A Break Instruction fault is taken. For the i_unit_form, f_unit_form and m_unit_form, the value specifiedby imm21 is zero-extended and placed in the Interruption Immediate control register (IIM).

For the b_unit_form, imm21 is ignored and the value zero is placed in the Interruption Immediate controlregister (IIM).

For the x_unit_form, the lower 21 bits of the value specified by imm62 is zero-extended and placed in theInterruption Immediate control register (IIM). The L slot of the bundle contains the upper 41 bits ofimm62.

This instruction has five forms, each of which can be executed only on a particular execution unit type.The pseudo-op can be used if the unit type to execute on is unimportant.

Operation: if (PR[qp]) {if (b_unit_form)

immediate = 0;else if (x_unit_form)

immediate = zero_ext(imm62, 21);else // i_unit_form || m_unit_form || f_unit_form

immediate = zero_ext(imm21, 21);

break_instruction_fault(immediate);}


chk IA-64 Application ISA Guide 1.0

Speculation Check

Format: (qp) chk.s r2, target25 pseudo-op(qp) chk.s.i r2, target25 control_form, i_unit_form, gr_form I20(qp) chk.s.m r2, target25 control_form, m_unit_form, gr_form M20(qp) chk.s f2, target25 control_form, fr_form M21(qp) chk.a.aclr r1, target25 data_form, gr_form M22(qp) chk.a.aclr f1, target25 data_form, fr_form M23

Description: The result of a control- or data-speculative calculation is checked for success or failure. If the check fails,a branch to target25 is taken.

In the control_form, success is determined by a NaT indication for the source register. If the NaT bit cor-responding to GR r2 is 1 (in the gr_form), or FR f2 contains a NaTVal (in the fr_form), the check fails.

In the data_form, success is determined by the ALAT. The ALAT is queried using the general registerspecifier r1 (in the gr_form), or the floating-point register specifier f1 (in the fr_form). If no ALAT entrymatches, the check fails. An implementation may optionally cause the check to fail independent ofwhether an ALAT entry matches.

The target25 operand, in assembly, specifies a label to branch to. This is encoded in the instruction as asigned immediate displacement (imm21) between the target bundle and the bundle containing this instruc-tion (imm21 = target25 – IP >> 4).

The control_form of this instruction for checking general registers can be encoded on either an I-unit or anM-unit. The pseudo-op can be used if the unit type to execute on is unimportant.

For the data_form, if an ALAT entry matches, the matching ALAT entry can be optionally invalidated,based on the value of the aclr completer (See Table 6-9).

Note that if the clr value of the aclr completer is used and the check succeeds, the matching ALAT entry isinvalidated. However, if the check fails (which may happen even if there is a matching ALAT entry), anymatching ALAT entry may optionally be invalidated, but this is not required. Recovery code for data spec-ulation, therefore, cannot rely on the absence of a matching ALAT entry.

Table 6-9. ALAT Clear Completer

aclr Completer Effect on ALATclr Invalidate matching ALAT entry

nc Don’t invalidate

IA-64 Application ISA Guide 1.0 chk

Operation: if (PR[qp]) {if (control_form) {

if (fr_form && (tmp_isrcode = fp_reg_disabled(f2, 0, 0, 0)))disabled_fp_register_fault(tmp_isrcode, 0);

check_type = gr_form ? CHKS_GENERAL : CHKS_FLOAT;fail = (gr_form && GR[r2].nat) || (fr_form && FR[f2] == NATVAL);

} else { // data_formreg_type = gr_form ? GENERAL : FLOAT;alat_index = gr_form ? r1 : (data_form ? f1 : f2);

check_type = gr_form ? CHKA_GENERAL : CHKA_FLOAT;fail = !alat_cmp(reg_type, alat_index);

}if (fail) {

taken_branch = 1;IP = IP + sign_ext((imm21 << 4), 25);if ((PSR.it && unimplemented_virtual_address(IP))

|| (!PSR.it && unimplemented_physical_address(IP)))unimplemented_instruction_address_trap(0, IP);


}if (!fail && data_form && (aclr == ‘clr’))

alat_inval_single_entry(reg_type, alat_index);}


clrrrb IA-64 Application ISA Guide 1.0

Clear RRB

Format: clrrrb all_form B8clrrrb.pr pred_form B8

Description: In the all_form, the register rename base registers (CFM.rrb.gr, CFM.rrb.fr, and CFM.rrb.pr) are cleared.In the pred_form, the single register rename base register for the predicates (CFM.rrb.pr) is cleared.

This instruction must be the last instruction in an instruction group, or an Illegal Operation fault is taken.

This instruction cannot be predicated.

Operation: if (!followed_by_stop())illegal_operation_fault();

if (all_form) {CFM.rrb.gr = 0;CFM.rrb.fr = 0;CFM.rrb.pr = 0;

} else { // pred_formCFM.rrb.pr = 0;

}

IA-64 Application ISA Guide 1.0 cmp

Compare

Format: (qp) cmp.crel.ctype p1, p2 = r2, r3 register_form A6(qp) cmp.crel.ctype p1, p2 = imm8, r3 imm8_form A8(qp) cmp.crel.ctype p1, p2 = r0, r3 parallel_inequality_form A7(qp) cmp.crel.ctype p1, p2 = r3, r0 pseudo-op

Description: The two source operands are compared for one of ten relations specified by crel. This produces a booleanresult which is 1 if the comparison condition is true, and 0 otherwise. This result is written to the two pred-icate register destinations, p1 and p2. The way the result is written to the destinations is determined by thecompare type specified by ctype.

The compare types describe how the predicate targets are updated based on the result of the comparison.The normal type simply writes the compare result to one target, and the complement to the other. The par-allel types update the targets only for a particular comparison result. This allows multiple simultaneousOR-type or multiple simultaneous AND-type compares to target the same predicate register.

The unc type is special in that it first initializes both predicate targets to 0, independent of the qualifyingpredicate. It then operates the same as the normal type. The behavior of the compare types is described inTable 6-10. A blank entry indicates the predicate target is left unchanged.

In the register_form the first operand is GR r2; in the imm8_form the first operand is taken from the signextended imm8 encoding field; and in the parallel_inequality_form the first operand must be GR 0. Theparallel_inequality_form is only used when the compare type is one of the parallel types, and the relationis an inequality (>, >=, <, <=). See below.

If the two predicate register destinations are the same (p1 and p2 specify the same predicate register), theinstruction will take an Illegal Operation fault, if the qualifying predicate is set, or if the compare type isunc.

Of the ten relations, not all are directly implemented in hardware. Some are actually pseudo-ops. Forthese, the assembler simply switches the source operand specifiers and/or switches the predicate targetspecifiers and uses an implemented relation. For some of the pseudo-op compares in the imm8_form, theassembler subtracts 1 from the immediate value, making the allowed immediate range slightly different.Of the six parallel compare types, three of the types are actually pseudo-ops. The assembler simply usesthe negative relation with an implemented type. The implemented relations and how the pseudo-ops maponto them are shown in Table 6-11 (for normal and unc type compares), and Table 6-12 (for parallel typecompares).

Table 6-10. Comparison Types

ctypePseudo-

op ofPR[qp]==0

PR[qp]==1

result==0,No Source NaTs

result==1,No Source NaTs

One or MoreSource NaTs

PR[p1] PR[p2] PR[p1] PR[p2] PR[p1] PR[p2] PR[p1] PR[p2]none 0 1 1 0 0 0

unc 0 0 0 1 1 0 0 0

or 1 1

and 0 0 0 0

or.andcm 1 0

orcm or 1 1

andcm and 0 0 0 0

and.orcm or.andcm 0 1

cmp IA-64 Application ISA Guide 1.0

The parallel compare types can be used only with a restricted set of relations and operands. They can beused with equal and not-equal comparisons between two registers or between a register and an immediate,or they can be used with inequality comparisons between a register and GR 0. Unsigned relations are notprovided, since they are not of much use when one of the operands is zero. For the parallel inequality com-parisons, hardware only directly implements the ones where the first operand (GR r2) is GR 0. Compari-sons where the second operand is GR 0 are pseudo-ops for which the assembler switches the registerspecifiers and uses the opposite relation.

Table 6-11. 64-bit Comparison Relations for Normal and unc Compares

crelCompare Relation(a rel b)

Register Form is aPseudo-op of

Immediate Form is a Pseudo-op of

Immediate Range

eq a == b -128 .. 127

ne a != b eq p1 ↔ p2 eq p1 ↔ p2 -128 .. 127

lt a b lt a ↔ b lt a-1 p1 ↔ p2 -127 .. 128

ge a >= b lt p1 ↔ p2 lt p1 ↔ p2 -128 .. 127

ltu a b ltu a ↔ b ltu a-1 p1 ↔ p2 1 .. 128, 264-127 .. 264

geu a >= b ltu p1 ↔ p2 ltu p1 ↔ p2 0 .. 127, 264-128 .. 264-1

Table 6-12. 64-bit Comparison Relations for Parallel Compares

crelCompare Relation(a rel b)

Register Form is aPseudo-op of

Immediate Range

eq a == b -128 .. 127

ne a != b -128 .. 127

lt 0 b

gt a > 0 lt a ↔ b

ge 0 >= b

ge a >= 0 le a ↔ b


IA-64 Application ISA Guide 1.0 cmp

Operation: if (PR[qp]) {if (p1 == p2)

illegal_operation_fault();

tmp_nat = (register_form ? GR[r2].nat : 0) || GR[r3].nat;if (register_form)

tmp_src = GR[r2];else if (imm8_form)

tmp_src = sign_ext(imm8, 8);else // parallel_inequality_form

tmp_src = 0;

if (crel == ‘eq’) tmp_rel = tmp_src == GR[ r3];else if ( crel == ‘ne’) tmp_rel = tmp_src != GR[ r3];else if ( crel == ‘lt’) tmp_rel = lesser_signed(tmp_src, GR[ r3]);else if ( crel == ‘le’) tmp_rel = lesser_equal_signed(tmp_src, GR[ r3]);else if ( crel == ‘gt’) tmp_rel = greater_signed(tmp_src, GR[ r3]);else if ( crel == ‘ge’) tmp_rel = greater_equal_signed(tmp_src, GR[ r3]);else if ( crel == ‘ltu’) tmp_rel = lesser(tmp_src, GR[ r3]);else if ( crel == ‘leu’) tmp_rel = lesser_equal(tmp_src, GR[ r3]);else if ( crel == ‘gtu’) tmp_rel = greater(tmp_src, GR[ r3]);else tmp_rel = greater_equal(tmp_src, GR[ r3]); // ‘geu’

switch ( ctype) {case ‘and’: // and-type compare

if (tmp_nat || !tmp_rel) {PR[p1] = 0;PR[p2] = 0;

}break;

case ‘or’: // or-type compareif (!tmp_nat && tmp_rel) {

PR[p1] = 1;PR[p2] = 1;

}break;

case ‘or.andcm’: // or.andcm-type compareif (!tmp_nat && tmp_rel) {

PR[p1] = 1;PR[p2] = 0;

}break;

case ‘unc’: // unc-type comparedefault: // normal compare

if (tmp_nat) {PR[p1] = 0;PR[p2] = 0;

} else {PR[p1] = tmp_rel;PR[p2] = !tmp_rel;

}break;

}} else {

if ( ctype == ‘unc’) {if ( p1 == p2)

illegal_operation_fault();PR[p1] = 0;PR[p2] = 0;

}}

cmp4 IA-64 Application ISA Guide 1.0

Compare Word

Format: (qp) cmp4.crel.ctype p1, p2 = r2, r3 register_form A6(qp) cmp4.crel.ctype p1, p2 = imm8, r3 imm8_form A8(qp) cmp4.crel.ctype p1, p2 = r0, r3 parallel_inequality_form A7(qp) cmp4.crel.ctype p1, p2 = r3, r0 pseudo-op

Description: The least significant 32 bits from each of two source operands are compared for one of ten relations spec-ified by crel. This produces a boolean result which is 1 if the comparison condition is true, and 0 other-wise. This result is written to the two predicate register destinations, p1 and p2. The way the result iswritten to the destinations is determined by the compare type specified by ctype. See the Compare instruc-tion and Table 6-10 on page 6-19.

In the register_form the first operand is GR r2; in the imm8_form the first operand is taken from the signextended imm8 encoding field; and in the parallel_inequality_form the first operand must be GR 0. Theparallel_inequality_form is only used when the compare type is one of the parallel types, and the relationis an inequality (>, >=, <, <=). See the Compare instruction and Table 6-12 on page 6-20.


Of the ten relations, not all are directly implemented in hardware. Some are actually pseudo-ops. See theCompare instruction and Table 6-11 and Table 6-12 on page 6-20. The range for immediates is givenbelow.



tmp_nat = (register_form ? GR[r2].nat : 0) || GR[r3].nat;

if (register_form)tmp_src = GR[r2];

else if (imm8_form)tmp_src = sign_ext(imm8, 8);

else // parallel_inequality_formtmp_src = 0;

if (crel == ‘eq’) tmp_rel = tmp_src{31:0} == GR[ r3]{31:0};else if ( crel == ‘ne’) tmp_rel = tmp_src{31:0} != GR[ r3]{31:0};else if ( crel == ‘lt’)

tmp_rel = lesser_signed(sign_ext(tmp_src, 32), sign_ext(GR[ r3], 32));else if ( crel == ‘le’)

tmp_rel = lesser_equal_signed(sign_ext(tmp_src, 32), sign_ext(GR[ r3], 32));else if ( crel == ‘gt’)

Table 6-13. Immediate Range for 32-bit Compares

crelCompare Relation

(a rel b)Immediate Range

eq a == b -128 .. 127

ne a != b -128 .. 127

lt a b -127 .. 128

ge a >= b -128 .. 127

ltu a b 1 .. 128, 232-127 .. 232

geu a >= b 0 .. 127, 232-128 .. 232-1


IA-64 Application ISA Guide 1.0 cmp4

tmp_rel = greater_signed(sign_ext(tmp_src, 32), sign_ext(GR[r3], 32));else if (crel == ‘ge’)

tmp_rel = greater_equal_signed(sign_ext(tmp_src, 32), sign_ext(GR[ r3], 32));else if ( crel == ‘ltu’)

tmp_rel = lesser(zero_ext(tmp_src, 32), zero_ext(GR[ r3], 32));else if ( crel == ‘leu’)

tmp_rel = lesser_equal(zero_ext(tmp_src, 32), zero_ext(GR[ r3], 32));else if ( crel == ‘gtu’)

tmp_rel = greater(zero_ext(tmp_src, 32), zero_ext(GR[ r3], 32));else // ‘geu’

tmp_rel = greater_equal(zero_ext(tmp_src, 32), zero_ext(GR[ r3], 32));


if (tmp_nat || !tmp_rel) {PR[p1] = 0;PR[p2] = 0;

}break;

case ‘or’: // or-type compareif (!tmp_nat && tmp_rel) {

PR[p1] = 1;PR[p2] = 1;

}break;

case ‘or.andcm’: // or.andcm-type compareif (!tmp_nat && tmp_rel) {

PR[p1] = 1;PR[p2] = 0;

}break;


if (tmp_nat) {PR[p1] = 0;PR[p2] = 0;


}break;

}} else {

if ( ctype == ‘unc’) {if ( p1 == p2)


}}


cmpxchg IA-64 Application ISA Guide 1.0

Compare And Exchange

Format: (qp) cmpxchgsz.sem.ldhint r1 = [r3], r2, ar.ccv M16

Description: A value consisting of sz bytes is read from memory starting at the address specified by the value in GR r3.The value is zero extended and compared with the contents of the cmpxchg Compare Value applicationregister (AR[CCV]). If the two are equal, then the least significant sz bytes of the value in GR r2 are writ-ten to memory starting at the address specified by the value in GR r3. The zero-extended value read frommemory is placed in GR r1 and the NaT bit corresponding to GR r1 is cleared.

The values of the sz completer are given in Table 6-14. The sem completer specifies the type of semaphoreoperation. These operations are described in Table 6-15.

If the address specified by the value in GR r3 is not naturally aligned to the size of the value beingaccessed in memory, an Unaligned Data Reference fault is taken independent of the state of the User Maskalignment checking bit, UM.ac (PSR.ac in the Processor Status Register).

The memory read and write are guaranteed to be atomic.

Both read and write access privileges for the referenced page are required. The write access privilegecheck is performed whether or not the memory write is performed.

The value of the ldhint completer specifies the locality of the memory access. The values of the ldhintcompleter are given in Table 6-28 on page 6-102. Locality hints do not affect program functionality andmay be ignored by the implementation. See “Memory Hierarchy Control and Consistency” on page 4-16for details.

Table 6-14. Memory Compare and Exchange Size

sz Completer Bytes Accessed1 1

2 2

4 4

8 8

Table 6-15. Compare and Exchange Semaphore Types

semCompleter

OrderingSemantics

Semaphore Operation

acq AcquireThe memory read/write is made visible prior to all subsequent data memory accesses.

rel ReleaseThe memory read/write is made visible after all previous data memory accesses.


IA-64 Application ISA Guide 1.0 cmpxchg

Operation: if (PR[qp]) {check_target_register(r1, SEMAPHORE);

if (GR[r3].nat || GR[r2].nat)register_nat_consumption_fault(SEMAPHORE);

paddr = tlb_translate(GR[r3], sz, SEMAPHORE, PSR.cpl, &mattr, &tmp_unused);

if (!ma_supports_semaphores(mattr))unsupported_data_reference_fault(SEMAPHORE, GR[r3]);

if (sem == ‘acq’) {val = mem_xchg_cond(AR[CCV], GR[ r2], paddr, sz, UM.be, mattr, ACQUIRE,

ldhint);} else { // ‘rel’

val = mem_xchg_cond(AR[CCV], GR[ r2], paddr, sz, UM.be, mattr, RELEASE,ldhint);

}val = zero_ext(val, sz * 8);

if (AR[CCV] == val)alat_inval_multiple_entries(paddr, sz);

GR[r1] = val;GR[r1].nat = 0;

}


czx IA-64 Application ISA Guide 1.0

Compute Zero Index

Format: (qp) czx1.l r1 = r3 one_byte_form, left_form I29(qp) czx1.r r1 = r3 one_byte_form, right_form I29(qp) czx2.l r1 = r3 two_byte_form, left_form I29(qp) czx2.r r1 = r3 two_byte_form, right_form I29

Description: GR r3 is scanned for a zero element. The element is either an 8-bit aligned byte (one_byte_form) or a 16-bit aligned pair of bytes (two_byte_form). The index of the first zero element is placed in GR r1. If thereare no zero elements in GR r3, a default value is placed in GR r1. Table 6-16 gives the possible result val-ues. In the left_form, the source is scanned from most significant element to least significant element, andin the right_form it is scanned from least significant element to most significant element.


if (one_byte_form) {if (left_form) { // scan from most significant down

if ((GR[r3] & 0xff00000000000000) == 0) GR[r1] = 0;else if ((GR[r3] & 0x00ff000000000000) == 0) GR[r1] = 1;else if ((GR[r3] & 0x0000ff0000000000) == 0) GR[r1] = 2;else if ((GR[r3] & 0x000000ff00000000) == 0) GR[r1] = 3;else if ((GR[r3] & 0x00000000ff000000) == 0) GR[r1] = 4;else if ((GR[r3] & 0x0000000000ff0000) == 0) GR[r1] = 5;else if ((GR[r3] & 0x000000000000ff00) == 0) GR[r1] = 6;else if ((GR[r3] & 0x00000000000000ff) == 0) GR[r1] = 7;else GR[r1] = 8;

} else { // right_form scan from least significant upif ((GR[r3] & 0x00000000000000ff) == 0) GR[r1] = 0;else if ((GR[r3] & 0x000000000000ff00) == 0) GR[r1] = 1;else if ((GR[r3] & 0x0000000000ff0000) == 0) GR[r1] = 2;else if ((GR[r3] & 0x00000000ff000000) == 0) GR[r1] = 3;else if ((GR[r3] & 0x000000ff00000000) == 0) GR[r1] = 4;else if ((GR[r3] & 0x0000ff0000000000) == 0) GR[r1] = 5;else if ((GR[r3] & 0x00ff000000000000) == 0) GR[r1] = 6;else if ((GR[r3] & 0xff00000000000000) == 0) GR[r1] = 7;else GR[r1] = 8;

}} else { // two_byte_form

if (left_form) { // scan from most significant downif ((GR[r3] & 0xffff000000000000) == 0) GR[r1] = 0;else if ((GR[r3] & 0x0000ffff00000000) == 0) GR[r1] = 1;else if ((GR[r3] & 0x00000000ffff0000) == 0) GR[r1] = 2;else if ((GR[r3] & 0x000000000000ffff) == 0) GR[r1] = 3;else GR[r1] = 4;

} else { // right_form scan from least significant upif ((GR[r3] & 0x000000000000ffff) == 0) GR[r1] = 0;else if ((GR[r3] & 0x00000000ffff0000) == 0) GR[r1] = 1;else if ((GR[r3] & 0x0000ffff00000000) == 0) GR[r1] = 2;else if ((GR[r3] & 0xffff000000000000) == 0) GR[r1] = 3;else GR[r1] = 4;

}}GR[r1].nat = GR[r3].nat;

}

Table 6-16. Result Ranges for czx

SizeElement Width

Range of Result if Zero Element Found

Default Result if No Zero Element Found

1 8 bit 0-7 8

2 16 bit 0-3 4


IA-64 Application ISA Guide 1.0 dep

Deposit

Format: (qp) dep r1 = r2, r3, pos6, len4 merge_form, register_form I15(qp) dep r1 = imm1, r3, pos6, len6 merge_form, imm_form I14(qp) dep.z r1 = r2, pos6, len6 zero_form, register_form I12(qp) dep.z r1 = imm8, pos6, len6 zero_form, imm_form I13

Description: In the merge_form, a right justified bit field taken from the first source operand is deposited into the valuein GR r3 at an arbitrary bit position and the result is placed in GR r1. In the register_form the first sourceoperand is GR r2; and in the imm_form it is the sign-extended value specified by imm1 (either all ones orall zeroes). The deposited bit field begins at the bit position specified by the pos6 immediate and extendsto the left (towards the most significant bit) a number of bits specified by the len immediate. Note that lenhas a range of 1-16 in the register_form and 1-64 in the imm_form. The pos6 immediate has a range of 0 to63.

In the zero_form, a right justified bit field taken from either the value in GR r2 (in the register_form) or thesign extended value in imm8 (in the imm_form) is deposited into GR r1 and all other bits in GR r1 arecleared to zero. The deposited bit field begins at the bit position specified by the pos6 immediate andextends to the left (towards the most significant bit) a number of bits specified by the len immediate. Thelen immediate has a range of 1-64 and the pos6 immediate has a range of 0 to 63.

In the event that the deposited bit field extends beyond bit 63 of the target, i.e., len + pos6 > 64, the mostsignificant len + pos6 – 64 bits of the deposited bit field are truncated. The len immediate is encoded aslen minus 1 in the instruction.

The operation of dep t = s, r, 36, 16 is illustrated in Figure 6-5.


if (imm_form) {tmp_src = (merge_form ? sign_ext(imm1,1) : sign_ext(imm8, 8));tmp_nat = merge_form ? GR[r3].nat : 0;tmp_len = len6 ;

} else { // register_formtmp_src = GR[r2];tmp_nat = (merge_form ? GR[r3].nat : 0) || GR[r2].nat;tmp_len = merge_form ? len4 : len6 ;

}if (pos6 + tmp_len u> 64)

tmp_len = 64 - pos6;

if (merge_form)GR[r1] = GR[r3];

else // zero_formGR[r1] = 0;

GR[r1]{(pos6 + tmp_len - 1):pos6} = tmp_src{(tmp_len - 1):0};GR[r1].nat = tmp_nat;

}

Figure 6-5. Deposit Example

GR r:

GR t:

16

3652

3652

GR s:

0

0

0


extr IA-64 Application ISA Guide 1.0

Extract

Format: (qp) extr r1 = r3, pos6, len6 signed_form I11(qp) extr.u r1 = r3, pos6, len6 unsigned_form I11

Description: A field is extracted from GR r3, either zero extended or sign extended, and placed right-justified in GR r1.The field begins at the bit position given by the second operand and extends len6 bits to the left. The bitposition where the field begins is specified by the pos6 immediate. The extracted field is sign extended inthe signed_form or zero extended in the unsigned_form. The sign is taken from the most significant bit ofthe extracted field. If the specified field extends beyond the most significant bit of GR r3, the sign is takenfrom the most significant bit of GR r3. The immediate value len6 can be any number in the range 1 to 64,and is encoded as len6-1 in the instruction. The immediate value pos6 can be any value in the range 0 to63.

The operation of extr t = r, 7, 50 is illustrated in Figure 6-6.


tmp_len = len6;

if (pos6 + tmp_len u> 64)tmp_len = 64 - pos6;

if (unsigned_form)GR[r1] = zero_ext(shift_right_unsigned(GR[r3], pos6), tmp_len);

else // signed_formGR[r1] = sign_ext(shift_right_unsigned(GR[r3], pos6), tmp_len);

GR[r1].nat = GR[r3].nat;}

Figure 6-6. Extract Example

56 7 0

49 0

GR r:

GR t:

63

63

sign


IA-64 Application ISA Guide 1.0 fabs

Floating-Point Absolute Value

Format: (qp) fabs f1 = f3 pseudo-op of: (qp) fmerge.s f1 = f0, f3

Description: The absolute value of the value in FR f3 is computed and placed in FR f1.

If FR f3 is a NaTVal, FR f1 is set to NaTVal instead of the computed result.

Operation: See “Floating-Point Merge” on page 6-49.


fadd IA-64 Application ISA Guide 1.0

Floating-Point Add

Format: (qp) fadd.pc.sf f1 = f3, f2 pseudo-op of: (qp) fma.pc.sf f1 = f3, f1, f2

Description: FR f3 and FR f2 are added (computed to infinite precision), rounded to the precision indicated by pc (andpossibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placed in FRf1. If either FR f3 or FR f2 is a NaTVal, FR f1 is set to NaTVal instead of the computed result.

The mnemonic values for the opcode’s pc are given in Table 6-17. The mnemonic values for sf are givenin Table 6-18. For the encodings and interpretation of the status field’s pc, wre, and rc, refer to Table 5-5and Table 5-6 on page 5-5.

Operation: See “Floating-Point Multiply Add” on page 6-47.

Table 6-17. Specified pc Mnemonic Values

pc Mnemonic Precision Specifed.s single.d doublenone dynamic

(i.e., use pc value in status field)

Table 6-18. sf Mnemonic Values

sf Mnemonic Status Field Accessed.s0 or none sf0.s1 sf1.s2 sf2.s3 sf3


IA-64 Application ISA Guide 1.0 famax

Floating-Point Absolute Maximum

Format: (qp) famax.sf f1 = f2, f3 F8

Description: The operand with the larger absolute value is placed in FR f1. If the magnitude of FR f2 equals the magni-tude of FR f3, FR f1 gets FR f3.

If either FR f2 or FR f3 is a NaN, FR f1 gets FR f3.

If either FR f2 or FR f3 is a NaTVal, FR f1 is set to NaTVal instead of the computed result.

This operation does not propagate NaNs the same way as other floating-point arithmetic operations. TheInvalid Operation is signaled in the same manner as the fcmp.lt operation.

The mnemonic values for sf are given in Table 6-18 on page 6-30.

Operation: if (PR[qp]) {fp_check_target_register(f1);if (tmp_isrcode = fp_reg_disabled(f1, f2, f3, 0))

disabled_fp_register_fault(tmp_isrcode, 0);

if (fp_is_natval(FR[f2]) || fp_is_natval(FR[f3])) {FR[f1] = NATVAL;

} else {fminmax_exception_fault_check(f2, f3, sf, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))

fp_exception_fault(fp_decode_fault(tmp_fp_env));

tmp_right = fp_reg_read(FR[f2]);tmp_left = fp_reg_read(FR[f3]);tmp_right.sign = FP_SIGN_POSITIVE;tmp_left.sign = FP_SIGN_POSITIVE;tmp_bool_res = fp_less_than(tmp_left, tmp_right);FR[f1] = tmp_bool_res ? FR[f2] : FR[f3];

fp_update_fpsr(sf, tmp_fp_env);}

fp_update_psr(f1);}

FP Exceptions: Invalid Operation (V)Denormal/Unnormal Operand (D)Software Assist (SWA) fault


famin IA-64 Application ISA Guide 1.0

Floating-Point Absolute Minimum

Format: (qp) famin.sf f1 = f2, f3 F8

Description: The operand with the smaller absolute value is placed in FR f1. If the magnitude of FR f2 equals the mag-nitude of FR f3, FR f1 gets FR f3.










tmp_left = fp_reg_read(FR[f2]);tmp_right = fp_reg_read(FR[f3]);tmp_left.sign = FP_SIGN_POSITIVE;tmp_right.sign = FP_SIGN_POSITIVE;tmp_bool_res = fp_less_than(tmp_left, tmp_right);FR[f1] = tmp_bool_res ? FR[f2] : FR[f3];


fp_update_psr(f1);}



IA-64 Application ISA Guide 1.0 fand

Floating-Point Logical And

Format: (qp) fand f1 = f2, f3 F9

Description: The bit-wise logical AND of the significand fields of FR f2 and FR f3 is computed. The resulting value isstored in the significand field of FR f1. The exponent field of FR f1 is set to the biased exponent for 2.063

(0x1003E) and the sign field of FR f1 is set to positive (0).





} else {FR[f1].significand = FR[f2].significand & FR[f3].significand;FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

}fp_update_psr(f1);

}

FP Exceptions: None


fandcm IA-64 Application ISA Guide 1.0

Floating-Point And Complement

Format: (qp) fandcm f1 = f2, f3 F9

Description: The bit-wise logical AND of the significand field of FR f2 with the bit-wise complemented significandfield of FR f3 is computed. The resulting value is stored in the significand field of FR f1. The exponentfield of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the sign field of FR f1 is set to positive(0).





} else {FR[f1].significand = FR[f2].significand & ~FR[f3].significand;FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

}fp_update_psr(f1);

}

FP Exceptions: None


IA-64 Application ISA Guide 1.0 fc

Flush Cache

Format: (qp) fc r3 M28

Description: The cache line associated with the address specified by the value of GR r3 is invalidated from all levels ofthe processor cache hierarchy. The invalidation is broadcast throughout the coherence domain. If, at anylevel of the cache hierarchy, the line is inconsistent with memory it is written to memory before invalida-tion.

The line size affected is at least 32-bytes (aligned on a 32-byte boundary). An implementation may flush alarger region.

This instruction follows data dependency rules; it is ordered with respect to preceding and followingmemory references to the same line. fc has data dependencies in the sense that any prior stores by thisprocessor will be included in the data written back to memory. fc is an unordered operation, and is notaffected by a memory fence (mf) instruction. It is ordered with respect to the sync.i instruction.

Operation: if (PR[qp]) {itype = NON_ACCESS|FC|READ;if (GR[r3].nat)

register_nat_consumption_fault(itype);tmp_paddr = tlb_translate_nonaccess(GR[r3], itype);mem_flush(tmp_paddr);

}

fchkf IA-64 Application ISA Guide 1.0

Floating-Point Check Flags

Format: (qp) fchkf.sf target25 F14

Description: The flags in FPSR.sf.flags are compared with FPSR.s0.flags and FPSR.traps. If any flags set inFPSR.sf.flags correspond to FPSR.traps which are enabled, or if any flags set in FPSR.sf.flags are not setin FPSR.s0.flags, then a branch to target25 is taken.

The target25 operand, specifies a label to branch to. This is encoded in the instruction as a signed immedi-ate displacement (imm21) between the target bundle and the bundle containing this instruction (imm21 =target25 – IP >> 4).


Operation: if (PR[qp]) {switch (sf) {

case ‘s0’:tmp_flags = AR[FPSR].sf0.flags;break;




}if ((tmp_flags & ~AR[FPSR].traps) || (tmp_flags & ~AR[FPSR].sf0.flags)) {

if (check_branch_implemented(FCHKF)) {taken_branch = 1; IP = IP + sign_ext(( imm21 << 4), 25);if ((PSR.it && unimplemented_virtual_address(IP))

|| (!PSR.it && unimplemented_physical_address(IP)))unimplemented_instruction_address_trap(0, IP);


} elsespeculation_fault(FCHKF, zero_ext( imm21, 21));

}}

FP Exceptions: None


IA-64 Application ISA Guide 1.0 fclass

Floating-Point Class

Format: (qp) fclass.fcrel.fctype p1, p2 = f2, fclass9 F5

Description: The contents of FR f2 are classified according to the fclass9 completer as shown in Table 6-20. This pro-duces a boolean result based on whether the contents of FR f2 agrees with the floating-point number for-mat specified by fclass9, as specified by the fcrel completer. This result is written to the two predicateregister destinations, p1 and p2. The result written to the destinations is determined by the compare typespecified by fctype.

The allowed types are Normal (or none) and unc. See Table 6-21 on page 6-40. The assembly syntaxallows the specification of membership or non-membership and the assembler swaps the target predicatesto achieve the desired effect.

A number agrees with the pattern specified by fclass9 if:

• the number is NaTVal and fclass9 {8} is 1, or

• the number is a quiet NaN and fclass9 {7} is 1, or

• the number is a signaling NaN and fclass9 {6} is 1, or

• the sign of the number agrees with the sign specified by one of the two low-order bits of fclass9, andthe type of the number (disregarding the sign) agrees with the number-type specified by the next 4bits of fclass9, as shown in Table 6-20.

Note: a fclass9 of 0x1FF is equivalent to testing for any supported operand.

The class names used in Table 6-20 are defined in Table 5-2 on page 5-2.

Table 6-19. Floating-point Class Relations

fcrel Test Relationm FR f2 agrees with the pattern specified by fclass9 (is a member)

nm FR f2 does not agree with the pattern specified by fclass9 (is not a member)

Table 6-20. Floating-point Classes

fclass9 Class MnemonicEither these cases can be tested for

0x0100 NaTVal @nat

0x080 Quiet NaN @qnan

0x040 Signaling NaN @snan

or the OR of the following two cases

0x001 Positive @pos

0x002 Negative @neg

AND’ed with OR of the following 4 cases

0x004 Zero @zero

0x008 Unnormalized @unorm

0x010 Normalized @norm

0x020 Infinity @inf


fclass IA-64 Application ISA Guide 1.0



if (tmp_isrcode = fp_reg_disabled(f2, 0, 0, 0))disabled_fp_register_fault(tmp_isrcode, 0);

tmp_rel = ((fclass9{0} && !FR[f2].sign || fclass9{1} && FR[f2].sign)&& ((fclass9{2} && fp_is_zero(FR[f2]))||

(fclass9{3} && fp_is_unorm(FR[f2])) || (fclass9{4} && fp_is_normal(FR[f2])) || (fclass9{5} && fp_is_inf(FR[f2])))

)|| (fclass9{6} && fp_is_snan(FR[f2]))|| (fclass9{7} && fp_is_qnan(FR[f2]))|| (fclass9{8} && fp_is_natval(FR[f2]));

tmp_nat = fp_is_natval(FR[f2]) && (!fclass9{8});

if (tmp_nat) {PR[p1] = 0;PR[p2] = 0;


}} else {

if (fctype == ‘unc’) {if ( p1 == p2)


}}

FP Exceptions: None


IA-64 Application ISA Guide 1.0 fclrf

Floating-Point Clear Flags

Format: (qp) fclrf.sf F13

Description: The status field’s 6-bit flags field is reset to zero.The mnemonic values for sf are given in Table 6-18 on page 6-30.

Operation: if (PR[qp]) {fp_set_sf_flags(sf, 0);

}

FP Exceptions: None

fcmp IA-64 Application ISA Guide 1.0

Floating-Point Compare

Format: (qp) fcmp.frel.fctype.sf p1, p2 = f2, f3 F4

Description: The two source operands are compared for one of twelve relations specified by frel. This produces a bool-ean result which is 1 if the comparison condition is true, and 0 otherwise. This result is written to the twopredicate register destinations, p1 and p2. The way the result is written to the destinations is determined bythe compare type specified by fctype. The allowed types are Normal (or none) and unc.


The relations are defined for each of the comparison types in Table 6-22. Of the twelve relations, not allare directly implemented in hardware. Some are actually pseudo-ops. For these, the assembler simplyswitches the source operand specifiers and/or switches the predicate target specifiers and uses an imple-mented relation.

Table 6-21. Floating-point Comparison Types

fctypePR[qp]==0

PR[qp]==1

result==0,No Source NaTVals

result==1,No Source NaTVals

One or MoreSource NaTVals

PR[p1] PR[p2] PR[p1] PR[p2] PR[p1] PR[p2] PR[p1] PR[p2]none 0 1 1 0 0 0

unc 0 0 0 1 1 0 0 0

Table 6-22. Floating-point Comparison Relations

frel frel CompleterUnabbreviated

Relation Pseudo-op ofQuiet NaNas Operand

Signals Invalideq equal f2 == f3 No

lt less than f2 < f3 Yes

le less than or equal f2 <= f3 Yes

gt greater than f2 > f3 lt f2 ↔ f3 Yes

ge greater than or equal f2 >= f3 le f2 ↔ f3 Yes

unord unordered f2 ? f3 No

neq not equal !(f2 == f3) eq p1 ↔ p2 No

nlt not less than !(f2 < f3) lt p1 ↔ p2 Yes

nle not less than or equal !(f2 <= f3) le p1 ↔ p2 Yes

ngt not greater than !(f2 > f3) lt f2 ↔ f3 p1 ↔ p2 Yes

nge not greater than or equal !(f2 >= f3) le f2 ↔ f3 p1 ↔ p2 Yes

ord ordered !(f2 ? f3) unord p1 ↔ p2 No


IA-64 Application ISA Guide 1.0 fcmp



if (tmp_isrcode = fp_reg_disabled(f2, f3, 0, 0))disabled_fp_register_fault(tmp_isrcode, 0);

if (fp_is_natval(FR[f2]) || fp_is_natval(FR[f3])) {PR[p1] = 0;PR[p2] = 0;

} else {fcmp_exception_fault_check(f2, f3, frel, sf, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))


tmp_fr2 = fp_reg_read(FR[f2]);tmp_fr3 = fp_reg_read(FR[f3]);

if (frel == ‘eq’) tmp_rel = fp_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘lt’) tmp_rel = fp_less_than(tmp_fr2, tmp_fr3);else if ( frel == ‘le’) tmp_rel = fp_lesser_or_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘gt’) tmp_rel = fp_less_than(tmp_fr3, tmp_fr2);else if ( frel == ‘ge’) tmp_rel = fp_lesser_or_equal(tmp_fr3, tmp_fr2);else if ( frel == ‘unord’)tmp_rel = fp_unordered(tmp_fr2, tmp_fr3);else if ( frel == ‘neq’) tmp_rel = !fp_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘nlt’) tmp_rel = !fp_less_than(tmp_fr2, tmp_fr3);else if ( frel == ‘nle’) tmp_rel = !fp_lesser_or_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘ngt’) tmp_rel = !fp_less_than(tmp_fr3, tmp_fr2);else if ( frel == ‘nge’) tmp_rel = !fp_lesser_or_equal(tmp_fr3, tmp_fr2);else tmp_rel = !fp_unordered(tmp_fr2, tmp_fr3); //‘ord’

PR[p1] = tmp_rel;PR[p2] = !tmp_rel;


} else {if ( fctype == ‘unc’) {

if ( p1 == p2)illegal_operation_fault();

PR[p1] = 0;PR[p2] = 0;

}}



fcvt.fx IA-64 Application ISA Guide 1.0

Convert Floating-Point to Integer

Format: (qp) fcvt.fx.sf f1 = f2 signed_form F10(qp) fcvt.fx.trunc.sf f1 = f2 signed_form, trunc_form F10(qp) fcvt.fxu.sf f1 = f2 unsigned_form F10(qp) fcvt.fxu.trunc.sf f1 = f2 unsigned_form, trunc_form F10

Description: FR f2 is treated as a register format floating-point value and converted to a signed (signed_form) orunsigned integer (unsigned_form) using either the rounding mode specified in the FPSR.sf.rc, or usingRound-to-Zero if the trunc_form of the instruction is used. The result is placed in the 64-bit significandfield of FR f1. The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the signfield of FR f1 is set to positive (0).



Operation: if (PR[qp]) {fp_check_target_register(f1);if (tmp_isrcode = fp_reg_disabled(f1, f2, 0, 0))


if (fp_is_natval(FR[f2])) {FR[f1] = NATVAL;fp_update_psr(f1);

} else {tmp_default_result = fcvt_exception_fault_check(f2, sf,

signed_form, trunc_form, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))


if (fp_is_nan(tmp_default_result)) {FR[f1].significand = INTEGER_INDEFINITE;FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

} else {tmp_res = fp_ieee_rnd_to_int(fp_reg_read(FR[f2]), &tmp_fp_env);if (tmp_res.exponent)

tmp_res.significand = fp_U64_rsh(tmp_res.significand, (FP_INTEGER_EXP - tmp_res.exponent));

if (signed_form && tmp_res.sign)tmp_res.significand = (~tmp_res.significand) + 1;

FR[f1].significand = tmp_res.significand;FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

}

fp_update_fpsr(sf, tmp_fp_env);fp_update_psr(f1);if (fp_raise_traps(tmp_fp_env))

fp_exception_trap(fp_decode_trap(tmp_fp_env));}

}

FP Exceptions: Invalid Operation (V) Inexact (I)Denormal/Unnormal Operand (D)Software Assist (SWA) fault


IA-64 Application ISA Guide 1.0 fcvt.xf

Convert Signed Integer to Floating-point

Format: (qp) fcvt.xf f1 = f2 F11

Description: The 64-bit significand of FR f2 is treated as a signed integer and its register file precision floating-pointrepresentation is placed in FR f1.


This operation is always exact and is unaffected by the rounding mode.



if (fp_is_natval(FR[f2])) {FR[f1] = NATVAL;

} else {tmp_res = FR[f2];if (tmp_res.significand{63}) {

tmp_res.significand = (~tmp_res.significand) + 1;tmp_res.sign = 1;

} elsetmp_res.sign = 0;

tmp_res.exponent = FP_INTEGER_EXP;tmp_res = fp_normalize(tmp_res);

FR[f1].significand = tmp_res.significand;FR[f1].exponent = tmp_res.exponent;FR[f1].sign = tmp_res.sign;

}fp_update_psr(f1);

}

FP Exceptions: None


fcvt.xuf IA-64 Application ISA Guide 1.0

Convert Unsigned Integer to Floating-point

Format: (qp) fcvt.xuf.pc.sf f1 = f3 (unsigned form) pseudo-op of: (qp) fma.pc.sf f1 = f3, f1, f0

Description: FR f3 is multiplied with FR 1, rounded to the precision indicated by pc (and possibly FPSR.sf.pc andFPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placed in FR f1.Note: Multiplying FR f3 with FR 1 (a 1.0) normalizes the canonical representation of an integer in thefloating-point register file producing a normal floating-point value.


The mnemonic values for the opcode’s pc are given in Table 6-17 on page 6-30. The mnemonic values forsf are given in Table 6-18 on page 6-30. For the encodings and interpretation of the status field’s pc, wre,and rc, refer to Table 5-5 and Table 5-6 on page 5-5.

Operation: See “Floating-Point Multiply Add” on page 6-47


IA-64 Application ISA Guide 1.0 fetchadd

Fetch And Add Immediate

Format: (qp) fetchadd4.sem.ldhint r1 = [r3], inc3 four_byte_form M17(qp) fetchadd8.sem.ldhint r1 = [r3], inc3 eight_byte_form M17

Description: A value consisting of four or eight bytes is read from memory starting at the address specified by the valuein GR r3. The value is zero extended and added to the sign-extended immediate value specified by inc3.The values that may be specified by inc3 are: -16, -8, -4, -1, 1, 4, 8, 16. The least significant four or eightbytes of the sum are then written to memory starting at the address specified by the value in GR r3. Thezero-extended value read from memory is placed in GR r1 and the NaT bit corresponding to GR r1 iscleared.

The sem completer specifies the type of semaphore operation. These operations are described inTable 6-23.



Both read and write access privileges for the referenced page are required. The write access privilegecheck is performed whether or not the memory write is performed.

The value of the ldhint completer specifies the locality of the memory access. The values of the ldhintcompleter are given in Table 6-28 on page 6-102. Locality hints do not affect program functionality andmay be ignored by the implementation.


if (GR[r3].nat)register_nat_consumption_fault(SEMAPHORE);

size = four_byte_form ? 4 : 8;

paddr = tlb_translate(GR[r3], size, SEMAPHORE, PSR.cpl, &mattr, &tmp_unused);if (!ma_supports_fetchadd(mattr))

unsupported_data_reference_fault(SEMAPHORE, GR[r3]);

if (sem == ‘acq’)val = mem_xchg_add( inc3, paddr, size, UM.be, mattr, ACQUIRE, ldhint);

else // ‘rel’val = mem_xchg_add( inc3, paddr, size, UM.be, mattr, RELEASE, ldhint);

alat_inval_multiple_entries(paddr, size);

GR[r1] = zero_ext(val, size * 8);GR[r1].nat = 0;

}

Table 6-23. Fetch and Add Semaphore Types

semCompleter

Ordering Semantics

Semaphore Operation

acq AcquireThe memory read/write is made visible prior to all subsequent data memory accesses.

rel ReleaseThe memory read/write is made visible after all previous data mem-ory accesses.


flushrs IA-64 Application ISA Guide 1.0

Flush Register Stack

Format: flushrs M25

Description: All stacked general registers in the dirty partition of the register stack are written to the backing storebefore execution continues. The dirty partition contains registers from previous procedure frames thathave not yet been saved to the backing store.

After this instruction completes execution AR[BSPSTORE] is equal to AR[BSP].

This instruction must be the first instruction in an instruction group. Otherwise, the results are undefined.This instruction cannot be predicated.

Operation: while (AR[BSPSTORE] != AR[BSP]) {rse_store(MANDATORY); // increments AR[BSPSTORE]deliver_unmasked_pending_external_interrupt();

}


IA-64 Application ISA Guide 1.0 fma

Floating-Point Multiply Add

Format: (qp) fma.pc.sf f1 = f3, f4, f2 F1

Description: The product of FR f3 and FR f4 is computed to infinite precision and then FR f2 is added to this product,again in infinite precision. The resulting value is then rounded to the precision indicated by pc (and possi-bly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc. The rounded result isplaced in FR f1.

If any of FR f3, FR f4, or FR f2 is a NaTVal, FR f1 is set to NaTVal instead of the computed result.

If f2 is f0, an IEEE multiply operation is performed instead of a multiply and add. See “Floating-PointMultiply” on page 6-54.


Operation: if (PR[qp]) {fp_check_target_register(f1);if (tmp_isrcode = fp_reg_disabled(f1, f2, f3, f4))


if (fp_is_natval(FR[f2]) || fp_is_natval(FR[f3]) || fp_is_natval(FR[f4])) {FR[f1] = NATVAL;fp_update_psr(f1);

} else {tmp_default_result = fma_exception_fault_check(f2, f3, f4,

pc, sf, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))


if (fp_is_nan_or_inf(tmp_default_result)) {FR[f1] = tmp_default_result;

} else {tmp_res = fp_mul(fp_reg_read(FR[f3]), fp_reg_read(FR[f4]));if (f2 != 0)

tmp_res = fp_add(tmp_res, fp_reg_read(FR[f2]), tmp_fp_env);FR[f1] = fp_ieee_round(tmp_res, &tmp_fp_env);

}



}

FP Exceptions: Invalid Operation (V) Overflow (O)Denormal/Unnormal Operand (D) Inexact (I)Software Assist (SWA) fault Software Assist (SWA) trapUnderflow (U)


fmax IA-64 Application ISA Guide 1.0

Floating-Point Maximum

Format: (qp) fmax.sf f1 = f2, f3 F8

Description: The operand with the larger value is placed in FR f1. If FR f2 equals FR f3, FR f1 gets FR f3.










tmp_bool_res = fp_less_than(fp_reg_read(FR[f3]), fp_reg_read(FR[f2]));FR[f1] = (tmp_bool_res ? FR[f2] : FR[f3]);

fp_update_fpsr(sf, tmp_fp_env);}fp_update_psr(f1);

}



IA-64 Application ISA Guide 1.0 fmerge

Floating-Point Merge

Format: (qp) fmerge.ns f1 = f2, f3 neg_sign_form F9(qp) fmerge.s f1 = f2, f3 sign_form F9(qp) fmerge.se f1 = f2, f3 sign_exp_form F9

Description: Sign, exponent and significand fields are extracted from FR f2 and FR f3, combined, and the result isplaced in FR f1.

For the neg_sign_form, the sign of FR f2 is negated and concatenated with the exponent and the signifi-cand of FR f3. This form can be used to negate a floating-point number by using the same register for FRf2 and FR f3.

For the sign_form, the sign of FR f2 is concatenated with the exponent and the significand of FR f3.

For the sign_exp_form, the sign and exponent of FR f2 is concatenated with the significand of FR f3.

For all forms, if either FR f2 or FR f3 is a NaTVal, FR f1 is set to NaTVal instead of the computed result.

Figure 6-7. Floating-point Merge Negative Sign Operation

Figure 6-8. Floating-point Merge Sign Operation

Figure 6-9. Floating-point Merge Sign and Exponent Operation

81 080 64 63 81 080 64 63

81 080 64 63

FR f2

negatedsign bit

FR f3

FR f1

81 080 64 63 81 080 64 63

81 080 64 63

FR f2 FR f3

FR f1

81 080 64 63 81 080 64 63

81 080 64 63

FR f1

FR f3FR f2


fmerge IA-64 Application ISA Guide 1.0




} else {FR[f1].significand = FR[f3].significand;if (neg_sign_form) {

FR[f1].exponent = FR[f3].exponent;FR[f1].sign = !FR[f2].sign;

} else if (sign_form) {FR[f1].exponent = FR[f3].exponent;FR[f1].sign = FR[f2].sign;

} else { // sign_exp_formFR[f1].exponent = FR[f2].exponent;FR[f1].sign = FR[f2].sign;

}}

fp_update_psr(f1);}

FP Exceptions: None


IA-64 Application ISA Guide 1.0 fmin

Floating-Point Minimum

Format: (qp) fmin.sf f1 = f2, f3 F8

Description: The operand with the smaller value is placed in FR f1. If FR f2 equals FR f3, FR f1 gets FR f3.










tmp_bool_res = fp_less_than(fp_reg_read(FR[f2]), fp_reg_read(FR[f3]));FR[f1] = tmp_bool_res ? FR[f2] : FR[f3];


}



fmix IA-64 Application ISA Guide 1.0

Floating-Point Parallel Mix

Format: (qp) fmix.l f1 = f2, f3 mix_l_form F9(qp) fmix.r f1 = f2, f3 mix_r_form F9(qp) fmix.lr f1 = f2, f3 mix_lr_form F9

Description: For the mix_l_form (mix_r_form), the left (right) single precision value in FR f2 is concatenated with theleft (right) single precision value in FR f3. For the mix_lr_form, the left single precision value in FR f2 isconcatenated with the right single precision value in FR f3.

For all forms, the exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the signfield of FR f1 is set to positive (0).


Figure 6-10. Floating-point Mix Left

Figure 6-11. Floating-point Mix Right

Figure 6-12. Floating-point Mix Left-Right

f1

f2 f3

63 32 63 32

63 32 31 0

f1

f2 f3

31 0 31 0

63 32 31 0

f1

f2 f3

63 32 31 0

63 32 31 0


IA-64 Application ISA Guide 1.0 fmix




} else {if (mix_l_form) {

tmp_res_hi = FR[f2].significand{63:32};tmp_res_lo = FR[f3].significand{63:32};

} else if (mix_r_form) {tmp_res_hi = FR[f2].significand{31:0};tmp_res_lo = FR[f3].significand{31:0};

} else { // mix_lr_formtmp_res_hi = FR[f2].significand{63:32};tmp_res_lo = FR[f3].significand{31:0};

}FR[f1].significand = fp_concatenate(tmp_res_hi, tmp_res_lo);FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

}

fp_update_psr(f1);}

FP Exceptions: None


fmpy IA-64 Application ISA Guide 1.0

Floating-Point Multiply

Format: (qp) fmpy.pc.sf f1 = f3, f4 pseudo-op of: (qp) fma.pc.sf f1 = f3, f4, f0

Description: The product FR f3 and FR f4 is computed to infinite precision. The resulting value is then rounded to theprecision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specifiedby FPSR.sf.rc. The rounded result is placed in FR f1.





IA-64 Application ISA Guide 1.0 fms

Floating-Point Multiply Subtract

Format: (qp) fms.pc.sf f1 = f3, f4, f2 F1

Description: The product of FR f3 and FR f4 is computed to infinite precision and then FR f2 is subtracted from thisproduct, again in infinite precision. The resulting value is then rounded to the precision indicated by pc(and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc. Therounded result is placed in FR f1.

If any of FR f3, FR f4, or FR f2 is a NaTVal, a NaTVal is placed in FR f1 instead of the computed result.

If f2 is f0, an IEEE multiply operation is performed instead of a multiply and subtract.





} else {tmp_default_result = fms_fnma_exception_fault_check(f2, f3, f4,




} else {tmp_res = fp_mul(fp_reg_read(FR[f3]), fp_reg_read(FR[f4]));tmp_fr2 = fp_reg_read(FR[f2]);tmp_fr2.sign = !tmp_fr2.sign;if (f2 != 0)

tmp_res = fp_add(tmp_res, tmp_fr2, tmp_fp_env);FR[f1] = fp_ieee_round(tmp_res, &tmp_fp_env);

}



}



fneg IA-64 Application ISA Guide 1.0

Floating-Point Negate

Format: (qp) fneg f1 = f3 pseudo-op of: (qp) fmerge.ns f1 = f3, f3

Description: The value in FR f3 is negated and placed in FR f1.




IA-64 Application ISA Guide 1.0 fnegabs

Floating-Point Negate Absolute Value

Format: (qp) fnegabs f1 = f3 pseudo-op of: (qp) fmerge.ns f1 = f0, f3

Description: The absolute value of the value in FR f3 is computed, negated, and placed in FR f1.




fnma IA-64 Application ISA Guide 1.0

Floating-Point Negative Multiply Add

Format: (qp) fnma.pc.sf f1 = f3, f4, f2 F1

Description: The product of FR f3 and FR f4 is computed to infinite precision, negated, and then FR f2 is added to thisproduct, again in infinite precision. The resulting value is then rounded to the precision indicated by pc(and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc. Therounded result is placed in FR f1.


If f2 is f0, an IEEE multiply operation is performed, followed by negation of the product.





} else {tmp_default_result = fms_fnma_exception_fault_check(f2, f3, f4,




} else {tmp_res = fp_mul(fp_reg_read(FR[f3]), fp_reg_read(FR[f4]));tmp_res.sign = !tmp_res.sign;if (f2 != 0)

tmp_res = fp_add(tmp_res, fp_reg_read(FR[f2]), tmp_fp_env);FR[f1] = fp_ieee_round(tmp_res, &tmp_fp_env);

}



}



IA-64 Application ISA Guide 1.0 fnmpy

Floating-Point Negative Multiply

Format: (qp) fnmpy.pc.sf f1 = f3, f4 pseudo-op of: (qp) fnma.pc.sf f1 = f3, f4,f0

Description: The product FR f3 and FR f4 is computed to infinite precision and then negated. The resulting value is thenrounded to the precision indicated by pc (and possibly FPSR.sf.pc and FPSR.sf.wre) using the roundingmode specified by FPSR.sf.rc. The rounded result is placed in FR f1.



Operation: See “Floating-Point Negative Multiply Add” on page 6-58.


fnorm IA-64 Application ISA Guide 1.0

Floating-Point Normalize

Format: (qp) fnorm.pc.sf f1 = f3 pseudo-op of: (qp) fma.pc.sf f1 = f3, f1, f0

Description: FR f3 is normalized and rounded to the precision indicated by pc (and possibly FPSR.sf.pc andFPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placed in FR f1.





IA-64 Application ISA Guide 1.0 for

Floating-Point Logical Or

Format: (qp) for f1 = f2, f3 F9

Description: The bit-wise logical OR of the significand fields of FR f2 and FR f3 is computed. The resulting value isstored in the significand field of FR f1. The exponent field of FR f1 is set to the biased exponent for 2.063






} else {FR[f1].significand = FR[f2].significand | FR[f3].significand;FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

}

fp_update_psr(f1);}

FP Exceptions: None


fpabs IA-64 Application ISA Guide 1.0

Floating-Point Parallel Absolute Value

Format: (qp) fpabs f1 = f3 pseudo-op of: (qp) fpmerge.s f1 = f0, f3

Description: The absolute values of the pair of single precision values in the significand field of FR f3 are computedand stored in the significand field of FR f1. The exponent field of FR f1 is set to the biased exponent for2.063 (0x1003E) and the sign field of FR f1 is set to positive (0).


Operation: See “Floating-Point Parallel Merge” on page 6-73.


IA-64 Application ISA Guide 1.0 fpack

Floating-Point Pack

Format: (qp) fpack f1 = f2, f3 pack_form F9

Description: The register format numbers in FR f2 and FR f3 are converted to single precision memory format. Thesetwo single precision numbers are concatenated and stored in the significand field of FR f1 . The exponentfield of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the sign field of FR f1 is set to positive(0).





} else {tmp_res_hi = fp_single(FR[f2]);tmp_res_lo = fp_single(FR[f3]);

FR[f1].significand = fp_concatenate(tmp_res_hi, tmp_res_lo);FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

}fp_update_psr(f1);

}

FP Exceptions: None

Figure 6-13. Floating-point Pack

f1

f2 f3

81 0 81 0

82-bit FR to Single Mem Format Conversion

63 32 31 0


fpamax IA-64 Application ISA Guide 1.0

Floating-Point Parallel Absolute Maximum

Format: (qp) fpamax.sf f1 = f2, f3 F8

Description: The paired single precision values in the significands of FR f2 and FR f3 are compared. The operands withthe larger absolute value are returned in the significand field of FR f1.

If the magnitude of high (low) FR f3 is less than the magnitude of high (low) FR f2, high (low) FR f1 getshigh (low) FR f2. Otherwise high (low) FR f1 gets high (low) FR f3.

If high (low) FR f2 or high (low) FR f3 is a NaN, and neither FR f2 or FR f3 is a NaTVal, high (low) FR f1gets high (low) FR f3.

The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the sign field of FR f1 isset to positive (0).


This operation does not propagate NaNs the same way as other floating-point arithmetic operations. TheInvalid Operation is signaled in the same manner as for the fpcmp.lt operation.





} else {fpminmax_exception_fault_check(f2, f3, sf, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))


tmp_fr2 = tmp_right = fp_reg_read_hi(f2);tmp_fr3 = tmp_left = fp_reg_read_hi(f3);tmp_right.sign = FP_SIGN_POSITIVE;tmp_left.sign = FP_SIGN_POSITIVE;tmp_bool_res = fp_less_than(tmp_left, tmp_right);tmp_res_hi = fp_single(tmp_bool_res ? tmp_fr2: tmp_fr3);

tmp_fr2 = tmp_right = fp_reg_read_lo(f2);tmp_fr3 = tmp_left = fp_reg_read_lo(f3);tmp_right.sign = FP_SIGN_POSITIVE;tmp_left.sign = FP_SIGN_POSITIVE;tmp_bool_res = fp_less_than(tmp_left, tmp_right);tmp_res_lo = fp_single(tmp_bool_res ? tmp_fr2: tmp_fr3);



}



IA-64 Application ISA Guide 1.0 fpamin

Floating-Point Parallel Absolute Minimum

Format: (qp) fpamin.sf f1 = f2, f3 F8

Description: The paired single precision values in the significands of FR f2 or FR f3 are compared. The operands withthe smaller absolute value is returned in the significand of FR f1.

If the magnitude of high (low) FR f2 is less than the magnitude of high (low) FR f3, high (low) FR f1 getshigh (low) FR f2. Otherwise high (low) FR f1 gets high (low) FR f3.

If high (low) FR f2 or high (low) FR f3 is a NaN, and neither FR f2 or FR f3 is a NaTVal, high (low) FR f1gets high (low) FR f3.


If either FR f2 or FR f3 is NaTVal, FR f1 is set to NaTVal instead of the computed result.








tmp_fr2 = tmp_left = fp_reg_read_hi(f2);tmp_fr3 = tmp_right = fp_reg_read_hi(f3);tmp_left.sign = FP_SIGN_POSITIVE;tmp_right.sign = FP_SIGN_POSITIVE;tmp_bool_res = fp_less_than(tmp_left, tmp_right);tmp_res_hi = fp_single(tmp_bool_res ? tmp_fr2: tmp_fr3);

tmp_fr2 = tmp_left = fp_reg_read_lo(f2);tmp_fr3 = tmp_right = fp_reg_read_lo(f3);tmp_left.sign = FP_SIGN_POSITIVE;tmp_right.sign = FP_SIGN_POSITIVE;tmp_bool_res = fp_less_than(tmp_left, tmp_right);tmp_res_lo = fp_single(tmp_bool_res ? tmp_fr2: tmp_fr3);



}


fpcmp IA-64 Application ISA Guide 1.0

Floating-Point Parallel Compare

Format: (qp) fpcmp.frel.sf f1= f2, f3 F8

Description: The two pairs of single precision source operands in the significand fields of FR f2 and FR f3 are comparedfor one of twelve relations specified by frel. This produces a boolean result which is a mask of 32 1’s if thecomparison condition is true, and a mask of 32 0’s otherwise. This result is written to a pair of 32-bit inte-gers in the significand field of FR f1. The exponent field of FR f1 is set to the biased exponent for 2.063



The relations are defined for each of the comparison types in Table 6-24. Of the twelve relations, not allare directly implemented in hardware. Some are actually pseudo-ops. For these, the assembler simplyswitches the source operand specifiers and/or switches the predicate type specifiers and uses an imple-mented relation.


Table 6-24. Floating-point Parallel Comparison Results

PR[qp]==0

PR[qp]==1

result==false,No Source NaTVals

result==true,No Source NaTVals

One or MoreSource

NaTVal’sunchanged 0...0 1...1 NaTVal

Table 6-25. Floating-point Parallel Comparison Relations

frel frel CompleterUnabbreviated

Relation Pseudo-op ofQuiet NaNas Operand

Signals Invalideq equal f2 == f3 No

lt less than f2 < f3 Yes

le less than or equal f2 <= f3 Yes

gt greater than f2 > f3 lt f2 ↔ f3 Yes

ge greater than or equal f2 >= f3 le f2 ↔ f3 Yes

unord unordered f2 ? f3 No

neq not equal !(f2 == f3) No

nlt not less than !(f2 < f3) Yes

nle not less than or equal !(f2 <= f3) Yes

ngt not greater than !(f2 > f3) nlt f2 ↔ f3 Yes

nge not greater than or equal !(f2 >= f3) nle f2 ↔ f3 Yes

ord ordered !(f2 ? f3) No


IA-64 Application ISA Guide 1.0 fpcmp

Operation: if (PR[qp]) {fp_check_target_register(f1);if (tmp_isrcode = fp_reg_disabled(f1, f2,f3, 0))



} else {fpcmp_exception_fault_check(f2, f3, frel, sf, &tmp_fp_env);

if (fp_raise_fault(tmp_fp_env))fp_exception_fault(fp_decode_fault(tmp_fp_env));

tmp_fr2 = fp_reg_read_hi(f2);tmp_fr3 = fp_reg_read_hi(f3);

if (frel == ‘eq’) tmp_rel = fp_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘lt’) tmp_rel = fp_less_than(tmp_fr2, tmp_fr3);else if ( frel == ‘le’) tmp_rel = fp_lesser_or_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘gt’) tmp_rel = fp_less_than(tmp_fr3, tmp_fr2);else if ( frel == ‘ge’) tmp_rel = fp_lesser_or_equal(tmp_fr3, tmp_fr2);else if ( frel == ‘unord’)tmp_rel = fp_unordered(tmp_fr2, tmp_fr3);else if ( frel == ‘neq’) tmp_rel = !fp_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘nlt’) tmp_rel = !fp_less_than(tmp_fr2, tmp_fr3);else if ( frel == ‘nle’) tmp_rel = !fp_lesser_or_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘ngt’) tmp_rel = !fp_less_than(tmp_fr3, tmp_fr2);else if ( frel == ‘nge’) tmp_rel = !fp_lesser_or_equal(tmp_fr3, tmp_fr2);else tmp_rel = !fp_unordered(tmp_fr2, tmp_fr3); //‘ord’

tmp_res_hi = (tmp_rel ? 0xFFFFFFFF : 0x00000000);

tmp_fr2 = fp_reg_read_lo( f2);tmp_fr3 = fp_reg_read_lo( f3);

if ( frel == ‘eq’) tmp_rel = fp_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘lt’) tmp_rel = fp_less_than(tmp_fr2, tmp_fr3);else if ( frel == ‘le’) tmp_rel = fp_lesser_or_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘gt’) tmp_rel = fp_less_than(tmp_fr3, tmp_fr2);else if ( frel == ‘ge’) tmp_rel = fp_lesser_or_equal(tmp_fr3, tmp_fr2);else if ( frel == ‘unord’)tmp_rel = fp_unordered(tmp_fr2, tmp_fr3);else if ( frel == ‘neq’) tmp_rel = !fp_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘nlt’) tmp_rel = !fp_less_than(tmp_fr2, tmp_fr3);else if ( frel == ‘nle’) tmp_rel = !fp_lesser_or_equal(tmp_fr2, tmp_fr3);else if ( frel == ‘ngt’) tmp_rel = !fp_less_than(tmp_fr3, tmp_fr2);else if ( frel == ‘nge’) tmp_rel = !fp_lesser_or_equal(tmp_fr3, tmp_fr2);else tmp_rel = !fp_unordered(tmp_fr2, tmp_fr3); //‘ord’

tmp_res_lo = (tmp_rel ? 0xFFFFFFFF : 0x00000000);


fp_update_fpsr(sf, tmp_fp_env);}fp_update_psr( f1);

}



fpcvt.fx IA-64 Application ISA Guide 1.0

Convert Parallel Floating-Point to Integer

Format: (qp) fpcvt.fx.sf f1 = f2 signed_form F10(qp) fpcvt.fx.trunc.sf f1 = f2 signed_form, trunc_form F10(qp) fpcvt.fxu.sf f1 = f2 unsigned_form F10(qp) fpcvt.fxu.trunc.sf f1 = f2 unsigned_form, trunc_form F10

Description: The pair of single precision values in the significand field of FR f2 is converted to a pair of 32-bit signedintegers (signed_form) or unsigned integers (unsigned_form) using either the rounding mode specified inthe FPSR.sf.rc, or using Round-to-Zero if the trunc_form of the instruction is used. The result is written asa pair of 32-bit integers into the significand field of FR f1. The exponent field of FR f1 is set to the biasedexponent for 2.063 (0x1003E) and the sign field of FR f1 is set to positive (0). If the result of the conver-sion doesn’t fit in a 32-bit integer the 32-bit integer indefinite value 0x80000000 is used as the result if theIEEE Invalid Operation Floating-Point Exception fault is disabled.

If FR f2 is a NaTVal, FR f1 is set to NatVal instead of the computed result.



IA-64 Application ISA Guide 1.0 fpcvt.fx



if (fp_is_natval(FR[f2])) {FR[f1] = NATVAL;fp_update_psr(f1);

} else {tmp_default_result_pair = fpcvt_exception_fault_check(f2, sf,

signed_form, trunc_form, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))


if (fp_is_nan(tmp_default_result_pair.hi)) {tmp_res_hi = INTEGER_INDEFINITE_32_BIT;

} else {tmp_res = fp_ieee_rnd_to_int_sp(fp_reg_read_hi(f2), HIGH, &tmp_fp_env);if (tmp_res.exponent)


if (signed_form && tmp_res.sign)tmp_res.significand = (~tmp_res.significand) + 1;

tmp_res_hi = tmp_res.significand{31:0};}

if (fp_is_nan(tmp_default_result_pair.lo)) {tmp_res_lo = INTEGER_INDEFINITE_32_BIT;

} else {tmp_res = fp_ieee_rnd_to_int_sp(fp_reg_read_lo(f2), LOW, &tmp_fp_env);if (tmp_res.exponent)


if (signed_form && tmp_res.sign) tmp_res.significand = (~tmp_res.significand) + 1;

tmp_res_lo = tmp_res.significand{31:0};}




}

FP Exceptions: Invalid Operation (V) Inexact (I)Denormal/Unnormal Operand (D)Software Assist (SWA) Fault


fpma IA-64 Application ISA Guide 1.0

Floating-Point Parallel Multiply Add

Format: (qp) fpma.sf f1 = f3, f4, f2 F1

Description: The pair of products of the pairs of single precision values in the significand fields of FR f3 and FR f4 arecomputed to infinite precision and then the pair of single precision values in the significand field of FR f2is added to these products, again in infinite precision. The resulting values are then rounded to single pre-cision using the rounding mode specified by FPSR.sf.rc. The pair of rounded results are stored in the sig-nificand field of FR f1. The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) andthe sign field of FR f1 is set to positive (0).

If any of FR f3, FR f4, or FR f2 is a NaTVal, FR f1 is set to NaTVal instead of the computed results.

Note: If f2 is f0 in the fpma instruction, just the IEEE multiply operation is performed. (See “Floating-Point Parallel Multiply” on page 6-76.) FR f1, as an operand, is not a packed pair of 1.0 values, it is justthe register file format’s 1.0 value.

The mnemonic values for sf are given in Table 6-18 on page 6-30.The encodings and interpretation for the status field’s rc are given in Table 5-6 on page 5-5.


IA-64 Application ISA Guide 1.0 fpma




} else {tmp_default_result_pair = fpma_exception_fault_check(f2,

f3, f4, sf, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))


if (fp_is_nan_or_inf(tmp_default_result_pair.hi)) {tmp_res_hi = fp_single(tmp_default_result_pair.hi);

} else {tmp_res = fp_mul(fp_reg_read_hi(f3), fp_reg_read_hi(f4));if (f2 != 0)

tmp_res = fp_add(tmp_res, fp_reg_read_hi(f2), tmp_fp_env);tmp_res_hi = fp_ieee_round_sp(tmp_res, HIGH, &tmp_fp_env);

}

if (fp_is_nan_or_inf(tmp_default_result_pair.lo)) {tmp_res_lo = fp_single(tmp_default_result_pair.lo);

} else {tmp_res = fp_mul(fp_reg_read_lo(f3), fp_reg_read_lo(f4));if (f2 != 0)

tmp_res = fp_add(tmp_res, fp_reg_read_lo(f2), tmp_fp_env);tmp_res_lo = fp_ieee_round_sp(tmp_res, LOW, &tmp_fp_env);

}




}

FP Exceptions: Invalid Operation (V) Underflow (U)Denormal/Unnormal Operand (D) Overflow (O)Software Assist (SWA) Fault Inexact (I)

Software Assist (SWA) trap


fpmax IA-64 Application ISA Guide 1.0

Floating-Point Parallel Maximum

Format: (qp) fpmax.sf f1 = f2, f3 F8

Description: The paired single precision values in the significands of FR f2 or FR f3 are compared. The operands withthe larger value is returned in the significand of FR f1.

If the value of high (low) FR f3 is less than the value of high (low) FR f2, high (low) FR f1 gets high (low)FR f2. Otherwise high (low) FR f1 gets high (low) FR f3.

If high (low) FR f2 or high (low) FR f3 is a NaN, high (low) FR f1 gets high (low) FR f3.


If either FR f2 or FR f3 is NaTVal, FR f1 is set to NaTVal instead of the computed result.








tmp_fr2 = tmp_right = fp_reg_read_hi(f2);tmp_fr3 = tmp_left = fp_reg_read_hi(f3);tmp_bool_res = fp_less_than(tmp_left, tmp_right);tmp_res_hi = fp_single(tmp_bool_res ? tmp_fr2 : tmp_fr3);

tmp_fr2 = tmp_right = fp_reg_read_lo(f2);tmp_fr3 = tmp_left = fp_reg_read_lo(f3);tmp_bool_res = fp_less_than(tmp_left, tmp_right);tmp_res_lo = fp_single(tmp_bool_res ? tmp_fr2: tmp_fr3);



}



IA-64 Application ISA Guide 1.0 fpmerge

Floating-Point Parallel Merge

Format: (qp) fpmerge.ns f1 = f2, f3 neg_sign_form F9(qp) fpmerge.s f1 = f2, f3 sign_form F9(qp) fpmerge.se f1 = f2, f3 sign_exp_form F9

Description: For the neg_sign_form, the signs of the pair of single precision values in the significand field of FR f2 arenegated and concatenated with the exponents and the significands of the pair of single precision values inthe significand field of FR f3 and stored in the significand field of FR f1. This form can be used to negate apair of single precision floating-point numbers by using the same register for f2 and f3.

For the sign_form, the signs of the pair of single precision values in the significand field of FR f2 are con-catenated with the exponents and the significands of the pair of single precision values in the significandfield of FR f3 and stored in FR f1.

For the sign_exp_form, the signs and exponents of the pair of single precision values in the significandfield of FR f2 are concatenated with the pair of single precision significands in the significand field of FRf3 and stored in the significand field of FR f1.



Figure 6-14. Floating-point Merge Negative Sign Operation

Figure 6-15. Floating-point Merge Sign Operation

Figure 6-16. Floating-point Merge Sign and Exponent Operation

f1

negated

63 31

0303262

03163f3

f2

sign bit

f1

sign bit

63 31

0303262

03163f3

f2

f1

sign and

63 31

032

03163f3

f2

exponent 23 2255 54

63 55 54 23 22

23 2255 54

fpmerge IA-64 Application ISA Guide 1.0




} else {if (neg_sign_form) {

tmp_res_hi = (!FR[f2].significand{63} << 31) | (FR[f3].significand{62:32});

tmp_res_lo = (!FR[f2].significand{31} << 31) | (FR[f3].significand{30:0});

} else if (sign_form) {tmp_res_hi = (FR[f2].significand{63} << 31)

| (FR[f3].significand{62:32});tmp_res_lo = (FR[f2].significand{31} << 31)

| (FR[f3].significand{30:0});} else { // sign_exp_form

tmp_res_hi = (FR[f2].significand{63:55} << 23) | (FR[f3].significand{54:32});

tmp_res_lo = (FR[f2].significand{31:23} << 23) | (FR[f3].significand{22:0});

}


}

fp_update_psr(f1);}

FP Exceptions: None


IA-64 Application ISA Guide 1.0 fpmin

Floating-Point Parallel Minimum

Format: (qp) fpmin.sf f1 = f2, f3 F8

Description: The paired single precision values in the significands of FR f2 or FR f3 are compared. The operands withthe smaller value is returned in significand of FR f1.

If the value of high (low) FR f2 is less than the value of high (low) FR f3, high (low) FR f1 gets high (low)FR f2. Otherwise high (low) FR f1 gets high (low) FR f3.

If high (low) FR f2 or high (low) FR f3 is a NaN, high (low) FR f1 gets high (low) FR f3.










tmp_fr2 = tmp_left = fp_reg_read_hi(f2);tmp_fr3 = tmp_right = fp_reg_read_hi(f3);tmp_bool_res = fp_less_than(tmp_left, tmp_right);tmp_res_hi = fp_single(tmp_bool_res ? tmp_fr2: tmp_fr3);

tmp_fr2 = tmp_left = fp_reg_read_lo(f2);tmp_fr3 = tmp_right = fp_reg_read_lo(f3);tmp_bool_res = fp_less_than(tmp_left, tmp_right);tmp_res_lo = fp_single(tmp_bool_res ? tmp_fr2: tmp_fr3);



}



fpmpy IA-64 Application ISA Guide 1.0

Floating-Point Parallel Multiply

Format: (qp) fpmpy.sf f1 = f3, f4 pseudo-op of: (qp) fpma.sf f1 = f3, f4, f0

Description: The pair of products of the pairs of single precision values in the significand fields of FR f3 and FR f4 arecomputed to infinite precision. The resulting values are then rounded to single precision using the round-ing mode specified by FPSR.sf.rc. The pair of rounded results are stored in the significand field of FR f1.The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the sign field of FR f1 isset to positive (0).

If either FR f3, or FR f4 is a NaTVal, FR f1 is set to NaTVal instead of the computed results.


Operation: See “Floating-Point Parallel Multiply Add” on page 6-70.


IA-64 Application ISA Guide 1.0 fpms

Floating-Point Parallel Multiply Subtract

Format: (qp) fpms.sf f1 = f3, f4, f2 F1

Description: The pair of products of the pairs of single precision values in the significand fields of FR f3 and FR f4 arecomputed to infinite precision and then the pair of single precision values in the significand field of FR f2is subtracted from these products, again in infinite precision. The resulting values are then rounded to sin-gle precision using the rounding mode specified by FPSR.sf.rc. The pair of rounded results are stored inthe significand field of FR f1. The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E)and the sign field of FR f1 is set to positive (0).

If any of FR f3, FR f4, or FR f2 is a NaTVal, FR f1 is set to NaTVal instead of the computed results.

Note: If f2 is f0 in the fpms instruction, just the IEEE multiply operation is performed.



fpms IA-64 Application ISA Guide 1.0




} else {tmp_default_result_pair = fpms_fpnma_exception_fault_check(f2, f3, f4,

sf, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))



} else {tmp_res = fp_mul(fp_reg_read_hi(f3), fp_reg_read_hi(f4));if (f2 != 0) {

tmp_sub = fp_reg_read_hi(f2);tmp_sub.sign = !tmp_sub.sign;tmp_res = fp_add(tmp_res, tmp_sub, tmp_fp_env);

}tmp_res_hi = fp_ieee_round_sp(tmp_res, HIGH, &tmp_fp_env);

}


} else {tmp_res = fp_mul(fp_reg_read_lo(f3), fp_reg_read_lo(f4));if (f2 != 0) {

tmp_sub = fp_reg_read_lo(f2);tmp_sub.sign = !tmp_sub.sign;tmp_res = fp_add(tmp_res, tmp_sub, tmp_fp_env);

}tmp_res_lo = fp_ieee_round_sp(tmp_res, LOW, &tmp_fp_env);

}




}

FP Exceptions: Invalid Operation (V) Underflow (U)Denormal/Unnormal Operand (D) Overflow (O)Software Assist (SWA) Fault Inexact (I)



IA-64 Application ISA Guide 1.0 fpneg

Floating-Point Parallel Negate

Format: (qp) fpneg f1 = f3 pseudo-op of: (qp) fpmerge.ns f1 = f3, f3

Description: The pair of single precision values in the significand field of FR f3 are negated and stored in the signifi-cand field of FR f1. The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and thesign field of FR f1 is set to positive (0).




fpnegabs IA-64 Application ISA Guide 1.0

Floating-Point Parallel Negate Absolute Value

Format: (qp) fpnegabs f1 = f3 pseudo-op of: (qp) fpmerge.ns f1 = f0, f3

Description: The absolute values of the pair of single precision values in the significand field of FR f3 are computed,negated and stored in the significand field of FR f1. The exponent field of FR f1 is set to the biased expo-nent for 2.063 (0x1003E) and the sign field of FR f1 is set to positive (0).




IA-64 Application ISA Guide 1.0 fpnma

Floating-Point Parallel Negative Multiply Add

Format: (qp) fpnma.sf f1 = f3, f4, f2 F1

Description: The pair of products of the pairs of single precision values in the significand fields of FR f3 and FR f4 arecomputed to infinite precision, negated, and then the pair of single precision values in the significand fieldof FR f2 are added to these (negated) products, again in infinite precision. The resulting values are thenrounded to single precision using the rounding mode specified by FPSR.sf.rc. The pair of rounded resultsare stored in the significand field of FR f1. The exponent field of FR f1 is set to the biased exponent for2.063 (0x1003E) and the sign field of FR f1 is set to positive (0).


Note: If f2 is f0 in the fpnma instruction, just the IEEE multiply operation (with the product being negatedbefore rounding) is performed.



fpnma IA-64 Application ISA Guide 1.0




} else {tmp_default_result_pair = fpms_fpnma_exception_fault_check(f2, f3, f4,

sf, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))



} else {tmp_res = fp_mul(fp_reg_read_hi(f3), fp_reg_read_hi(f4));tmp_res.sign = !tmp_res.sign;if (f2 != 0)

tmp_res = fp_add(tmp_res, fp_reg_read_hi(f2), tmp_fp_env);tmp_res_hi = fp_ieee_round_sp(tmp_res, HIGH, &tmp_fp_env);

}


} else {tmp_res = fp_mul(fp_reg_read_lo(f3), fp_reg_read_lo(f4));tmp_res.sign = !tmp_res.sign;if (f2 != 0)

tmp_res = fp_add(tmp_res, fp_reg_read_lo(f2), tmp_fp_env);tmp_res_lo = fp_ieee_round_sp(tmp_res, LOW, &tmp_fp_env);

}




}

FP Exceptions: Invalid Operation (V) Underflow (U)Denormal/Unnormal Operand (D) Overflow (O)Software Assist (SWA) fault Inexact (I)



IA-64 Application ISA Guide 1.0 fpnmpy

Floating-Point Parallel Negative Multiply

Format: (qp) fpnmpy.sf f1 = f3, f4 pseudo-op of: (qp) fpnma.sf f1 = f3, f4,f0

Description: The pair of products of the pairs of single precision values in the significand fields of FR f3 and FR f4 arecomputed to infinite precision and then negated. The resulting values are then rounded to single precisionusing the rounding mode specified by FPSR.sf.rc. The pair of rounded results are stored in the significandfield of FR f1. The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the signfield of FR f1 is set to positive (0).

If either FR f3 or FR f4 is a NaTVal, FR f1 is set to NaTVal instead of the computed results.


Operation: See “Floating-Point Parallel Negative Multiply Add” on page 6-81.

fprcpa IA-64 Application ISA Guide 1.0

Floating-Point Parallel Reciprocal Approximation

Format: (qp) fprcpa.sf f1,p2 = f2, f3 F6

Description: If PR qp is 0, PR p2 is cleared and FR f1 remains unchanged.

If PR qp is 1, the following will occur:

• Each half of the significand of FR f1 is either set to an approximation (with a relative error < 2-8.886)of the reciprocal of the corresponding half of FR f3, or set to the IEEE-754 mandated response for thequotient FR f2/FR f3 of the corresponding half — if that half of FR f2 or of FR f3 is in the set {-Infin-ity, -0, +0, +Infinity, NaN}.

• If either half of FR f1 is set to the IEEE-754 mandated quotient, or is set to an approximation of thereciprocal which may cause the Newton-Raphson iterations to fail to produce the correct IEEE-754divide result, then PR p2 is set to 0, otherwise it is set to 1.

For correct IEEE divide results, when PR p2 is cleared, user software is expected to compute the quo-tient (FR f2/FR f3) for each half (using the non-parallel frcpa instruction), and merge the results intoFR f1, keeping PR p2 cleared.

• The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the sign field of FRf1 is set to positive (0).

• If either FR f2 or FR f3 is a NaTVal, FR f1 is set to NaTVal instead of the computed result, and PR p2is cleared.




if (fp_is_natval(FR[f2]) || fp_is_natval(FR[f3])) {FR[f1] = NATVAL;PR[p2] = 0;

} else {tmp_default_result_pair = fprcpa_exception_fault_check(f2, f3, sf,

&tmp_fp_env, &limits_check);if (fp_raise_fault(tmp_fp_env))


if (fp_is_nan_or_inf(tmp_default_result_pair.hi) || limits_check.hi_fr3) {tmp_res_hi = fp_single(tmp_default_result_pair.hi);tmp_pred_hi = 0;

} else {num = fp_normalize(fp_reg_read_hi(f2));den = fp_normalize(fp_reg_read_hi(f3));if (fp_is_inf(num) && fp_is_finite(den)) {

tmp_res = FP_INFINITY;tmp_res.sign = num.sign ^ den.sign;tmp_pred_hi = 0;

} else if (fp_is_finite(num) && fp_is_inf(den)) {tmp_res = FP_ZERO;tmp_res.sign = num.sign ^ den.sign;tmp_pred_hi = 0;

} else if (fp_is_zero(num) && fp_is_finite(den)) {tmp_res = FP_ZERO;tmp_res.sign = num.sign ^ den.sign;tmp_pred_hi = 0;

} else {tmp_res = fp_ieee_recip(den);if (limits_check.hi_fr2_or_quot)

tmp_pred_hi = 0;else

tmp_pred_hi = 1;}


IA-64 Application ISA Guide 1.0 fprcpa

tmp_res_hi = fp_single(tmp_res);}if (fp_is_nan_or_inf(tmp_default_result_pair.lo) || limits_check.lo_fr3) {

tmp_res_lo = fp_single(tmp_default_result_pair.lo);tmp_pred_lo = 0;

} else {num = fp_normalize(fp_reg_read_lo(f2));den = fp_normalize(fp_reg_read_lo(f3));if (fp_is_inf(num) && fp_is_finite(den)) {

tmp_res = FP_INFINITY;tmp_res.sign = num.sign ^ den.sign;tmp_pred_lo = 0;

} else if (fp_is_finite(num) && fp_is_inf(den)) {tmp_res = FP_ZERO;tmp_res.sign = num.sign ^ den.sign;tmp_pred_lo = 0;

} else if (fp_is_zero(num) && fp_is_finite(den)) {tmp_res = FP_ZERO;tmp_res.sign = num.sign ^ den.sign;tmp_pred_lo = 0;

} else {tmp_res = fp_ieee_recip(den);if (limits_check.lo_fr2_or_quot)

tmp_pred_lo = 0;else

tmp_pred_lo = 1;}tmp_res_lo = fp_single(tmp_res);

}

FR[f1].significand = fp_concatenate(tmp_res_hi, tmp_res_lo);FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;PR[p2] = tmp_pred_hi && tmp_pred_lo;


} else {PR[p2] = 0;

}

FP Exceptions: Invalid Operation (V)Zero Divide (Z)Denormal/Unnormal Operand (D)Software Assist (SWA) fault

fprsqrta IA-64 Application ISA Guide 1.0

Floating-Point Parallel Reciprocal Square Root Approximation

Format: (qp) fprsqrta.sf f1,p2 = f3 F7



• Each half of the significand of FR f1 is either set to an approximation (with a relative error < 2-8.831)of the reciprocal square root of the corresponding half of FR f3, or set to the IEEE-754 compliantresponse for the reciprocal square root of the corresponding half of FR f3 — if that half of FR f3 is inthe set {-Infinity, -Finite, -0, +0, +Infinity, NaN}.

• If either half of FR f1 is set to the IEEE-754 mandated reciprocal square root, or is set to an approxi-mation of the reciprocal square root which may cause the Newton-Raphson iterations to fail to pro-duce the correct IEEE-754 square root result, then PR p2 is set to 0, otherwise it is set to 1.

For correct IEEE square root results, when PR p2 is cleared, user software is expected to compute thesquare root for each half (using the non-parallel frsqrta instruction), and merge the results in FR f1,keeping PR p2 cleared.

• The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the sign field of FRf1 is set to positive (0).

• If FR f3 is a NaTVal, FR f1 is set to NaTVal instead of the computed result, and PR p2 is cleared.




if (fp_is_natval(FR[f3])) {FR[f1] = NATVAL;PR[p2] = 0;

} else {tmp_default_result_pair = fprsqrta_exception_fault_check(f3, sf,

&tmp_fp_env, &limits_check);if (fp_raise_fault(tmp_fp_env))


if (fp_is_nan(tmp_default_result_pair.hi)) {tmp_res_hi = fp_single(tmp_default_result_pair.hi);tmp_pred_hi = 0;

} else {tmp_fr3 = fp_normalize(fp_reg_read_hi(f3));if (fp_is_zero(tmp_fr3)) {

tmp_res = FP_INFINITY;tmp_res.sign = tmp_fr3.sign;tmp_pred_hi = 0;

} else if (fp_is_pos_inf(tmp_fr3)) {tmp_res = FP_ZERO;tmp_pred_hi = 0;

} else {tmp_res = fp_ieee_recip_sqrt(tmp_fr3);if (limits_check.hi)

tmp_pred_hi = 0;else

tmp_pred_hi = 1;}tmp_res_hi = fp_single(tmp_res);

}

if (fp_is_nan(tmp_default_result_pair.lo)) {tmp_res_lo = fp_single(tmp_default_result_pair.lo);tmp_pred_lo = 0;

} else {tmp_fr3 = fp_normalize(fp_reg_read_lo(f3));


IA-64 Application ISA Guide 1.0 fprsqrta

if (fp_is_zero(tmp_fr3)) {tmp_res = FP_INFINITY;tmp_res.sign = tmp_fr3.sign;tmp_pred_lo = 0;

} else if (fp_is_pos_inf(tmp_fr3)) {tmp_res = FP_ZERO;tmp_pred_lo = 0;

} else {tmp_res = fp_ieee_recip_sqrt(tmp_fr3);if (limits_check.lo)

tmp_pred_lo = 0;else

tmp_pred_lo = 1;}tmp_res_lo = fp_single(tmp_res);

}

FR[f1].significand = fp_concatenate(tmp_res_hi, tmp_res_lo);FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;PR[p2] = tmp_pred_hi && tmp_pred_lo;


} else {PR[p2] = 0;

}


frcpa IA-64 Application ISA Guide 1.0

Floating-Point Reciprocal Approximation

Format: (qp) frcpa.sf f1, p2 = f2, f3 F6



• FR f1 is either set to an approximation (with a relative error < 2-8.886) of the reciprocal of FR f3, or tothe IEEE-754 mandated quotient of FR f2/FR f3 — if either FR f2 or FR f3 is in the set {-Infinity, -0,Pseudo-zero, +0, +Infinity, NaN, Unsupported}.

• If FR f1 is set to the approximation of the reciprocal of FR f3, then PR p2 is set to 1; otherwise, it is setto 0.

• If FR f2 and FR f3 are such that the approximation of FR f3’s reciprocal may cause the Newton-Raph-son iterations to fail to produce the correct IEEE-754 result of FR f2/FR f3, then a Floating-pointException fault for Software Assist occurs.

System software is expected to compute the IEEE-754 quotient (FR f2/FR f3), return the result in FRf1, and set PR p2 to 0.

• If either FR f2 or FR f3 is a NaTVal, FR f1 is set to NaTVal instead of the computed result, and PR p2is cleared.




if (fp_is_natval(FR[f2]) || fp_is_natval(FR[f3])) {FR[f1] = NATVAL;PR[p2] = 0;

} else {tmp_default_result = frcpa_exception_fault_check(f2, f3, sf, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))


if (fp_is_nan_or_inf(tmp_default_result)) {FR[f1] = tmp_default_result;PR[p2] = 0;

} else {num = fp_normalize(fp_reg_read(FR[f2]));den = fp_normalize(fp_reg_read(FR[f3]));if (fp_is_inf(num) && fp_is_finite(den)) {

FR[f1] = FP_INFINITY;FR[f1].sign = num.sign ^ den.sign;PR[p2] = 0;

} else if (fp_is_finite(num) && fp_is_inf(den)) {FR[f1] = FP_ZERO;FR[f1].sign = num.sign ^ den.sign;PR[p2] = 0;

} else if (fp_is_zero(num) && fp_is_finite(den)) {FR[f1] = FP_ZERO;FR[f1].sign = num.sign ^ den.sign;PR[p2] = 0;

} else {FR[f1] = fp_ieee_recip(den);PR[p2] = 1;

}}fp_update_fpsr(sf, tmp_fp_env);

}fp_update_psr(f1);

} else {PR[p2] = 0;

}

IA-64 Application ISA Guide 1.0 frcpa

// fp_ieee_recip()

fp_ieee_recip(den){

const EM_uint_t RECIP_TABLE[256] = {0x3fc, 0x3f4, 0x3ec, 0x3e4, 0x3dd, 0x3d5, 0x3cd, 0x3c6,0x3be, 0x3b7, 0x3af, 0x3a8, 0x3a1, 0x399, 0x392, 0x38b,0x384, 0x37d, 0x376, 0x36f, 0x368, 0x361, 0x35b, 0x354,0x34d, 0x346, 0x340, 0x339, 0x333, 0x32c, 0x326, 0x320,0x319, 0x313, 0x30d, 0x307, 0x300, 0x2fa, 0x2f4, 0x2ee,0x2e8, 0x2e2, 0x2dc, 0x2d7, 0x2d1, 0x2cb, 0x2c5, 0x2bf,0x2ba, 0x2b4, 0x2af, 0x2a9, 0x2a3, 0x29e, 0x299, 0x293,0x28e, 0x288, 0x283, 0x27e, 0x279, 0x273, 0x26e, 0x269,0x264, 0x25f, 0x25a, 0x255, 0x250, 0x24b, 0x246, 0x241,0x23c, 0x237, 0x232, 0x22e, 0x229, 0x224, 0x21f, 0x21b,0x216, 0x211, 0x20d, 0x208, 0x204, 0x1ff, 0x1fb, 0x1f6,0x1f2, 0x1ed, 0x1e9, 0x1e5, 0x1e0, 0x1dc, 0x1d8, 0x1d4,0x1cf, 0x1cb, 0x1c7, 0x1c3, 0x1bf, 0x1bb, 0x1b6, 0x1b2,0x1ae, 0x1aa, 0x1a6, 0x1a2, 0x19e, 0x19a, 0x197, 0x193,0x18f, 0x18b, 0x187, 0x183, 0x17f, 0x17c, 0x178, 0x174,0x171, 0x16d, 0x169, 0x166, 0x162, 0x15e, 0x15b, 0x157,0x154, 0x150, 0x14d, 0x149, 0x146, 0x142, 0x13f, 0x13b,0x138, 0x134, 0x131, 0x12e, 0x12a, 0x127, 0x124, 0x120,0x11d, 0x11a, 0x117, 0x113, 0x110, 0x10d, 0x10a, 0x107,0x103, 0x100, 0x0fd, 0x0fa, 0x0f7, 0x0f4, 0x0f1, 0x0ee,0x0eb, 0x0e8, 0x0e5, 0x0e2, 0x0df, 0x0dc, 0x0d9, 0x0d6,0x0d3, 0x0d0, 0x0cd, 0x0ca, 0x0c8, 0x0c5, 0x0c2, 0x0bf,0x0bc, 0x0b9, 0x0b7, 0x0b4, 0x0b1, 0x0ae, 0x0ac, 0x0a9,0x0a6, 0x0a4, 0x0a1, 0x09e, 0x09c, 0x099, 0x096, 0x094,0x091, 0x08e, 0x08c, 0x089, 0x087, 0x084, 0x082, 0x07f,0x07c, 0x07a, 0x077, 0x075, 0x073, 0x070, 0x06e, 0x06b,0x069, 0x066, 0x064, 0x061, 0x05f, 0x05d, 0x05a, 0x058,0x056, 0x053, 0x051, 0x04f, 0x04c, 0x04a, 0x048, 0x045,0x043, 0x041, 0x03f, 0x03c, 0x03a, 0x038, 0x036, 0x033,0x031, 0x02f, 0x02d, 0x02b, 0x029, 0x026, 0x024, 0x022,0x020, 0x01e, 0x01c, 0x01a, 0x018, 0x015, 0x013, 0x011,0x00f, 0x00d, 0x00b, 0x009, 0x007, 0x005, 0x003, 0x001,

};

tmp_index = den.significand{62:55};tmp_res.significand = (1 << 63) | (RECIP_TABLE[tmp_index] << 53);tmp_res.exponent = FP_REG_EXP_ONES - 2 - den.exponent;tmp_res.sign = den.sign;return (tmp_res);

}

FP Exceptions: Invalid Operation (V)Zero Divide (Z)Denormal/Unnormal Operand (D)Software Assist (SWA) fault

frsqrta IA-64 Application ISA Guide 1.0

Floating-Point Reciprocal Square Root Approximation

Format: (qp) frsqrta.sf f1, p2 = f3 F7



• FR f1 is either set to an approximation (with a relative error < 2-8.831) of the reciprocal square root ofFR f3, or set to the IEEE-754 mandated square root of FR f3 — if FR f3 is in the set {-Infinity, -Finite,-0, Pseudo-zero, +0, +Infinity, NaN, Unsupported}.

• If FR f1 is set to an approximation of the reciprocal square root of FR f3, then PR p2 is set to 1; other-wise, it is set to 0.

• If FR f3 is such the approximation of its reciprocal square root may cause the Newton-Raphson itera-tions to fail to produce the correct IEEE-754 square root result, then a Floating-point Exception faultfor Software Assist occurs.

System software is expected to compute the IEEE-754 square root, return the result in FR f1, and setPR p2 to 0.

• If FR f3 is a NaTVal, FR f1 is set to NaTVal instead of the computed result, and PR p2 is cleared.




if (fp_is_natval(FR[f3])) {FR[f1] = NATVAL;PR[p2] = 0;

} else {tmp_default_result = frsqrta_exception_fault_check(f3, sf, &tmp_fp_env);if (fp_raise_fault(tmp_fp_env))


if (fp_is_nan(tmp_default_result)) {FR[f1] = tmp_default_result;PR[p2] = 0;

} else {tmp_fr3 = fp_normalize(fp_reg_read(FR[f3]));if (fp_is_zero(tmp_fr3)) {

FR[f1] = tmp_fr3;PR[p2] = 0;

} else if (fp_is_pos_inf(tmp_fr3)) {FR[f1] = tmp_fr3;PR[p2] = 0;

} else {FR[f1] = fp_ieee_recip_sqrt(tmp_fr3);PR[p2] = 1;

}}fp_update_fpsr(sf, tmp_fp_env);

}fp_update_psr(f1);

} else {PR[p2] = 0;

}

// fp_ieee_recip_sqrt()

fp_ieee_recip_sqrt(root){

const EM_uint_t RECIP_SQRT_TABLE[256] = {0x1a5, 0x1a0, 0x19a, 0x195, 0x18f, 0x18a, 0x185, 0x180,0x17a, 0x175, 0x170, 0x16b, 0x166, 0x161, 0x15d, 0x158,

IA-64 Application ISA Guide 1.0 frsqrta

0x153, 0x14e, 0x14a, 0x145, 0x140, 0x13c, 0x138, 0x133,0x12f, 0x12a, 0x126, 0x122, 0x11e, 0x11a, 0x115, 0x111,0x10d, 0x109, 0x105, 0x101, 0x0fd, 0x0fa, 0x0f6, 0x0f2,0x0ee, 0x0ea, 0x0e7, 0x0e3, 0x0df, 0x0dc, 0x0d8, 0x0d5,0x0d1, 0x0ce, 0x0ca, 0x0c7, 0x0c3, 0x0c0, 0x0bd, 0x0b9,0x0b6, 0x0b3, 0x0b0, 0x0ad, 0x0a9, 0x0a6, 0x0a3, 0x0a0,0x09d, 0x09a, 0x097, 0x094, 0x091, 0x08e, 0x08b, 0x088,0x085, 0x082, 0x07f, 0x07d, 0x07a, 0x077, 0x074, 0x071,0x06f, 0x06c, 0x069, 0x067, 0x064, 0x061, 0x05f, 0x05c,0x05a, 0x057, 0x054, 0x052, 0x04f, 0x04d, 0x04a, 0x048,0x045, 0x043, 0x041, 0x03e, 0x03c, 0x03a, 0x037, 0x035,0x033, 0x030, 0x02e, 0x02c, 0x029, 0x027, 0x025, 0x023,0x020, 0x01e, 0x01c, 0x01a, 0x018, 0x016, 0x014, 0x011,0x00f, 0x00d, 0x00b, 0x009, 0x007, 0x005, 0x003, 0x001,0x3fc, 0x3f4, 0x3ec, 0x3e5, 0x3dd, 0x3d5, 0x3ce, 0x3c7,0x3bf, 0x3b8, 0x3b1, 0x3aa, 0x3a3, 0x39c, 0x395, 0x38e,0x388, 0x381, 0x37a, 0x374, 0x36d, 0x367, 0x361, 0x35a,0x354, 0x34e, 0x348, 0x342, 0x33c, 0x336, 0x330, 0x32b,0x325, 0x31f, 0x31a, 0x314, 0x30f, 0x309, 0x304, 0x2fe,0x2f9, 0x2f4, 0x2ee, 0x2e9, 0x2e4, 0x2df, 0x2da, 0x2d5,0x2d0, 0x2cb, 0x2c6, 0x2c1, 0x2bd, 0x2b8, 0x2b3, 0x2ae,0x2aa, 0x2a5, 0x2a1, 0x29c, 0x298, 0x293, 0x28f, 0x28a,0x286, 0x282, 0x27d, 0x279, 0x275, 0x271, 0x26d, 0x268,0x264, 0x260, 0x25c, 0x258, 0x254, 0x250, 0x24c, 0x249,0x245, 0x241, 0x23d, 0x239, 0x235, 0x232, 0x22e, 0x22a,0x227, 0x223, 0x220, 0x21c, 0x218, 0x215, 0x211, 0x20e,0x20a, 0x207, 0x204, 0x200, 0x1fd, 0x1f9, 0x1f6, 0x1f3,0x1f0, 0x1ec, 0x1e9, 0x1e6, 0x1e3, 0x1df, 0x1dc, 0x1d9,0x1d6, 0x1d3, 0x1d0, 0x1cd, 0x1ca, 0x1c7, 0x1c4, 0x1c1,0x1be, 0x1bb, 0x1b8, 0x1b5, 0x1b2, 0x1af, 0x1ac, 0x1aa,

};

tmp_index = (root.exponent{0} << 7) | root.significand{62:56};tmp_res.significand = (1 << 63) | (RECIP_SQRT_TABLE[tmp_index] << 53);tmp_res.exponent = FP_REG_EXP_HALF - ((root.exponent - FP_REG_BIAS) >> 1);tmp_res.sign = FP_SIGN_POSITIVE;return (tmp_res);

}


fselect IA-64 Application ISA Guide 1.0

Floating-Point Select

Format: (qp) fselect f1 = f3, f4, f2 F3

Description: The significand field of FR f3 is logically AND-ed with the significand field of FR f2 and the significandfield of FR f4 is logically AND-ed with the one’s complement of the significand field of FR f2. The tworesults are logically OR-ed together. The result is placed in the significand field of FR f1.

The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E). The sign bit field of FR f1 isset to positive (0).




if (fp_is_natval(FR[f2]) || fp_is_natval(FR[f3]) || fp_is_natval(FR[f4])) {FR[f1] = NATVAL;

} else {FR[f1].significand = (FR[f3].significand & FR[f2].significand)

| (FR[f4].significand & ~FR[f2].significand);FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

}

fp_update_psr(f1);}

FP Exceptions: None


IA-64 Application ISA Guide 1.0 fsetc

Floating-Point Set Controls

Format: (qp) fsetc.sf amask7, omask7 F12

Description: The status field’s control bits are initialized to the value obtained by logically AND-ing the sf0.controlsand amask7 immediate field and logically OR-ing the omask7 immediate field.


Operation: if (PR[qp]) {tmp_controls = (AR[FPSR].sf0.controls & amask7) | omask7;if (is_reserved_field(FSETC, sf, tmp_controls))

reserved_register_field_fault(); fp_set_sf_controls(sf, tmp_controls);

}

FP Exceptions: None


fsub IA-64 Application ISA Guide 1.0

Floating-Point Subtract

Format: (qp) fsub.pc.sf f1 = f3, f2 pseudo-op of: (qp) fms.pc.sf f1 = f3, f1, f2

Description: FR f2 is subtracted from FR f3 (computed to infinite precision), rounded to the precision indicated by pc(and possibly FPSR.sf.pc and FPSR.sf.wre) using the rounding mode specified by FPSR.sf.rc, and placedin FR f1.



Operation: See “Floating-Point Multiply Subtract” on page 6-55.


IA-64 Application ISA Guide 1.0 fswap

Floating-Point Swap

Format: (qp) fswap f1 = f2, f3 swap_form F9(qp) fswap.nl f1 = f2, f3 swap_nl_form F9(qp) fswap.nr f1 = f2, f3 swap_nr_form F9

Description: For the swap_form, the left single precision value in FR f2 is concatenated with the right single precisionvalue in FR f3. The concatenated pair is then swapped.

For the swap_nl_form, the left single precision value in FR f2 is concatenated with the right single preci-sion value in FR f3. The concatenated pair is then swapped, and the left single precision value is negated.

For the swap_nr_form, the left single precision value in FR f2 is concatenated with the right single preci-sion value in FR f3. The concatenated pair is then swapped, and the right single precision value is negated.



Figure 6-17. Floating-point Swap

Figure 6-18. Floating-point Swap Negate Left or Right

f1

f2 f3

63 32 31 0

63 32 31 0

f1

f2 f3

63 32 31 062 30

negate leftnegate rignt

63 62 32 31 30 0

fswap IA-64 Application ISA Guide 1.0




} else {if (swap_form) {

tmp_res_hi = FR[f3].significand{31:0};tmp_res_lo = FR[f2].significand{63:32};

} else if (swap_nl_form) {tmp_res_hi = (!FR[f3].significand{31} << 31)

| (FR[f3].significand{30:0});tmp_res_lo = FR[f2].significand{63:32};

} else { // swap_nr_formtmp_res_hi = FR[f3].significand{31:0};tmp_res_lo = (!FR[f2].significand{63} << 31)

| (FR[f2].significand{62:32});}


}

fp_update_psr(f1);}

FP Exceptions: None


IA-64 Application ISA Guide 1.0 fsxt

Floating-Point Sign Extend

Format: (qp) fsxt.l f1 = f2, f3 sxt_l_form F9(qp) fsxt.r f1 = f2, f3 sxt_r_form F9

Description: For the sxt_l_form (sxt_r_form), the sign of the left (right) single precision value in FR f2 is extended to32-bits and is concatenated with the left (right) single precision value in FR f3.






} else {if (sxt_l_form) {

tmp_res_hi = (FR[f2].significand{63} ? 0xFFFFFFFF : 0x00000000);tmp_res_lo = FR[f3].significand{63:32};

} else { // sxt_r_formtmp_res_hi = (FR[f2].significand{31} ? 0xFFFFFFFF : 0x00000000);tmp_res_lo = FR[f3].significand{31:0};

}


}

fp_update_psr(f1);}

FP Exceptions: None

Figure 6-19. Floating-point Sign Extend Left

Figure 6-20. Floating-point Sign Extend Right

f1

f2 f3

63 32 63 32

63 32 31 0

f1

f2 f3

31 0 31 0

63 32 31 0


fxor IA-64 Application ISA Guide 1.0

Floating-Point Exclusive Or

Format: (qp) fxor f1 = f2, f3 F9

Description: The bit-wise logical exclusive-OR of the significand fields of FR f2 and FR f3 is computed. The resultingvalue is stored in the significand field of FR f1. The exponent field of FR f1 is set to the biased exponentfor 2.063 (0x1003E) and the sign field of FR f1 is set to positive (0).

If either of FR f2 or FR f3 is a NaTVal, FR f1 is set to NaTVal instead of the computed result.




} else {FR[f1].significand = FR[f2].significand ^ FR[f3].significand;FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

}

fp_update_psr(f1);}

FP Exceptions: None


IA-64 Application ISA Guide 1.0 getf

Get Floating-Point Value or Exponent or Significand

Format: (qp) getf.s r1 = f2 single_form M19(qp) getf.d r1 = f2 double_form M19(qp) getf.exp r1 = f2 exponent_form M19(qp) getf.sig r1 = f2 significand_form M19

Description: In the single and double forms, the value in FR f2 is converted into a single precision (single_form) ordouble precision (double_form) memory representation and placed in GR r1. In the single_form, the most-significant 32 bits of GR r1 are set to 0.

In the exponent_form, the exponent field of FR f2 is copied to bits 16:0 of GR r1 and the sign bit of thevalue in FR f2 is copied to bit 17 of GR r1. The most-significant 46-bits of GR r1 are set to zero.

In the significand_form, the significand field of the value in FR f2 is copied to GR r1

For all forms, if FR f2 contains a NaTVal, then the NaT bit corresponding to GR r1 is set to 1.

Operation: if (PR[qp]) {check_target_register(r1);if (tmp_isrcode = fp_reg_disabled(f2, 0, 0, 0))


if (single_form) {GR[r1]{31:0} = fp_fr_to_mem_format(FR[f2], 4, 0);GR[r1]{63:32} = 0;

} else if (double_form) {GR[r1] = fp_fr_to_mem_format(FR[f2], 8, 0);

} else if (exponent_form) {GR[r1]{63:18} = 0;GR[r1]{16:0} = FR[f2].exponent;GR[r1]{17} = FR[f2].sign;

} else // significand_formGR[r1] = FR[f2].significand;

if (fp_is_natval(FR[f2]))GR[r1].nat = 1;

elseGR[r1].nat = 0;

}

Figure 6-21. Function of getf.exp

Figure 6-22. Function of getf.sig

s significandexponentFR f2

GR r1 0

01618

17146

63

significandexponentsFR f2

GR r1

063

64


invala IA-64 Application ISA Guide 1.0

Invalidate ALAT

Format: (qp) invala complete_form M24(qp) invala.e r1 gr_form, entry_form M26(qp) invala.e f1 fr_form, entry_form M27

Description: The selected entry or entries in the ALAT are invalidated.

In the complete_form, all ALAT entries are invalidated. In the entry_form, the ALAT is queried using thegeneral register specifier r1 (gr_form), or the floating-point register specifier f1 (fr_form), and if anyALAT entry matches, it is invalidated.

Operation: if (PR[qp]) {if (complete_form)

alat_inval();else { // entry_form

if (gr_form)alat_inval_single_entry(GENERAL, r1);

else // fr_formalat_inval_single_entry(FLOAT, f1);

}}


IA-64 Application ISA Guide 1.0 ld

Load

Format: (qp) ldsz.ldtype.ldhint r1 = [r3] no_base_update_form M1(qp) ldsz.ldtype.ldhint r1 = [r3], r2 reg_base_update_form M2(qp) ldsz.ldtype.ldhint r1 = [r3], imm9 imm_base_update_form M3(qp) ld8.fill.ldhint r1 = [r3] fill_form, no_base_update_form M1(qp) ld8.fill.ldhint r1 = [r3], r2 fill_form, reg_base_update_form M2(qp) ld8.fill.ldhint r1 = [r3], imm9 fill_form, imm_base_update_form M3

Description: A value consisting of sz bytes is read from memory starting at the address specified by the value in GR r3.The value is then zero extended and placed in GR r1. The values of the sz completer are given inTable 6-26. The NaT bit corresponding to GR r1 is cleared, except as described below for speculativeloads. The ldtype completer specifies special load operations, which are described in Table 6-27.

For the fill_form, an 8-byte value is loaded, and a bit in the UNAT application register is copied into thetarget register NaT bit. This instruction is used for reloading a spilled register/NaT pair. See “ControlSpeculation” on page 4-10 for details.

In the base update forms, the value in GR r3 is added to either a signed immediate value (imm9) or a valuefrom GR r2, and the result is placed back in GR r3. This base register update is done after the load, anddoes not affect the load address. In the reg_base_update_form, if the NaT bit corresponding to GR r2 isset, then the NaT bit corresponding to GR r3 is set and no fault is raised.

Table 6-26. sz Completers

sz Completer Bytes Accessed1 1 byte

2 2 bytes

4 4 bytes

8 8 bytes

Table 6-27. Load Types

ldtypeCompleter

Interpretation Special Load Operation

none Normal load

s Speculative load Certain exceptions may be deferred rather than generating a fault. Deferral causes the target register’s NaT bit to be set. The NaT bit is later used to detect deferral.

a Advanced load An entry is added to the ALAT. This allows later instructions to check for colliding stores. If the referenced data page has a non-speculative attribute, the target register and NaT bit is cleared, and the processor ensures that no ALAT entry exists for the target register. The absence of an ALAT entry is later used to detect deferral or collision.

sa SpeculativeAdvanced load

An entry is added to the ALAT, and certain exceptions may be deferred. Deferral causes the target register’s NaT bit to be set, and the processor ensures that no ALAT entry exists for the target register. The absence of an ALAT entry is later used to detect deferral or collision.

c.nc Check load- no clear

The ALAT is searched for a matching entry. If found, no load is done and the target register is unchanged. Regardless of ALAT hit or miss, base regis-ter updates are performed, if specified. An implementation may optionally cause the ALAT lookup to fail independent of whether an ALAT entry matches. If not found, a load is performed, and an entry is added to the ALAT (unless the referenced data page has a non-speculative attribute, in which case no ALAT entry is allocated).


ld IA-64 Application ISA Guide 1.0

For more details on ordered, biased, speculative, advanced and check loads see “Control Speculation” onpage 4-10 and “Data Speculation” on page 4-12. For more details on ordered loads see “Memory AccessOrdering” on page 4-18. See “Memory Hierarchy Control and Consistency” on page 4-16 for details onbiased loads.

For the non-speculative load types, if NaT bit associated with GR r3 is 1, a Register NaT Consumptionfault is taken. For speculative and speculative advanced loads, no fault is raised, and the exception isdeferred. For the base-update calculation, if the NaT bit associated with GR r2 is 1, the NaT bit associatedwith GR r3 is set to 1 and no fault is raised.

The value of the ldhint completer specifies the locality of the memory access. The values of the ldhintcompleter are given in Table 6-28. A prefetch hint is implied in the base update forms. The address speci-fied by the value in GR r3 after the base update acts as a hint to prefetch the indicated cache line. Thisprefetch uses the locality hints specified by ldhint. Prefetch and locality hints do not affect program func-tionality and may be ignored by the implementation. See “Memory Hierarchy Control and Consistency”on page 4-16 for details.

In the no_base_update form, the value in GR r3 is not modified and no prefetch hint is implied.

For the base update forms, specifying the same register address in r1 and r3 will cause an Illegal Operationfault.

c.clr Check load- clear

The ALAT is searched for a matching entry. If found, the entry is removed, no load is done and the target register is unchanged. Regardless of ALAT hit or miss, base register updates are performed, if specified. An implementa-tion may optionally cause the ALAT lookup to fail independent of whether an ALAT entry matches. If not found, a clear check load behaves like a nor-mal load.

c.clr.acq Ordered check load – clear

This type behaves the same as the unordered clear form, except that the ALAT lookup (and resulting load, if no ALAT entry is found) is performed with acquire semantics.

acq Ordered load An ordered load is performed with acquire semantics.

bias Biased load A hint is provided to the implementation to acquire exclusive ownership of the accessed cache line.

Table 6-28. Load Hints

ldhint Completer Interpretationnone Temporal locality, level 1

nt1 No temporal locality, level 1

nta No temporal locality, all levels

Table 6-27. Load Types (Continued)

ldtypeCompleter



IA-64 Application ISA Guide 1.0 ld

Operation: if (PR[qp]) {size = fill_form ? 8 : sz;

speculative = (ldtype == ‘s’ || ldtype == ‘sa’);advanced = ( ldtype == ‘a’ || ldtype == ‘sa’);check_clear = ( ldtype == ‘c.clr’ || ldtype == ‘c.clr.acq’);check_no_clear = ( ldtype == ‘c.nc’);check = check_clear || check_no_clear;acquire = ( ldtype == ‘acq’ || ldtype == ‘c.clr.acq’);bias = ( ldtype == ‘bias’) ? BIAS : 0 ;

itype = READ;if (speculative) itype |= SPEC ;if (advanced)itype |= ADVANCE ;

if ((reg_base_update_form || imm_base_update_form) && ( r1 == r3))illegal_operation_fault();

check_target_register( r1, itype);if (reg_base_update_form || imm_base_update_form)

check_target_register( r3);

if (reg_base_update_form) {tmp_r2 = GR[ r2];tmp_r2nat = GR[ r2].nat;

}

if (!speculative && GR[ r3].nat) // fault on NaT addressregister_nat_consumption_fault(itype);

defer = speculative && (GR[ r3].nat || PSR.ed);// defer exception if spec

if (check && alat_cmp(GENERAL, r1)) { // no load on ld.c & ALAT hitif (check_clear) // remove entry on ld.c.clr or ld.c.clr.acq

alat_inval_single_entry(GENERAL, r1);} else {

if (!defer) {paddr = tlb_translate(GR[ r3], size, itype, PSR.cpl, &mattr,

&defer);if (!defer) {

otype = acquire ? ACQUIRE : UNORDERED;val = mem_read(paddr, size, UM.be, mattr, otype, bias | ldhint);

}}if (check_clear || advanced) // remove any old ALAT entry

alat_inval_single_entry(GENERAL, r1);if (defer) {

if (speculative) {GR[r1] = natd_gr_read(paddr, size, UM.be, mattr, otype,

bias | ldhint);GR[r1].nat = 1;

} else {GR[r1] = 0; // ld.a to sequential memoryGR[r1].nat = 0;

}} else { // execute load normally

if (fill_form) { // fill NaT on ld8.fillbit_pos = GR[ r3]{8:3};GR[r1] = val;GR[r1].nat = AR[UNAT]{bit_pos};

} else { // clear NaT on other typesGR[r1] = zero_ext(val, size * 8);GR[r1].nat = 0;

}if ((check_no_clear || advanced) && ma_is_speculative(mattr))

// add entry to ALATalat_write(GENERAL, r1, paddr, size);

}}


ld IA-64 Application ISA Guide 1.0

if (imm_base_update_form) { // update base registerGR[r3] = GR[r3] + sign_ext(imm9, 9);GR[r3].nat = GR[r3].nat;

} else if (reg_base_update_form) {GR[r3] = GR[r3] + tmp_r2;GR[r3].nat = GR[r3].nat || tmp_r2nat;

}

if ((reg_base_update_form || imm_base_update_form) && !GR[r3].nat)mem_implicit_prefetch(GR[r3], bias | ldhint);

}


IA-64 Application ISA Guide 1.0 ldf

Floating-Point Load

Format: (qp) ldffsz.fldtype.ldhint f1 = [r3] no_base_update_form M6(qp) ldffsz.fldtype.ldhint f1 = [r3], r2 reg_base_update_form M7(qp) ldffsz.fldtype.ldhint f1 = [r3], imm9 imm_base_update_form M8(qp) ldf8.fldtype.ldhint f1 = [r3] integer_form, no_base_update_form M6(qp) ldf8.fldtype.ldhint f1 = [r3], r2 integer_form, reg_base_update_form M7(qp) ldf8.fldtype.ldhint f1 = [r3], imm9 integer_form, imm_base_update_form M8(qp) ldf.fill.ldhint f1 = [r3] fill_form, no_base_update_form M6(qp) ldf.fill.ldhint f1 = [r3], r2 fill_form, reg_base_update_form M7(qp) ldf.fill.ldhint f1 = [r3], imm9 fill_form, imm_base_update_form M8

Description: A value consisting of fsz bytes is read from memory starting at the address specified by the value in GR r3.The value is then converted into the floating-point register format and placed in FR f1. See “Data Typesand Formats” on page 5-1for details on conversion to floating-point register format. The values of the fszcompleter are given in Table 6-29. The fldtype completer specifies special load operations, which aredescribed in Table 6-30.

For the integer_form, an 8-byte value is loaded and placed in the significand field of FR f1 without conver-sion. The exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the sign field of FRf1 is set to positive (0).

For the fill_form, a 16-byte value is loaded, and the appropriate fields are placed in FR f1 without conver-sion. This instruction is used for reloading a spilled register. See “Control Speculation” on page 4-10fordetails.

In the base update forms, the value in GR r3 is added to either a signed immediate value (imm9) or a valuefrom GR r2, and the result is placed back in GR r3. This base register update is done after the load, anddoes not affect the load address. In the reg_base_update_form, if the NaT bit corresponding to GR r2 isset, then the NaT bit corresponding to GR r3 is set and no fault is raised.

Table 6-29. fsz Completers

fsz Completer Bytes Accessed Memory Formats 4 bytes Single precision

d 8 bytes Double precision

e 10 bytes Extended precision

Table 6-30. FP Load Types

fldtypeCompleter


none Normal load

s Speculative loadCertain exceptions may be deferred rather than generating a fault. Deferral causes NaTVal to be placed in the target register. The NaTVal value is later used to detect deferral.

a Advanced load

An entry is added to the ALAT. This allows later instructions to check for colliding stores. If the referenced data page has a non-speculative attribute, no ALAT entry is added to the ALAT and the target register is set as fol-lows: for the integer_form, the exponent is set to 0x1003E and the sign and significand are set to zero; for all other forms, the sign, exponent and sig-nificand are set to zero. The absence of an ALAT entry is later used to detect deferral or collision.

saSpeculative

Advanced load

An entry is added to the ALAT, and certain exceptions may be deferred. Deferral causes NaTVal to be placed in the target register, and the proces-sor ensures that no ALAT entry exists for the target register. The absence of an ALAT entry is later used to detect deferral or collision.


ldf IA-64 Application ISA Guide 1.0

For more details on speculative, advanced and check loads see “Control Speculation” on page 4-10 and“Data Speculation” on page 4-12.

For the non-speculative load types, if NaT bit associated with GR r3 is 1, a Register NaT Consumptionfault is taken. For speculative and speculative advanced loads, no fault is raised, and the exception isdeferred. For the base-update calculation, if the NaT bit associated with GR r2 is 1, the NaT bit associatedwith GR r3 is set to 1 and no fault is raised.

The value of the ldhint modifier specifies the locality of the memory access. The mnemonic values ofldhint are given in Table 6-28 on page 6-102. A prefetch hint is implied in the base update forms. Theaddress specified by the value in GR r3 after the base update acts as a hint to prefetch the indicated cacheline. This prefetch uses the locality hints specified by ldhint. Prefetch and locality hints do not affect pro-gram functionality and may be ignored by the implementation. See “Memory Hierarchy Control and Con-sistency” on page 4-16 for details.


The PSR.mfl and PSR.mfh bits are updated to reflect the modification of FR f1.

c.ncCheck load -

no clear

The ALAT is searched for a matching entry. If found, no load is done and the target register is unchanged. Regardless of ALAT hit or miss, base reg-ister updates are performed, if specified. An implementation may option-ally cause the ALAT lookup to fail independent of whether an ALAT entry matches. If not found, a load is performed, and an entry is added to the ALAT (unless the referenced data page has a non-speculative attribute, in which case no ALAT entry is allocated).

c.clrCheck load –

clear

The ALAT is searched for a matching entry. If found, the entry is removed, no load is done and the target register is unchanged. Regardless of ALAT hit or miss, base register updates are performed, if specified. An implemen-tation may optionally cause the ALAT lookup to fail independent of whether an ALAT entry matches. If not found, a clear check load behaves like a normal load.

Table 6-30. FP Load Types (Continued)

fldtypeCompleter



IA-64 Application ISA Guide 1.0 ldf

Operation: if (PR[qp]) {size = (fill_form ? 16 : (integer_form ? 8 : fsz));speculative = (fldtype == ‘s’ || fldtype == ‘sa’);advanced = ( fldtype == ‘a’ || fldtype == ‘sa’);check_clear = ( fldtype == ‘c.clr’ );check_no_clear = ( fldtype == ‘c.nc’);check = check_clear || check_no_clear;

itype = READ;if (speculative) itype |= SPEC ;if (advanced) itype |= ADVANCE ;

if (reg_base_update_form || imm_base_update_form)check_target_register( r3);

fp_check_target_register( f1);if (tmp_isrcode = fp_reg_disabled( f1, 0, 0, 0))

disabled_fp_register_fault(tmp_isrcode, itype);

if (!speculative && GR[ r3].nat) // fault on NaT addressregister_nat_consumption_fault(itype);

defer = speculative && (GR[ r3].nat || PSR.ed);// defer exception if spec

if (check && alat_cmp(FLOAT, f1)) { // no load on ldf.c & ALAT hitif (check_clear) // remove entry on ldf.c.clr

alat_inval_single_entry(FLOAT, f1);} else {

if (!defer) {paddr = tlb_translate(GR[ r3], size, itype, PSR.cpl, &mattr,

&defer);if (!defer)

val = mem_read(paddr, size, UM.be, mattr, UNORDERED, ldhint);}if (check_clear || advanced) // remove any old ALAT entry

alat_inval_single_entry(FLOAT, f1);if (speculative && defer) {

FR[f1] = NATVAL;} else if (advanced && !speculative && defer) {

FR[f1] = (integer_form ? FP_INT_ZERO : FP_ZERO);} else { // execute load normally

FR[f1] = fp_mem_to_fr_format(val, size, integer_form);

if ((check_no_clear || advanced) && ma_is_speculative(mattr))// add entry to ALAT

alat_write(FLOAT, f1, paddr, size);}

}

if (imm_base_update_form) { // update base registerGR[r3] = GR[ r3] + sign_ext( imm9, 9);GR[r3].nat = GR[ r3].nat;

} else if (reg_base_update_form) {GR[r3] = GR[ r3] + GR[ r2];GR[r3].nat = GR[ r3].nat || GR[ r2].nat;

}

if ((reg_base_update_form || imm_base_update_form) && !GR[ r3].nat)mem_implicit_prefetch(GR[ r3], ldhint);

fp_update_psr( f1);}


ldfp IA-64 Application ISA Guide 1.0

Floating-Point Load Pair

Format: (qp) ldfps.fldtype.ldhint f1, f2 = [r3] single_form, no_base_update_form M11(qp) ldfps.fldtype.ldhint f1, f2 = [r3], 8 single_form, base_update_form M12(qp) ldfpd.fldtype.ldhint f1, f2 = [r3] double_form, no_base_update_form M11(qp) ldfpd.fldtype.ldhint f1, f2 = [r3], 16 double_form, base_update_form M12(qp) ldfp8.fldtype.ldhint f1, f2 = [r3] integer_form, no_base_update_form M11(qp) ldfp8.fldtype.ldhint f1, f2 = [r3], 16 integer_form, base_update_form M12

Description: Eight (single_form) or sixteen (double_form/integer_form) bytes are read from memory starting at theaddress specified by the value in GR r3. The value read is treated as a contiguous pair of floating-pointnumbers for the single_form/double_form and as integer/Parallel FP data for the integer_form. Each num-ber is converted into the floating-point register format. The value at the lowest address is placed in FR f1,and the value at the highest address is placed in FR f2. See “Data Types and Formats” on page 5-1 fordetails on conversion to floating-point register format. The fldtype completer specifies special load opera-tions, which are described in Table 6-30 on page 6-105.

For more details on speculative, advanced and check loads see “Control Speculation” on page 4-10 and“Data Speculation” on page 4-12.

For the non-speculative load types, if NaT bit associated with GR r3 is 1, a Register NaT Consumptionfault is taken. For speculative and speculative advanced loads, no fault is raised, and the exception isdeferred.

In the base_update_form, the value in GR r3 is added to an implied immediate value (equal to double thedata size) and the result is placed back in GR r3. This base register update is done after the load, and doesnot affect the load address.

The value of the ldhint modifier specifies the locality of the memory access. The mnemonic values ofldhint are given in Table 6-28 on page 6-102. A prefetch hint is implied in the base update form. Theaddress specified by the value in GR r3 after the base update acts as a hint to prefetch the indicated cacheline. This prefetch uses the locality hints specified by ldhint. Prefetch and locality hints do not affect pro-gram functionality and may be ignored by the implementation. See “Memory Hierarchy Control and Con-sistency” on page 4-16 for details.


The PSR.mfl and PSR.mfh bits are updated to reflect the modification of FR f1 and FR f2.

There is a restriction on the choice of target registers. Register specifiers f1 and f2 must specify one odd-numbered physical FR and one even-numbered physical FR. Specifying two odd or two even registerswill cause an Illegal Operation fault to be raised. The restriction is on physical register numbers after reg-ister rotation. This means that if f1 and f2 both specify static registers or both specify rotating registers,then f1 and f2 must be odd/even or even/odd. If f1 and f2 specify one static and one rotating register, therestriction depends on CFM.rrb.fr. If CFM.rrb.fr is even, the restriction is the same; f1 and f2 must be odd/even or even/odd. If CFM.rrb.fr is odd, then f1 and f2 must be even/even or odd/odd. Specifying one staticand one rotating register should only be done when CFM.rrb.fr will have a predictable value (such as 0).

Operation: if (PR[qp]) {size = single_form ? 8 : 16;

speculative = (fldtype == ‘s’ || fldtype == ‘sa’);advanced = ( fldtype == ‘a’ || fldtype == ‘sa’);check_clear = ( fldtype == ‘c.clr’);check_no_clear = ( fldtype == ‘c.nc’);check = check_clear || check_no_clear;

itype = READ;if (speculative) itype |= SPEC;if (advanced) itype |= ADVANCE;

if (fp_reg_bank_conflict(f1, f2))illegal_operation_fault();


IA-64 Application ISA Guide 1.0 ldfp

if (base_update_form)check_target_register(r3);

fp_check_target_register(f1);fp_check_target_register(f2);if (tmp_isrcode = fp_reg_disabled(f1, f2, 0, 0))

disabled_fp_register_fault(tmp_isrcode, itype);

if (!speculative && GR[r3].nat) // fault on NaT addressregister_nat_consumption_fault(itype);

defer = speculative && (GR[r3].nat || PSR.ed);// defer exception if spec

if (check && alat_cmp(FLOAT, f1)) { // no load on ldfp.c & ALAT hitif (check_clear) // remove entry on ldfp.c.clr

alat_inval_single_entry(FLOAT, f1);} else {

if (!defer) {paddr = tlb_translate(GR[r3], size, itype, PSR.cpl, &mattr,

&defer);if (!defer)

val = mem_read(paddr, size, UM.be, mattr, UNORDERED, ldhint);}if (check_clear || advanced) // remove any old ALAT entry

alat_inval_single_entry(FLOAT, f1);if (speculative && defer) {

FR[f1] = NATVAL;FR[f2] = NATVAL;

} else if (advanced && !speculative && defer) {FR[f1] = (integer_form ? FP_INT_ZERO : FP_ZERO);FR[f2] = (integer_form ? FP_INT_ZERO : FP_ZERO);

} else { // execute load normallyif (UM.be) {

FR[f1] = fp_mem_to_fr_format(val u>> (size/2*8), size/2,integer_form);

FR[f2] = fp_mem_to_fr_format(val, size/2, integer_form);} else {

FR[f1] = fp_mem_to_fr_format(val, size/2, integer_form);FR[f2] = fp_mem_to_fr_format(val u>> (size/2*8), size/2,

integer_form);}

if ((check_no_clear || advanced) && ma_is_speculative(mattr))// add entry to ALAT

alat_write(FLOAT, f1, paddr, size);}

}

if (base_update_form) { // update base registerGR[r3] = GR[r3] + size;GR[r3].nat = GR[r3].nat;if (!GR[r3].nat)

mem_implicit_prefetch(GR[r3], ldhint);}

fp_update_psr(f1);fp_update_psr(f2);

}


lfetch IA-64 Application ISA Guide 1.0

Line Prefetch

Format: (qp) lfetch.lftype.lfhint [r3] no_base_update_form M13(qp) lfetch.lftype.lfhint [r3], r2 reg_base_update_form M14(qp) lfetch.lftype.lfhint [r3], imm9 imm_base_update_form M15(qp) lfetch.lftype.excl.lfhint [r3] no_base_update_form, exclusive_form M13(qp) lfetch.lftype.excl.lfhint [r3], r2 reg_base_update_form, exclusive_form M14(qp) lfetch.lftype.excl.lfhint [r3], imm9 imm_base_update_form, exclusive_form M15

Description: The line containing the address specified by the value in GR r3 is moved to the highest level of the datamemory hierarchy. The value of the lfhint modifier specifies the locality of the memory access. The mne-monic values of lfhint are given in Table 6-32.

The behavior of the memory read is also determined by the memory attribute associated with the accessedpage. Line size is implementation dependent but must be a power of two greater than or equal to 32 bytes.In the exclusive form, the cache line is allowed to be marked in an exclusive state. This qualifier is usedwhen the program expects soon to modify a location in that line. If the memory attribute for the page con-taining the line is not cacheable, then no reference is made.

The completer, lftype, specifies whether or not the instruction raises faults normally associated with a reg-ular load. Table 6-31 defines these two options.

In the base update forms, after being used to address memory, the value in GR r3 is incremented by eitherthe sign extended value in imm9 (in the imm_base_update_form) or the value in GR r2 (in thereg_base_update_form). In the reg_base_update_form, if the NaT bit corresponding to GR r2 is set, thenthe NaT bit corresponding to GR r3 is set – no fault is raised.

In the reg_base_update_form and the imm_base_update_form, if the NaT bit corresponding to GR r3 isclear, then the address specified by the value in GR r3 after the post-increment acts as a hint to implicitlyprefetch the indicated cache line. This implicit prefetch uses the locality hints specified by lfhint. Theimplicit prefetch does not affect program functionality, does not raise any faults, and may be ignored bythe implementation.

In the no_base_update_form, the value in GR r3 is not modified and no implicit prefetch hint is implied.

If the NaT bit corresponding to GR r3 is set then the state of memory is not affected. In thereg_base_update_form and imm_base_update_form, the post increment of GR r3 is performed andprefetch is hinted as described above.

Table 6-31. lftype Mnemonic Values

lftype Mnemonic Interpretationnone Ignore faults

fault Raise faults

Table 6-32. lfhint Mnemonic Values

lfhint Mnemonic Interpretationnone Temporal locality, level 1



nta No temporal locality, all levels


IA-64 Application ISA Guide 1.0 lfetch

Operation: if (PR[qp]) {itype = READ|NON_ACCESS;itype |= (lftype == ‘fault’) ? LFETCH_FAULT : LFETCH;

if (reg_base_update_form || imm_base_update_form)check_target_register( r3);

if ( lftype == ‘fault’) { // faulting formif (GR[ r3].nat && !PSR.ed) // fault on NaT address

register_nat_consumption_fault(itype);}

if (exclusive_form)excl_hint = EXCLUSIVE;

elseexcl_hint = 0;

if (!GR[ r3].nat && !PSR.ed) {// faulting form already faulted if r3 is nat’edpaddr = tlb_translate(GR[ r3], 1, itype, PSR.cpl, &mattr, &defer);if (!defer)

mem_promote(paddr, mattr, lfhint | excl_hint);}

if (imm_base_update_form) {GR[r3] = GR[ r3] + sign_ext( imm9, 9);GR[r3].nat = GR[ r3].nat;

} else if (reg_base_update_form) {GR[r3] = GR[ r3] + GR[ r2];GR[r3].nat = GR[ r2].nat || GR[ r3].nat;

}

if ((reg_base_update_form || imm_base_update_form) && !GR[ r3].nat)mem_implicit_prefetch(GR[ r3], lfhint | excl_hint);

}


mf IA-64 Application ISA Guide 1.0

Memory Fence

Format: (qp) mf ordering_form M24(qp) mf.a acceptance_form M24

Description: This instruction forces ordering between prior and subsequent memory accesses. The ordering_formensures all prior data memory accesses are made visible prior to any subsequent data memory accessesbeing made visible. It does not ensure prior data memory references have been accepted by the externalplatform, nor that prior data memory references are visible.

The acceptance_form prevents any subsequent data memory accesses by the processor from initiatingtransactions to the external platform until:

• all prior loads have returned data, and

• all prior stores have been accepted by the external platform.

The definition of “acceptance” is platform dependent. The acceptance_form is typically used to ensure theprocessor has “waited” until a memory-mapped IO transaction has been “accepted”, before initiating addi-tional external transactions. The acceptance_form does not ensure ordering.

Operation: if (PR[qp]){if (acceptance_form)

acceptance_fence();else

ordering_fence();}


IA-64 Application ISA Guide 1.0 mix

Mix

Format: (qp) mix1.l r1 = r2, r3 one_byte_form, left_form I2(qp) mix2.l r1 = r2, r3 two_byte_form, left_form I2(qp) mix4.l r1 = r2, r3 four_byte_form, left_form I2(qp) mix1.r r1 = r2, r3 one_byte_form, right_form I2(qp) mix2.r r1 = r2, r3 two_byte_form, right_form I2(qp) mix4.r r1 = r2, r3 four_byte_form, right_form I2

Description: The data elements of GR r2 and r3 are mixed as shown in Figure 6-23, and the result placed in GR r1. Thedata elements in the source registers are grouped in pairs, and one element from each pair is selected forthe result. In the left_form, the result is formed from the leftmost elements from each of the pairs. In theright_form, the result is formed from the rightmost elements. Elements are selected alternately from thetwo source registers.


mix IA-64 Application ISA Guide 1.0

Figure 6-23. Mix Example

GR r2:

GR r1:

GR r3:

mix1.l

GR r2:

GR r1:

GR r3:

GR r2:

GR r1:

GR r3:

GR r2:

GR r1:

GR r3:

mix1.r

GR r2:

GR r1:

GR r3:

mix2.l

mix2.r

GR r2:

GR r1:

GR r3:

mix4.l

mix4.r


IA-64 Application ISA Guide 1.0 mix


if (one_byte_form) { // one-byte elementsx[0] = GR[r2]{7:0}; y[0] = GR[r3]{7:0};x[1] = GR[r2]{15:8}; y[1] = GR[r3]{15:8};x[2] = GR[r2]{23:16}; y[2] = GR[r3]{23:16};x[3] = GR[r2]{31:24}; y[3] = GR[r3]{31:24};x[4] = GR[r2]{39:32}; y[4] = GR[r3]{39:32};x[5] = GR[r2]{47:40}; y[5] = GR[r3]{47:40};x[6] = GR[r2]{55:48}; y[6] = GR[r3]{55:48};x[7] = GR[r2]{63:56}; y[7] = GR[r3]{63:56};

if (left_form)GR[r1] = concatenate8(x[7], y[7], x[5], y[5],

x[3], y[3], x[1], y[1]);else

GR[r1] = concatenate8(x[6], y[6], x[4], y[4],x[2], y[2], x[0], y[0]);

} else if (two_byte_form) { // two-byte elementsx[0] = GR[r2]{15:0}; y[0] = GR[r3]{15:0};x[1] = GR[r2]{31:16}; y[1] = GR[r3]{31:16};x[2] = GR[r2]{47:32}; y[2] = GR[r3]{47:32};x[3] = GR[r2]{63:48}; y[3] = GR[r3]{63:48};

if (left_form)GR[r1] = concatenate4(x[3], y[3], x[1], y[1]);

elseGR[r1] = concatenate4(x[2], y[2], x[0], y[0]);

} else { // four-byte elementsx[0] = GR[r2]{31:0}; y[0] = GR[r3]{31:0};x[1] = GR[r2]{63:32}; y[1] = GR[r3]{63:32};

if (left_form)GR[r1] = concatenate2(x[1], y[1]);

elseGR[r1] = concatenate2(x[0], y[0]);

}GR[r1].nat = GR[r2].nat || GR[r3].nat;

}


mov ar IA-64 Application ISA Guide 1.0

Move Application Register

Format: (qp) mov r1 = ar3 pseudo-op(qp) mov ar3 = r2 pseudo-op(qp) mov ar3 = imm8 pseudo-op(qp) mov.i r1 = ar3 i_form, from_form I28(qp) mov.i ar3 = r2 i_form, register_form, to_form I26(qp) mov.i ar3 = imm8 i_form, immediate_form, to_form I27(qp) mov.m r1 = ar3 m_form, from_form M31(qp) mov.m ar3 = r2 m_form, register_form, to_form M29(qp) mov.m ar3 = imm8 m_form, immediate_form, to_form M30

Description: The source operand is copied to the destination register.

In the from_form, the application register specified by ar3 is copied into GR r1 and the corresponding NaTbit is cleared.

In the to_form, the value in GR r2 (in the register_form), or the sign extended value in imm8 (in theimmediate_form), is placed in AR ar3. In the register_form if the NaT bit corresponding to GR r2 is set,then a Register NaT Consumption fault is raised.

Only a subset of the application registers can be accessed by each execution unit (M or I). Table 3-3 onpage 3-5 in indicates which application registers may be accessed from which execution unit type. Anaccess to an application register from the wrong unit type causes an Illegal Operation fault.

This instruction has multiple forms with the pseudo operation eliminating the need for specifying the exe-cution unit. Accesses of the ARs are always implicitly serialized. While implicitly serialized, read-after-write and write-after-write dependencies must be avoided (e.g., setting CCV, followed by cmpxchg in thesame instruction group, or simultaneous writes to the UNAT register by ld.fill and mov to UNAT).

IA-64 Application ISA Guide 1.0 mov ar

Operation: if (PR[qp]) {tmp_type = (i_form ? AR_I_TYPE : AR_M_TYPE);if (is_reserved_reg(tmp_type, ar3))


if (from_form) {check_target_register(r1);if (((ar3 == BSPSTORE) || (ar3 == RNAT)) && (AR[RSC].mode != 0))


if (ar3 == ITC && PSR.si && PSR.cpl != 0)privileged_register_fault();

GR[r1] = (is_ignored_reg(ar3)) ? 0 : AR[ar3];GR[r1].nat = 0;

} else { // to_formtmp_val = (register_form) ? GR[r2] : sign_ext(imm8, 8);

if (ar3 == BSP)illegal_operation_fault();

if (((ar3 == BSPSTORE) || (ar3 == RNAT)) && (AR[RSC].mode != 0))illegal_operation_fault();

if (register_form && GR[r2].nat)register_nat_consumption_fault(0);

if (is_reserved_field(AR_TYPE, ar3, tmp_val))reserved_register_field_fault();

if ((is_kernel_reg(ar3) || ar3 == ITC) && (PSR.cpl != 0))privileged_register_fault();

if (!is_ignored_reg(ar3)) {tmp_val = ignored_field_mask(AR_TYPE, ar3, tmp_val);// check for illegal promotionif (ar3 == RSC && tmp_val{3:2} u< PSR.cpl)

tmp_val{3:2} = PSR.cpl;AR[ar3] = tmp_val;

if (ar3 == BSPSTORE) {AR[BSP] = rse_update_internal_stack_pointers(tmp_val);AR[RNAT] = undefined();

}}

}}


mov br IA-64 Application ISA Guide 1.0

Move Branch Register

Format: (qp) mov r1 = b2 from_form I22(qp) mov b1= r2 to_form I21(qp) mov.ret b1 = r2 return_form, to_form I21


In the from_form, the branch register specified by b2 is copied into GR r1. The NaT bit corresponding toGR r1 is cleared.

In the to_form, the value in GR r2 is copied into BR b1. If the NaT bit corresponding to GR r2 is 1, then aRegister NaT Consumption fault is taken.

Operation: if (PR[qp]) {if (from_form) {

check_target_register(r1);GR[r1] = BR[b2];GR[r1].nat = 0;

} else { // to_formif (GR[r2].nat)

register_nat_consumption_fault(0);BR[b1] = GR[r2];

}}


IA-64 Application ISA Guide 1.0 mov fr

Move Floating-Point Register

Format: (qp) mov f1 = f3 pseudo-op of: (qp) fmerge.s f1 = f3, f3

Description: The value of FR f3 is copied to FR f1.



mov gr IA-64 Application ISA Guide 1.0

Move General Register

Format: (qp) mov r1 = r3 pseudo-op of: (qp) adds r1 = 0, r3

Description: The value of GR r3 is copied to GR r1.

Operation: See “Add” on page 6-3.


IA-64 Application ISA Guide 1.0 mov imm

Move Immediate

Format: (qp) mov r1 = imm22 pseudo-op of: (qp) addl r1 = imm22, r0

Description: The immediate value, imm22, is sign extended to 64 bits and placed in GR r1.

Operation: See “Add” on page 6-3.


mov indirect IA-64 Application ISA Guide 1.0

Move Indirect Register

Format: (qp) mov r1 = ireg[r3] from_form M43


For move from indirect register, GR r3 is read and the value used as an index into the register file specifiedby ireg (see Table 6-33 below). The indexed register is read and its value is copied into GR r1.

Bits {7:0} of GR r3 are used as the index. The remainder of the bits are ignored.

Apart from the PMD register file, access of a non-existent register results in a Reserved Register/Fieldfault. All accesses to the implementation-dependent portion of the PMD register file result in implementa-tion dependent behavior but do not fault.

Operation: if (PR[qp]) {tmp_index = GR[r3]{7:0};

if (from_form) {check_target_register(r1);

if (GR[r3].nat)register_nat_consumption_fault(0);

if (is_reserved_reg(ireg, tmp_index))reserved_register_field_fault();

if (ireg == PMD_TYPE) {GR[r1] = pmd_read(tmp_index);

} elseswitch (ireg) {

case CPUID_TYPE: GR[r1] = CPUID[tmp_index]; break;}

GR[r1].nat = 0;}

}

Table 6-33. Indirect Register File Mnemonics

ireg Register Filecpuid Processor Identification Register

pmd Performance Monitor Data Register


IA-64 Application ISA Guide 1.0 mov ip

Move Instruction Pointer

Format: (qp) mov r1 = ip I25

Description: The Instruction Pointer (IP) for the bundle containing this instruction is copied into GR r1.


GR[r1] = IP;GR[r1].nat = 0;

}

mov pr IA-64 Application ISA Guide 1.0

Move Predicates

Format: (qp) mov r1 = pr from_form I25(qp) mov pr = r2, mask17 to_form I23(qp) mov pr.rot = imm44 to_rotate_form I24


For moving the predicates to a GR, PR i is copied to bit position i within GR r1.

For moving to the predicates, the source can either be a general register, or an immediate value. In theto_form, the source operand is GR r2 and only those predicates specified by the immediate value mask17are written. The value mask17 is encoded in the instruction in an imm16 field such that: imm16 = mask17 >>1. Predicate register 0 is always one. The mask17 value is sign extended. The most significant bit ofmask17, therefore, is the mask bit for all of the rotating predicates. If there is a deferred exception for GRr2 (the NaT bit is 1), a Register NaT Consumption fault is taken.

In the to_rotate_form, only the 48 rotating predicates can be written. The source operand is taken from theimm44 operand (which is encoded in the instruction in an imm28 field, such that: imm28 = imm44 >> 16).The low 16-bits correspond to the static predicates. The immediate is sign extended to set the top 21 pred-icates. Bit position i in the source operand is copied to PR i.

This instruction operates as if the predicate rotation base in the Current Frame Marker (CFM.rrb.pr) werezero.


check_target_register(r1);GR[r1] = 1; // PR[0] is always 1for (i = 1; i <= 63; i++) {

GR[r1]{i} = PR[pr_phys_to_virt(i)];}GR[r1].nat = 0;

} else if (to_form) {if (GR[r2].nat)

register_nat_consumption_fault(0);tmp_src = sign_ext(mask17, 17);for (i = 1; i <= 63; i++) {

if (tmp_src{i})PR[pr_phys_to_virt(i)] = GR[r2]{i};

}} else { // to_rotate_form

tmp_src = sign_ext(imm44, 44);for (i = 16; i <= 63; i++) {

PR[pr_phys_to_virt(i)] = tmp_src{i};}

}}


IA-64 Application ISA Guide 1.0 mov um

Move User Mask

Format: (qp) mov r1 = psr.um from_form M36(qp) mov psr.um = r2 to_form M35


For move from user mask, PSR{5:0} is read, zero-extend, and copied into GR r1.

For move to user mask, PSR{5:0} is written by bits {5:0} of GR r2.


check_target_register(r1);

GR[r1] = zero_ext(PSR{5:0}, 6);GR[r1].nat = 0;

} else { // to_formif (GR[r2].nat)

register_nat_consumption_fault(0);

if (is_reserved_field(PSR_TYPE, PSR_UM, GR[r2]))reserved_register_field_fault();

PSR{5:0} = GR[r2]{5:0};}

}


movl IA-64 Application ISA Guide 1.0

Move Long Immediate

Format: (qp) movl r1 = imm64 X2

Description: The immediate value imm64 is copied to GR r1. The L slot of the bundle contains 41 bits of imm64.


GR[r1] = imm64;GR[r1].nat = 0;

}


IA-64 Application ISA Guide 1.0 mux

Mux

Format: (qp) mux1 r1 = r2, mbtype4 one_byte_form I3(qp) mux2 r1 = r2, mhtype8 two_byte_form I4

Description: A permutation is performed on the packed elements in a single source register, GR r2, and the result isplaced in GR r1. For 8-bit elements, only some of all possible permutations can be specified. The five pos-sible permutations are given in Table 6-34 and shown in Figure 6-24.

For 16-bit elements, all possible permutations, with and without repetitions can be specified. They areexpressed with an 8-bit mhtype8 field, which encodes the indices of the four 16-bit data elements. Theindexed 16-bit elements of GR r2 are copied to corresponding 16-bit positions in the target register GR r1.The indices are encoded in little-endian order. (The 8 bits of mhtype8[7:0] are grouped in pairs of bits andnamed mhtype8[3], mhtype8[2], mhtype8[1], mhtype8[0] in the operation section).

Table 6-34. Mux Permutations for 8-bit Elements

mbtype4 Function@rev Reverse the order of the bytes

@mix Perform a Mix operation on the two halves of GR r2

@shuf Perform a Shuffle operation on the two halves of GR r2

@alt Perform an Alternate operation on the two halves of GR r2

@brcst Perform a Broadcast operation on the least significand byte of GR r2

Figure 6-24. Mux1 Operation (8-bit elements)

GR r1:

GR r2:

mux1 r1 = r2, @rev

GR r1:

GR r2:

mux1 r1 = r2, @mix

GR r1:

GR r2:

mux1 r1 = r2, @shuf

GR r1:

GR r2:

mux1 r1 = r2, @alt

GR r1:

GR r2:

mux1 r1 = r2, @brcst


mux IA-64 Application ISA Guide 1.0

Figure 6-25. Mux2 Examples (16-bit elements)

GR r1:

GR r2:

mux2 r1 = r2, 0x8b (shuffle 10 00 11 01)

GR r1:

GR r2:

mux2 r1 = r2, 0x1b (reverse 00 01 10 11)

GR r1:

GR r2:

mux2 r1 = r2, 0xaa (broadcast 10 10 10 10)

GR r1:

GR r2:

mux2 r1 = r2, 0xe4 (alternate 11 01 10 00)


IA-64 Application ISA Guide 1.0 mux


if (one_byte_form) {x[0] = GR[r2]{7:0};x[1] = GR[r2]{15:8};x[2] = GR[r2]{23:16};x[3] = GR[r2]{31:24};x[4] = GR[r2]{39:32};x[5] = GR[r2]{47:40};x[6] = GR[r2]{55:48};x[7] = GR[r2]{63:56};

switch (mbtype) {case ‘@rev’:

GR[r1] = concatenate8(x[0], x[1], x[2], x[3],x[4], x[5], x[6], x[7]);

break;

case ‘@mix’: GR[ r1] = concatenate8(x[7], x[3], x[5], x[1],

x[6], x[2], x[4], x[0]);break;

case ‘@shuf’:GR[r1] = concatenate8(x[7], x[3], x[6], x[2],

x[5], x[1], x[4], x[0]);break;

case ‘@alt’:GR[r1] = concatenate8(x[7], x[5], x[3], x[1],

x[6], x[4], x[2], x[0]);break;

case ‘@brcst’:GR[r1] = concatenate8(x[0], x[0], x[0], x[0],

x[0], x[0], x[0], x[0]);break;

}} else { // two_byte_form

x[0] = GR[ r2]{15:0};x[1] = GR[ r2]{31:16};x[2] = GR[ r2]{47:32};x[3] = GR[ r2]{63:48};

res[0] = x[mhtype8{1:0}];res[1] = x[mhtype8{3:2}];res[2] = x[mhtype8{5:4}];res[3] = x[mhtype8{7:6}];

GR[r1] = concatenate4(res[3], res[2], res[1], res[0]);}GR[r1].nat = GR[ r2].nat;

}


nop IA-64 Application ISA Guide 1.0

No Operation

Format: (qp) nop imm21 pseudo-op(qp) nop.i imm21 i_unit_form I19(qp) nop.b imm21 b_unit_form B9(qp) nop.m imm21 m_unit_form M37(qp) nop.f imm21 f_unit_form F15(qp) nop.x imm62 x_unit_form X1

Description: No operation is done.

The immediate, imm21 or imm62, can be used by software as a marker in program code. It is ignored byhardware.

For the x_unit_form, the L slot of the bundle contains the upper 41 bits of imm62.

This instruction has five forms, each of which can be executed only on a particular execution unit type.The pseudo-op can be used if the unit type to execute on is unimportant.

Operation: if (PR[qp]) {; // no operation

}


IA-64 Application ISA Guide 1.0 or

Logical Or

Format: (qp) or r1 = r2, r3 register_form A1(qp) or r1 = imm8, r3 imm8_form A3

Description: The two source operands are logically ORed and the result placed in GR r1. In the register form the firstoperand is GR r2; in the immediate form the first operand is taken from the imm8 encoding field.



GR[r1] = tmp_src | GR[r3];GR[r1].nat = tmp_nat || GR[r3].nat;

}


pack IA-64 Application ISA Guide 1.0

Pack

Format: (qp) pack2.sss r1 = r2, r3 two_byte_form, signed_saturation_form I2(qp) pack2.uss r1 = r2, r3 two_byte_form, unsigned_saturation_form I2(qp) pack4.sss r1 = r2, r3 four_byte_form, signed_saturation_form I2

Description: 32-bit or 16-bit elements from GR r2 and GR r3 are converted into 16-bit or 8-bit elements respectively,and the results are placed GR r1. The source elements are treated as signed values. If a source elementcannot be represented in the result element, then saturation clipping is performed. The saturation caneither be signed or unsigned. If an element is larger than the upper limit value, the result is the upper limitvalue. If it is smaller than the lower limit value, the result is the lower limit value. The saturation limits aregiven in Table 6-35.

Table 6-35. Pack Saturation Limits

SizeSource Element

WidthResult Element

WidthSaturation

Upper Limit

Lower Limit

2 16 bit 8 bit signed 0x7f 0x80

2 16 bit 8 bit unsigned 0xff 0x00

4 32 bit 16 bit signed 0x7fff 0x8000

Figure 6-26. Pack Operation

GR r3:

GR r1:

GR r2:

pack4

GR r3:

GR r1:

GR r2:

pack2

IA-64 Application ISA Guide 1.0 pack


if (two_byte_form) { // two_byte_formif (signed_saturation_form) { // signed_saturation_form

max = sign_ext(0x7f, 8);min = sign_ext(0x80, 8);

} else { // unsigned_saturation_formmax = 0xff;min = 0x00;

}temp[0] = sign_ext(GR[r2]{15:0}, 16);temp[1] = sign_ext(GR[r2]{31:16}, 16);temp[2] = sign_ext(GR[r2]{47:32}, 16);temp[3] = sign_ext(GR[r2]{63:48}, 16);temp[4] = sign_ext(GR[r3]{15:0}, 16);temp[5] = sign_ext(GR[r3]{31:16}, 16);temp[6] = sign_ext(GR[r3]{47:32}, 16);temp[7] = sign_ext(GR[r3]{63:48}, 16);

for (i = 0; i < 8; i++) {if (temp[i] > max)

temp[i] = max;

if (temp[i] < min)temp[i] = min;

}

GR[r1] = concatenate8(temp[7], temp[6], temp[5], temp[4], temp[3], temp[2], temp[1], temp[0]);

} else { // four_byte_formmax = sign_ext(0x7fff, 16); // signed_saturation_formmin = sign_ext(0x8000, 16);temp[0] = sign_ext(GR[r2]{31:0}, 32);temp[1] = sign_ext(GR[r2]{63:32}, 32);temp[2] = sign_ext(GR[r3]{31:0}, 32);temp[3] = sign_ext(GR[r3]{63:32}, 32);

for (i = 0; i < 4; i++) {if (temp[i] > max)

temp[i] = max;


}

GR[r1] = concatenate4(temp[3], temp[2], temp[1], temp[0]);}GR[r1].nat = GR[r2].nat || GR[r3].nat;

}


padd IA-64 Application ISA Guide 1.0

Parallel Add

Format: (qp) padd1 r1 = r2, r3 one_byte_form, modulo_form A9(qp) padd1.sss r1 = r2, r3 one_byte_form, sss_saturation_form A9(qp) padd1.uus r1 = r2, r3 one_byte_form, uus_saturation_form A9(qp) padd1.uuu r1 = r2, r3 one_byte_form, uuu_saturation_form A9(qp) padd2 r1 = r2, r3 two_byte_form, modulo_form A9(qp) padd2.sss r1 = r2, r3 two_byte_form, sss_saturation_form A9(qp) padd2.uus r1 = r2, r3 two_byte_form, uus_saturation_form A9(qp) padd2.uuu r1 = r2, r3 two_byte_form, uuu_saturation_form A9(qp) padd4 r1 = r2, r3 four_byte_form, modulo_form A9

Description: The sets of elements from the two source operands are added, and the results placed in GR r1.

If a sum of two elements cannot be represented in the result element and a saturation completer is speci-fied, then saturation clipping is performed. The saturation can either be signed or unsigned, as given inTable 6-36. If the sum of two elements is larger than the upper limit value, the result is the upper limitvalue. If it is smaller than the lower limit value, the result is the lower limit value. The saturation limits aregiven in Table 6-37.

Table 6-36. Parallel Add Saturation Completers

Completer Result r1 Treated as Source r2 Treated as Source r3 Treated assss signed signed signed

uus unsigned unsigned signed

uuu unsigned unsigned unsigned

Table 6-37. Parallel Add Saturation Limits

SizeElement Width

Result r1 Signed Result r1 Unsigned

Upper Limit

Lower LimitUpper Limit

Lower Limit

1 8 bit 0x7f 0x80 0xff 0x00

2 16 bit 0x7fff 0x8000 0xffff 0x0000

Figure 6-27. Parallel Add Examples

GR r2:

GR r1:

GR r3:

++++

paddl padd2

GR r2:

GR r1:

GR r3:

++++ ++++

IA-64 Application ISA Guide 1.0 padd



if (sss_saturation_form) { // sss_saturation_formmax = sign_ext(0x7f, 8);min = sign_ext(0x80, 8);

for (i = 0; i < 8; i++) {temp[i] = sign_ext(x[i], 8) + sign_ext(y[i], 8);

}} else if (uus_saturation_form) { // uus_saturation_form

max = 0xff;min = 0x00;

for (i = 0; i < 8; i++) {temp[i] = zero_ext(x[i], 8) + sign_ext(y[i], 8);

}} else if (uuu_saturation_form) { // uuu_saturation_form

max = 0xff;min = 0x00;

for (i = 0; i < 8; i++) {temp[i] = zero_ext(x[i], 8) + zero_ext(y[i], 8);

}} else { // modulo_form


}}

if (sss_saturation_form || uus_saturation_form || uuu_saturation_form) {for (i = 0; i < 8; i++) {

if (temp[i] > max)temp[i] = max;


}}GR[r1] = concatenate8(temp[7], temp[6], temp[5], temp[4],

temp[3], temp[2], temp[1], temp[0]);

} else if (two_byte_form) { // 2-byte elementsx[0] = GR[r2]{15:0}; y[0] = GR[r3]{15:0};x[1] = GR[r2]{31:16}; y[1] = GR[r3]{31:16};x[2] = GR[r2]{47:32}; y[2] = GR[r3]{47:32};x[3] = GR[r2]{63:48}; y[3] = GR[r3]{63:48};

if (sss_saturation_form) { // sss_saturation_formmax = sign_ext(0x7fff, 16);min = sign_ext(0x8000, 16);

for (i = 0; i < 4; i++) {temp[i] = sign_ext(x[i], 16) + sign_ext(y[i], 16);

}} else if (uus_saturation_form) { // uus_saturation_form

max = 0xffff;min = 0x0000;

padd IA-64 Application ISA Guide 1.0

for (i = 0; i < 4; i++) {temp[i] = zero_ext(x[i], 16) + sign_ext(y[i], 16);

}} else if (uuu_saturation_form) { // uuu_saturation_form

max = 0xffff;min = 0x0000;


}} else { // modulo_form


}}




}}GR[r1] = concatenate4(temp[3], temp[2], temp[1], temp[0]);


for (i = 0; i < 2; i++) { // modulo_formtemp[i] = zero_ext(x[i], 32) + zero_ext(y[i], 32);

}

GR[r1] = concatenate2(temp[1], temp[0]);}

GR[r1].nat = GR[r2].nat || GR[r3].nat;}


IA-64 Application ISA Guide 1.0 pavg

Parallel Average

Format: (qp) pavg1 r1 = r2, r3 normal_form, one_byte_form A9(qp) pavg1.raz r1 = r2, r3 raz_form, one_byte_form A9(qp) pavg2 r1 = r2, r3 normal_form, two_byte_form A9(qp) pavg2.raz r1 = r2, r3 raz_form, two_byte_form A9

Description: The unsigned data elements of GR r2 are added to the unsigned data elements of GR r3. The results of theadd are then each independently shifted to the right by one bit position. The high-order bits of each ele-ment are filled with the carry bits of the sums. To prevent cumulative round-off errors, an averaging is per-formed. The unsigned results are placed in GR r1.

The averaging operation works as follows. In the normal_form, the low-order bit of each result is set to 1if at least one of the two least significant bits of the corresponding sum is 1. In the raz_form, the averagerounds away from zero by adding 1 to each of the sums.

Figure 6-28. Parallel Average Example

GR r2:

GR r1:

GR r3:

++++

pavg2

16-bit sum

shift right1 bit or

shift right 1 bitwith average inlow-order bit

pluscarry

sum bitscarrybit


pavg IA-64 Application ISA Guide 1.0

Figure 6-29. Parallel Average with Round Away from Zero Example

GR r2:

GR r1:

GR r3:

++++

pavg2.raz

16-bit sum

shift right1 bit

shift right 1 bit

pluscarry

sum bitscarrybit

1 1 1 1

IA-64 Application ISA Guide 1.0 pavg


if (one_byte_form) { // one_byte_formx[0] = GR[r2]{7:0}; y[0] = GR[r3]{7:0};x[1] = GR[r2]{15:8}; y[1] = GR[r3]{15:8};x[2] = GR[r2]{23:16}; y[2] = GR[r3]{23:16};x[3] = GR[r2]{31:24}; y[3] = GR[r3]{31:24};x[4] = GR[r2]{39:32}; y[4] = GR[r3]{39:32};x[5] = GR[r2]{47:40}; y[5] = GR[r3]{47:40};x[6] = GR[r2]{55:48}; y[6] = GR[r3]{55:48};x[7] = GR[r2]{63:56}; y[7] = GR[r3]{63:56};

if (raz_form) {for (i = 0; i < 8; i++) {

temp[i] = zero_ext(x[i], 8) + zero_ext(y[i], 8) + 1;res[i] = shift_right_unsigned(temp[i], 1);

}} else { // normal form

for (i = 0; i < 8; i++) {temp[i] = zero_ext(x[i], 8) + zero_ext(y[i], 8);res[i] = shift_right_unsigned(temp[i], 1) | (temp[i]{0});

}}GR[r1] = concatenate8(res[7], res[6], res[5], res[4],

res[3], res[2], res[1], res[0]);

} else { // two_byte_formx[0] = GR[r2]{15:0}; y[0] = GR[r3]{15:0};x[1] = GR[r2]{31:16}; y[1] = GR[r3]{31:16};x[2] = GR[r2]{47:32}; y[2] = GR[r3]{47:32};x[3] = GR[r2]{63:48}; y[3] = GR[r3]{63:48};

if (raz_form) {for (i = 0; i < 4; i++) {

temp[i] = zero_ext(x[i], 16) + zero_ext(y[i], 16) + 1;res[i] = shift_right_unsigned(temp[i], 1);

}} else { // normal form

for (i = 0; i < 4; i++) {temp[i] = zero_ext(x[i], 16) + zero_ext(y[i], 16);res[i] = shift_right_unsigned(temp[i], 1) | (temp[i]{0});

}}GR[r1] = concatenate4(res[3], res[2], res[1], res[0]);


}


pavgsub IA-64 Application ISA Guide 1.0

Parallel Average Subtract

Format: (qp) pavgsub1 r1 = r2, r3 one_byte_form A9(qp) pavgsub2 r1 = r2, r3 two_byte_form A9

Description: The unsigned data elements of GR r3 are subtracted from the unsigned data elements of GR r2. The resultsof the subtraction are then each independently shifted to the right by one bit position. The high-order bitsof each element are filled with the borrow bits of the subtraction (the complements of the ALU carries).To prevent cumulative round-off errors, an averaging is performed. The low-order bit of each result is setto 1 if at least one of the two least significant bits of the corresponding difference is 1. The signed resultsare placed in GR r1.

Figure 6-30. Parallel Average Subtract Example

GR r2:

GR r1:

GR r3:

----

pavgsub2

16-bit

shift right1 bit or

shift right 1 bitwith average inlow-order bit

plusborrow

sum bitsborrowbit

difference

IA-64 Application ISA Guide 1.0 pavgsub


if (one_byte_form) { // one_byte_formx[0] = GR[r2]{7:0}; y[0] = GR[r3]{7:0};x[1] = GR[r2]{15:8}; y[1] = GR[r3]{15:8};x[2] = GR[r2]{23:16}; y[2] = GR[r3]{23:16};x[3] = GR[r2]{31:24}; y[3] = GR[r3]{31:24};x[4] = GR[r2]{39:32}; y[4] = GR[r3]{39:32};x[5] = GR[r2]{47:40}; y[5] = GR[r3]{47:40};x[6] = GR[r2]{55:48}; y[6] = GR[r3]{55:48};x[7] = GR[r2]{63:56}; y[7] = GR[r3]{63:56};

for (i = 0; i < 8; i++) {temp[i] = zero_ext(x[i], 8) - zero_ext(y[i], 8);res[i] = (temp[i]{8:0} u>> 1) | (temp[i]{0});

}GR[r1] = concatenate8(res[7], res[6], res[5], res[4],

res[3], res[2], res[1], res[0]);

} else { // two_byte_formx[0] = GR[r2]{15:0}; y[0] = GR[r3]{15:0};x[1] = GR[r2]{31:16}; y[1] = GR[r3]{31:16};x[2] = GR[r2]{47:32}; y[2] = GR[r3]{47:32};x[3] = GR[r2]{63:48}; y[3] = GR[r3]{63:48};

for (i = 0; i < 4; i++) {temp[i] = zero_ext(x[i], 16) - zero_ext(y[i], 16);res[i] = (temp[i]{16:0} u>> 1) | (temp[i]{0});

}GR[r1] = concatenate4(res[3], res[2], res[1], res[0]);


}


pcmp IA-64 Application ISA Guide 1.0

Parallel Compare

Format: (qp) pcmp1.prel r1 = r2, r3 one_byte_form A9(qp) pcmp2.prel r1 = r2, r3 two_byte_form A9(qp) pcmp4.prel r1 = r2, r3 four_byte_form A9

Description: The two source operands are compared for one of the two relations shown in Table 6-38. If the compari-son condition is true for corresponding data elements of GR r2 and GR r3, then the corresponding data ele-ment in GR r1 is set to all ones. If the comparison condition is false, the corresponding data element in GRr1 is set to all zeros. For the ‘>’ relation, both operands are interpreted as signed.

Table 6-38. Pcmp Relations

prel Compare Relation (r2 prel r3)eq r2 == r3

gt r2 > r3 (signed)

Figure 6-31. Parallel Compare Example

GR r2:

GR r1:

GR r3:

====

pcmp2.eq

true false true true

0xffff 0x0000 0xffff 0xffff

GR r2:

GR r1:

GR r3:

>

pcmp1.ge

t

ff 00

GR r2:

GR r1:

GR r3:

=

pcmp4.eq

true

0xffffffff 0x00000000

>

f

>

t

>

t

>

f

>

f

>

f

>

t

ff ff 00 00 00 ff

=

false

IA-64 Application ISA Guide 1.0 pcmp


if (one_byte_form) { // one-byte elementsx[0] = GR[r2]{7:0}; y[0] = GR[r3]{7:0};x[1] = GR[r2]{15:8}; y[1] = GR[r3]{15:8};x[2] = GR[r2]{23:16}; y[2] = GR[r3]{23:16};x[3] = GR[r2]{31:24}; y[3] = GR[r3]{31:24};x[4] = GR[r2]{39:32}; y[4] = GR[r3]{39:32};x[5] = GR[r2]{47:40}; y[5] = GR[r3]{47:40};x[6] = GR[r2]{55:48}; y[6] = GR[r3]{55:48};x[7] = GR[r2]{63:56}; y[7] = GR[r3]{63:56};for (i = 0; i < 8; i++) {

if (prel == ‘eq’)tmp_rel = x[i] == y[i];

elsetmp_rel = greater_signed(sign_ext(x[i], 8), sign_ext(y[i], 8));

if (tmp_rel)res[i] = 0xff;

elseres[i] = 0x00;

}GR[r1] = concatenate8(res[7], res[6], res[5], res[4],

res[3], res[2], res[1], res[0]);} else if (two_byte_form) { // two-byte elements

x[0] = GR[ r2]{15:0}; y[0] = GR[ r3]{15:0};x[1] = GR[ r2]{31:16}; y[1] = GR[ r3]{31:16};x[2] = GR[ r2]{47:32}; y[2] = GR[ r3]{47:32};x[3] = GR[ r2]{63:48}; y[3] = GR[ r3]{63:48};for (i = 0; i < 4; i++) {

if ( prel == ‘eq’)tmp_rel = x[i] == y[i];


if (tmp_rel)res[i] = 0xffff;

elseres[i] = 0x0000;

}GR[r1] = concatenate4(res[3], res[2], res[1], res[0]);

} else { // four-byte elementsx[0] = GR[ r2]{31:0}; y[0] = GR[ r3]{31:0};x[1] = GR[ r2]{63:32}; y[1] = GR[ r3]{63:32};for (i = 0; i < 2; i++) {

if ( prel == ‘eq’)tmp_rel = x[i] == y[i];


if (tmp_rel)res[i] = 0xffffffff;

elseres[i] = 0x00000000;

}GR[r1] = concatenate2(res[1], res[0]);

}GR[r1].nat = GR[ r2].nat || GR[ r3].nat;

}

pmax IA-64 Application ISA Guide 1.0

Parallel Maximum

Format: (qp) pmax1.u r1 = r2, r3 one_byte_form I2(qp) pmax2 r1 = r2, r3 two_byte_form I2

Description: The maximum of the two source operands is placed in the result register. In the one_byte_form, eachunsigned 8-bit element of GR r2 is compared with the corresponding unsigned 8-bit element of GR r3 andthe greater of the two is placed in the corresponding 8-bit element of GR r1. In the two_byte_form, eachsigned 16-bit element of GR r2 is compared with the corresponding signed 16-bit element of GR r3 andthe greater of the two is placed in the corresponding 16-bit element of GR r1.



res[i] = (zero_ext(x[i],8) < zero_ext(y[i],8)) ? y[i] : x[i];}GR[r1] = concatenate8(res[7], res[6], res[5], res[4],

res[3], res[2], res[1], res[0]);} else { // two-byte elements

x[0] = GR[r2]{15:0}; y[0] = GR[r3]{15:0};x[1] = GR[r2]{31:16}; y[1] = GR[r3]{31:16};x[2] = GR[r2]{47:32}; y[2] = GR[r3]{47:32};x[3] = GR[r2]{63:48}; y[3] = GR[r3]{63:48};for (i = 0; i < 4; i++) {

res[i] = (sign_ext(x[i],16) < sign_ext(y[i],16)) ? y[i] : x[i];}GR[r1] = concatenate4(res[3], res[2], res[1], res[0]);


}

Figure 6-32. Parallel Maximum Example

GR r2:

GR r1:

GR r3:

<<<<

pmax2


GR r2:

GR r1:

GR r3:

<

pmax1.u

t

<

f

<

t

<

t

<

f

<

f

<

f

<

t

IA-64 Application ISA Guide 1.0 pmin

Parallel Minimum

Format: (qp) pmin1.u r1 = r2, r3 one_byte_form I2(qp) pmin2 r1 = r2, r3 two_byte_form I2

Description: The minimum of the two source operands is placed in the result register. In the one_byte_form, eachunsigned 8-bit element of GR r2 is compared with the corresponding unsigned 8-bit element of GR r3 andthe smaller of the two is placed in the corresponding 8-bit element of GR r1. In the two_byte_form, eachsigned 16-bit element of GR r2 is compared with the corresponding signed 16-bit element of GR r3 andthe smaller of the two is placed in the corresponding 16-bit element of GR r1.



res[i] = (zero_ext(x[i],8) < zero_ext(y[i],8)) ? x[i] : y[i];}GR[r1] = concatenate8(res[7], res[6], res[5], res[4],

res[3], res[2], res[1], res[0]);} else { // two-byte elements

x[0] = GR[r2]{15:0}; y[0] = GR[r3]{15:0};x[1] = GR[r2]{31:16}; y[1] = GR[r3]{31:16};x[2] = GR[r2]{47:32}; y[2] = GR[r3]{47:32};x[3] = GR[r2]{63:48}; y[3] = GR[r3]{63:48};for (i = 0; i < 4; i++) {

res[i] = (sign_ext(x[i],16) < sign_ext(y[i],16)) ? x[i] : y[i];}GR[r1] = concatenate4(res[3], res[2], res[1], res[0]);


}

Figure 6-33. Parallel Minimum Example

GR r2:

GR r1:

GR r3:

<<<<

pmin2


GR r2:

GR r1:

GR r3:

<

pmin1.u

t

<

f

<

t

<

t

<

f

<

f

<

f

<

t


pmpy IA-64 Application ISA Guide 1.0

Parallel Multiply

Format: (qp) pmpy2.r r1 = r2, r3 right_form I2(qp) pmpy2.l r1 = r2, r3 left_form I2

Description: Two signed 16-bit data elements of GR r2 are multiplied by the corresponding two signed 16-bit data ele-ments of GR r3 as shown in Figure 6-34. The two 32-bit results are placed in GR r1.


if (right_form) {GR[r1]{31:0} = sign_ext(GR[r2]{15:0}, 16) * sign_ext(GR[r3]{15:0}, 16);GR[r1]{63:32} = sign_ext(GR[r2]{47:32}, 16) * sign_ext(GR[r3]{47:32}, 16);

} else { // left_formGR[r1]{31:0} = sign_ext(GR[r2]{31:16}, 16) * sign_ext(GR[r3]{31:16}, 16);GR[r1]{63:32} = sign_ext(GR[r2]{63:48}, 16) * sign_ext(GR[r3]{63:48}, 16);

}


Figure 6-34. Parallel Multiply Operation

GR r2:

GR r1:

GR r3:

**

pmpy2.r

GR r2:

GR r1:

GR r3:

**

pmpy2.l

IA-64 Application ISA Guide 1.0 pmpyshr

Parallel Multiply and Shift Right

Format: (qp) pmpyshr2 r1 = r2, r3, count2 signed_form I1(qp) pmpyshr2.u r1 = r2, r3, count2 unsigned_form I1

Description: The four 16-bit data elements of GR r2 are multiplied by the corresponding four 16-bit data elements ofGR r3 as shown in Figure 6-35. This multiplication can either be signed (pmpyshr2), or unsigned(pmpyshr2.u). Each product is then shifted to the right count2 bits, and the least-significant 16-bits of eachshifted product form 4 16-bit results, which are placed in GR r1. A count2 of 0 gives the 16 low bits of theresults, a count2 of 16 gives the 16 high bits of the results. The allowed values for count2 are given inTable 6-39.

Operation: if (PR[qp]) {check_target_register(r1);x[0] = GR[r2]{15:0}; y[0] = GR[r3]{15:0};x[1] = GR[r2]{31:16}; y[1] = GR[r3]{31:16};x[2] = GR[r2]{47:32}; y[2] = GR[r3]{47:32};x[3] = GR[r2]{63:48}; y[3] = GR[r3]{63:48};for (i = 0; i < 4; i++) {

if (unsigned_form) // unsigned multiplicationtemp[i] = zero_ext(x[i], 16) * zero_ext(y[i], 16);

else // signed multiplicationtemp[i] = sign_ext(x[i], 16) * sign_ext(y[i], 16);

res[i] = temp[i]{(count2 + 15):count2};}

GR[r1] = concatenate4(res[3], res[2], res[1], res[0]);GR[r1].nat = GR[r2].nat || GR[r3].nat;

}

Table 6-39. PMPYSHR Shift Options

count2 Selected Bit Field from each 32-bit Product0 15:0

7 22:7

15 30:15

16 31:16

Figure 6-35. Parallel Multiply and Shift Right Operation

GR r2:

GR r1:

GR r3:

****

pmpyshr2

32-bit

shift rightcount2 bits

products

16-bitsourceelements

16-bitresultelements

popcnt IA-64 Application ISA Guide 1.0

Population Count

Format: (qp) popcnt r1 = r3 I9

Description: The number of bits in GR r3 having the value 1 is counted, and the resulting sum is placed in GR r1.


res = 0;// Count up all the one bitsfor (i = 0; i < 64; i++) {

res += GR[r3]{i};}

GR[r1] = res;GR[r1].nat = GR[r3].nat;

}

IA-64 Application ISA Guide 1.0 psad

Parallel Sum of Absolute Difference

Format: (qp) psad1 r1 = r2, r3 I2

Description: The unsigned 8-bit elements of GR r2 are subtracted from the unsigned 8-bit elements of GR r3. The abso-lute value of each difference is accumulated across the elements and placed in GR r1.


x[0] = GR[r2]{7:0}; y[0] = GR[r3]{7:0};x[1] = GR[r2]{15:8}; y[1] = GR[r3]{15:8};x[2] = GR[r2]{23:16}; y[2] = GR[r3]{23:16};x[3] = GR[r2]{31:24}; y[3] = GR[r3]{31:24};x[4] = GR[r2]{39:32}; y[4] = GR[r3]{39:32};x[5] = GR[r2]{47:40}; y[5] = GR[r3]{47:40};x[6] = GR[r2]{55:48}; y[6] = GR[r3]{55:48};x[7] = GR[r2]{63:56}; y[7] = GR[r3]{63:56};

GR[r1] = 0;for (i = 0; i < 8; i++) {

temp[i] = zero_ext(x[i], 8) - zero_ext(y[i], 8);if (temp[i] < 0)

temp[i] = -temp[i];GR[r1] += temp[i];

}


Figure 6-36. Parallel Sum of Absolute Difference Example

psad1

GR r2:

GR r1:

GR r3:

---- ----

abs abs abs abs abs abs abs abs

+ + + +

+

+

+

pshl IA-64 Application ISA Guide 1.0

Parallel Shift Left

Format: (qp) pshl2 r1 = r2, r3 two_byte_form, variable_form I7(qp) pshl2 r1 = r2, count5 two_byte_form, fixed_form I8(qp) pshl4 r1 = r2, r3 four_byte_form, variable_form I7(qp) pshl4 r1 = r2, count5 four_byte_form, fixed_form I8

Description: The data elements of GR r2 are each independently shifted to the left by the scalar shift count in GR r3, orin the immediate field count5. The low-order bits of each element are filled with zeros. The shift count isinterpreted as unsigned. Shift counts greater than 15 (for 16-bit quantities) or 31 (for 32-bit quantities)yield all zero results. The results are placed in GR r1.


shift_count = (variable_form ? GR[r3] : count5);tmp_nat = (variable_form ? GR[r3].nat : 0);

if (two_byte_form) { // two_byte_formif (shift_count u> 16)

shift_count = 16;GR[r1]{15:0} = GR[r2]{15:0} << shift_count;GR[r1]{31:16} = GR[r2]{31:16} << shift_count;GR[r1]{47:32} = GR[r2]{47:32} << shift_count;GR[r1]{63:48} = GR[r2]{63:48} << shift_count;

} else { // four_byte_formif (shift_count u> 32)

shift_count = 32;GR[r1]{31:0} = GR[r2]{31:0} << shift_count;GR[r1]{63:32} = GR[r2]{63:32} << shift_count;

}

GR[r1].nat = GR[r2].nat || tmp_nat;}

Figure 6-37. Parallel Shift Left Example

GR r2:

GR r1:

shift left

pshl2

0

0

0

0

0 0 0 0

GR r2:

GR r1:

pshl4

0

0

0 0

IA-64 Application ISA Guide 1.0 pshladd

Parallel Shift Left and Add

Format: (qp) pshladd2 r1 = r2, count2, r3 A10

Description: The four signed 16-bit data elements of GR r2 are each independently shifted to the left by count2 bits(shifting zeros into the low-order bits), and added to the four signed 16-bit data elements of GR r3. Boththe left shift and the add operations are saturating: if the result of either the shift or the add is not repre-sentable as a signed 16-bit value, the final result is saturated. The four signed 16-bit results are placed inGR r1. The first operand can be shifted by 1, 2 or 3 bits.


x[0] = GR[r2]{15:0}; y[0] = GR[r3]{15:0};x[1] = GR[r2]{31:16}; y[1] = GR[r3]{31:16};x[2] = GR[r2]{47:32}; y[2] = GR[r3]{47:32};x[3] = GR[r2]{63:48}; y[3] = GR[r3]{63:48};

max = sign_ext(0x7fff, 16);min = sign_ext(0x8000, 16);

for (i = 0; i < 4; i++) {temp[i] = sign_ext(x[i], 16) << count2;

if (temp[i] > max)res[i] = max;

else if (temp[i] < min)res[i] = min;

else {res[i] = temp[i] + sign_ext(y[i], 16);if (res[i] > max)

res[i] = max;if (res[i] < min)

res[i] = min;}

}


}


pshr IA-64 Application ISA Guide 1.0

Parallel Shift Right

Format: (qp) pshr2 r1 = r3, r2 signed_form, two_byte_form, variable_form I5(qp) pshr2 r1 = r3, count5 signed_form, two_byte_form, fixed_form I6(qp) pshr2.u r1 = r3, r2 unsigned_form, two_byte_form, variable_form I5(qp) pshr2.u r1 = r3, count5 unsigned_form, two_byte_form, fixed_form I6(qp) pshr4 r1 = r3, r2 signed_form, four_byte_form, variable_form I5(qp) pshr4 r1 = r3, count5 signed_form, four_byte_form, fixed_form I6(qp) pshr4.u r1 = r3, r2 unsigned_form, four_byte_form, variable_form I5(qp) pshr4.u r1 = r3, count5 unsigned_form, four_byte_form, fixed_form I6

Description: The data elements of GR r3 are each independently shifted to the right by the scalar shift count in GR r2,or in the immediate field count5. The high-order bits of each element are filled with either the initial valueof the sign bits of the data elements in GR r3 (arithmetic shift) or zeros (logical shift). The shift count isinterpreted as unsigned. Shift counts greater than 15 (for 16-bit quantities) or 31 (for 32-bit quantities)yield all zero or all one results depending on the initial values of the sign bits of the data elements in GR r3and whether a signed or unsigned shift is done. The results are placed in GR r1.


shift_count = (variable_form ? GR[r2] : count5);tmp_nat = (variable_form ? GR[r2].nat : 0);

if (two_byte_form) { // two_byte_formif (shift_count u> 16)

shift_count = 16;if (unsigned_form) { // unsigned shift

GR[r1]{15:0} = shift_right_unsigned(zero_ext(GR[r3]{15:0}, 16),shift_count);




} else { // signed shiftGR[r1]{15:0} = shift_right_signed(sign_ext(GR[r3]{15:0}, 16),

shift_count);GR[r1]{31:16} = shift_right_signed(sign_ext(GR[r3]{31:16}, 16),



shift_count);}

} else { // four_byte_formif (shift_count > 32)

shift_count = 32;if (unsigned_form) { // unsigned shift



} else { // signed shiftGR[r1]{31:0} = shift_right_signed(sign_ext(GR[r3]{31:0}, 32),


shift_count);}

}

GR[r1].nat = GR[r3].nat || tmp_nat;}

IA-64 Application ISA Guide 1.0 pshradd

Parallel Shift Right and Add

Format: (qp) pshradd2 r1 = r2, count2, r3 A10

Description: The four signed 16-bit data elements of GR r2 are each independently shifted to the right by count2 bits,and added to the four signed 16-bit data elements of GR r3. The right shift operation fills the high-orderbits of each element with the initial value of the sign bits of the data elements in GR r2. The add operationis performed with signed saturation. The four signed 16-bit results of the add are placed in GR r1. The firstoperand can be shifted by 1, 2 or 3 bits.


x[0] = GR[r2]{15:0}; y[0] = GR[r3]{15:0};x[1] = GR[r2]{31:16}; y[1] = GR[r3]{31:16};x[2] = GR[r2]{47:32}; y[2] = GR[r3]{47:32};x[3] = GR[r2]{63:48}; y[3] = GR[r3]{63:48};

max = sign_ext(0x7fff, 16);min = sign_ext(0x8000, 16);

for (i = 0; i < 4; i++) {temp[i] = shift_right_signed(sign_ext(x[i], 16), count2);

res[i] = temp[i] + sign_ext(y[i], 16);if (res[i] > max)

res[i] = max;if (res[i] < min)

res[i] = min;}


}


psub IA-64 Application ISA Guide 1.0

Parallel Subtract

Format: (qp) psub1 r1 = r2, r3 one_byte_form, modulo_form A9(qp) psub1.sss r1 = r2, r3 one_byte_form, sss_saturation_form A9(qp) psub1.uus r1 = r2, r3 one_byte_form, uus_saturation_form A9(qp) psub1.uuu r1 = r2, r3 one_byte_form, uuu_saturation_form A9(qp) psub2 r1 = r2, r3 two_byte_form, modulo_form A9(qp) psub2.sss r1 = r2, r3 two_byte_form, sss_saturation_form A9(qp) psub2.uus r1 = r2, r3 two_byte_form, uus_saturation_form A9(qp) psub2.uuu r1 = r2, r3 two_byte_form, uuu_saturation_form A9(qp) psub4 r1 = r2, r3 four_byte_form, modulo_form A9

Description: The sets of elements from the two source operands are subtracted, and the results placed in GR r1.

If the difference between two elements cannot be represented in the result element and a saturation compl-eter is specified, then saturation clipping is performed. The saturation can either be signed or unsigned, asgiven in Table 6-40. If the difference of two elements is larger than the upper limit value, the result is theupper limit value. If it is smaller than the lower limit value, the result is the lower limit value. The satura-tion limits are given in Table 6-41.

Table 6-40. Parallel Subtract Saturation Completers

CompleterResult r1 Treated

asSource r2 Treated

asSource r3 Treated

assss signed signed signed

uus unsigned unsigned signed

uuu unsigned unsigned unsigned

Table 6-41. Parallel Subtract Saturation Limits

SizeElement Width

Result r1 Signed Result r1 Unsigned

Upper Limit

Lower Limit

Upper Limit

Lower Limit

1 8 bit 0x7f 0x80 0xff 0x00

2 16 bit 0x7fff 0x8000 0xffff 0x0000

Figure 6-38. Parallel Subtract Example

GR r2:

GR r1:

GR r3:

----

psubl psub2

GR r2:

GR r1:

GR r3:

---- ----

IA-64 Application ISA Guide 1.0 psub



if (sss_saturation_form) { // sss_saturation_formmax = sign_ext(0x7f, 8);min = sign_ext(0x80, 8);for (i = 0; i < 8; i++) {

temp[i] = sign_ext(x[i], 8) - sign_ext(y[i], 8);}

} else if (uus_saturation_form) { // uus_saturation_formmax = 0xff;min = 0x00;for (i = 0; i < 8; i++) {

temp[i] = zero_ext(x[i], 8) - sign_ext(y[i], 8);}

} else if (uuu_saturation_form) { // uuu_saturation_formmax = 0xff;min = 0x00;for (i = 0; i < 8; i++) {

temp[i] = zero_ext(x[i], 8) - zero_ext(y[i], 8);}

} else { // modulo_formfor (i = 0; i < 8; i++) {


}




}}

GR[r1] = concatenate8(temp[7], temp[6], temp[5], temp[4],temp[3], temp[2], temp[1], temp[0]);


if (sss_saturation_form) { // sss_saturation_formmax = sign_ext(0x7fff, 16);min = sign_ext(0x8000, 16);for (i = 0; i < 4; i++) {

temp[i] = sign_ext(x[i], 16) - sign_ext(y[i], 16);}

} else if (uus_saturation_form) { // uus_saturation_formmax = 0xffff;min = 0x0000;for (i = 0; i < 4; i++) {

temp[i] = zero_ext(x[i], 16) - sign_ext(y[i], 16);}

} else if (uuu_saturation_form) { // uuu_saturation_formmax = 0xffff;

psub IA-64 Application ISA Guide 1.0

min = 0x0000;for (i = 0; i < 4; i++) {


} else { // modulo_formfor (i = 0; i < 4; i++) {


}




}}

GR[r1] = concatenate4(temp[3], temp[2], temp[1], temp[0]);} else { // four-byte elements

x[0] = GR[r2]{31:0}; y[0] = GR[r3]{31:0};x[1] = GR[r2]{63:32}; y[1] = GR[r3]{63:32};

for (i = 0; i < 2; i++) { // modulo_formtemp[i] = zero_ext(x[i], 32) - zero_ext(y[i], 32);

}

GR[r1] = concatenate2(temp[1], temp[0]);}


IA-64 Application ISA Guide 1.0 rum

Reset User Mask

Format: (qp) rum imm24 M44

Description: The complement of the imm24 operand is ANDed with the user mask (PSR{5:0}) and the result is placedin the user mask.

PSR.up is only cleared if the secure performance monitor bit (PSR.sp) is zero. Otherwise PSR.up is notmodified.

Operation: if (PR[qp]) {if (is_reserved_field(PSR_TYPE, PSR_UM, imm24))

reserved_register_field_fault();

if (imm24{1}) PSR{1} = 0;if (imm24{2} && PSR.sp == 0) //non-secure perf monitor

PSR{2} = 0;if (imm24{3}) PSR{3} = 0;if (imm24{4}) PSR{4} = 0;if (imm24{5}) PSR{5} = 0;

}


setf IA-64 Application ISA Guide 1.0

Set Floating-Point Value, Exponent, or Significand

Format: (qp) setf.s f1 = r2 single_form M18(qp) setf.d f1 = r2 double_form M18(qp) setf.exp f1 = r2 exponent_form M18(qp) setf.sig f1 = r2 significand_form M18

Description: In the single and double forms, GR r2 is treated as a single precision (in the single_form) or double preci-sion (in the double_form) memory representation, converted into floating-point register format, andplaced in FR f1.

In the exponent_form, bits 16:0 of GR r2 are copied to the exponent field of FR f1 and bit 17 of GR r2 iscopied to the sign bit of FR f1. The significand field of FR f1 is set to one (0x800...000).

In the significand_form, the value in GR r2 is copied to the significand field of FR f1.


For all forms, if the NaT bit corresponding to r2 is equal to 1, FR f1 is set to NaTVal instead of the com-puted result.

Figure 6-39. Function of setf.exp

Figure 6-40. Function of setf.sig

1000exponentsFR f1

GR r1

018 17

000. . .

63

significand0x1003E0FR f1

GR r1

063


IA-64 Application ISA Guide 1.0 setf

Operation: if (PR[qp]) {fp_check_target_register(f1);if (tmp_isrcode = fp_reg_disabled(f1, 0, 0, 0))


if (!GR[r2].nat) {if (single_form)

FR[f1] = fp_mem_to_fr_format(GR[r2], 4, 0);else if (double_form)

FR[f1] = fp_mem_to_fr_format(GR[r2], 8, 0);else if (significand_form) {

FR[f1].significand = GR[r2];FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = 0;

} else { // exponent_formFR[f1].significand = 0x8000000000000000;FR[f1].exp = GR[r2]{16:0};FR[f1].sign = GR[r2]{17};

}} else

FR[f1] = NATVAL;

fp_update_psr(f1);}

shl IA-64 Application ISA Guide 1.0

Shift Left

Format: (qp) shl r1= r2, r3 I7(qp) shl r1 = r2, count6 pseudo-op of: (qp) dep.z r1 = r2, count6, 64–count6

Description: The value in GR r2 is shifted to the left, with the vacated bit positions filled with zeroes, and placed in GRr1. The number of bit positions to shift is specified by the value in GR r3 or by an immediate value count6.The shift count is interpreted as an unsigned number. If the value in GR r3 is greater than 63, then theresult is all zeroes.

For the immediate form, See “Deposit” on page 6-27.


count = GR[r3];GR[r1] = (count > 63) ? 0: GR[r2] << count;

IA-64 Application ISA Guide 1.0 shladd

Shift Left and Add

Format: (qp) shladd r1 = r2, count2, r3 A2

Description: The first source operand is shifted to the left by count2 bits and then added to the second source operandand the result placed in GR r1. The first operand can be shifted by 1, 2, 3, or 4 bits.


GR[r1] = (GR[r2] << count2) + GR[r3];GR[r1].nat = GR[r2].nat || GR[r3].nat;

}

shladdp4 IA-64 Application ISA Guide 1.0

Shift Left and Add Pointer

Format: (qp) shladdp4 r1 = r2, count2, r3 A2

Description: The first source operand is shifted to the left by count2 bits and then is added to the second source oper-and. The upper 32 bits of the result are forced to zero, and then bits {31:30} of GR r3 are copied to bits{62:61} of the result. This result is placed in GR r1. The first operand can be shifted by 1, 2, 3, or 4 bits.


tmp_res = (GR[r2] << count2) + GR[r3];tmp_res = zero_ext(tmp_res{31:0}, 32);tmp_res{62:61} = GR[r3]{31:30};GR[r1] = tmp_res;GR[r1].nat = GR[r2].nat || GR[r3].nat;

}

Figure 6-41. Shift Left and Add Pointer

GR r3:

GR r1:

GR r2:

+

00


IA-64 Application ISA Guide 1.0 shr

Shift Right

Format: (qp) shr r1 = r3, r2 signed_form I5(qp) shr.u r1 = r3, r2 unsigned_form I5(qp) shr r1 = r3, count6 pseudo-op of: (qp) extr r1 = r3, count6, 64–count6(qp) shr.u r1 = r3, count6 pseudo-op of: (qp) extr.u r1 = r3, count6, 64–count6

Description: The value in GR r3 is shifted to the right and placed in GR r1. In the signed_form the vacated bit positionsare filled with bit 63 of GR r3; in the unsigned_form the vacated bit positions are filled with zeroes. Thenumber of bit positions to shift is specified by the value in GR r2 or by an immediate value count6. Theshift count is interpreted as an unsigned number. If the value in GR r2 is greater than 63, then the result isall zeroes (for the unsigned_form, or if bit 63 of GR r3 was 0) or all ones (for the signed_form if bit 63 ofGR r3 was 1).

If the .u completer is specified, the shift is unsigned (logical), otherwise it is signed (arithmetic).

For the immediate forms, See “Extract” on page 6-28.


if (signed_form) {count = (GR[r2] > 63) ? 63 : GR[r2];GR[r1] = shift_right_signed(GR[r3], count);

} else {count = GR[r2];GR[r1] = (count > 63) ? 0 : shift_right_unsigned(GR[r3], count);

}


shrp IA-64 Application ISA Guide 1.0

Shift Right Pair

Format: (qp) shrp r1 = r2, r3, count6 I10

Description: The two source operands, GR r2 and GR r3, are concatenated to form a 128-bit value and shifted to theright count6 bits. The least-significant 64 bits of the result are placed in GR r1.

The immediate value count6 can be any number in the range 0 to 63.


temp1 = shift_right_unsigned(GR[r3], count6);temp2 = GR[r2] << (64 - count6);GR[r1] = zero_ext(temp1, 64 - count6) | temp2;GR[r1].nat = GR[r2].nat || GR[r3].nat;

}

Figure 6-42. Shift Right Pair

GR r3:

GR r1:

GR r2:


IA-64 Application ISA Guide 1.0 srlz

Serialize

Format: (qp) srlz.i M24

Description: Instruction serialization (srlz.i) ensures:

• prior modifications to processor register resources that affect fetching of subsequent instructiongroups are observed,

• prior modifications to processor register resources that affect subsequent execution or data memoryaccesses are observed,

• prior memory synchronization (sync.i) operations have taken effect on the local processor instruc-tion cache,

• subsequent instruction group fetches are re-initiated after srlz.i completes.

The srlz.i instruction must be in an instruction group after the instruction group containing the opera-tion that is to be serialized. Operations dependent on the serialization must be in an instruction group afterthe instruction group containing the srlz.i.

Operation: if (PR[qp]) {instruction_serialize();

}


st IA-64 Application ISA Guide 1.0

Store

Format: (qp) stsz.sttype.sthint [r3] = r2 normal_form, no_base_update_form M4(qp) stsz.sttype.sthint [r3] = r2, imm9 normal_form, imm_base_update_form M5(qp) st8.spill.sthint [r3] = r2 spill_form, no_base_update_form M4(qp) st8.spill.sthint [r3] = r2, imm9 spill_form, imm_base_update_form M5

Description: A value consisting of the least significant sz bytes of the value in GR r2 is written to memory starting atthe address specified by the value in GR r3. The values of the sz completer are given in Table 6-26 onpage 6-101. The sttype completer specifies special store operations, which are described in Table 6-42. Ifthe NaT bit corresponding to GR r3 is 1 (or in the normal_form, if the NaT bit corresponding to GR r2 is1), a Register NaT Consumption fault is taken.

In the spill_form, an 8-byte value is stored, and the NaT bit corresponding to GR r2 is copied to a bit in theUNAT application register. This instruction is used for spilling a register/NaT pair. See “Control Specula-tion” on page 4-10 for details.

In the imm_base_update form, the value in GR r3 is added to a signed immediate value (imm9) and theresult is placed back in GR r3. This base register update is done after the store, and does not affect thestore address, nor the value stored (for the case where r2 and r3 specify the same register).

For more details on ordered stores see “Memory Access Ordering” on page 4-18 .

The ALAT is queried using the physical memory address and the access size, and all overlapping entriesare invalidated.

The value of the sthint completer specifies the locality of the memory access. The values of the sthintcompleter are given in Table 6-43. See “Memory Hierarchy Control and Consistency” on page 4-16.

Table 6-42. Store Types

sttypeCompleter

Interpretation Special Store Operation

none Normal store

rel Ordered storeAn ordered store is performed with release semantics.

Table 6-43. Store Hints

sthint Completer Interpretationnone Temporal locality, level 1

nta Non-temporal locality, all levels


IA-64 Application ISA Guide 1.0 st

Operation: if (PR[qp]) {size = spill_form ? 8 : sz;otype = (sttype == ‘rel’) ? RELEASE : UNORDERED;

if (imm_base_update_form)check_target_register( r3);

if (GR[ r3].nat || (normal_form && GR[ r2].nat))register_nat_consumption_fault(WRITE);

paddr = tlb_translate(GR[ r3], size, WRITE, PSR.cpl, &mattr,&tmp_unused);

if (spill_form && GR[ r2].nat)natd_gr_write(GR[ r2], paddr, size, UM.be, mattr, otype, sthint);

elsemem_write(GR[ r2], paddr, size, UM.be, mattr, otype, sthint);

if (spill_form) {bit_pos = GR[ r3]{8:3};AR[UNAT]{bit_pos} = GR[ r2].nat;

}


if (imm_base_update_form) {GR[r3] = GR[ r3] + sign_ext( imm9, 9);GR[r3].nat = 0;

}}


stf IA-64 Application ISA Guide 1.0

Floating-Point Store

Format: (qp) stffsz.sthint [r3] = f2 normal_form, no_base_update_form M9(qp) stffsz.sthint [r3] = f2, imm9 normal_form, imm_base_update_form M10(qp) stf8.sthint [r3] = f2 integer_form, no_base_update_form M9(qp) stf8.sthint [r3] = f2, imm9 integer_form, imm_base_update_form M10(qp) stf.spill.sthint [r3] = f2 spill_form, no_base_update_form M9(qp) stf.spill.sthint [r3] = f2, imm9 spill_form, imm_base_update_form M10

Description: A value, consisting of fsz bytes, is generated from the value in FR f2 and written to memory starting at theaddress specified by the value in GR r3. In the normal_form, the value in FR f2 is converted to the memoryformat and then stored. In the integer_form, the significand of FR f2 is stored. The values of the fsz compl-eter are given in Table 6-29 on page 6-105. In the normal_form or the integer_form, if the NaT bit corre-sponding to GR r3 is 1 or if FR f2 contains NaTVal, a Register NaT Consumption fault is taken. See “DataTypes and Formats” on page 5-1 for details on conversion from floating-point register format.

In the spill_form, a 16-byte value from FR f2 is stored without conversion. This instruction is used forspilling a register. See “Control Speculation” on page 4-10 for details.

In the imm_base_update form, the value in GR r3 is added to a signed immediate value (imm9) and theresult is placed back in GR r3. This base register update is done after the store, and does not affect thestore address.

The ALAT is queried using the physical memory address and the access size, and all overlapping entriesare invalidated.

The value of the sthint completer specifies the locality of the memory access. The values of the sthintcompleter are given in Table 6-43 on page 6-166. See “Memory Hierarchy Control and Consistency” onpage 4-16.

Operation: if (PR[qp]) {if (imm_base_update_form)

check_target_register(r3);if (tmp_isrcode = fp_reg_disabled(f2, 0, 0, 0))

disabled_fp_register_fault(tmp_isrcode, WRITE);

if (GR[r3].nat || (!spill_form && (FR[f2] == NATVAL)))register_nat_consumption_fault(WRITE);

size = spill_form ? 16 : (integer_form ? 8 : fsz);

paddr = tlb_translate(GR[r3], size, WRITE, PSR.cpl, &mattr, &tmp_unused);val = fp_fr_to_mem_format(FR[f2], size, integer_form);mem_write(val, paddr, size, UM.be, mattr, UNORDERED, sthint);


if (imm_base_update_form) {GR[r3] = GR[r3] + sign_ext(imm9, 9);GR[r3].nat = 0;

}}


IA-64 Application ISA Guide 1.0 sub

Subtract

Format: (qp) sub r1 = r2, r3 register_form A1(qp) sub r1 = r2, r3, 1 minus1_form, register_form A1(qp) sub r1 = imm8, r3 imm8_form A3

Description: The second source operand (and an optional constant 1) are subtracted from the first operand and theresult placed in GR r1. In the register form the first operand is GR r2; in the immediate form the first oper-and is taken from the sign extended imm8 encoding field.

The minus1_form is available only in the register_form (although the equivalent effect can be achieved byadjusting the immediate).



if (minus1_form)GR[r1] = tmp_src - GR[r3] - 1;

elseGR[r1] = tmp_src - GR[r3];

GR[r1].nat = tmp_nat || GR[r3].nat;}


sum IA-64 Application ISA Guide 1.0

Set User Mask

Format: (qp) sum imm24 M44

Description: The imm24 operand is ORed with the user mask (PSR{5:0}) and the result is placed in the user mask.

PSR.up can only be set if the secure performance monitor bit (PSR.sp) is zero. Otherwise PSR.up is notmodified.

Operation: if (PR[qp]) {if (is_reserved_field(PSR_TYPE, PSR_UM, imm24))

reserved_register_field_fault();

if (imm24{1}) PSR{1} = 1;if (imm24{2} && PSR.sp == 0) //non-secure perf monitor

PSR{2} = 1;if (imm24{3}) PSR{3} = 1;if (imm24{4}) PSR{4} = 1;if (imm24{5}) PSR{5} = 1;

}


IA-64 Application ISA Guide 1.0 sxt

Sign Extend

Format: (qp) sxtxsz r1 = r3 I29

Description: The value in GR r3 is sign extended from the bit position specified by xsz and the result is placed in GR r1.The mnemonic values for xsz are given in Table 6-44.


GR[r1] = sign_ext(GR[r3],xsz * 8);GR[r1].nat = GR[r3].nat;

}

Table 6-44. xsz Mnemonic Values

xsz Mnemonic Bit Position1 7

2 15

4 31


sync IA-64 Application ISA Guide 1.0

Memory Synchronization

Format: (qp) sync.i M24

Description: sync.i ensures that when previously initiated Flush Cache (fc) operations issued by the local processorbecome visible to local data memory references, prior Flush Cache operations are also observed by thelocal processor instruction fetch stream. sync.i also ensures that at the time previously initiated FlushCache (fc) operations are observed on a remote processor by data memory references they are alsoobserved by instruction memory references on the remote processor. sync.i is ordered with respect to allcache flush operations as observed by another processor. A sync.i and a previous fc must be in separateinstruction groups. If semantically required, the programmer must explicitly insert ordered data references(acquire, release or fence type) to appropriately constrain sync.i (and hence fc) visibility to the datastream on other processors.

sync.i is used to maintain an ordering relationship between instruction and data caches on local andremote processors. An instruction serialize operation be used to ensure synchronization initiated bysync.i on the local processor has been observed by a given point in program execution.

An example of self-modifying code (local processor):

st [L1] = data //store into local instruction streamfc L1 //flush stale datum from instruction/data cache;; //require instruction boundary between fc and sync.isync.i //ensure local and remote data/inst caches are synchronized;;srlz.i //ensure sync has been observed by the local processor,;; //ensure subsequent instructions observe modified memory

L1: target //instruction modified

Operation: if (PR[qp]) {instruction_synchronize();

}


IA-64 Application ISA Guide 1.0 tbit

Test Bit

Format: (qp) tbit.trel.ctype p1, p2 = r3, pos6 I16

Description: The bit specified by the pos6 immediate is selected from GR r3. The selected bit forms a single bit resulteither complemented or not depending on the trel completer. This result is written to the two predicate reg-ister destinations p1 and p2. The way the result is written to the destinations is determined by the comparetype specified by ctype. See the Compare instruction and Table 6-10 on page 6-19.

The trel completer values .nz and .z indicate non-zero and zero sense of the test. For normal and unc types,only the .z value is directly implemented in hardware; the .nz value is actually a pseudo-op. For it, theassembler simply switches the predicate target specifiers and uses the implemented relation. For the paral-lel types, both relations are implemented in hardware.


Table 6-45. Test Bit Relations for Normal and unc tbits

trel Test Relation Pseudo-op ofnz selected bit == 1 z p1 ↔ p2

z selected bit == 0

Table 6-46. Test Bit Relations for Parallel tbits

trel Test Relationnz selected bit == 1

z selected bit == 0


tbit IA-64 Application ISA Guide 1.0



if (trel == ‘nz’) // ‘nz’ - test for 1tmp_rel = GR[ r3]{ pos6};

else // ‘z’ - test for 0tmp_rel = !GR[ r3]{ pos6};


if (GR[ r3].nat || !tmp_rel) {PR[p1] = 0;PR[p2] = 0;

}break;

case ‘or’: // or-type compareif (!GR[ r3].nat && tmp_rel) {

PR[p1] = 1;PR[p2] = 1;

}break;

case ‘or.andcm’: // or.andcm-type compareif (!GR[ r3].nat && tmp_rel) {

PR[p1] = 1;PR[p2] = 0;

}break;


if (GR[ r3].nat) {PR[p1] = 0;PR[p2] = 0;


}break;

}} else {

if ( ctype == ‘unc’) {if ( p1 == p2)


}}


IA-64 Application ISA Guide 1.0 tnat

Test NaT

Format: (qp) tnat.trel.ctype p1, p2 = r3 I17

Description: The NaT bit from GR r3 forms a single bit result, either complemented or not depending on the trel com-pleter. This result is written to the two predicate register destinations, p1 and p2. The way the result is writ-ten to the destinations is determined by the compare type specified by ctype. See the Compare instructionand Table 6-10 on page 6-19.

The trel completer values .nz and .z indicate non-zero and zero sense of the test. For normal and unc types,only the .z value is directly implemented in hardware; the .nz value is actually a pseudo-op. For it, theassembler simply switches the predicate target specifiers and uses the implemented relation. For the paral-lel types, both relations are implemented in hardware.


Table 6-47. Test NaT Relations for Normal and unc tnats

trel Test Relation Pseudo-op ofnz selected bit == 1 z p1 ↔ p2

z selected bit == 0

Table 6-48. Test NaT Relations for Parallel tnats

trel Test Relationnz selected bit == 1

z selected bit == 0


tnat IA-64 Application ISA Guide 1.0



if (trel == ‘nz’) // ‘nz’ - test for 1tmp_rel = GR[ r3].nat;

else // ‘z’ - test for 0tmp_rel = !GR[ r3].nat;


if (!tmp_rel) {PR[p1] = 0;PR[p2] = 0;

}break;

case ‘or’: // or-type compareif (tmp_rel) {

PR[p1] = 1;PR[p2] = 1;

}break;

case ‘or.andcm’: // or.andcm-type compareif (tmp_rel) {

PR[p1] = 1;PR[p2] = 0;

}break;


PR[p1] = tmp_rel;PR[p2] = !tmp_rel;break;

}} else {

if ( ctype == ‘unc’) {if ( p1 == p2)


}}


IA-64 Application ISA Guide 1.0 unpack

Unpack

Format: (qp) unpack1.h r1 = r2, r3 one_byte_form, high_form I2(qp) unpack2.h r1 = r2, r3 two_byte_form, high_form I2(qp) unpack4.h r1 = r2, r3 four_byte_form, high_form I2(qp) unpack1.l r1 = r2, r3 one_byte_form, low_form I2(qp) unpack2.l r1 = r2, r3 two_byte_form, low_form I2(qp) unpack4.l r1 = r2, r3 four_byte_form, low_form I2

Description: The data elements of GR r2 and r3 are unpacked, and the result placed in GR r1. In the high_form, themost significant elements of each source register are selected, while in the low_form the least significantelements of each source register are selected. Elements are selected alternately from the source registers.


unpack IA-64 Application ISA Guide 1.0

Figure 6-43. Unpack Operation

GR r2:

GR r1:

GR r3:

unpack1.h

GR r2:

GR r1:

GR r3:

GR r2:

GR r1:

GR r3:

GR r2:

GR r1:

GR r3:

unpack1.l

GR r2:

GR r1:

GR r3:

unpack2.h

unpack2.l

GR r2:

GR r1:

GR r3:

unpack4.h

unpack4.l


IA-64 Application ISA Guide 1.0 unpack



if (high_form)GR[r1] = concatenate8( x[7], y[7], x[6], y[6],

x[5], y[5], x[4], y[4]);else

GR[r1] = concatenate8( x[3], y[3], x[2], y[2],x[1], y[1], x[0], y[0]);


if (high_form)GR[r1] = concatenate4(x[3], y[3], x[2], y[2]);

elseGR[r1] = concatenate4(x[1], y[1], x[0], y[0]);


if (high_form)GR[r1] = concatenate2(x[1], y[1]);

elseGR[r1] = concatenate2(x[0], y[0]);


}


xchg IA-64 Application ISA Guide 1.0

Exchange

Format: (qp) xchgsz.ldhint r1 = [r3], r2 M16

Description: A value consisting of sz bytes is read from memory starting at the address specified by the value in GR r3.The least significant sz bytes of the value in GR r2 are written to memory starting at the address specifiedby the value in GR r3. The value read from memory is then zero extended and placed in GR r1 and theNaT bit corresponding to GR r1 is cleared. The values of the sz completer are given in Table 6-49.


Both read and write access privileges for the referenced page are required.

The exchange is performed with acquire semantics, i.e., the memory read/write is made visible prior to allsubsequent data memory accesses.


The value of the ldhint completer specifies the locality of the memory access. The values of the ldhintcompleter are given in Table 6-28 on page 6-102. Locality hints do not affect program functionality andmay be ignored by the implementation. See “Memory Hierarchy Control and Consistency” on page 4-16for details.


if (GR[r3].nat || GR[r2].nat)register_nat_consumption_fault(SEMAPHORE);

paddr = tlb_translate(GR[r3], sz, SEMAPHORE, PSR.cpl, &mattr, &tmp_unused);

if (!ma_supports_semaphores(mattr))unsupported_data_reference_fault(SEMAPHORE, GR[r3]);

val = mem_xchg(GR[r2], paddr, sz, UM.be, mattr, ACQUIRE, ldhint);

alat_inval_multiple_entries(paddr, sz);

GR[r1] = zero_ext(val, sz * 8);GR[r1].nat = 0;

}

Table 6-49. Memory Exchange Size

sz Completer Bytes Accessed1 1 byte

2 2 bytes

4 4 bytes

8 8 bytes


IA-64 Application ISA Guide 1.0 xma

Fixed-Point Multiply Add

Format: (qp) xma.l f1 = f3, f4, f2 low_form F2(qp) xma.lu f1 = f3, f4, f2 pseudo-op of: (qp) xma.l f1 = f3, f4, f2(qp) xma.h f1 = f3, f4, f2 high_form F2(qp) xma.hu f1 = f3, f4, f2 high_unsigned_form F2

Description: Two source operands (FR f3 and FR f4) are treated as either signed or unsigned integers and multiplied.The third source operand (FR f2) is zero extended and added to the product. The upper or lower 64 bits ofthe resultant sum are selected and placed in FR f1.

In the high_unsigned_form, the significand fields of FR f3 and FR f4 are treated as unsigned integers andmultiplied to produce a full 128-bit unsigned result. The significand field of FR f2 is zero extended andadded to the product. The most significant 64-bits of the resultant sum are placed in the significand fieldof FR f1.

In the high_form, the significand fields of FR f3 and FR f4 are treated as signed integers and multiplied toproduce a full 128-bit signed result. The significand field of FR f2 is zero extended and added to the prod-uct. The most significant 64-bits of the resultant sum are placed in the significand field of FR f1.

In the other forms, the significand fields of FR f3 and FR f4 are treated as signed integers and multiplied toproduce a full 128-bit signed result. The significand field of FR f2 is zero extended and added to the prod-uct. The least significant 64-bits of the resultant sum are placed in the significand field of FR f1.

In all forms, the exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the sign fieldof FR f1 is set to positive (0). Note: f1 as an operand is not an integer 1; it is just the register file format’s1.0 value.

In all forms, if any of FR f3 , FR f4 , or FR f2 is a NaTVal, FR f1 is set to NaTVal instead of the computedresult.



if (fp_is_natval(FR[f2]) || fp_is_natval(FR[f3]) || fp_is_natval(FR[f4])) {FR[f1] = NATVAL;

} else {if (low_form || high_form)

tmp_res_128 =fp_I64_x_I64_to_I128(FR[f3].significand, FR[f4].significand);

else // high_unsigned_formtmp_res_128 =

fp_U64_x_U64_to_U128(FR[f3].significand, FR[f4].significand);

tmp_res_128 =fp_U128_add(tmp_res_128, fp_U64_to_U128(FR[f2].significand));

if (high_form || high_unsigned_form)FR[f1].significand = tmp_res_128.hi;

else // low_formFR[f1].significand = tmp_res_128.lo;

FR[f1].exponent = FP_INTEGER_EXP;FR[f1].sign = FP_SIGN_POSITIVE;

}

fp_update_psr(f1);}


xmpy IA-64 Application ISA Guide 1.0

Fixed-Point Multiply

Format: (qp) xmpy.l f1 = f3, f4 pseudo-op of: (qp) xma.l f1 = f3, f4, f0(qp) xmpy.lu f1 = f3, f4 pseudo-op of: (qp) xma.l f1 = f3, f4, f0(qp) xmpy.h f1 = f3, f4 pseudo-op of: (qp) xma.h f1 = f3, f4, f0(qp) xmpy.hu f1 = f3, f4 pseudo-op of: (qp) xma.hu f1 = f3, f4, f0

Description: Two source operands (FR f3 and FR f4) are treated as either signed or unsigned integers and multiplied.The upper or lower 64 bits of the resultant product are selected and placed in FR f1.

In the high_unsigned_form, the significand fields of FR f3 and FR f4 are treated as unsigned integers andmultiplied to produce a full 128-bit unsigned result. The most significant 64-bits of the resultant productare placed in the significand field of FR f1.

In the high_form, the significand fields of FR f3 and FR f4 are treated as signed integers and multiplied toproduce a full 128-bit signed result. The most significant 64-bits of the resultant product are placed in thesignificand field of FR f1.

In the other forms, the significand fields of FR f3 and FR f4 are treated as signed integers and multiplied toproduce a full 128-bit signed result. The least significant 64-bits of the resultant product are placed in thesignificand field of FR f1.

In all forms, the exponent field of FR f1 is set to the biased exponent for 2.063 (0x1003E) and the sign fieldof FR f1 is set to positive (0). Note: f1 as an operand is not an integer 1; it is just the register file format’s1.0 value.

Operation: See “Fixed-Point Multiply Add” on page 6-181.


IA-64 Application ISA Guide 1.0 xor

Exclusive Or

Format: (qp) xor r1 = r2, r3 register_form A1(qp) xor r1 = imm8, r3 imm8_form A3

Description: The two source operands are logically XORed and the result placed in GR r1. In the register_form the firstoperand is GR r2; in the imm8_form the first operand is taken from the imm8 encoding field.



GR[r1] = tmp_src ^ GR[r3];GR[r1].nat = tmp_nat || GR[r3].nat;

}


zxt IA-64 Application ISA Guide 1.0

Zero Extend

Format: (qp) zxtxsz r1 = r3 I29

Description: The value in GR r3 is zero extended above the bit position specified by xsz and the result is placed in GRr1. The mnemonic values for xsz are given in Table 6-44 on page 6-171.


GR[r1] = zero_ext(GR[r3],xsz * 8);GR[r1].nat = GR[r3].nat;

}

HP/Intel Instruction Sequencing Considerations A-1


A Instruction Sequencing Considerations

Instruction execution consists of four phases:

1. Read the instruction from memory (fetch)

2. Read architectural state, if necessary (read)

3. Perform the specified operation (execute)

4. Update architectural state, if necessary (update).

An instruction group is a sequence of instructions starting at a given bundle address and slot number and including allinstructions at sequentially increasing slot numbers and bundle addresses up to the first stop or taken branch. For theinstructions in an instruction group to have well-defined behavior, they must meet the ordering and dependency require-ments described below.

If the instructions in instruction groups meet the resource-dependency requirements, then the behavior of a program willbe as though each individual instruction is sequenced through these phases in the order listed above. The order of a phaseof a given instruction relative to any phase of a previous instruction is prescribed by the instruction sequencing rulesbelow.

• There is no a priori relationship between the fetch of an instruction and the read, execute, or update of any dynami-cally previous instruction. The sync.i and srlz.i instructions can be used to enforce a sequential relationshipbetween the fetch of all succeeding instructions and the update of all previous instructions.

• Between instruction groups, every instruction in a given instruction group will behave as though its read occurredafter the update of all the instructions from the previous instruction group. All instructions are assumed to have unitlatency. Instructions on opposing sides of a stop are architecturally considered to be separated by at least one unit oflatency.

Some system state updates require more stringent requirements than those described here.

• Within an instruction group, every instruction will behave as though its read of the memory and ALAT state occurredafter the update of the memory and ALAT state of all prior instructions in that instruction group.

• Within an instruction group, every instruction will behave as though its read of the register state occurred before theupdate of the register state by any instruction (prior or later) in that instruction group, except as noted in the depen-dency restrictions section below.

The ordering rules above form the context for register dependency restrictions, memory dependency restrictions and theorder of exception reporting. These dependency restrictions apply only between instructions whose resource reads andwrites are not dynamically disabled by predication.

• Register dependencies: Within an instruction group, read-after-write (RAW) and write-after-write (WAW) registerdependencies are not allowed (except as noted in “RAW Ordering Exceptions” on page A-2 and “WAW OrderingExceptions” on page A-3). Write-after-read (WAR) register dependencies are allowed (except as noted in “WAROrdering Exceptions” on page A-3).

These dependency restrictions apply to both explicit register accesses (from the instruction’s operands) and implicitregister accesses (such as application and control registers implicitly accessed by certain instructions). Predicateregister PR0 is excluded from these register dependency restrictions, since writes to PR0 are ignored and readsalways return 1 (one).

• Memory dependencies: Within an instruction group, RAW, WAW, and WAR memory dependencies and ALATdependencies are allowed. A load will observe the results of the most recent store to the same memory address. In theevent that multiple stores to the same address are present in the same instruction group, memory will contain theresult of the latest store after execution of the instruction group. A store following a load to the same address will not

A-2 Instruction Sequencing Considerations HP/Intel


affect the data loaded by the load. Advanced loads, check loads, advanced load checks, stores, and memory sema-phore instructions implicitly access the ALAT. RAW, WAW, and WAR ALAT dependencies are allowed within aninstruction group and behave as described for memory dependencies.

The net effect of the dependency restrictions stated above is that a processor may execute all (or any subset) of the instruc-tions within a legal instruction group concurrently or serially with the end result being identical. If these dependencyrestrictions are not met, the behavior of the program is undefined.

The instruction sequencing resulting from the rules stated above is termed sequential execution.

The ordering rules and the dependency restrictions allow the processor to dynamically re-order instructions, executeinstructions with non-unit latency, or even concurrently execute instructions on opposing sides of a stop or taken branch,provided that correct sequencing is enforced and the appearance of sequential execution is presented to the programmer.

IP is a special resource in that reads and writes of IP behave as though the instruction stream was being executed serially,rather than in parallel. RAW dependencies on IP are allowed, and the reader gets the IP of the bundle in which it is con-tained. So, each bundle being executed in parallel logically reads IP, increments it and writes it back. WAW is alsoallowed.

Ignored ARs are not exceptional for dependency checking purposes. RAW and WAW dependencies to ignored ARs arenot allowed.

A.1 RAW Ordering ExceptionsThere are four exceptions to the rule prohibiting RAW register dependencies within an instruction group. These excep-tions are the alloc instruction, check load instructions, instructions that affect branching, and the ld8.fill andst8.spill instructions.

• The alloc instruction implicitly writes the Current Frame Marker (CFM) which is implicitly read by all instructionsaccessing the stacked subset of the general register file. Instructions that access the stacked subset of the general reg-ister file may appear in the same instruction group as alloc and will see the stack frame specified by the alloc. Notethat some instructions have RAW or WAW dependences on resources other than CFM affected by alloc and are thusnot allowed in the same instruction group after an alloc: flushrs, move from AR[BSPSTORE], move fromAR[RNAT], br.cexit, br.ctop, br.wexit, br.wtop, br.call, br.ia, br.ret, clrrrb. Note that alloc isrequired to be the first instruction in an instruction group.

• A check load instruction may or may not perform a load since it is dependent upon its corresponding advanced load.If the check load misses the ALAT it will execute a load from memory. A check load and a subsequent instruction thatreads the target of the check load may exist in the same instruction group. The dependent instruction will get the newvalue loaded by the check load.

• A branch may read branch registers and may implicitly read predicate registers, the LC, EC, and PFS application reg-isters, as well as CFM. Except for LC, EC and predicate registers, writes to any of these registers by a non-branchinstruction will be visible to a subsequent branch in the same instruction group. Writes to predicate registers by anynon-floating-point instruction will be visible to a subsequent branch in the same instruction group. RAW registerdependencies within the same instruction group are not allowed for LC and EC. Dynamic RAW dependencies wherethe predicate writer is a floating-point instruction and the reader is a branch are also not allowed within the sameinstruction group. Branches br.cond, br.call, br.ret and br.ia work like other instructions for the purposes ofregister dependency; i.e., if their qualifying predicate is 0, they are not considered readers or writers of otherresources. Branches br.cloop, br.cexit, br.ctop, br.wexit, and br.wtop are exceptional in that they arealways readers or writers of their resources, regardless of the value of their qualifying predicate.

• The ld8.fill and st8.spill instructions implicitly access the User NaT Collection application register (UNAT).For these instructions the restriction on dynamic RAW register dependencies with respect to UNAT applies at the bitlevel. These instructions may appear in the same instruction group provided they do not access the same bit of UNAT.RAW UNAT dependencies between ld8.fill or st8.spill instructions and mov ar= or mov =ar instructionsaccessing UNAT must not occur within the same instruction group.

For the purposes of resource dependencies, CFM is treated as a single resource.

HP/Intel Instruction Sequencing Considerations A-3


A.2 WAW Ordering ExceptionsThere are three exceptions to the rule prohibiting WAW register dependencies within an instruction group. The exceptionsare compare-type instructions, floating-point instructions, and the st8.spill instruction.

• The set of compare-type instructions includes: cmp, cmp4, tbit, tnat, fcmp, frsqrta, frcpa, and fclass. Com-pare-type instructions in the same instruction group may target the same predicate register provided:

• The compare-type instructions are either all AND-type compares or all OR-type compares (AND-type comparescorrespond to “.and” and “.andcm” completers; OR-type compares correspond to “.or” and “.orcm” completers),or

• The compare-type instructions all target PR0. All WAW dependencies for PR0 are allowed; the compares can beof any types and can be of differing types.

All other WAW dependencies within an instruction group are disallowed, including dynamic WAW register dependencieswith move to PR instructions that access the same predicate registers as another writer. Note that the move to PR instruc-tion only writes those PRs indicated by its mask, but the move from PR instruction always reads all the predicate registers.

• Floating-point instructions implicitly write the Floating-Point Status Register (FPSR) and the Processor Status Regis-ter (PSR). Multiple floating-point instructions may appear in the same instruction group since the restriction on WAWregister dependencies with respect to the FPSR and PSR do not apply. The state of FPSR and PSR after executing theinstruction group will be the logical OR of all writes.

• The st8.spill instruction implicitly writes the UNAT register. For this instruction the restriction on WAW registerdependencies with respect to UNAT applies at the bit level. Multiple st8.spill instructions may appear in the sameinstruction group provided they do not write the same bit of UNAT. WAW register dependencies between st8.spill

instructions and mov ar= instructions targeting UNAT must not occur within the same instruction group.

WAW dependencies to ignored ARs are not allowed.

A.3 WAR Ordering ExceptionsWAR dependence between the reading of PR63 by a branch instruction and the subsequent writing of PR63 by a loopclosing branch (br.ctop, br.cexit, br.wtop, or br.wexit) in the same instruction group is not allowed. Otherwise,WAR dependencies are allowed.

A-4 Instruction Sequencing Considerations HP/Intel


HP/Intel IA-64 Pseudo-Code Functions B-1


B IA-64 Pseudo-Code Functions

This appendix contains a table of pseudo-code functions used in Chapter 6, "IA-64 Instruction Reference".

Table B-1. Pseudo-Code Functions

Function Operationxxx_fault(parameters ...) There are several fault functions. Each fault function accepts parameters spe-

cific to the fault, e.g., exception code values, virtual addresses, etc. If the fault is deferred for speculative load exceptions the fault function will return with a deferral indication. Otherwise, fault routines do not return and terminate the instruction sequence.

xxx_trap(parameters ...) There are several trap functions. Each trap function accepts parameters spe-cific to the trap, e.g., trap code values, virtual addresses, etc. Trap routines do not return.

acceptance_fence() Ensures prior data memory references to uncached ordered-sequential mem-ory pages are “accepted”, before subsequent data memory references are per-formed by the processor.

alat_cmp(rtype, raddr) Returns a one if the implementation finds an ALAT entry which matches the register type specified by rtype and the register address specified by raddr, else returns zero. This function is implementation specific. Note that an imple-mentation may optionally choose to return zero (indicating no match) even if a matching entry exists in the ALAT. This provides implementation flexibility in designing fast ALAT lookup circuits.

alat_frame_update( delta_bof, delta_sof)

Notifies the ALAT of a change in the bottom of frame and/or size of frame. This allows management of the ALAT’s tag bits or other management func-tions it might need.

alat_inval() Invalidate all entries in the ALAT.

alat_inval_multiple_entries(paddr, size)

The ALAT is queried using the physical memory address specified by paddr and the access size specified by size. All matching ALAT entries are invali-dated. No value is returned.

alat_inval_single_entry(rtype, rega) The ALAT is queried using the register type specified by rtype and the regis-ter address specified by rega. At most one matching ALAT entry is invali-dated. No value is returned.

alat_write(rtype, raddr, paddr, size) Allocates a new ALAT entry using the register type specified by rtype, the register address specified by raddr, the physical memory address specified by paddr, and the access size specified by size. No value is returned. This func-tion guarantees that only one ALAT entry exists for a given raddr. If a ld.c.nc, ldf.c.nc, or ldfp.c.nc instruction’s raddr matches an existing ALAT entry’s register tag, but the instruction’s size and/or paddr are differ-ent than that of the existing entry’s; then this function may either preserve the existing entry, or invalidate it and write a new entry with the instruction’s specified size and paddr.

check_target_register(r1) If r1 targets an out-of-frame stacked register (as defined by CFM), an illegal operation fault is delivered, and this function does not return.

check_target_register_sof(r1, new-sof)

If r1 targets an out-of-frame stacked register (as defined by the newsof param-eter), an illegal operation fault is delivered, and this function does not return.

B-2 IA-64 Pseudo-Code Functions HP/Intel


concatenate2(x1, x2) Concatenates the lower 32 bits of the 2 arguments, and returns the 64-bit result.

concatenate4(x1, x2, x3, x4) Concatenates the lower 16 bits of the 4 arguments, and returns the 64-bit result.

concatenate8(x1, x2, x3, x4, x5, x6, x7, x8)

Concatenates the lower 8 bits of the 8 arguments, and returns the 64-bit result.

fadd(fp_dp, fr2) Adds a floating-point register value to the infinitely precise product and return the infinitely precise sum, ready for rounding.

fcmp_exception_fault_check(fr2, fr3, frel, sf, *tmp_fp_env)

Checks for all floating-point faulting conditions for the fcmp instruction.

fcvt_fx_exception_fault_check(fr2, trunc, sf *tmp_fp_env)

Checks for all floating-point faulting conditions for the fcvt.fx and fcvt.fx.trunc instructions. It propagates NaNs, and NaTVals.

fcvt_fxu_exception_fault_check(fr2, trunc, sf, *tmp_fp_env)

Checks for all floating-point faulting conditions for the fcvt.fxu and fcvt.fxu.trunc instructions. It propagates NaNs, and NaTVals.

fma_exception_fault_check(fr2, fr3, fr4, pc, sf, *tmp_fp_env)

Checks for all floating-point faulting conditions for the fma instruction. It propagates NaNs, NaTVals, and special IEEE results.

fminmax_exception_fault_check(fr2, fr3, sf, *tmp_fp_env)

Checks for all floating-point faulting conditions for the famax, famin, fmax, and fmin instructions.

fms_fnma_exception_fault_check(fr2, fr3, fr4, pc, sf, *tmp_fp_env)

Checks for all floating-point faulting conditions for the fms and fnma instruc-tions. It propagates NaNs, NaTVals, and special IEEE results.

fmul(fr3, fr4) Performs an infinitely precise multiply of two floating-point register values.

followed_by_stop() Returns TRUE if the current instruction is followed by a stop; otherwise, returns FALSE.

fp_check_target_register(f1) If the specified floating-point register identifier is 0 or 1, this function causes an illegal operation fault.

fp_decode_fault(tmp_fp_env) Returns floating-point exception fault code values for ISR.code.

fp_decode_traps(tmp_fp_env) Returns floating-point trap code values for ISR.code.

fp_is_nan_or_inf(freg) Returns true if the floating-point exception_fault_check functions returned a IEEE fault disabled default result or a propagated NaN.

fp_equal(fr1, fr2) IEEE standard equality relationship test.

fp_ieee_recip(num, den) Returns the true quotient for special sets of operands, or an approximation to the reciprocal of the divisor to be used in the software divide algorithm.

fp_ieee_recip_sqrt(root) Returns the true square root result for special operands, or an approximation to the reciprocal square root to be used in the software square root algorithm.

fp_is_nan(freg) Returns true when floating register contains a NaN.

fp_is_natval(freg) Returns true when floating register contains a NaTVal

fp_is_normal(freg) Returns true when floating register contains a normal number.

fp_is_pos_inf(freg) Returns true when floating register contains a positive infinity.

fp_is_qnan(freg) Returns true when floating register contains a quiet NaN.

fp_is_snan(freg) Returns true when floating register contains a signalling NaN.

fp_is_unorm(freg) Returns true when floating register contains an unnormalizednumber.

fp_is_unsupported(freg) Returns true when floating register contains an unsupported format.

fp_less_than(fr1, fr2) IEEE standard less-than relationship test.

fp_lesser_or_equal(fr1, fr2) IEEE standard less-than or equal-to relationship test

fp_normalize(fr1) Normalizes an unnormalized fp value. This function flushes to zero any unnormal values which can not be represented in the register file

fp_raise_fault(tmp_fp_env) Checks the local instruction state for any faulting conditions which require an interruption to be raised.

Table B-1. Pseudo-Code Functions (Continued)

Function Operation



fp_raise_traps(tmp_fp_env) Checks the local instruction state for any trapping conditions which require an interruption to be raised.

fp_reg_bank_conflict(f1, f2) Returns true if the two specified FRs are in the same bank.

fp_reg_disabled(f1, f2, f3, f4) Check for possible disabled floating-point register faults.

fp_reg_read(freg) Reads the FR and gives canonical double-extended denormals (and pseudo-denormals) their true mathematical exponent. Other classes of operands are unaltered.

fp_unordered(fr1, fr2) IEEE standard unordered relationship

fp_fr_to_mem_format(freg, size) Converts a floating-point value in register format to floating-point memory format. It assumes that the floating-point value in the register has been previ-ously rounded to the correct precision which corresponds with the size parameter.

frcpa_exception_fault_check(fr2, fr3, sf, *tmp_fp_env)

Checks for all floating-point faulting conditions for the frcpa instruction. It propagates NaNs, NaTVals, and special IEEE results.

frsqrta_exception_fault_check(fr3, sf, *tmp_fp_env)

Checks for all floating-point faulting conditions for the frsqrta instruction. It propagates NaNs, NaTVals, and special IEEE results

ignored_field_mask(regclass, reg, value)

Boolean function that returns value with bits cleared to 0 corresponding to ignored bits for the specified register and register type.

instruction_serialize() Ensures all prior register updates with side-effects are observed before subse-quent instruction and data memory references are performed. Also ensures prior SYNC.i operations have been observed by the instruction cache.

instruction_synchronize Synchronizes the instruction and data stream for Flush Cache operations. This function ensures that when prior FC operations are observed by the local data cache they are observed by the local instruction cache, and when prior FC operations are observed by another processor’s data cache they are observed within the same processor’s instruction cache.

is_finite(freg) Returns true when floating register contains a finite number.

is_ignored_reg(regnum) Boolean function that returns true if regnum is an ignored application register, otherwise false.

is_inf(freg) Returns true when floating register contains an infinite number.

is_kernel_reg(ar_addr) Returns a one if ar_addr is the address of a kernel register application regis-ter

is_reserved_field(regclass, arg2, arg3)

Returns true if the specified data would write a one in a reserved field.

is_reserved_reg(regclass, regnum) Returns true if register regnum is reserved in the regclass register file.

mem_flush(paddr) The line addressed by the physical address paddr is invalidated in all levels of the memory hierarchy above memory and written back to memory if it is inconsistent with memory.

mem_implicit_prefetch(vaddr, hint) Moves the line addressed by vaddr to the location of the memory hierarchy specified by hint. This function is implementation dependent and can be ignored.

mem_promote(paddr, mtype, hint) Moves the line addressed by paddr to the highest level of the memory hierar-chy conditioned by the access hints specified by hint. Implementation depen-dent and can be ignored.

mem_read(paddr, size, border, mattr, otype, hint)

Returns the size bytes starting at the physical memory location specified by paddr with byte order specified by border, memory attributes specified by mattr, and access hint specified by hint. otype specifies the memory order-ing attribute of this access, and must be UNORDERED or ACQUIRE.

fp_mem_to_fr_format(mem, size) Converts a floating-point value in memory format to floating-point register format.


Function Operation



mem_write(value, paddr, size, bor-der, mattr, otype, hint)

Writes the least significant size bytes of value into memory starting at the physical memory address specified by paddr with byte order specified by border, memory attributes specified by mattr, and access hint specified by hint. otype specifies the memory ordering attribute of this access, and must be UNORDERED or RELEASE. No value is returned.

mem_xchg(data, paddr, size, byte_order, mattr, otype, hint)

Returns size bytes from memory starting at the physical address specified bypaddr. The read is conditioned by the locality hint specified by hint. Afterthe read, the least significant size bytes of data are written to size bytes inmemory starting at the physical address specified by paddr. The read andwrite are performed atomically. Both the read and the write are conditioned bythe memory attribute specified by mattr and the byte ordering in memory isspecified by byte_order. otype specifies the memory ordering attribute ofthis access, and must be ACQUIRE.

mem_xchg_add(add_val, paddr, size, byte_order, mattr, otype, hint)

Returns size bytes from memory starting at the physical address specified bypaddr. The read is conditioned by the locality hint specified by hint. Theleast significant size bytes of the sum of the value read from memory andadd_val is then written to size bytes in memory starting at the physicaladdress specified by paddr. The read and write are performed atomically.Both the read and the write are conditioned by the memory attribute specifiedby mattr and the byte ordering in memory is specified by byte_order.otype specifies the memory ordering attribute of this access, and has thevalue ACQUIRE or RELEASE.

mem_xchg_cond(cmp_val, data, paddr, size, byte_order, mattr, otype, hint)

Returns size bytes from memory starting at the physical address specified bypaddr. The read is conditioned by the locality hint specified by hint. If thevalue read from memory is equal to cmp_val, then the least significant sizebytes of data are written to size bytes in memory starting at the physicaladdress specified by paddr. If the write is performed, the read and write areperformed atomically. Both the read and the write are conditioned by thememory attribute specified by mattr and the byte ordering in memory is spec-ified by byte_order. otype specifies the memory ordering attribute of thisaccess, and has the value ACQUIRE or RELEASE.

ordering_fence() Ensures prior data memory references are made visible before future data memory references are made visible by the processor.

pr_phys_to_virt(phys_id) Returns the virtual register id of the predicate from the physical register id, phys_id of the predicate.

rotate_regs() Decrements the Register Rename Base registers, effectively rotating the regis-ter files. CFM.rrb.gr is decremented only if CFM.sor is non-zero.

rse_enable_current_frame_load() If the RSE load pointer (RSE.BSPLoad) is greater than AR[BSP], the RSE.CFLE bit is set to indicate that mandatory RSE loads are allowed to restore registers in the current frame (in no other case does the RSE spill or fill registers in the current frame). This function does not perform mandatory RSE loads. This procedure does not cause any interruptions.

rse_invalidate_non_current_regs() All registers outside the current frame are invalidated.

rse_new_frame(current_frame_size, new_frame_size)

A new frame is defined without changing any register renaming. The new frame size is completely defined by the new_frame_size parameter (succes-sive calls are not cumulative). If new_frame_size is larger than current_frame_size and the number of registers in the invalid and clean partitions is less than the size of frame growth then mandatory RSE stores are issued until enough registers are available. The resulting sequence of RSE stores may be interrupted. Mandatory RSE stores may cause interruptions; see rse_store for a list.


Function Operation



rse_preserve_frame(preserved_frame_size)

The number of registers specified by preserved_frame_size are marked to be preserved by the RSE. Register renaming causes the preserved_frame_size registers after GR[32] to be renamed to GR[32]. AR[BSP] is updated to contain the backing store address where the new GR[32] will be stored.

rse_store(type) Saves a register or NaT collection to the backing store (store_address = AR[BSPSTORE]). If store_address{8:3} is equal to 0x3f then the NaT collec-tion AR[RNAT] is stored. If store_address{8:3} is not equal to 0x3f then the register RSE.StoreReg is stored and the NaT bit from that register is deposited in AR[RNAT]{store_address{8:3}}. If the store is successful AR[BSP-STORE] is incremented by 8. If the store is successful and a register was stored RSE.StoreReg is incremented by 1 (possibly wrapping in the stacked registers). This store moves a register from the dirty partition to the clean par-tition. The privilege level of the store is obtained from AR[RSC].pl. The byte order of the store is obtained from AR[RSC].be. For mandatory RSE stores, type is MANDATORY. RSE stores do not invalidate ALAT entries.

rse_update_internal_stack_pointers(new_store_pointer)

Given a new value for AR[BSPSTORE] (new_store_pointer) this function computes the new value for AR[BSP]. This value is equal to new_store_pointer plus the number of dirty registers plus the number of intervening NaT collections. This means that the size of the dirty partition is the same before and after a write to AR[BSPSTORE]. All clean registers are moved to the invalid partition.

sign_ext(value, pos) Returns a 64 bit number with bits pos-1 through 0 taken from value and bit pos-1 of value replicated in bit positions pos through 63. If pos is greater than or equal to 64, value is returned.

tlb_translate(vaddr, size, type, cpl, *attr, *defer)

Returns the translated data physical address for the specified virtual memory address (vaddr) when translation enabled; otherwise, returns vaddr. size specifies the size of the access, type specifies the type of access (e.g., read, write, advance, spec). cpl specifies the privilege level for access checking purposes. *attr returns the mapped physical memory attribute. If any fault conditions are detected and deferred, tlb_translate returns with *defer set. If a fault is generated but the fault is not deferred, tlb_translate does not return.

tlb_translate_nonaccess(vaddr, type) Returns the translated data physical address for the specified virtual memory address (vaddr). type specifies the type of access (e.g., FC). If a fault is gen-erated, tlb_translate_nonaccess does not return.

unimplemented_physical_address(paddr)

Return TRUE if the presented physical address is unimplemented on this pro-cessor model; FALSE otherwise. This function is model-specific.

impl_undefined_natd_gr_read(paddr, size, be, mattr, otype, ldhint)

defines register return data for a speculative load to a NaTed address. This function may return data from another address space.

unimplemented_virtual_address(vaddr)

Return TRUE if the presented virtual address is unimplemented on this pro-cessor model; FALSE otherwise. This function is model-specific.

fp_update_fpsr(sf, tmp_fp_env) Copies a floating-point instruction’s local state into the global FPSR.

zero_ext(value, pos) Returns a 64 bit unsigned number with bits pos-1 through 0 taken from value and zeroes in bit positions pos through 63. If pos is greater than or equal to 64, value is returned.


Function Operation

HP/Intel IA-64 Instruction Formats C-1


C IA-64 Instruction Formats

Each IA-64 instruction is categorized into one of six types; each instruction type may be executed on one or more execu-tion unit types. Table C-1 lists the instruction types and the execution unit type on which they are executed:

Three instructions are grouped together into 128-bit sized and aligned containers called bundles. Each bundle containsthree 41-bit instruction slots and a 5-bit template field. The format of a bundle is depicted in Figure C-1.

The template field specifies two properties: stops within the current bundle, and the mapping of instruction slots to execu-tion unit types. Not all combinations of these two properties are allowed - Table C-2 indicates the defined combinations.The three rightmost columns correspond to the three instruction slots in a bundle; listed within each column is the execu-tion unit type controlled by that instruction slot for each encoding of the template field. A double line to the right of aninstruction slot indicates that a stop occurs at that point within the current bundle. See “Instruction Encoding Overview”on page 3-11 for the definition of a stop. Within a bundle, execution order proceeds from slot 0 to slot 2. Unused templatevalues (appearing as empty rows in Table C-2) are reserved and cause an Illegal Operation fault.

Extended instructions, used for long immediate integer instructions, occupy two instruction slots.

Table C-1. Relationship Between Instruction Type and Execution Unit Type

InstructionType

DescriptionExecution Unit

TypeA Integer ALU I-unit or M-unit

I Non-ALU integer I-unit

M Memory M-unit

F Floating-point F-unit

B Branch B-unit

L+X Extended I-unit

127 87 86 46 45 5 4 0

instruction slot 2 instruction slot 1 instruction slot 0 template41 41 41 5

Figure C-1. Bundle Format

C-2 IA-64 Instruction Formats HP/Intel


C.1 Format SummaryAll instructions in the instruction set are 41 bits in length. The leftmost 4 bits (40:37) of each instruction are the majoropcode. Table C-3 shows the major opcode assignments for each of the 5 instruction types — ALU (A), Integer (I), Mem-ory (M), Floating-point (F), and Branch (B). Bundle template bits are used to distinguish among the 4 columns, so thesame major op values can be reused in each column.

Unused major ops (appearing as blank entries in Table C-3) behave in one of three ways:

• Ignored major ops (white entries in Table C-3) execute as nop instructions.

• Reserved major ops (light gray in the grayscale version of Table C-3, brown in the color version) cause an IllegalOperation fault.

• Reserved if PR[qp] is 1 major ops (dark gray in the grayscale version of Table C-3, purple in the color version) causean Illegal Operation fault if the predicate register specified by the qp field of the instruction (bits 5:0) is 1 and executeas a nop instruction if 0.

Table C-2. Template Field Encoding and Instruction Slot Mapping

Template Slot 0 Slot 1 Slot 200 M-unit I-unit I-unit






06

07



0A M-unit M-unit I-unit

0B M-unit M-unit I-unit

0C M-unit F-unit I-unit

0D M-unit F-unit I-unit

0E M-unit M-unit F-unit

0F M-unit M-unit F-unit





14

15





1A

1B

1C M-unit F-unit B-unit

1D M-unit F-unit B-unit

1E

1F



Table C-4 on page C-4 summarizes all the instruction formats. The instruction fields are color-coded for ease of identifi-cation, as described in Table C-5 on page C-6.

The instruction field names, used throughout this chapter, are described in Table C-6 on page C-7. The set of special nota-tions (such as whether an instruction must be first in an instruction group) are listed in Table C-7 on page C-7. These nota-tions appear in the “Instruction” column of the opcode tables.

Most instruction containing immediates encode those immediates in more than one instruction field. For example, the 14-bit immediate in the Add Imm14 instruction (format A4) is formed from the imm7b, imm6d, and s fields. Table C-65 onpage C-57 shows how the immediates are formed from the instruction fields for each instruction which has an immediate.

Table C-3. Major Opcode Assignments

Major Op

(bits 40:37)

Instruction Type

I/A M/A F B L+X

0 Misc 0 Mem Mgmt 0 FP Misc 0 Misc/Indirect Branch0 Misc 0

1 1 Mem Mgmt 1 FP Misc 1 Indirect Call 1 1

2 2 2 2 Nop 2 2

3 3 3 3 3 3

4 Deposit 4 Int Ld +Reg/getf 4 FP Compare 4 IP-relative Branch4 4

5 Shift/Test Bit 5 Int Ld/St +Imm 5 FP Class 5 IP-rel Call 5 5

6 6 FP Ld/St +Reg/setf6 6 6 movl 6

7 MM Mpy/Shift 7 FP Ld/St +Imm 7 7 7 7

8 ALU/MM ALU 8 ALU/MM ALU 8 fma 8 8 8

9 Add Imm229 Add Imm22

9 fma 9 9 9

A A A fms A A A

B B B fms B B B

C Compare C Compare C fnma C C C

D Compare D Compare D fnma D D D

E Compare E Compare E fselect/xma E E E

F F F F F F



Table C-4. Instruction Format Summary40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

ALU A1 8 x2a ve x4 x2b r3 r2 r1 qpShift L and Add A2 8 x2a ve x4 ct2d r3 r2 r1 qp

ALU Imm8 A3 8 s x2a ve x4 x2b r3 imm7b r1 qpAdd Imm14 A4 8 s x2a ve imm6d r3 imm7b r1 qpAdd Imm22 A5 9 s imm9d imm5c r3 imm7b r1 qp

Compare A6 C - E tb x2 ta p2 r3 r2 c p1 qpCompare to Zero A7 C - E tb x2 ta p2 r3 0 c p1 qpCompare Imm8 A8 C - E s x2 ta p2 r3 imm7b c p1 qp

MM ALU A9 8 za x2a zb x4 x2b r3 r2 r1 qpMM Shift and Add A10 8 za x2a zb x4 ct2d r3 r2 r1 qpMM Multiply Shift I1 7 za x2a zbve ct2d x2b r3 r2 r1 qpMM Mpy/Mix/Pack I2 7 za x2a zbve x2c x2b r3 r2 r1 qp

MM Mux1 I3 7 za x2a zbve x2c x2b mbt4c r2 r1 qpMM Mux2 I4 7 za x2a zbve x2c x2b mht8c r2 r1 qp

Shift R Variable I5 7 za x2a zbve x2c x2b r3 r2 r1 qpMM Shift R Fixed I6 7 za x2a zbve x2c x2b r3 count5b r1 qp

Shift L Variable I7 7 za x2a zbve x2c x2b r3 r2 r1 qpMM Shift L Fixed I8 7 za x2a zbve x2c x2b ccount5c r2 r1 qp

Popcount I9 7 za x2a zbve x2c x2b r3 0 r1 qpShift Right Pair I10 5 x2 x count6d r3 r2 r1 qp

Extract I11 5 x2 x len6d r3 pos6b y r1 qpDep.Z I12 5 x2 x len6d y cpos6c r2 r1 qp

Dep.Z Imm8 I13 5 s x2 x len6d y cpos6c imm7b r1 qpDeposit Imm1 I14 5 s x2 x len6d r3 cpos6b r1 qp

Deposit I15 4 cpos6d len4d r3 r2 r1 qpTest Bit I16 5 tb x2 ta p2 r3 pos6b y c p1 qp

Test NaT I17 5 tb x2 ta p2 r3 y c p1 qpBreak/Nop I19 0 i x3 x6 imm20a qp

Int Spec Check I20 0 s x3 imm13c r2 imm7a qpMove to BR I21 0 x3 x r2 b1 qp

Move from BR I22 0 x3 x6 b2 r1 qpMove to Pred I23 0 s x3 mask8c r2 mask7a qp

Move to Pred Imm44 I24 0 s x3 imm27a qpMove from Pred/IP I25 0 x3 x6 r1 qp

Move to AR I26 0 x3 x6 ar3 r2 qpMove to AR Imm8 I27 0 s x3 x6 ar3 imm7b qp

Move from AR I28 0 x3 x6 ar3 r1 qpSxt/Zxt/Czx I29 0 x3 x6 r3 r1 qp

40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0



Int Load M1 4 m x6 hint x r3 r1 qpInt Load +Reg M2 4 m x6 hint x r3 r2 r1 qpInt Load +Imm M3 5 s x6 hint i r3 imm7b r1 qp

Int Store M4 4 m x6 hint x r3 r2 qpInt Store +Imm M5 5 s x6 hint i r3 r2 imm7a qp

FP Load M6 6 m x6 hint x r3 f1 qpFP Load +Reg M7 6 m x6 hint x r3 r2 f1 qpFP Load +Imm M8 7 s x6 hint i r3 imm7b f1 qp

FP Store M9 6 m x6 hint x r3 f2 qpFP Store +Imm M10 7 s x6 hint i r3 f2 imm7a qpFP Load Pair M11 6 m x6 hint x r3 f2 f1 qp

FP Load Pair +ImmM12 6 m x6 hint x r3 f2 f1 qpLine Prefetch M13 6 m x6 hint x r3 qp

Line Prefetch +Reg M14 6 m x6 hint x r3 r2 qpLine Prefetch +ImmM15 7 s x6 hint i r3 imm7b qp

(Cmp &) Exchg M16 4 m x6 hint x r3 r2 r1 qpFetch & Add M17 4 m x6 hint x r3 s i2b r1 qp

Set FR M18 6 m x6 x r2 f1 qpGet FR M19 4 m x6 x f2 r1 qp

Int Spec Check M20 1 s x3 imm13c r2 imm7a qpFP Spec Check M21 1 s x3 imm13c f2 imm7a qp

Int ALAT Check M22 0 s x3 imm20b r1 qpFP ALAT Check M23 0 s x3 imm20b f1 qpSync/Srlz/ALAT M24 0 x3 x2 x4 qp

RSE Control M25 0 x3 x2 x4 0Int ALAT Inval M26 0 x3 x2 x4 r1 qpFP ALAT Inval M27 0 x3 x2 x4 f1 qp

Flush Cache M28 1 x3 x6 r3 qpMove to AR M29 1 x3 x6 ar3 r2 qp

Move to AR Imm8 M30 0 s x3 x2 x4 ar3 imm7b qpMove from AR M31 1 x3 x6 ar3 r1 qp

Alloc M34 1 x3 sor sol sof r1 0Move to PSR M35 1 x3 x6 r2 qp

Move from PSR M36 1 x3 x6 r1 qpBreak/Nop M37 0 i x3 x2 x4 imm20a qp

Mv from Ind M43 1 x3 x6 r3 r1 qpSet/Reset Mask M44 0 i x3 i2d x4 imm21a qp

Table C-4. Instruction Format Summary (Continued)40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0



IP-Relative Branch B1 4 s d wh imm20b p btype qpCounted Branch B2 4 s d wh imm20b p btype 0IP-Relative Call B3 5 s d wh imm20b p b1 qpIndirect Branch B4 0 d wh x6 b2 p btype qp

Indirect Call B5 1 d wh b2 p b1 qpMisc B8 0 x6 0

Break/Nop B9 0/2 i x6 imm20a qpFP Arithmetic F1 8 - D x sf f4 f3 f2 f1 qp

Fixed Multiply Add F2 E x x2 f4 f3 f2 f1 qpFP Select F3 E x f4 f3 f2 f1 qp

FP Compare F4 4 rb sf ra p2 f3 f2 ta p1 qpFP Class F5 5 fc2 p2 fclass7c f2 ta p1 qp

FP Recip Approx F6 0 - 1 q sf x p2 f3 f2 f1 qpFP Recip Sqrt App F7 0 - 1 q sf x p2 f3 f1 qpFP Min/Max/Pcmp F8 0 - 1 sf x x6 f3 f2 f1 qpFP Merge/Logical F9 0 - 1 x x6 f3 f2 f1 qp

Convert FP to Fixed F10 0 - 1 sf x x6 f2 f1 qpConvert Fixed to FP F11 0 x x6 f2 f1 qp

FP Set Controls F12 0 sf x x6 omask7c amask7b qpFP Clear Flags F13 0 sf x x6 qpFP Check Flags F14 0 s sf x x6 imm20a qp

Break/Nop F15 0 i x x6 imm20a qpBreak/Nop X1 0 i x3 x6 imm20a qp imm41

Move Imm64 X2 6 i imm9d imm5c ic vc imm7b r1 qp imm41

Table C-5. Instruction Field Color Key

Field & ColorALU Instruction Opcode ExtensionInteger Instruction Opcode Hint ExtensionMemory Instruction ImmediateBranch Instruction Indirect SourceFloating-point Instruction Predicate DestinationInteger Source Integer DestinationMemory Source Memory Source & DestinationShift Source Shift ImmediateSpecial Register Source Special Register DestinationFloating-point Source Floating-point DestinationBranch Source Branch DestinationAddress Source Reserved InstructionQualifying Predicate Reserved Inst if PR[qp] is 1Ignored Field/Instruction

Table C-4. Instruction Format Summary (Continued)40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0



The remaining sections of this chapter present the detailed encodings of all instructions. The “A-Unit Instruction encod-ings” are presented first, followed by the “I-Unit Instruction Encodings” on page C-16, “M-Unit Instruction Encodings”on page C-26, “B-Unit Instruction Encodings” on page C-45, “F-Unit Instruction Encodings” on page C-49, and “X-UnitInstruction Encodings” on page C-56.

Within each section, the instructions are grouped by function, and appear with their instruction format in the same order asin Table C-4, “Instruction Format Summary,” on page C-4. The opcode extension fields are briefly described and tablespresent the opcode extension assignments. Unused instruction encodings (appearing as blank entries in the opcode exten-sions tables) behave in one of three ways:

Table C-6. Instruction Field Names

Field Name Descriptionar3 application register source/target

b1, b2 branch register source/target

btype branch type opcode extension

c complement compare relation opcode extension

ccount5c multimedia shift left complemented shift count immediate

count5b, count6d multimedia shift right/shift right pair shift count immediate

cposx deposit complemented bit position immediate

ct2d multimedia multiply shift/shift and add shift count immediate

d branch cache deallocation hint opcode extension

fn floating-point register source/target

fc2, fclass7c floating-point class immediate

hint memory reference hint opcode extension

i, i2b, i2d, immx immediate of length 1, 2, or x

len4d, len6d extract/deposit length immediate

m memory reference post-modify opcode extension

maskx predicate immediate mask

mbt4c, mht8c multimedia mux1/mux2 immediate

p sequential prefetch hint opcode extension

p1, p2 predicate register target

pos6b test bit/extract bit position immediate

q floating-point reciprocal/reciprocal square-root opcode extension

qp qualifying predicate register source

rn general register source/target

s immediate sign bit

sf floating-point status field opcode extension

sof, sol, sor alloc size of frame, size of locals, size of rotating immediates

ta, tb compare type opcode extension

vx reserved opcode extension field

wh branch whether hint opcode extension

x, xn opcode extension of length 1 or n

y extract/deposit/test bit/test NaT opcode extension

za, zb multimedia operand size opcode extension

Table C-7. Special Instruction Notations

Notation Descriptionf instruction must be the first in an instruction group

l instruction must be the last in an instruction group

t instruction is only allowed in instruction slot 2



• Ignored instructions (white entries in the tables) execute as nop instructions.

• Reserved instructions (light gray in the grayscale version of the tables, brown in the color version) cause an IllegalOperation fault.

• Reserved if PR[qp] is 1 instructions (dark gray in the grayscale version of the tables, purple in the color version)cause an Illegal Operation fault if the predicate register specified by the qp field of the instruction (bits 5:0) is 1 andexecute as a nop instruction if 0.

Constant 0 fields in instructions must be 0 or undefined operation results. The undefined operation may include checkingthat the constant field is 0 and causing an Illegal Operation fault if it is not. If an instruction having a constant 0 field alsohas a qualifying predicate (qp field), the fault or other undefined operation must not occur if PR[qp] is 0. For constant 0fields in instruction bits 5:0 (normally used for qp), the fault or other undefined operation may or may not depend on thePR addressed by those bits.

Ignored (white space) fields in instructions should be coded as 0. Although ignored in this revision of the architecture,future architecture revisions may define these fields as hint extensions. These hint extensions will be defined such that the0 value in each field corresponds to the default hint. It is expected that assemblers will automatically set these fields tozero by default.

C.2 A-Unit Instruction Encodings

C.2.1 Integer ALU

All integer ALU instructions are encoded within major opcode 8 using a 2-bit opcode extension field in bits 35:34 (x2a)and most have a second 2-bit opcode extension field in bits 28:27 (x2b), a 4-bit opcode extension field in bits 32:29 (x4),and a 1-bit reserved opcode extension field in bit 33 (ve). Table C-8 shows the 2-bit x2a and 1-bit ve assignments,Table C-9 shows the integer ALU 4-bit+2-bit assignments, and Table C-12 on page C-13 shows the multimedia ALU 1-bit+2-bit assignments (which also share major opcode 8).

Table C-8. Integer ALU 2-bit+1-bit Opcode Extensions

OpcodeBits

40:37

x2aBits

35:34

veBit 33

0 1

8

0 Integer ALU 4-bit+2-bit Ext (Table C-9)1 Multimedia ALU 1-bit+2-bit Ext (Table C-12)2 adds – imm14 A43 addp4 – imm14 A4

Table C-9. Integer ALU 4-bit+2-bit Opcode Extensions

OpcodeBits

40:37

x2aBits

35:34

veBit 33

x4Bits

32:29

x2bBits 28:27

0 1 2 3

8 0 0

0 add A1 add +1 A11 sub –1 A1 sub A12 addp4 A13 and A1 andcm A1 or A1 xor A14 shladd A256 shladdp4 A2789 sub – imm8 A3AB and – imm8 A3 andcm – imm8 A3 or – imm8 A3 xor – imm8 A3CDEF



C.2.1.1 Integer ALU – Register-Register

A1

C.2.1.2 Shift Left and Add

A2

C.2.1.3 Integer ALU – Immediate 8-Register

A3

C.2.1.4 Add Immediate 14

A4

C.2.1.5 Add Immediate 22

A5

40 37 36 35 34 33 32 29 28 27 26 20 19 13 12 6 5 0

8 x2a ve x4 x2b r3 r2 r1 qp4 1 2 1 4 2 7 7 7 6

Instruction Operands OpcodeExtension

x2a ve x4 x2b

add r1 = r2, r3

8 0 0

0 0r1 = r2, r3, 1 1

sub r1 = r2, r3 1 1r1 = r2, r3, 1 0

addp4

r1 = r2, r3

2 0and

3

0andcm 1or 2xor 3

40 37 36 35 34 33 32 29 28 27 26 20 19 13 12 6 5 0

8 x2a ve x4 ct2d r3 r2 r1 qp4 1 2 1 4 2 7 7 7 6


x2a ve x4shladd

r1 = r2, count2, r3 8 0 0 4shladdp4 6

40 37 36 35 34 33 32 29 28 27 26 20 19 13 12 6 5 0

8 s x2a ve x4 x2b r3 imm7b r1 qp4 1 2 1 4 2 7 7 7 6


x2a ve x4 x2bsub

r1 = imm8, r3 8 0 0

9 1and

B

0andcm 1or 2xor 3

40 37 36 35 34 33 32 27 26 20 19 13 12 6 5 0

8 s x2a ve imm6d r3 imm7b r1 qp4 1 2 1 6 7 7 7 6

Instruction Operands OpcodeExtensionx2a ve

addsr1 = imm14, r3 8 2 0addp4 3

40 37 36 35 27 26 22 21 20 19 13 12 6 5 0

9 s imm9d imm5c r3 imm7b r1 qp4 1 9 5 2 7 7 6

Instruction Operands Opcodeaddl r1 = imm22, r3 9



C.2.2 Integer Compare

The integer compare instructions are encoded within major opcodes C – E using a 2-bit opcode extension field (x2) in bits35:34 and three 1-bit opcode extension fields in bits 33 (ta), 36 (tb), and 12 (c), as shown in Table C-10. The integer com-pare immediate instructions are encoded within major opcodes C – E using a 2-bit opcode extension field (x2) in bits35:34 and two 1-bit opcode extension fields in bits 33 (ta) and 12 (c), as shown in Table C-11.

Table C-10. Integer Compare Opcode Extensions

x2Bits

35:34

tbBit 36

taBit 33

cBit 12

OpcodeBits 40:37

C D E

0

00 0 cmp.lt A6 cmp.ltu A6 cmp.eq A6

1 cmp.lt.unc A6 cmp.ltu.unc A6 cmp.eq.unc A6

1 0 cmp.eq.and A6 cmp.eq.or A6 cmp.eq.or.andcm A61 cmp.ne.and A6 cmp.ne.or A6 cmp.ne.or.andcm A6

10 0 cmp.gt.and A7 cmp.gt.or A7 cmp.gt.or.andcm A7

1 cmp.le.and A7 cmp.le.or A7 cmp.le.or.andcm A7

1 0 cmp.ge.and A7 cmp.ge.or A7 cmp.ge.or.andcm A71 cmp.lt.and A7 cmp.lt.or A7 cmp.lt.or.andcm A7

1

00 0 cmp4.lt A6 cmp4.ltu A6 cmp4.eq A6

1 cmp4.lt.unc A6 cmp4.ltu.unc A6 cmp4.eq.unc A6

1 0 cmp4.eq.and A6 cmp4.eq.or A6 cmp4.eq.or.andcm A61 cmp4.ne.and A6 cmp4.ne.or A6 cmp4.ne.or.andcm A6

10 0 cmp4.gt.and A7 cmp4.gt.or A7 cmp4.gt.or.andcm A7

1 cmp4.le.and A7 cmp4.le.or A7 cmp4.le.or.andcm A7

1 0 cmp4.ge.and A7 cmp4.ge.or A7 cmp4.ge.or.andcm A71 cmp4.lt.and A7 cmp4.lt.or A7 cmp4.lt.or.andcm A7

Table C-11. Integer Compare Immediate Opcode Extensions

x2Bits

35:34

taBit 33

cBit 12

OpcodeBits 40:37

C D E

20 0 cmp.lt – imm8 A8 cmp.ltu – imm8 A8 cmp.eq – imm8 A8

1 cmp.lt.unc – imm8 A8 cmp.ltu.unc – imm8 A8 cmp.eq.unc – imm8 A8

1 0 cmp.eq.and – imm8 A8 cmp.eq.or – imm8 A8 cmp.eq.or.andcm – imm8 A81 cmp.ne.and – imm8 A8 cmp.ne.or – imm8 A8 cmp.ne.or.andcm – imm8 A8

30 0 cmp4.lt – imm8 A8 cmp4.ltu – imm8 A8 cmp4.eq – imm8 A8

1 cmp4.lt.unc – imm8 A8 cmp4.ltu.unc – imm8 A8 cmp4.eq.unc – imm8 A8

1 0 cmp4.eq.and – imm8 A8 cmp4.eq.or – imm8 A8 cmp4.eq.or.andcm – imm8 A81 cmp4.ne.and – imm8 A8 cmp4.ne.or – imm8 A8 cmp4.ne.or.andcm – imm8 A8



C.2.2.1 Integer Compare – Register-Register

A6

40 37 36 35 34 33 32 27 26 20 19 13 12 11 6 5 0

C - E tb x2 ta p2 r3 r2 c p1 qp4 1 2 1 6 7 7 1 6 6


x2 tb ta ccmp.lt

p1, p2 = r2, r3

C

0 0

0

0cmp.ltu Dcmp.eq Ecmp.lt.unc C

1cmp.ltu.unc Dcmp.eq.unc Ecmp.eq.and C

1

0cmp.eq.or Dcmp.eq.or.andcm Ecmp.ne.and C

1cmp.ne.or Dcmp.ne.or.andcm Ecmp4.lt C

1 0

0

0cmp4.ltu Dcmp4.eq Ecmp4.lt.unc C

1cmp4.ltu.unc Dcmp4.eq.unc Ecmp4.eq.and C

1

0cmp4.eq.or Dcmp4.eq.or.andcm Ecmp4.ne.and C

1cmp4.ne.or Dcmp4.ne.or.andcm E



C.2.2.2 Integer Compare to Zero – Register

A7

40 37 36 35 34 33 32 27 26 20 19 13 12 11 6 5 0

C - E tb x2 ta p2 r3 0 c p1 qp4 1 2 1 6 7 7 1 6 6


x2 tb ta ccmp.gt.and

p1, p2 = r0, r3

C

0

1

0

0cmp.gt.or Dcmp.gt.or.andcm Ecmp.le.and C

1cmp.le.or Dcmp.le.or.andcm Ecmp.ge.and C

1

0cmp.ge.or Dcmp.ge.or.andcm Ecmp.lt.and C

1cmp.lt.or Dcmp.lt.or.andcm Ecmp4.gt.and C

1

0

0cmp4.gt.or Dcmp4.gt.or.andcm Ecmp4.le.and C

1cmp4.le.or Dcmp4.le.or.andcm Ecmp4.ge.and C

1

0cmp4.ge.or Dcmp4.ge.or.andcm Ecmp4.lt.and C

1cmp4.lt.or Dcmp4.lt.or.andcm E



C.2.2.3 Integer Compare – Immediate-Register

A8

C.2.3 Multimedia

All multimedia ALU instructions are encoded within major opcode 8 using two 1-bit opcode extension fields in bits 36(za) and 33 (zb) and a 2-bit opcode extension field in bits 35:34 (x2a) as shown in Table C-12. The multimedia ALUinstructions also have a 4-bit opcode extension field in bits 32:29 (x4), and a 2-bit opcode extension field in bits 28:27(x2b) as shown in Table C-13 on page C-14.

40 37 36 35 34 33 32 27 26 20 19 13 12 11 6 5 0

C - E s x2 ta p2 r3 imm7b c p1 qp4 1 2 1 6 7 7 1 6 6


x2 ta ccmp.lt

p1, p2 = imm8, r3

C

2

0

0cmp.ltu Dcmp.eq Ecmp.lt.unc C

1cmp.ltu.unc Dcmp.eq.unc Ecmp.eq.and C

1

0cmp.eq.or Dcmp.eq.or.andcm Ecmp.ne.and C

1cmp.ne.or Dcmp.ne.or.andcm Ecmp4.lt C

3

0

0cmp4.ltu Dcmp4.eq Ecmp4.lt.unc C

1cmp4.ltu.unc Dcmp4.eq.unc Ecmp4.eq.and C

1

0cmp4.eq.or Dcmp4.eq.or.andcm Ecmp4.ne.and C

1cmp4.ne.or Dcmp4.ne.or.andcm E

Table C-12. Multimedia ALU 2-bit+1-bit Opcode Extensions

OpcodeBits 40:37

x2aBits

35:34

zaBit 36

zbBit 33

8 10 0 Multimedia ALU Size 1 (Table C-13)

1 Multimedia ALU Size 2 (Table C-14)

1 0 Multimedia ALU Size 4 (Table C-15)1



Table C-13. Multimedia ALU Size 1 4-bit+2-bit Opcode Extensions

OpcodeBits

40:37

x2aBits

35:34

zaBit 36

zbBit 33

x4Bits

32:29

x2bBits 28:27

0 1 2 3

8 1 0 0

0 padd1 A9 padd1.sss A9 padd1.uuu A9 padd1.uus A91 psub1 A9 psub1.sss A9 psub1.uuu A9 psub1.uus A92 pavg1 A9 pavg1.raz A93 pavgsub1 A9456789 pcmp1.eq A9 pcmp1.gt A9ABCDEF


OpcodeBits

40:37

x2aBits

35:34

zaBit 36

zbBit 33

x4Bits

32:29

x2bBits 28:27

0 1 2 3

8 1 0 1

0 padd2 A9 padd2.sss A9 padd2.uuu A9 padd2.uus A91 psub2 A9 psub2.sss A9 psub2.uuu A9 psub2.uus A92 pavg2 A9 pavg2.raz A93 pavgsub2 A94 pshladd2 A1056 pshradd2 A10789 pcmp2.eq A9 pcmp2.gt A9ABCDEF



C.2.3.1 Multimedia ALU

A9


OpcodeBits

40:37

x2aBits

35:34

zaBit 36

zbBit 33

x4Bits

32:29

x2bBits 28:27

0 1 2 3

8 1 1 0

0 padd4 A91 psub4 A923456789 pcmp4.eq A9 pcmp4.gt A9ABCDEF

40 37 36 35 34 33 32 29 28 27 26 20 19 13 12 6 5 0

8 za x2a zb x4 x2b r3 r2 r1 qp4 1 2 1 4 2 7 7 7 6


x2a za zb x4 x2bpadd1

r1 = r2, r3 8 1

0 0

0

0padd2 1padd4 1 0padd1.sss 0 0 1padd2.sss 1padd1.uuu 0 0 2padd2.uuu 1padd1.uus 0 0 3padd2.uus 1psub1 0 0

1

0psub2 1psub4 1 0psub1.sss 0 0 1psub2.sss 1psub1.uuu 0 0 2psub2.uuu 1psub1.uus 0 0 3psub2.uus 1pavg1 0 0

22pavg2 1

pavg1.raz 0 0 3pavg2.raz 1pavgsub1 0 0 3 2pavgsub2 1pcmp1.eq 0 0

9

0pcmp2.eq 1pcmp4.eq 1 0pcmp1.gt 0 0

1pcmp2.gt 1pcmp4.gt 1 0



C.2.3.2 Multimedia Shift and Add

A10

C.3 I-Unit Instruction Encodings

C.3.1 Multimedia and Variable Shifts

All multimedia multiply/shift/max/min/mix/mux/pack/unpack and variable shift instructions are encoded within majoropcode 7 using two 1-bit opcode extension fields in bits 36 (za) and 33 (zb) and a 1-bit reserved opcode extension in bit 32(ve) as shown in Table C-16. They also have a 2-bit opcode extension field in bits 35:34 (x2a) and a 2-bit field in bits 29:28(x2b) and most have a 2-bit field in bits 31:30 (x2c) as shown in Table C-17.

40 37 36 35 34 33 32 29 28 27 26 20 19 13 12 6 5 0

8 za x2a zb x4 ct2d r3 r2 r1 qp4 1 2 1 4 2 7 7 7 6


x2a za zb x4pshladd2

r1 = r2, count2, r3 8 1 0 1 4pshradd2 6

Table C-16. Multimedia and Variable Shift 1-bit Opcode Extensions

OpcodeBits

40:37

zaBit 36

zbBit 33

veBit 32

0 1

70 0 Multimedia Size 1 (Table C-17)

1 Multimedia Size 2 (Table C-18)

1 0 Multimedia Size 4 (Table C-19)1 Variable Shift (Table C-20)

Table C-17. Multimedia Max/Min/Mix/Pack/Unpack Size 1 2-bit Opcode Extensions

OpcodeBits

40:37

zaBit 36

zbBit 33

veBit 32

x2aBits

35:34

x2bBits

29:28

x2cBits 31:30

0 1 2 3

7 0 0 0

0

0123

1

0123

2

0 unpack1.h I2 mix1.r I21 pmin1.u I2 pmax1.u I22 unpack1.l I2 mix1.l I23 psad1 I2

3

012 mux1 I33



Table C-18. Multimedia Multiply/Shift/Max/Min/Mix/Pack/Unpack Size 2 2-bit Opcode Extensions

OpcodeBits

40:37

zaBit 36

zbBit 33

veBit 32

x2aBits

35:34

x2bBits

29:28

x2cBits 31:30

0 1 2 3

7 0 1 0

0

0 pshr2.u – var I5 pshl2 – var I71 pmpyshr2.u I12 pshr2 – var I53 pmpyshr2 I1

1

01 pshr2.u – fixed I6 popcnt I923 pshr2 – fixed I6

2

0 pack2.uss I2 unpack2.h I2 mix2.r I21 pmpy2.r I22 pack2.sss I2 unpack2.l I2 mix2.l I23 pmin2 I2 pmax2 I2 pmpy2.l I2

3

01 pshl2 – fixed I82 mux2 I43

Table C-19. Multimedia Shift/Mix/Pack/Unpack Size 4 2-bit Opcode Extensions

OpcodeBits

40:37

zaBit 36

zbBit 33

veBit 32

x2aBits

35:34

x2bBits

29:28

x2cBits 31:30

0 1 2 3

7 1 0 0

0

0 pshr4.u – var I5 pshl4 – var I712 pshr4 – var I53

1

01 pshr4.u – fixed I623 pshr4 – fixed I6

2

0 unpack4.h I2 mix4.r I212 pack4.sss I2 unpack4.l I2 mix4.l I23

3

01 pshl4 – fixed I823



C.3.1.1 Multimedia Multiply and Shift

I1

C.3.1.2 Multimedia Multiply/Mix/Pack/Unpack

I2

Table C-20. Variable Shift 2-bit Opcode Extensions

OpcodeBits

40:37

zaBit 36

zbBit 33

veBit 32

x2aBits

35:34

x2bBits

29:28

x2cBits 31:30

0 1 2 3

7 1 1 0

0

0 shr.u – var I5 shl – var I712 shr – var I53

1

0123

2

0123

3

0123

40 37 36 35 34 33 32 31 30 29 28 27 26 20 19 13 12 6 5 0

7 za x2a zbvect2d x2b r3 r2 r1 qp4 1 2 1 1 2 2 1 7 7 7 6


za zb ve x2a x2bpmpyshr2

r1 = r2, r3, count2 7 0 1 0 0 3pmpyshr2.u 1

40 37 36 35 34 33 32 31 30 29 28 27 26 20 19 13 12 6 5 0

7 za x2a zbve x2c x2b r3 r2 r1 qp4 1 2 1 1 2 2 1 7 7 7 6


za zb ve x2a x2b x2cpmpy2.r

r1 = r2, r3 7

0 1

0 2

1 3pmpy2.l 3mix1.r 0 0

0

2

mix2.r 0 1mix4.r 1 0mix1.l 0 0

2mix2.l 0 1mix4.l 1 0pack2.uss 0 1 0

0pack2.sss 0 1 2pack4.sss 1 0unpack1.h 0 0

0

1

unpack2.h 0 1unpack4.h 1 0unpack1.l 0 0

2unpack2.l 0 1unpack4.l 1 0pmin1.u 0 0 1 0pmax1.u 1pmin2 0 1 3 0pmax2 1psad1 0 0 3 2



C.3.1.3 Multimedia Mux1

I3

C.3.1.4 Multimedia Mux2

I4

C.3.1.5 Shift Right – Variable

I5

C.3.1.6 Multimedia Shift Right – Fixed

I6

C.3.1.7 Shift Left – Variable

I7

40 37 36 35 34 33 32 31 30 29 28 27 24 23 20 19 13 12 6 5 0

7 za x2a zbve x2c x2b mbt4c r2 r1 qp4 1 2 1 1 2 2 4 4 7 7 6


za zb ve x2a x2b x2cmux1 r1 = r2, mbtype4 7 0 0 0 3 2 2

40 37 36 35 34 33 32 31 30 29 28 27 20 19 13 12 6 5 0

7 za x2a zbve x2c x2b mht8c r2 r1 qp4 1 2 1 1 2 2 8 7 7 6


za zb ve x2a x2b x2cmux2 r1 = r2, mhtype8 7 0 1 0 3 2 2

40 37 36 35 34 33 32 31 30 29 28 27 26 20 19 13 12 6 5 0



za zb ve x2a x2b x2cpshr2

r1 = r3, r2 7

0 1

0 0

2

0

pshr4 1 0shr 1 1pshr2.u 0 1

0pshr4.u 1 0shr.u 1 1

40 37 36 35 34 33 32 31 30 29 28 27 26 20 19 18 14 13 12 6 5 0

7 za x2a zbve x2c x2b r3 count5b r1 qp4 1 2 1 1 2 2 1 7 1 5 1 7 6


za zb ve x2a x2b x2cpshr2

r1 = r3, count5 7

0 1

0 13

0pshr4 1 0pshr2.u 0 1 1pshr4.u 1 0

40 37 36 35 34 33 32 31 30 29 28 27 26 20 19 13 12 6 5 0



za zb ve x2a x2b x2cpshl2

r1 = r2, r3 70 1

0 0 0 1pshl4 1 0shl 1 1



C.3.1.8 Multimedia Shift Left – Fixed

I8

C.3.1.9 Population Count

I9

C.3.2 Integer Shifts

The integer shift, test bit, and test NaT instructions are encoded within major opcode 5 using a 2-bit opcode extensionfield in bits 35:34 (x2) and a 1-bit opcode extension field in bit 33 (x). The extract and test bit instructions also have a 1-bitopcode extension field in bit 13 (y). Table C-21 shows the test bit, extract, and shift right pair assignments.

Most deposit instructions also have a 1-bit opcode extension field in bit 26 (y). Table C-22 shows these assignments.

C.3.2.1 Shift Right Pair

I10

40 37 36 35 34 33 32 31 30 29 28 27 25 24 20 19 13 12 6 5 0

7 za x2a zbve x2c x2b ccount5c r2 r1 qp4 1 2 1 1 2 2 3 5 7 7 6


za zb ve x2a x2b x2cpshl2

r1 = r2, count5 7 0 1 0 3 1 1pshl4 1 0

40 37 36 35 34 33 32 31 30 29 28 27 26 20 19 13 12 6 5 0

7 za x2a zbve x2c x2b r3 0 r1 qp4 1 2 1 1 2 2 1 7 7 7 6


za zb ve x2a x2b x2cpopcnt r1 = r3 7 0 1 0 1 1 2

Table C-21. Integer Shift/Test Bit/Test NaT 2-bit Opcode Extensions

Opcodebits

40:37

x2bits

35:34

xbit 33

ybit 13

0 1

5

0

0

Test Bit (Table C-23) Test NaT (Table C-23)1 extr.u I11 extr I1123 shrp I10

Table C-22. Deposit Opcode Extensions

Opcodebits

40:37

x2bits

35:34

xbit 33

ybit 26

0 1

5

0

1

Test Bit/Test NaT (Table C-23)1 dep.z I12 dep.z – imm8 I1323 dep – imm1 I14

40 37 36 35 34 33 32 27 26 20 19 13 12 6 5 0

5 x2 x count6d r3 r2 r1 qp4 1 2 1 6 7 7 7 6

Instruction Operands OpcodeExtensionx2 x

shrp r1 = r2, r3, count6 5 3 0



C.3.2.2 Extract

I11

C.3.2.3 Zero and Deposit

I12

C.3.2.4 Zero and Deposit Immediate8

I13

C.3.2.5 Deposit Immediate1

I14

C.3.2.6 Deposit

I15

40 37 36 35 34 33 32 27 26 20 19 14 13 12 6 5 0

5 x2 x len6d r3 pos6b y r1 qp4 1 2 1 6 7 6 1 7 6


x2 x yextr.u

r1 = r3, pos6, len6 5 1 0 0extr 1

40 37 36 35 34 33 32 27 26 25 20 19 13 12 6 5 0

5 x2 x len6d y cpos6c r2 r1 qp4 1 2 1 6 1 6 7 7 6


x2 x ydep.z r1 = r2, pos6, len6 5 1 1 0

40 37 36 35 34 33 32 27 26 25 20 19 13 12 6 5 0

5 s x2 x len6d y cpos6c imm7b r1 qp4 1 2 1 6 1 6 7 7 6


x2 x ydep.z r1 = imm8, pos6, len6 5 1 1 1

40 37 36 35 34 33 32 27 26 20 19 14 13 12 6 5 0

5 s x2 x len6d r3 cpos6b r1 qp4 1 2 1 6 7 6 1 7 6


dep r1 = imm1, r3, pos6, len6 5 3 1

40 37 36 31 30 27 26 20 19 13 12 6 5 0

4 cpos6d len4d r3 r2 r1 qp4 6 4 7 7 7 6

Instruction Operands Opcodedep r1 = r2, r3, pos6, len4 4



C.3.3 Test Bit

All test bit instructions are encoded within major opcode 5 using a 2-bit opcode extension field in bits 35:34 (x2) plus four1-bit opcode extension fields in bits 33 (ta), 36 (tb), 12 (c), and 19 (y). Table C-23 summarizes these assignments.

C.3.3.1 Test Bit

I16

C.3.3.2 Test NaT

I17

Table C-23. Test Bit Opcode Extensions

Opcodebits

40:37

x2bits

35:34

tabit 33

tbbit 36

cbit 12

ybit 13

0 1

5 0

00 0 tbit.z I16 tnat.z I17

1 tbit.z.unc I16 tnat.z.unc I17

1 0 tbit.z.and I16 tnat.z.and I171 tbit.nz.and I16 tnat.nz.and I17

10 0 tbit.z.or I16 tnat.z.or I17

1 tbit.nz.or I16 tnat.nz.or I17

1 0 tbit.z.or.andcm I16 tnat.z.or.andcm I171 tbit.nz.or.andcm I16 tnat.nz.or.andcm I17

40 37 36 35 34 33 32 27 26 20 19 14 13 12 11 6 5 0

5 tb x2 ta p2 r3 pos6b y c p1 qp4 1 2 1 6 7 6 1 1 6 6


x2 ta tb y ctbit.z

p1, p2 = r3, pos6 5 0

00

0

0tbit.z.unc 1tbit.z.and 1 0tbit.nz.and 1tbit.z.or

10 0

tbit.nz.or 1tbit.z.or.andcm 1 0tbit.nz.or.andcm 1

40 37 36 35 34 33 32 27 26 20 19 14 13 12 11 6 5 0

5 tb x2 ta p2 r3 y c p1 qp4 1 2 1 6 7 6 1 1 6 6


x2 ta tb y ctnat.z

p1, p2 = r3 5 0

00

1

0tnat.z.unc 1tnat.z.and 1 0tnat.nz.and 1tnat.z.or

10 0

tnat.nz.or 1tnat.z.or.andcm 1 0tnat.nz.or.andcm 1



C.3.4 Miscellaneous I-Unit Instructions

The miscellaneous I-unit instructions are encoded in major opcode 0 using a 3-bit opcode extension field (x3) in bits35:33. Some also have a 6-bit opcode extension field (x6) in bits 32:27. Table C-24 shows the 3-bit assignments andTable C-25 summarizes the 6-bit assignments.

C.3.4.1 Break/Nop (I-Unit)

I19

C.3.4.2 Integer Speculation Check (I-Unit)

I20

Table C-24. Misc I-Unit 3-bit Opcode Extensions

OpcodeBits 40:37

x3Bits 35:33

0

0 6-bit Ext (Table C-25)1 chk.s.i – int I202 mov to pr.rot – imm44 I243 mov to pr I234567 mov to b I22

Table C-25. Misc I-Unit 6-bit Opcode Extensions

OpcodeBits

40:37

x3Bits

35:33

x6Bits

30:27Bits 32:31

0 1 2 3

0 0

0 break.i I19 zxt1 I29 mov from ip I251 nop.i I19 zxt2 I29 mov from b I222 zxt4 I29 mov.i from ar I283 mov from pr I254 sxt1 I295 sxt2 I296 sxt4 I2978 czx1.l I299 czx2.l I29A mov.i to ar – imm8 I27 mov.i to ar I26BC czx1.r I29D czx2.r I29EF

40 37 36 35 33 32 27 26 25 6 5 0

0 i x3 x6 imm20a qp4 1 3 6 1 20 6

Instruction Operands OpcodeExtensionx3 x6

break.iimm21 0 0 00

nop.i 01

40 37 36 35 33 32 20 19 13 12 6 5 0

0 s x3 imm13c r2 imm7a qp4 1 3 13 7 7 6


x3chk.s.i r2, target25 0 1



C.3.5 GR/BR Moves

The GR/BR move instructions are encoded in major opcode 0. See “Miscellaneous I-Unit Instructions” on page C-23 fora summary of the opcode extensions. The mov to BR instruction uses a 1-bit opcode extension field (x) in bit 22 to distin-guish the return form from the normal form.

C.3.5.1 Move to BR

I21

C.3.5.2 Move from BR

I22

C.3.6 GR/Predicate/IP Moves

The GR/Predicate/IP move instructions are encoded in major opcode 0. See “Miscellaneous I-Unit Instructions” onpage C-23 for a summary of the opcode extensions.

C.3.6.1 Move to Predicates – Register

I23

C.3.6.2 Move to Predicates – Immediate 44

I24

C.3.6.3 Move from Predicates/IP

I25

40 37 36 35 33 32 24 23 22 21 20 19 13 12 9 8 6 5 0

0 x3 x r2 b1 qp4 1 3 10 1 2 7 4 3 6


mov b1 = r2 0 7 0

mov.ret 1

40 37 36 35 33 32 27 26 16 15 13 12 6 5 0

0 x3 x6 b2 r1 qp4 1 3 6 11 3 7 6


mov r1 = b2 0 0 31

40 37 36 35 33 32 31 24 23 20 19 13 12 6 5 0

0 s x3 mask8c r2 mask7a qp4 1 3 1 8 4 7 7 6


x3mov pr = r2, mask17 0 3

40 37 36 35 33 32 6 5 0

0 s x3 imm27a qp4 1 3 27 6


x3mov pr.rot = imm44 0 2

40 37 36 35 33 32 27 26 13 12 6 5 0

0 x3 x6 r1 qp4 1 3 6 14 7 6


mov r1 = ip 0 0 30r1 = pr 33



C.3.7 GR/AR Moves (I-Unit)

The I-Unit GR/AR move instructions are encoded in major opcode 0. (some ARs are accessed using system/memory man-agement instructions on the M-unit. See “GR/AR Moves (M-Unit)” on page C-42.) See “Miscellaneous I-Unit Instruc-tions” on page C-23 for a summary of the I-Unit GR/AR opcode extensions.

C.3.7.1 Move to AR – Register (I-Unit)

I26

C.3.7.2 Move to AR – Immediate 8 (I-Unit)

I27

C.3.7.3 Move from AR (I-Unit)

I28

C.3.8 Sign/Zero Extend/Compute Zero Index

I29

40 37 36 35 33 32 27 26 20 19 13 12 6 5 0

0 x3 x6 ar3 r2 qp4 1 3 6 7 7 7 6


mov.i ar3 = r2 0 0 2A

40 37 36 35 33 32 27 26 20 19 13 12 6 5 0

0 s x3 x6 ar3 imm7b qp4 1 3 6 7 7 7 6


mov.i ar3 = imm8 0 0 0A

40 37 36 35 33 32 27 26 20 19 13 12 6 5 0

0 x3 x6 ar3 r1 qp4 1 3 6 7 7 7 6


mov.i r1 = ar3 0 0 32

40 37 36 35 33 32 27 26 20 19 13 12 6 5 0

0 x3 x6 r3 r1 qp4 1 3 6 7 7 7 6


zxt1

r1 = r3 0 0

10zxt2 11zxt4 12sxt1 14sxt2 15sxt4 16czx1.l 18czx2.l 19czx1.r 1Cczx2.r 1D



C.4 M-Unit Instruction Encodings

C.4.1 Loads and Stores

All load and store instructions are encoded within major opcodes 4, 5, 6, and 7 using a 6-bit opcode extension field in bits35:30 (x6). Instructions in major opcode 4 (integer load/store, semaphores, and get FR) use two 1-bit opcode extensionfields in bit 36 (m) and bit 27 (x) as shown in Table C-26. Instructions in major opcode 6 (floating-point load/store, loadpair, and set FR) use two 1-bit opcode extension fields in bit 36 (m) and bit 27 (x) as shown in Table C-27.

The integer load/store opcode extensions are summarized in Table C-28 on page C-26, Table C-29 on page C-27, andTable C-30 on page C-27, and the semaphore and get FR opcode extensions in Table C-31 on page C-28. The floating-point load/store opcode extensions are summarized in Table C-32 on page C-28, Table C-33 on page C-29, andTable C-34 on page C-29, the floating-point load pair and set FR opcode extensions in Table C-35 on page C-30 andTable C-36 on page C-30.

Table C-26. Integer Load/Store/Semaphore/Get FR 1-bit Opcode Extensions

OpcodeBits 40:37

mBit 36

xBit 27

4

0 0 Load/Store (Table C-28)0 1 Semaphore/get FR (Table C-31)1 0 Load +Reg (Table C-29)1 1

Table C-27. Floating-point Load/Store/Load Pair/Set FR 1-bit Opcode Extensions

OpcodeBits 40:37

mBit 36

xBit 27

6

0 0 FP Load/Store (Table C-32)0 1 FP Load Pair/set FR (Table C-35)1 0 FP Load +Reg (Table C-33)1 1 FP Load Pair +Imm (Table C-36)

Table C-28. Integer Load/Store Opcode Extensions

OpcodeBits

40:37

mBit 36

xBit 27

x6Bits

35:32Bits 31:30

0 1 2 3

4 0 0

0 ld1 M1 ld2 M1 ld4 M1 ld8 M11 ld1.s M1 ld2.s M1 ld4.s M1 ld8.s M12 ld1.a M1 ld2.a M1 ld4.a M1 ld8.a M13 ld1.sa M1 ld2.sa M1 ld4.sa M1 ld8.sa M14 ld1.bias M1 ld2.bias M1 ld4.bias M1 ld8.bias M15 ld1.acq M1 ld2.acq M1 ld4.acq M1 ld8.acq M16 ld8.fill M178 ld1.c.clr M1 ld2.c.clr M1 ld4.c.clr M1 ld8.c.clr M19 ld1.c.nc M1 ld2.c.nc M1 ld4.c.nc M1 ld8.c.nc M1A ld1.c.clr.acq M1 ld2.c.clr.acq M1 ld4.c.clr.acq M1 ld8.c.clr.acq M1BC st1 M4 st2 M4 st4 M4 st8 M4D st1.rel M4 st2.rel M4 st4.rel M4 st8.rel M4E st8.spill M4F



Table C-29. Integer Load +Reg Opcode Extensions

OpcodeBits

40:37

mBit 36

xBit 27

x6Bits

35:32Bits 31:30

0 1 2 3

4 1 0

0 ld1 M2 ld2 M2 ld4 M2 ld8 M21 ld1.s M2 ld2.s M2 ld4.s M2 ld8.s M22 ld1.a M2 ld2.a M2 ld4.a M2 ld8.a M23 ld1.sa M2 ld2.sa M2 ld4.sa M2 ld8.sa M24 ld1.bias M2 ld2.bias M2 ld4.bias M2 ld8.bias M25 ld1.acq M2 ld2.acq M2 ld4.acq M2 ld8.acq M26 ld8.fill M278 ld1.c.clr M2 ld2.c.clr M2 ld4.c.clr M2 ld8.c.clr M29 ld1.c.nc M2 ld2.c.nc M2 ld4.c.nc M2 ld8.c.nc M2A ld1.c.clr.acq M2 ld2.c.clr.acq M2 ld4.c.clr.acq M2 ld8.c.clr.acq M2BCDEF

Table C-30. Integer Load/Store +Imm Opcode Extensions

OpcodeBits

40:37

x6Bits

35:32Bits 31:30

0 1 2 3

5

0 ld1 M3 ld2 M3 ld4 M3 ld8 M31 ld1.s M3 ld2.s M3 ld4.s M3 ld8.s M32 ld1.a M3 ld2.a M3 ld4.a M3 ld8.a M33 ld1.sa M3 ld2.sa M3 ld4.sa M3 ld8.sa M34 ld1.bias M3 ld2.bias M3 ld4.bias M3 ld8.bias M35 ld1.acq M3 ld2.acq M3 ld4.acq M3 ld8.acq M36 ld8.fill M378 ld1.c.clr M3 ld2.c.clr M3 ld4.c.clr M3 ld8.c.clr M39 ld1.c.nc M3 ld2.c.nc M3 ld4.c.nc M3 ld8.c.nc M3A ld1.c.clr.acq M3 ld2.c.clr.acq M3 ld4.c.clr.acq M3 ld8.c.clr.acq M3BC st1 M5 st2 M5 st4 M5 st8 M5D st1.rel M5 st2.rel M5 st4.rel M5 st8.rel M5E st8.spill M5F



Table C-31. Semaphore/Get FR Opcode Extensions

OpcodeBits

40:37

mBit 36

xBit 27

x6Bits

35:32Bits 31:30

0 1 2 3

4 0 1

0 cmpxchg1.acq M16 cmpxchg2.acq M16 cmpxchg4.acq M16 cmpxchg8.acq M161 cmpxchg1.rel M16 cmpxchg2.rel M16 cmpxchg4.rel M16 cmpxchg8.rel M162 xchg1 M16 xchg2 M16 xchg4 M16 xchg8 M1634 fetchadd4.acq M17 fetchadd8.acq M175 fetchadd4.rel M17 fetchadd8.rel M1767 getf.sig M19 getf.exp M19 getf.s M19 getf.d M1989ABCDEF

Table C-32. Floating-point Load/Store/Lfetch Opcode Extensions

OpcodeBits

40:37

mBit 36

xBit 27

x6Bits

35:32Bits 31:30

0 1 2 3

6 0 0

0 ldfe M6 ldf8 M6 ldfs M6 ldfd M61 ldfe.s M6 ldf8.s M6 ldfs.s M6 ldfd.s M62 ldfe.a M6 ldf8.a M6 ldfs.a M6 ldfd.a M63 ldfe.sa M6 ldf8.sa M6 ldfs.sa M6 ldfd.sa M6456 ldf.fill M678 ldfe.c.clr M6 ldf8.c.clr M6 ldfs.c.clr M6 ldfd.c.clr M69 ldfe.c.nc M6 ldf8.c.nc M6 ldfs.c.nc M6 ldfd.c.nc M6AB lfetch M13 lfetch.excl M13 lfetch.fault M13 lfetch.fault.excl M13C stfe M9 stf8 M9 stfs M9 stfd M9DE stf.spill M9F



Table C-33. Floating-point Load/Lfetch +Reg Opcode Extensions

OpcodeBits

40:37

mBit 36

xBit 27

x6Bits

35:32Bits 31:30

0 1 2 3

6 1 0

0 ldfe M7 ldf8 M7 ldfs M7 ldfd M71 ldfe.s M7 ldf8.s M7 ldfs.s M7 ldfd.s M72 ldfe.a M7 ldf8.a M7 ldfs.a M7 ldfd.a M73 ldfe.sa M7 ldf8.sa M7 ldfs.sa M7 ldfd.sa M7456 ldf.fill M778 ldfe.c.clr M7 ldf8.c.clr M7 ldfs.c.clr M7 ldfd.c.clr M79 ldfe.c.nc M7 ldf8.c.nc M7 ldfs.c.nc M7 ldfd.c.nc M7AB lfetch M14 lfetch.excl M14 lfetch.fault M14 lfetch.fault.excl M14CDEF

Table C-34. Floating-point Load/Store/Lfetch +Imm Opcode Extensions

OpcodeBits

40:37

x6Bits

35:32Bits 31:30

0 1 2 3

7

0 ldfe M8 ldf8 M8 ldfs M8 ldfd M81 ldfe.s M8 ldf8.s M8 ldfs.s M8 ldfd.s M82 ldfe.a M8 ldf8.a M8 ldfs.a M8 ldfd.a M83 ldfe.sa M8 ldf8.sa M8 ldfs.sa M8 ldfd.sa M8456 ldf.fill M878 ldfe.c.clr M8 ldf8.c.clr M8 ldfs.c.clr M8 ldfd.c.clr M89 ldfe.c.nc M8 ldf8.c.nc M8 ldfs.c.nc M8 ldfd.c.nc M8AB lfetch M15 lfetch.excl M15 lfetch.fault M15 lfetch.fault.excl M15C stfe M10 stf8 M10 stfs M10 stfd M10DE stf.spill M10F



The load and store instructions all have a 2-bit opcode extension field in bits 29:28 (hint) which encodes locality hintinformation. Table C-37and Table C-38 summarize these assignments.

Table C-35. Floating-point Load Pair/Set FR Opcode Extensions

OpcodeBits

40:37

mBit 36

xBit 27

x6Bits

35:32Bits 31:30

0 1 2 3

6 0 1

0 ldfp8 M11 ldfps M11 ldfpd M111 ldfp8.s M11 ldfps.s M11 ldfpd.s M112 ldfp8.a M11 ldfps.a M11 ldfpd.a M113 ldfp8.sa M11 ldfps.sa M11 ldfpd.sa M114567 setf.sig M18 setf.exp M18 setf.s M18 setf.d M188 ldfp8.c.clr M11 ldfps.c.clr M11 ldfpd.c.clr M119 ldfp8.c.nc M11 ldfps.c.nc M11 ldfpd.c.nc M11ABCDEF

Table C-36. Floating-point Load Pair +Imm Opcode Extensions

OpcodeBits

40:37

mBit 36

xBit 27

x6Bits

35:32Bits 31:30

0 1 2 3

6 1 1

0 ldfp8 M12 ldfps M12 ldfpd M121 ldfp8.s M12 ldfps.s M12 ldfpd.s M122 ldfp8.a M12 ldfps.a M12 ldfpd.a M123 ldfp8.sa M12 ldfps.sa M12 ldfpd.sa M1245678 ldfp8.c.clr M12 ldfps.c.clr M12 ldfpd.c.clr M129 ldfp8.c.nc M12 ldfps.c.nc M12 ldfpd.c.nc M12ABCDEF

Table C-37. Load Hint Completer

HintBits 29:28

ldhint

0 none 1 .nt123 .nta

Table C-38. Store Hint Completer

HintBits 29:28

sthint

0 none 123 .nta



C.4.1.1 Integer Load

M1

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

4 m x6 hint x r3 r1 qp4 1 6 2 1 7 7 7 6


m x x6 hintld1.ldhint

r1 = [r3] 4 0 0

00

See Table C-37

on page C-30

ld2.ldhint 01ld4.ldhint 02ld8.ldhint 03ld1.s.ldhint 04ld2.s.ldhint 05ld4.s.ldhint 06ld8.s.ldhint 07ld1.a.ldhint 08ld2.a.ldhint 09ld4.a.ldhint 0Ald8.a.ldhint 0Bld1.sa.ldhint 0Cld2.sa.ldhint 0Dld4.sa.ldhint 0Eld8.sa.ldhint 0Fld1.bias.ldhint 10ld2.bias.ldhint 11ld4.bias.ldhint 12ld8.bias.ldhint 13ld1.acq.ldhint 14ld2.acq.ldhint 15ld4.acq.ldhint 16ld8.acq.ldhint 17ld8.fill.ldhint 1Bld1.c.clr.ldhint 20ld2.c.clr.ldhint 21ld4.c.clr.ldhint 22ld8.c.clr.ldhint 23ld1.c.nc.ldhint 24ld2.c.nc.ldhint 25ld4.c.nc.ldhint 26ld8.c.nc.ldhint 27ld1.c.clr.acq.ldhint 28ld2.c.clr.acq.ldhint 29ld4.c.clr.acq.ldhint 2Ald8.c.clr.acq.ldhint 2B



C.4.1.2 Integer Load – Increment by Register

M2

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

4 m x6 hint x r3 r2 r1 qp4 1 6 2 1 7 7 7 6


m x x6 hintld1.ldhint

r1 = [r3], r2 4 1 0

00

See Table C-37

on page C-30




C.4.1.3 Integer Load – Increment by Immediate

M3

C.4.1.4 Integer Store

M4

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

5 s x6 hint i r3 imm7b r1 qp4 1 6 2 1 7 7 7 6


x6 hintld1.ldhint

r1 = [r3], imm9 5

00

See Table C-37

on page C-30


40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

4 m x6 hint x r3 r2 qp4 1 6 2 1 7 7 7 6


m x x6 hintst1.sthint

[r3] = r2 4 0 0

30

See Table C-38

on page C-30

st2.sthint 31st4.sthint 32st8.sthint 33st1.rel.sthint 34st2.rel.sthint 35st4.rel.sthint 36st8.rel.sthint 37st8.spill.sthint 3B



C.4.1.5 Integer Store – Increment by Immediate

M5

C.4.1.6 Floating-point Load

M6

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

5 s x6 hint i r3 r2 imm7a qp4 1 6 2 1 7 7 7 6


x6 hintst1.sthint

[r3] = r2, imm9 5

30

See Table C-38

on page C-30

st2.sthint 31st4.sthint 32st8.sthint 33st1.rel.sthint 34st2.rel.sthint 35st4.rel.sthint 36st8.rel.sthint 37st8.spill.sthint 3B

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

6 m x6 hint x r3 f1 qp4 1 6 2 1 7 7 7 6


m x x6 hintldfs.ldhint

f1 = [r3] 6 0 0

02

See Table C-37

on page C-30

ldfd.ldhint 03ldf8.ldhint 01ldfe.ldhint 00ldfs.s.ldhint 06ldfd.s.ldhint 07ldf8.s.ldhint 05ldfe.s.ldhint 04ldfs.a.ldhint 0Aldfd.a.ldhint 0Bldf8.a.ldhint 09ldfe.a.ldhint 08ldfs.sa.ldhint 0Eldfd.sa.ldhint 0Fldf8.sa.ldhint 0Dldfe.sa.ldhint 0Cldf.fill.ldhint 1Bldfs.c.clr.ldhint 22ldfd.c.clr.ldhint 23ldf8.c.clr.ldhint 21ldfe.c.clr.ldhint 20ldfs.c.nc.ldhint 26ldfd.c.nc.ldhint 27ldf8.c.nc.ldhint 25ldfe.c.nc.ldhint 24



C.4.1.7 Floating-point Load – Increment by Register

M7

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

6 m x6 hint x r3 r2 f1 qp4 1 6 2 1 7 7 7 6


m x x6 hintldfs.ldhint

f1 = [r3], r2 6 1 0

02

See Table C-37

on page C-30




C.4.1.8 Floating-point Load – Increment by Immediate

M8

C.4.1.9 Floating-point Store

M9

C.4.1.10 Floating-point Store – Increment by Immediate

M10

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

7 s x6 hint i r3 imm7b f1 qp4 1 6 2 1 7 7 7 6


x6 hintldfs.ldhint

f1 = [r3], imm9 7

02

See Table C-37

on page C-30


40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

6 m x6 hint x r3 f2 qp4 1 6 2 1 7 7 7 6


m x x6 hintstfs.sthint

[r3] = f2 6 0 0

32See

Table C-38 on page C-30

stfd.sthint 33stf8.sthint 31stfe.sthint 30stf.spill.sthint 3B

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

7 s x6 hint i r3 f2 imm7a qp4 1 6 2 1 7 7 7 6


x6 hintstfs.sthint

[r3] = f2, imm9 7

32See


stfd.sthint 33stf8.sthint 31stfe.sthint 30stf.spill.sthint 3B



C.4.1.11 Floating-point Load Pair

M11

C.4.1.12 Floating-point Load Pair – Increment by Immediate

M12

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

6 m x6 hint x r3 f2 f1 qp4 1 6 2 1 7 7 7 6


m x x6 hintldfps.ldhint

f1, f2 = [r3] 6 0 1

02

See Table C-37

on page C-30

ldfpd.ldhint 03ldfp8.ldhint 01ldfps.s.ldhint 06ldfpd.s.ldhint 07ldfp8.s.ldhint 05ldfps.a.ldhint 0Aldfpd.a.ldhint 0Bldfp8.a.ldhint 09ldfps.sa.ldhint 0Eldfpd.sa.ldhint 0Fldfp8.sa.ldhint 0Dldfps.c.clr.ldhint 22ldfpd.c.clr.ldhint 23ldfp8.c.clr.ldhint 21ldfps.c.nc.ldhint 26ldfpd.c.nc.ldhint 27ldfp8.c.nc.ldhint 25

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

6 m x6 hint x r3 f2 f1 qp4 1 6 2 1 7 7 7 6


m x x6 hintldfps.ldhint f1, f2 = [r3], 8

6 1 1

02

See Table C-37

on page C-30

ldfpd.ldhintf1, f2 = [r3], 16 03

ldfp8.ldhint 01ldfps.s.ldhint f1, f2 = [r3], 8 06ldfpd.s.ldhint

f1, f2 = [r3], 16 07ldfp8.s.ldhint 05ldfps.a.ldhint f1, f2 = [r3], 8 0Aldfpd.a.ldhint

f1, f2 = [r3], 16 0Bldfp8.a.ldhint 09ldfps.sa.ldhint f1, f2 = [r3], 8 0Eldfpd.sa.ldhint

f1, f2 = [r3], 16 0Fldfp8.sa.ldhint 0Dldfps.c.clr.ldhint f1, f2 = [r3], 8 22ldfpd.c.clr.ldhint

f1, f2 = [r3], 16 23ldfp8.c.clr.ldhint 21ldfps.c.nc.ldhint f1, f2 = [r3], 8 26ldfpd.c.nc.ldhint

f1, f2 = [r3], 16 27ldfp8.c.nc.ldhint 25



C.4.2 Line Prefetch

The line prefetch instructions are encoded in major opcodes 6 and 7 along with the floating-point load/store instructions.See “Loads and Stores” on page C-26 for a summary of the opcode extensions.

The line prefetch instructions all have a 2-bit opcode extension field in bits 29:28 (hint) which encodes locality hint infor-mation as shown in Table C-39.

C.4.2.1 Line Prefetch

M13

C.4.2.2 Line Prefetch – Increment by Register

M14

C.4.2.3 Line Prefetch – Increment by Immediate

M15

Table C-39. Line Prefetch Hint Completer

HintBits 29:28

lfhint

0 none 1 .nt12 .nt23 .nta

40 37 36 35 30 29 28 27 26 20 19 6 5 0

6 m x6 hint x r3 qp4 1 6 2 1 7 14 6


m x x6 hintlfetch.lfhint

[r3] 6 0 0

2C See Table C-39 on

page C-38

lfetch.excl.lfhint 2Dlfetch.fault.lfhint 2Elfetch.fault.excl.lfhint 2F

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

6 m x6 hint x r3 r2 qp4 1 6 2 1 7 7 7 6


m x x6 hintlfetch.lfhint

[r3], r2 6 1 0


page C-38


40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

7 s x6 hint i r3 imm7b qp4 1 6 2 1 7 7 7 6


x6 hintlfetch.lfhint

[r3], imm9 7


page C-38




C.4.3 Semaphores

The semaphore instructions are encoded in major opcode 4 along with the integer load/store instructions. See “Loads andStores” on page C-26 for a summary of the opcode extensions.

C.4.3.1 Exchange/Compare and Exchange

M16

C.4.3.2 Fetch and Add – Immediate

M17

C.4.4 Set/Get FR

The set FR instructions are encoded in major opcode 6 along with the floating-point load/store instructions. The get FRinstructions are encoded in major opcode 4 along with the integer load/store instructions. See “Loads and Stores” onpage C-26 for a summary of the opcode extensions.

C.4.4.1 Set FR

M18

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

4 m x6 hint x r3 r2 r1 qp4 1 6 2 1 7 7 7 6


m x x6 hintcmpxchg1.acq.ldhint

r1 = [r3], r2, ar.ccv

4 0 1

00

See Table C-37

on page C-30

cmpxchg2.acq.ldhint 01cmpxchg4.acq.ldhint 02cmpxchg8.acq.ldhint 03cmpxchg1.rel.ldhint 04cmpxchg2.rel.ldhint 05cmpxchg4.rel.ldhint 06cmpxchg8.rel.ldhint 07xchg1.ldhint

r1 = [r3], r2

08xchg2.ldhint 09xchg4.ldhint 0Axchg8.ldhint 0B

40 37 36 35 30 29 28 27 26 20 19 16 15 14 13 12 6 5 0

4 m x6 hint x r3 s i2b r1 qp4 1 6 2 1 7 4 1 2 7 6


m x x6 hintfetchadd4.acq.ldhint

r1 = [r3], inc3 4 0 1

12 See Table C-37

on page C-30

fetchadd8.acq.ldhint 13fetchadd4.rel.ldhint 16fetchadd8.rel.ldhint 17

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

6 m x6 x r2 f1 qp4 1 6 2 1 7 7 7 6


m x x6setf.sig

f1 = r2 6 0 1

1Csetf.exp 1Dsetf.s 1Esetf.d 1F



C.4.4.2 Get FR

M19

C.4.5 Speculation and Advanced Load Checks

The speculation and advanced load check instructions are encoded in major opcodes 0 and 1 along with the system/mem-ory management instructions. See “Memory Management” on page C-43 for a summary of the opcode extensions.

C.4.5.1 Integer Speculation Check (M-Unit)

M20

C.4.5.2 Floating-point Speculation Check

M21

C.4.5.3 Integer Advanced Load Check

M22

C.4.5.4 Floating-point Advanced Load Check

M23

40 37 36 35 30 29 28 27 26 20 19 13 12 6 5 0

4 m x6 x f2 r1 qp4 1 6 2 1 7 7 7 6


m x x6getf.sig

r1 = f2 4 0 1

1Cgetf.exp 1Dgetf.s 1Egetf.d 1F

40 37 36 35 33 32 20 19 13 12 6 5 0

1 s x3 imm13c r2 imm7a qp4 1 3 13 7 7 6


x3chk.s.m r2, target25 1 1

40 37 36 35 33 32 20 19 13 12 6 5 0

1 s x3 imm13c f2 imm7a qp4 1 3 13 7 7 6


x3chk.s f2, target25 1 3

40 37 36 35 33 32 13 12 6 5 0

0 s x3 imm20b r1 qp4 1 3 20 7 6


x3chk.a.nc

r1, target25 0 4chk.a.clr 5

40 37 36 35 33 32 13 12 6 5 0

0 s x3 imm20b f1 qp4 1 3 20 7 6


x3chk.a.nc

f1, target25 0 6chk.a.clr 7



C.4.6 Cache/Synchronization/RSE/ALAT

The cache/synchronization/RSE/ALAT instructions are encoded in major opcode 0 along with the memory managementinstructions. See “Memory Management” on page C-43 for a summary of the opcode extensions.

C.4.6.1 Sync/Fence/Serialize/ALAT Control

M24

C.4.6.2 RSE Control

M25

C.4.6.3 Integer ALAT Entry Invalidate

M26

C.4.6.4 Floating-point ALAT Entry Invalidate

M27

C.4.6.5 Flush Cache

M28

40 37 36 35 33 32 31 30 27 26 6 5 0

0 x3 x2 x4 qp4 1 3 2 4 21 6

Instruction OpcodeExtension

x3 x4 x2invala

0 0

0 1mf 2 2mf.a 3srlz.i 1 3sync.i 3

40 37 36 35 33 32 31 30 27 26 6 5 0

0 x3 x2 x4 04 1 3 2 4 21 6


x3 x4 x2flushrs f 0 0 C 0

40 37 36 35 33 32 31 30 27 26 13 12 6 5 0

0 x3 x2 x4 r1 qp4 1 3 2 4 14 7 6


x3 x4 x2invala.e r1 0 0 2 1

40 37 36 35 33 32 31 30 27 26 13 12 6 5 0

0 x3 x2 x4 f1 qp4 1 3 2 4 14 7 6


x3 x4 x2invala.e f1 0 0 3 1

40 37 36 35 33 32 27 26 20 19 6 5 0

1 x3 x6 r3 qp4 1 3 6 7 14 6


fc r3 1 0 30



C.4.7 GR/AR Moves (M-Unit)

The M-Unit GR/AR move instructions are encoded in major opcode 0 along with the system/memory managementinstructions. (some ARs are accessed using system control instructions on the I-unit. See “GR/AR Moves (I-Unit)” onpage C-25.) See “Memory Management” on page C-43 for a summary of the M-Unit GR/AR opcode extensions.

C.4.7.1 Move to AR – Register (M-Unit)

M29

C.4.7.2 Move to AR – Immediate 8 (M-Unit)

M30

C.4.7.3 Move from AR (M-Unit)

M31

C.4.8 Miscellaneous M-Unit Instructions

The miscellaneous M-unit instructions are encoded in major opcode 0 along with the memory management instructions.See “Memory Management” on page C-43 for a summary of the opcode extensions.

C.4.8.1 Allocate Register Stack Frame

M34

NOTE: The three immediates in the instruction encoding are formed from the operands as follows:

sof = i + l + osol = i + lsor = r >> 3

C.4.8.2 Move to PSR

M35

40 37 36 35 33 32 27 26 20 19 13 12 6 5 0

1 x3 x6 ar3 r2 qp4 1 3 6 7 7 7 6


mov.m ar3 = r2 1 0 2A

40 37 36 35 33 32 31 30 27 26 20 19 13 12 6 5 0

0 s x3 x2 x4 ar3 imm7b qp4 1 3 2 4 7 7 7 6


x3 x4 x2mov.m ar3 = imm8 0 0 8 2

40 37 36 35 33 32 27 26 20 19 13 12 6 5 0

1 x3 x6 ar3 r1 qp4 1 3 6 7 7 7 6


mov.m r1 = ar3 1 0 22

40 37 36 35 33 32 31 30 27 26 20 19 13 12 6 5 0

1 x3 sor sol sof r1 04 1 3 2 4 7 7 7 6


x3alloc f r1 = ar.pfs, i, l, o, r 1 6

40 37 36 35 33 32 27 26 20 19 13 12 6 5 0

1 x3 x6 r2 qp4 1 3 6 7 7 7 6


mov psr.um = r2 1 0 29



C.4.8.3 Move from PSR

M36

C.4.8.4 Break/Nop (M-Unit)

M37

C.4.9 Memory Management

All system/memory management instructions are encoded within major opcodes 0 and 1 using a 3-bit opcode extensionfield (x3) in bits 35:33. Some instructions also have a 4-bit opcode extension field (x4) in bits 30:27, or a 6-bit opcodeextension field (x6) in bits 32:27. Most of the instructions having a 4-bit opcode extension field also have a 2-bit extensionfield (x2) in bits 32:31. Table C-40 shows the 3-bit assignments for opcode 0, Table C-41 summarizes the 4-bit+2-bitassignments for opcode 0, Table C-42 shows the 3-bit assignments for opcode 1, and Table C-43 summarizes the 6-bitassignments for opcode 1.

40 37 36 35 33 32 27 26 13 12 6 5 0

1 x3 x6 r1 qp4 1 3 6 14 7 6


mov r1 = psr.um 1 0 21

40 37 36 35 33 32 31 30 27 26 25 6 5 0

0 i x3 x2 x4 imm20a qp4 1 3 2 4 1 20 6


x3 x4 x2break.m

imm21 0 0 0 0nop.m 1

Table C-40. Opcode 0 Memory Management 3-bit Opcode Extensions

OpcodeBits

40:37

x3Bits

35:33

0

0System/Memory Management 4-bit+2-bit Ext (Table C-41)

1234 chk.a.nc – int M225 chk.a.clr – int M226 chk.a.nc – fp M237 chk.a.clr – fp M23



Table C-41. Opcode 0 Memory Management 4-bit+2-bit Opcode Extensions

OpcodeBits

40:37

x3Bits

35:33

x4Bits

30:27

x2Bits 32:31

0 1 2 3

0 0

0 break.m M37 invala M241 nop.m M37 srlz.i M242 invala.e – int M26 mf M243 invala.e – fp M27 mf.a M24 sync.i M244 sum M445 rum M44678 mov.m to ar – imm8 M309ABC flushrs M25DEF


OpcodeBits 40:37

x3Bits 35:33

1

0Memory Management 6-bit Ext

(Table C-43)1 chk.s.m – int M2023 chk.s – fp M21456 alloc M347


OpcodeBits

40:37

x3Bits

35:33

x6Bits

30:27Bits 32:31

0 1 2 3

1 0

0 fc M281 mov from psr.um M362 mov.m from ar M31345 mov from pmd M4367 mov from cpuid M4389 mov to psr.um M35A mov.m to ar M29BCDEF



C.4.9.1 Move from Indirect Register

M43

C.4.9.2 Set/Reset User Mask

M44

C.5 B-Unit Instruction EncodingsThe branch-unit includes branch and miscellaneous instructions.

C.5.1 Branches

Opcode 0 is used for indirect branch, opcode 1 for indirect call, opcode 4 for IP-relative branch, and opcode 5 for IP-rela-tive call.

The IP-relative branch instructions encoded within major opcode 4 use a 3-bit opcode extension field in bits 8:6 (btype) todistinguish the branch types as shown in Table C-44.

The indirect branch, indirect return, and miscellaneous branch-unit instructions are encoded within major opcode 0 usinga 6-bit opcode extension field in bits 32:27 (x6). Table C-45 summarizes these assignments.

40 37 36 35 33 32 27 26 20 19 13 12 6 5 0

1 x3 x6 r3 r1 qp4 1 3 6 7 7 7 6


mov r1 = pmd[r3] 1 0 15r1 = cpuid[r3] 17

40 37 36 35 33 32 31 30 27 26 6 5 0

0 i x3 i2d x4 imm21a qp4 1 3 2 4 21 6


sum imm24 0 0 4rum 5

Table C-44. IP-Relative Branch Types

OpcodeBits 40:37

btypebits 8:6

4

0 br.cond B112 br.wexit B13 br.wtop B145 br.cloop B26 br.cexit B27 br.ctop B2



The indirect branch instructions encoded within major opcodes 0 use a 3-bit opcode extension field in bits 8:6 (btype) todistinguish the branch types as shown in Table C-46.

The indirect return branch instructions encoded within major opcodes 0 use a 3-bit opcode extension field in bits 8:6(btype) to distinguish the branch types as shown in Table C-47.

All of the branch instructions have a 1-bit opcode extension field, p, in bit 12 which provides a sequential prefetch hint.Table C-48 summarizes these assignments.

Table C-45. Indirect/Miscellaneous Branch Opcode Extensions

OpcodeBits

40:37

x6Bits

30:27Bits 32:31

0 1 2 3

0

0 break.b B9Indirect Branch

(Table C-46)

1Indirect Return (Table C-47)

234 clrrrb B85 clrrrb.pr B86789ABCDEF

Table C-46. Indirect Branch Types

OpcodeBits 40:37

x6Bits 32:27

btypeBits 8:6

0 20

0 br.cond B41 br.ia B4234567

Table C-47. Indirect Return Branch Types

OpcodeBits 40:37

x6Bits 32:27

btypeBits 8:6

0 21

01234 br.ret B4567

Table C-48. Sequential Prefetch Hint Completer

pBit 12

ph

0 .few1 .many



The IP-relative and indirect branch instructions all have a 2-bit opcode extension field in bits 34:33 (wh) which encodesbranch prediction “whether” hint information as shown in Table C-49. Indirect call instructions have a 3-bit opcode exten-sion field in bits 34:32 (wh) for “whether” hint information as shown in Table C-50.

The branch instructions also have a 1-bit opcode extension field in bit 35 (d) which encodes a branch cache deallocationhint as shown in Table C-51.

C.5.1.1 IP-Relative Branch

B1

C.5.1.2 IP-Relative Counted Branch

B2

Table C-49. Branch Whether Hint Completer

whBits 34:33

bwh

0 .sptk1 .spnt2 .dptk3 .dpnt

Table C-50. Indirect Call Whether Hint Completer

whBits 34:32

bwh

01 .sptk23 .spnt45 .dptk67 .dpnt

Table C-51. Branch Cache Deallocation Hint Completer

dBit 35

dh

0 none 1 .clr

40 37 36 35 34 33 32 13 12 11 9 8 6 5 0

4 s d wh imm20b p btype qp4 1 1 2 20 1 3 3 6


btype p wh dbr.cond.bwh.ph.dh

target25 40 See


See Table C-49 on

page C-47

See Table C-51 on

page C-47br.wexit.bwh.ph.dh t 2br.wtop.bwh.ph.dh t 3

40 37 36 35 34 33 32 13 12 11 9 8 6 5 0

4 s d wh imm20b p btype 04 1 1 2 20 1 3 3 6


btype p wh dbr.cloop.bwh.ph.dh t

target25 45 See


See Table C-49 on

page C-47

See Table C-51 on

page C-47br.cexit.bwh.ph.dh t 6br.ctop.bwh.ph.dh t 7



C.5.1.3 IP-Relative Call

B3

C.5.1.4 Indirect Branch

B4

C.5.1.5 Indirect Call

B5

C.5.2 Nop

The nop instruction is encoded in major opcode 2. The nop instruction in major opcode 2 uses a 6-bit opcode extensionfield in bits 32:27 (x6). Table C-52 summarizes these assignments.

40 37 36 35 34 33 32 13 12 11 9 8 6 5 0

5 s d wh imm20b p b1 qp4 1 1 2 20 1 3 3 6


p wh d

br.call.bwh.ph.dh b1 = target25 5See


See Table C-49 on

page C-47

See Table C-51 on

page C-47

40 37 36 35 34 33 32 27 26 16 15 13 12 11 9 8 6 5 0

0 d wh x6 b2 p btype qp4 1 1 2 6 11 3 1 3 3 6


x6 btype p wh d

br.cond.bwh.ph.dh

b2 020

0 See Table C-

48 on page C-4

6

See Table C-

49 on page C-4

7

See Table C-

51 on page C-4

7

br.ia.bwh.ph.dh 1

br.ret.bwh.ph.dh 21 4

40 37 36 35 34 32 31 16 15 13 12 11 9 8 6 5 0

1 d wh b2 p b1 qp4 1 1 3 16 3 1 3 3 6


p wh d

br.call.bwh.ph.dh b1 = b2 1See


See Table C-50 on

page C-47

See Table C-51 on

page C-47

Table C-52. Indirect Predict/Nop Opcode Extensions

OpcodeBits

40:37

x6Bits

30:27Bits 32:31

0 1 2 3

2

0 nop.b B9123456789ABCDEF



C.5.3 Miscellaneous B-Unit Instructions

The miscellaneous branch-unit instructions include a number of instructions encoded within major opcode 0 using a 6-bitopcode extension field in bits 32:27 (x6) as described in Table C-45 on page C-46.

C.5.3.1 Miscellaneous (B-Unit)

B8

C.5.3.2 Break/Nop (B-Unit)

B9

C.6 F-Unit Instruction EncodingsThe floating-point instructions are encoded in major opcodes 8 – E for floating-point and fixed-point arithmetic, opcode 4for floating-point compare, opcode 5 for floating-point class, and opcodes 0 and 1 for miscellaneous floating-pointinstructions.

The miscellaneous and reciprocal approximation floating-point instructions are encoded within major opcodes 0 and 1using a 1-bit opcode extension field (x) in bit 33 and either a second 1-bit extension field in bit 36 (q) or a 6-bit opcodeextension field (x6) in bits 32:27. Table C-53 shows the 1-bit x assignments, Table C-56 shows the additional 1-bit qassignments for the reciprocal approximation instructions; Table C-54 and Table C-55 summarize the 6-bit x6 assign-ments.

40 37 36 33 32 27 26 6 5 0

0 x6 04 4 6 21 6


x6clrrrb l

0 04clrrrb.pr l 05

40 37 36 35 33 32 27 26 25 6 5 0

0/2 i x6 imm20a qp4 1 3 6 1 20 6


x6break.b

imm210 00nop.b 2

Table C-53. Miscellaneous Floating-point 1-bit Opcode Extensions

OpcodeBits 40:37

xBit 33

0 0 6-bit Ext (Table C-54)1 Reciprocal Approximation (Table C-56)

1 0 6-bit Ext (Table C-55)1 Reciprocal Approximation (Table C-56)



Table C-54. Opcode 0 Miscellaneous Floating-point 6-bit Opcode Extensions

OpcodeBits

40:37

xBit 33

x6Bits

30:27Bits 32:31

0 1 2 3

0 0

0 break.f F15 fmerge.s F91 nop.f F15 fmerge.ns F92 fmerge.se F934 fsetc F12 fmin F8 fswap F95 fclrf F13 fmax F8 fswap.nl F96 famin F8 fswap.nr F97 famax F88 fchkf F14 fcvt.fx F10 fpack F99 fcvt.fxu F10 fmix.lr F9A fcvt.fx.trunc F10 fmix.r F9B fcvt.fxu.trunc F10 fmix.l F9C fcvt.xf F11 fand F9 fsxt.r F9D fandcm F9 fsxt.l F9E for F9F fxor F9

Table C-55. Opcode 1 Miscellaneous Floating-point 6-bit Opcode Extensions

OpcodeBits

40:37

xBit 33

x6Bits

30:27Bits 32:31

0 1 2 3

1 0

0 fpmerge.s F9 fpcmp.eq F81 fpmerge.ns F9 fpcmp.lt F82 fpmerge.se F9 fpcmp.le F83 fpcmp.unord F84 fpmin F8 fpcmp.neq F85 fpmax F8 fpcmp.nlt F86 fpamin F8 fpcmp.nle F87 fpamax F8 fpcmp.ord F88 fpcvt.fx F109 fpcvt.fxu F10A fpcvt.fx.trunc F10B fpcvt.fxu.trunc F10CDEF

Table C-56. Reciprocal Approximation 1-bit Opcode Extensions

OpcodeBits 40:37

xBit 33

qBit 36

01

0 frcpa F61 frsqrta F7

1 0 fprcpa F61 fprsqrta F7



Most floating-point instructions have a 2-bit opcode extension field in bits 35:34 (sf) which encodes the FPSR status fieldto be used. Table C-57 summarizes these assignments.

C.6.1 Arithmetic

The floating-point arithmetic instructions are encoded within major opcodes 8 – D using a 1-bit opcode extension field (x)in bit 36 and a 2-bit opcode extension field (sf) in bits 35:34. The opcode and x assignments are shown in Table C-58.

The fixed-point arithmetic and parallel floating-point select instructions are encoded within major opcode E using a 1-bitopcode extension field (x) in bit 36. The fixed-point arithmetic instructions also have a 2-bit opcode extension field (x2) inbits 35:34. These assignments are shown in Table C-59.

C.6.1.1 Floating-point Multiply Add

F1

Table C-57. Floating-point Status Field Completer

sfBits 35:34

sf

0 .s01 .s12 .s23 .s3

Table C-58. Floating-point Arithmetic 1-bit Opcode Extensions

xBit 36

OpcodeBits 40:37

8 9 A B C D0 fma F1 fma.d F1 fms F1 fms.d F1 fnma F1 fnma.d F11 fma.s F1 fpma F1 fms.s F1 fpms F1 fnma.s F1 fpnma F1

Table C-59. Fixed-point Multiply Add and Select Opcode Extensions

OpcodeBits

40:37

xBit 36

x2Bits 35:34

0 1 2 3

E 0 fselect F31 xma.l F2 xma.hu F2 xma.h F2

40 37 36 35 34 33 27 26 20 19 13 12 6 5 0

8 - D x sf f4 f3 f2 f1 qp4 1 2 7 7 7 7 6


x sffma.sf

f1 = f3, f4, f2

8 0

See Table C-57

on page C-51

fma.s.sf 1fma.d.sf 9 0fpma.sf 1fms.sf A 0fms.s.sf 1fms.d.sf B 0fpms.sf 1fnma.sf C 0fnma.s.sf 1fnma.d.sf D 0fpnma.sf 1



C.6.1.2 Fixed-point Multiply Add

F2

C.6.2 Parallel Floating-point Select

F3

C.6.3 Compare and Classify

The predicate setting floating-point compare instructions are encoded within major opcode 4 using three 1-bit opcodeextension fields in bits 33 (ra), 36 (rb), and 12 (ta), and a 2-bit opcode extension field (sf) in bits 35:34. The opcode, ra, rb,and ta assignments are shown in Table C-60. The sf assignments are shown in Table C-57 on page C-51.

The parallel floating-point compare instructions are described on page C-54.

The floating-point class instructions are encoded within major opcode 5 using a 1-bit opcode extension field in bit 12 (ta)as shown in Table C-61.

C.6.3.1 Floating-point Compare

F4

40 37 36 35 34 33 27 26 20 19 13 12 6 5 0

E x x2 f4 f3 f2 f1 qp4 1 2 7 7 7 7 6

Instruction Operands OpcodeExtensionx x2

xma.lf1 = f3, f4, f2 E 1

0xma.h 3xma.hu 2

40 37 36 35 34 33 27 26 20 19 13 12 6 5 0

E x f4 f3 f2 f1 qp4 1 2 7 7 7 7 6


xfselect f1 = f3, f4, f2 E 0

Table C-60. Floating-point Compare Opcode Extensions

OpcodeBits

40:37

raBit 33

rbBit 36

taBit 12

0 1

40 0 fcmp.eq F4 fcmp.eq.unc F4

1 fcmp.lt F4 fcmp.lt.unc F4

1 0 fcmp.le F4 fcmp.le.unc F41 fcmp.unord F4 fcmp.unord.unc F4

Table C-61. Floating-point Class 1-bit Opcode Extensions

OpcodeBits 40:37

taBit 12

5 0 fclass.m F51 fclass.m.unc F5

40 37 36 35 34 33 32 27 26 20 19 13 12 11 6 5 0

4 rb sf ra p2 f3 f2 ta p1 qp4 1 2 1 6 7 7 1 6 6


ra rb ta sffcmp.eq.sf

p1, p2 = f2, f3 4

0 0

0See


fcmp.lt.sf 1fcmp.le.sf 1 0fcmp.unord.sf 1fcmp.eq.unc.sf 0 0

1fcmp.lt.unc.sf 1fcmp.le.unc.sf 1 0fcmp.unord.unc.sf 1



C.6.3.2 Floating-point Class

F5

C.6.4 Approximation

C.6.4.1 Floating-point Reciprocal Approximation

There are two Reciprocal Approximation instructions. The first, in major op 0, encodes the full register variant. The sec-ond, in major op 1, encodes the parallel variant.

F6

C.6.4.2 Floating-point Reciprocal Square Root Approximation

There are two Reciprocal Square Root Approximation instructions. The first, in major op 0, encodes the full register vari-ant. The second, in major op 1, encodes the parallel variant.

F7

40 37 36 35 34 33 32 27 26 20 19 13 12 11 6 5 0

5 fc2 p2 fclass7c f2 ta p1 qp4 2 2 6 7 7 1 6 6


tafclass.m

p1, p2 = f2, fclass9 5 0fclass.m.unc 1

40 37 36 35 34 33 32 27 26 20 19 13 12 6 5 0

0 - 1 q sf x p2 f3 f2 f1 qp4 1 2 1 6 7 7 7 6


x q sf

frcpa.sff1, p2 = f2, f3

0 1 0See Table C-57 on page C-51

fprcpa.sf 1

40 37 36 35 34 33 32 27 26 20 19 13 12 6 5 0

0 - 1 q sf x p2 f3 f1 qp4 1 2 1 6 7 7 7 6


x q sf

frsqrta.sff1, p2 = f3

0 1 1See Table C-57 on page C-51fprsqrta.sf 1



C.6.5 Minimum/Maximum and Parallel Compare

There are 2 groups of Minimum/Maximum instructions. The first group, in major op 0, encodes the full register variants.The second group, in major op 1, encodes the parallel variants. The parallel compare instructions are all encoded in majorop 1.

F8

C.6.6 Merge and Logical

F9

40 37 36 35 34 33 32 27 26 20 19 13 12 6 5 0

0 - 1 sf x x6 f3 f2 f1 qp4 1 2 1 6 7 7 7 6


x x6 sffmin.sf

f1 = f2, f3

0

0

14

See Table C-57

on page C-51

fmax.sf 15famin.sf 16famax.sf 17fpmin.sf

1

14fpmax.sf 15fpamin.sf 16fpamax.sf 17fpcmp.eq.sf 30fpcmp.lt.sf 31fpcmp.le.sf 32fpcmp.unord.sf 33fpcmp.neq.sf 34fpcmp.nlt.sf 35fpcmp.nle.sf 36fpcmp.ord.sf 37

40 37 36 34 33 32 27 26 20 19 13 12 6 5 0

0 - 1 x x6 f3 f2 f1 qp4 3 1 6 7 7 7 6


fmerge.s

f1 = f2, f3

0

0

10fmerge.ns 11fmerge.se 12fmix.lr 39fmix.r 3Afmix.l 3Bfsxt.r 3Cfsxt.l 3Dfpack 28fswap 34fswap.nl 35fswap.nr 36fand 2Cfandcm 2Dfor 2Efxor 2Ffpmerge.s

110

fpmerge.ns 11fpmerge.se 12



C.6.7 Conversion

C.6.7.1 Convert Floating-point to Fixed-point

F10

C.6.7.2 Convert Fixed-point to Floating-point

F11

C.6.8 Status Field Manipulation

C.6.8.1 Floating-point Set Controls

F12

C.6.8.2 Floating-point Clear Flags

F13

C.6.8.3 Floating-point Check Flags

F14

40 37 36 35 34 33 32 27 26 20 19 13 12 6 5 0

0 - 1 sf x x6 f2 f1 qp4 1 2 1 6 7 7 7 6


x x6 sffcvt.fx.sf

f1 = f2

0

0

18

See Table C-57

on page C-51

fcvt.fxu.sf 19fcvt.fx.trunc.sf 1Afcvt.fxu.trunc.sf 1Bfpcvt.fx.sf

1

18fpcvt.fxu.sf 19fpcvt.fx.trunc.sf 1Afpcvt.fxu.trunc.sf 1B

40 37 36 34 33 32 27 26 20 19 13 12 6 5 0

0 x x6 f2 f1 qp4 3 1 6 7 7 7 6


fcvt.xf f1 = f2 0 0 1C

40 37 36 35 34 33 32 27 26 20 19 13 12 6 5 0

0 sf x x6 omask7c amask7b qp4 1 2 1 6 7 7 7 6


x x6 sf

fsetc.sf amask7, omask7 0 0 04See Table C-57 on page C-51

40 37 36 35 34 33 32 27 26 6 5 0

0 sf x x6 qp4 1 2 1 6 21 6


x x6 sf

fclrf.sf 0 0 05See Table C-57 on page C-51

40 37 36 35 34 33 32 27 26 25 6 5 0

0 s sf x x6 imm20a qp4 1 2 1 6 1 20 6


x x6 sf

fchkf.sf target25 0 0 08See Table C-57 on page C-51



C.6.9 Miscellaneous F-Unit Instructions

C.6.9.1 Break/Nop (F-Unit)

F15

C.7 X-Unit Instruction EncodingsThe X-unit instructions occupy two instruction slots, L+X. The major opcode, opcode extensions and hints, qp, and smallimmediate fields occupy the X instruction slot. For movl, break.x, and nop.x, the imm41 field occupies the L instructionslot.

C.7.1 Miscellaneous X-Unit Instructions

The miscellaneous X-unit instructions are encoded in major opcode 0 using a 3-bit opcode extension field (x3) in bits35:33 and a 6-bit opcode extension field (x6) in bits 32:27. Table C-62 shows the 3-bit assignments and Table C-63 sum-marizes the 6-bit assignments. These instructions are executed by an I-unit.

40 37 36 35 34 33 32 27 26 25 6 5 0

0 i x x6 imm20a qp4 1 2 1 6 1 20 6


break.fimm21 0 0 00

nop.f 01

Table C-62. Misc X-Unit 3-bit Opcode Extensions

OpcodeBits 40:37

x3Bits

35:33

0

0 6-bit Ext (Table C-63)1234567

Table C-63. Misc X-Unit 6-bit Opcode Extensions

OpcodeBits

40:37

x3Bits

35:33

x6Bits

30:27Bits 32:31

0 1 2 3

0 0

0 break.x X11 nop.x X123456789ABCDEF

C.7.1.1 Break/Nop (X-Unit)

X1

C.7.2 Move Long Immediate64

The move long immediate instruction is encoded within major opcode 6 using a 1-bit reserved opcode extension in bit 20(vc) as shown in Table C-64. This instruction is executed by an I-unit.

X2

C.8 Immediate FormationThe following table shows, for each instruction format that has one or more immediates, how those immediates areformed. In each equation, the symbol to the left of the equals is the assembly language name for the immediate. The sym-bols to the right are the field names in the instruction encoding.

40 37 36 35 33 32 27 26 25 6 5 0 40 0

0 i x3 x6 imm20a qp imm414 1 3 6 1 20 6 41


break.ximm62 0 0 00

nop.x 01

Table C-64. Move Long 1-bit Opcode Extensions

Opcodebits 40:37

vcbit 20

6 0 movl X21

40 37 36 35 27 26 22 21 20 19 13 12 6 5 0 40 0

6 i imm9d imm5c ic vc imm7b r1 qp imm414 1 9 5 1 1 7 7 6 41


vcmovl r1 = imm64 6 0

Table C-65. Immediate Formation

InstructionFormat

Immediate Formation

A2 count2 = ct2d + 1

A3 A8 I27 M30 imm8 = sign_ext(s << 7 | imm7b, 8)

A4 imm14 = sign_ext(s << 13 | imm6d << 7 | imm7b, 14)

A5 imm22 = sign_ext(s << 21 | imm5c << 16 | imm9d << 7 | imm7b, 22)

A10 count2 = (ct2d > 2) ? reservedQPa : ct2d + 1

I1 count2 = (ct2d == 0) ? 0 : (ct2d == 1) ? 7 : (ct2d == 2) ? 15 : 16

I3mbtype4 = (mbt4c == 0) ? @brcst : (mbt4c == 8) ? @mix : (mbt4c == 9) ? @shuf :

(mbt4c == 0xA) ? @alt : (mbt4c == 0xB) ? @rev : reservedQPa

I4 mhtype8 = mht8c

I6 count5 = count5b

I8 count5 = 31 – ccount5c

I10 count6 = count6d

I11len6 = len6d + 1

pos6 = pos6b

I12len6 = len6d + 1

pos6 = 63 – cpos6c

I13len6 = len6d + 1

pos6 = 63 – cpos6cimm8 = sign_ext(s << 7 | imm7b, 8)

I14len6 = len6d + 1

pos6 = 63 – cpos6bimm1 = sign_ext(s, 1)

I15len4 = len4d + 1

pos6 = 63 – cpos6d

I16 pos6 = pos6b

I19 M37 imm21 = i << 20 | imm20a

I23 mask17 = sign_ext(s << 16 | mask8c << 8 | mask7a << 1, 17)

I24 imm44 = sign_ext(s << 43 | imm27a << 16, 44)

M3 M8 M15 imm9 = sign_ext(s << 8 | i << 7 | imm7b, 9)

M5 M10 imm9 = sign_ext(s << 8 | i << 7 | imm7a, 9)

M17 inc3 = sign_ext(((s) ? –1 : 1) * ((i2b == 3) ? 1 : 1 << (4 – i2b)), 6)

I20 M20 M21 target25 = IP + (sign_ext(s << 20 | imm13c << 7 | imm7a, 21) << 4)

M22 M23 target25 = IP + (sign_ext(s << 20 | imm20b, 21) << 4)

M34il = sol

o = sof – solr = sor << 3

M44 imm24 = i << 23 | i2d << 21 | imm21a

B1 B2 B3 target25 = IP + (sign_ext(s << 20 | imm20b, 21) << 4)

B9 imm21 = i << 20 | imm20a

F5 fclass9 = fclass7c << 2 | fc2

F12amask7 = amask7bomask7 = omask7c

F14 target25 = IP + (sign_ext(s << 20 | imm20a, 21) << 4)

F15 imm21 = i << 20 | imm20a

X1 imm62 = imm41 << 21 | i << 20 | imm20a

X2 imm64 = i << 63 | imm41 << 22 | ic << 21 | imm5c << 16 | imm9d << 7 | imm7b

a. This encoding causes an Illegal Operation fault if the value of the qualifying predicate is 1.

Table C-65. Immediate Formation (Continued)

InstructionFormat

Immediate Formation

ia-64 application instruction set architecture guide ... · ia-64 application isa guide 1.0 this...

Documents