bdw + fpga beta release 5.0.3 core cache interface (cci-p ...athanas/harp tutorial... · 12-feb-15...

Intel Confidential

BDW + FPGA

Beta Release 5.0.3 Core Cache Interface (CCI-P)

Interface Specification

2-Sep-16 Document Version 1.0

BDW + FPGA Beta Release 5.0.3

Core Cache Interface (CCI-P) Interface Specification

2-Sep-16 5:07 PM of 69 Notices and Disclaimers

Intel Confidential

Notices and Disclaimers

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by

this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of

merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising

from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All

information provided here is subject to change without notice. Contact your Intel representative to

obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause

deviations from published specifications. Current characterized errata are available on request.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

Updates

This document belongs to the group of documents provided for the BDW + FPGA product.

Identify the latest copy by the date printed in the footer on each page.

Questions and Feedback

Intel solicits and appreciates feedback. Input should be provided through Intel® Premier Support (IPS). Customers need to ensure IPS access by working with their respective Account Manager/ FAE.

BDW + FPGA Beta Release 5.0.3 Core Cache Interface (CCI-P) Interface Specification

Revision History of 69 2-Sep-16 5:07 PM

Intel Confidential

Revision History

Date Version Doc Modifications

24-Nov-14 0.15 Edits to A2C interface

12-Feb-15 0.25

19-Jul-15 0.5 Major edits to CCI-P section. Defined separate CFG Read/Write channel

6-Aug-15 0.55 Define internal interfaces to IPs: CCI-U and csr sideband

3-Sep-15 0.56 Formatting and minor editing

21-Sep-15 0.56 Minor editing

28-Oct-15 0.6 Draft version. Updates to CCI-U cfg header, and CCI-P control signals. Cfg channel renamed to MMIO and signals regrouped in CCI-P section

01-Dec-15 0.6 Removed internal information. Reformat for external distribution.

15-Dec-15 Pre-Alpha Format for pre-alpha; add some clarification

20-Dec-15 0.6 0.6 version with internal information

30-Dec-15 Pre-Alpha External sections updated; called Pre-Alpha

24-Jan-16 Pre-Alpha (CCI-P 0.7)

Edit and format

16-Mar-16 5.0.2 Update for Beta.

11-Jun-16 5.0.2 v1.0 Added Intel Confidential to footer.

23-Aug-16 5.0.3



2-Sep-16 5:07 PM of 69 Contents

Intel Confidential

Contents

Notices and Disclaimers ....................................................................................................................... 2

Updates ............................................................................................................................................... 2

Questions and Feedback ...................................................................................................................... 2

Revision History ................................................................................................................................... 3

About this Document ........................................................................................................................... 8

Intended Audience ............................................................................................................................... 8

Conventions ......................................................................................................................................... 8

Related Documentation ....................................................................................................................... 9

Glossary ............................................................................................................................................. 10

1 Introduction ............................................................................................................................... 13

1.1 Xeon® Processor + FPGA Block Diagram ..................................................................................... 14

1.2 Development models .................................................................................................................. 17

1.3 Memory hierarchy ...................................................................................................................... 18

2 CCI-P Interface ............................................................................................................................ 20

2.1 Features ...................................................................................................................................... 22

2.2 Signaling information .................................................................................................................. 23

2.3 Read from/Write to Main Memory............................................................................................. 24

2.4 UMsg ........................................................................................................................................... 24

2.5 MMIO Cycles to IO Memory ....................................................................................................... 26

2.6 CCI-P Tx Signals ........................................................................................................................... 27

2.7 Tx Header Format ....................................................................................................................... 30

2.8 CCI-P Rx Signals ........................................................................................................................... 34

2.8.1 Rx Header and RxData Format .......................................................................................... 36

2.9 Multi-Cacheline Memory Requests ............................................................................................ 39

2.10 Additional Control Signals ........................................................................................................... 41

Protocol Flow .......................................................................................................................................... 43

2.10.1 Upstream Requests ........................................................................................................... 43

2.10.2 Downstream Requests ...................................................................................................... 45

2.11 Ordering Rules ............................................................................................................................ 46

2.11.1 Memory Requests ............................................................................................................. 46

2.11.1.1 Write Fence usage ............................................................................................................ 47

2.11.1.2 Memory Consistency Explained ........................................................................................ 47

2.11.1.2.1 Two Writes on Different VCs ............................................................................................ 48

2.11.1.2.2 Two Writes on the Same VC ............................................................................................. 49

2.11.1.2.3 Two Reads on Different VCs ............................................................................................. 50

2.11.1.2.4 Two Reads on the Same VC .............................................................................................. 51

2.11.1.2.5 Read-after-Write on Same VC ........................................................................................... 51


Contents of 69 2-Sep-16 5:07 PM

Intel Confidential

2.11.1.2.6 Read-after-Write on Different VCs ................................................................................... 51

2.11.1.2.7 Write-after-Read on Same or Different VCs ..................................................................... 51

2.11.1.2.8 Some example scenarios: ................................................................................................. 52

2.11.2 MMIO Requests ................................................................................................................ 53

2.12 Timing diagrams .......................................................................................................................... 54

2.13 Clock Frequency .......................................................................................................................... 55

2.14 CCI-P Guidance ............................................................................................................................ 56

3 AFU Requirements ..................................................................................................................... 57

3.1 Mandatory AFU CSR Definitions ................................................................................................. 57

3.2 AFU Discovery Flow ..................................................................................................................... 61

3.3 AFU_ID ........................................................................................................................................ 61

3.3.1 How to Create an AFU_ID / GUID ..................................................................................... 62

3.3.2 How to Use an AFU_ID ...................................................................................................... 62

4 Basic Building Blocks .................................................................................................................. 63

5 Device Feature List ..................................................................................................................... 64

Code

Code 1: ccip_std_afu port map ............................................................................................................. 23

Code 2: Tx interface structure inside ccip_if_pkg.sv .................................................................................. 27

Code 3: Tx channel structures inside ccip_if_pkf.sv ................................................................................... 28

Code 4: Rx interface structure inside ccip_if_pkg.sv .......................................................................... 34

Code 5: Rx channel structure inside ccip_if_pkg.sv ............................................................................ 34

Code 6: Set the Mandatory AFU Registers in the AFU ................................................................................ 58

Code 7: AAL Reads the AFU ID .................................................................................................................... 58

Figures

Figure 1: High-Level Block Diagram of Xeon®+ FPGA Logic ........................................................................ 15

Figure 2 Xeon+FPGA system memory hierachy, 1 Processor topology ..................................................... 18

Figure 3: CCI-P Signals ................................................................................................................................. 21

Figure 4 : UMsg initialization and usage flow ............................................................................................. 25

Figure 5 : Multi-CL Memory Write Requests .............................................................................................. 39

Figure 6 : Multi-CL Memory Write Reponses .............................................................................................. 40

Figure 7 : Multi-CL Memory Read Responses ............................................................................................. 40

Figure 8: Write Out of Order Commit ......................................................................................................... 48

Figure 9: Use WrFence to Enforce Write Ordering ..................................................................................... 48

Figure 10: Two Writes on Same VC, Only One Outstanding ....................................................................... 49

Figure 11: Read Re-Ordering to Same Address, Different VCs ................................................................... 50



2-Sep-16 5:07 PM of 69 Contents

Intel Confidential

Figure 12: Read Re-Ordering to Same Address, Same VC........................................................................... 51

Figure 13: Tx Channel 0 & 1 almost full threshold ...................................................................................... 54

Figure 14: Write Fence Behavior ................................................................................................................. 54

Figure 15: C0 Rx Channel Interleaved between MMIO Requests and Memory Responses ....................... 55

Figure 16:Rd Response Timeout ................................................................................................................. 55

Figure 17 : AFU discovery flow .................................................................................................................... 61

Figure 18 Example feature hierarchy .......................................................................................................... 65

Figure 19: Device Feature Conceptual View ............................................................................................... 69

Tables

Table 1: CCI-P Features ............................................................................................................................... 13

Table 2: Comparison of Platform Capabilities ............................................................................................ 16

Table 3 AFU Memory Read paths ............................................................................................................... 19

Table 4: CCI-P Features summary ............................................................................................................... 22

Table 5: Tx Channel Signal Description ....................................................................................................... 29

Table 6 Tx Header Field Definitions ............................................................................................................ 30

Table 7: Tx Request Encodings & Mapping to Header Formats ................................................................. 31

Table 8:C0 Read Memory Request Header Format .................................................................................... 32

Table 9: C1 Write Memory Request Header Format .................................................................................. 32

Table 10: C1 Fence Header Format ............................................................................................................. 32

Table 11: C2 MMIO Response Header Format ........................................................................................... 33

Table 12: Rx Channel Signal Description ..................................................................................................... 35

Table 13 Rx Header Field Definitions .......................................................................................................... 36

Table 14: AFU Rx Response Encodings and Channels Mapping.................................................................. 37

Table 15: C0 Memory Read Response Header Format ............................................................................... 37

Table 16: MMIO Request Header Format .................................................................................................. 37

Table 17: C1 Memory Write Response Header Format .............................................................................. 38

Table 18: UMsg Header Format .................................................................................................................. 38

Table 19: WrFence Header Format ............................................................................................................. 38

Table 20: Clock and Reset ........................................................................................................................... 41

Table 21: Protocol Flow for upstream requests from AFU to FIU .............................................................. 43

Table 22 CCI-P VL0 protocol flows .............................................................................................................. 44

Table 23: Protocol Flow for Downstream Requests from CPU to AFU ....................................................... 45

Table 24 Ordering rules for upstream requests from AFU ......................................................................... 46

Table 25: MMIO Ordering Rules ................................................................................................................. 53

Table 26: Clock Frequency .......................................................................................................................... 55

Table 27 Recommended Choices for Memory Requests ............................................................................ 56

Table 28: Register Attribute Definition ....................................................................................................... 57

Table 29: Mandatory AFU CSRs .................................................................................................................. 57

Table 30: Feature Header CSR Definition ................................................................................................... 59

Table 31: AFU_ID_L CSR Definition ............................................................................................................. 60


Contents of 69 2-Sep-16 5:07 PM

Intel Confidential

Table 32: AFU_ID_H CSR Definition ............................................................................................................ 60

Table 33: DFH_RSVD0 CSR Definition ......................................................................................................... 60


Table 35:Differences between AFU, Private Features, and BBBs ............................................................... 64

Table 36 : Device Feature Header CSR ........................................................................................................ 66

Table 37 Next DFH Byte offset example ..................................................................................................... 66

Table 38 Mandatory AFU DFH register map ............................................................................................... 67

Table 39 AFU_ID_L CSR definition .............................................................................................................. 67

Table 40 AFU_ID_H CSR definition .............................................................................................................. 67

Table 41 Next AFU CSR................................................................................................................................ 67


Table 43: Mandatory BBB DFH Register Map ............................................................................................. 68

Table 44: BBB_ID_L CSR Definition ............................................................................................................. 68

Table 45: BB_ID_H CSR Definition .............................................................................................................. 68



2-Sep-16 5:07 PM of 69 About this Document

Intel Confidential

About this Document

This document describes the Core Cache Interface (CCI-P) specification which is the interface between the Accelerated Function Unit (AFU) and the BDW + FPGA IP.

Intended Audience

The intended audience is system engineers, platform architects, and software developers. . Users must design the HW AFU to be compliant with the CCI-P specification

Conventions

Conventions used in this document include the following:

# preceding a command indicates the command is to be entered as root.

$ indicates a command is to be entered as a user.

This font this font

Filenames, commands, and keywords are printed in this font. Long command lines are printed in this font. Although some very long command lines may wrap to the next line, the return is not considered part of the command; do not enter it.

<variable_name> indicates the placeholder text that appears between the angle brackets is to be replaced with an appropriate value. Do not enter the angle brackets


Related Documentation of 69 2-Sep-16 5:07 PM

Intel Confidential

Related Documentation

Item Description

BDW + FPGA Beta Release 5.0.3 Read This First

This document summarizes the available documentation and suggests how users might navigate through it.

BDW + FPGA Beta Release 5.0.3 Release Notes

This document lists the key features, limitations, changes from the previous release, and possible future changes.

BDW + FPGA Beta Release 5.0.3 Software Installation Guide

This document lists the software prerequisites needed by the AAL SDK and provides instructions on how to install the AAL SDK.

BDW + FPGA Beta Release 5.0.3 AFU Simulation Environment User’s Guide

This document provides instructions on how to use the Accelerated Function Unit (AFU) Simulation Environment (ASE).

BDW + FPGA Beta Release 5.0.3 Software Architecture Guide

This document presents the rationale behind AAL and the concepts upon which AAL is based.

BDW + FPGA Beta Release 5.0.3 Programmer’s Guide

This document shows how the concepts described in BDW + FPGA Beta Release 5.0.3 Software Architecture Guide can be implemented in code. It does not assume the existence of an AFU.

BDW + FPGA Beta Release 5.0.3 Core Cache Interface (CCI-P) Specification

This document describes the Core Cache Interface (CCI-P) specification which is the interface between the Accelerated Function Unit (AFU) and the FPGA Interface Unit.

BDW + FPGA Beta Release 5.0.3 How to Build, Load, and Debug a Bitstream

This document lists the steps to build, load, and debug a green bitstream with Quartus and the BDW + FPGA product

BDW + FPGA Beta Release 5.0.3 Sample Programs Guide

This document describes how to run the sample programs provided with Release 5.0.3 of the BDW + FPGA Accelerator Abstraction Layer and the BDW + FPGA platform.

Arria 10 Avalon-ST Interface with SR-IOV PCIe Solutions User Guide

https://documentation.altera.com/#/00014789-AA$NT00089097

Intel Software Developers Manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

Intel Virtualization Technology for Directed-IO

Intel Virtualization Technology for Directed-IO

https://documentation.altera.com/#/00014789-AA$NT00089097

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf



2-Sep-16 5:07 PM of 69 Glossary

Intel Confidential

Glossary

Acronym Expansion Description

AAL Accelerator Abstraction Layer A set of runtime and software development tools that facilitate the deployment of systems consisting of a collection of non-uniform, asymmetric compute resources.

The AALSDK is the AAL Software Development Kit.

AFU Accelerated Function Unit Hardware Accelerator implemented in FPGA logic that accelerates or intends to accelerate an application kernel.

ALI AFU Link Interface This is the software interface between AAL and CCI-P.

ASE AFU Simulation Environment A co-development and simulation area available in Intel® QuickAssist AALSDK consisting of hardware and software.

CA Caching Agent A Caching Agent (CA) makes read and write requests to the coherent memory in the system. It is also responsible for servicing snoops generated by other agents in the system.

CCI-P Core Cache Interface Interface between the AFU and the FPGA Interface Unit (FIU).

CL Cache Line 64-byte cache line

DPI Direct Programming Interface A set of features in SystemVerilog that allows export/import of parameters to/from a C function

FIU FPGA Interface Unit The Intel UPI & PCIe on FPGA together form the FIU sub-block.

FPGA Field Programmable Gate Array http://en.wikipedia.org/wiki/Fpga

PA Physical Address Physical address of the host machine

IPC Inter-Process Communication Refers to constructs in Linux-like shared memory (/dev/shm) and message queues (/dev/mqueue); these are leveraged for ASE core functionality.

KiB 1024 bytes The term KiB is for 1024 bytes and KB for 1000 bytes. When referring to memory, KB is often used and KiB is implied. When referring to clock frequency, KHz is used, and here K is 1000.

http://en.wikipedia.org/wiki/Fpga


Glossary of 69 2-Sep-16 5:07 PM

Intel Confidential


Mdata Message Tag Data This is a user-defined field, which is relayed from Tx header to the Rx header. It may be used to tag requests with transaction id or channel id.

Msg Message Message- a control notification

NLB Native Loopback Adapter Sample RTL

PAR Place & Route In this context, refers to a stage in the building a bitstream. Placement decides where to place components on the FPGA; and routing determines how to connect the placed components.

RdLine_I1 Read Line Invalid Memory Read Request, with FPGA cache hint set to Invalid, i.e. do not cache it. The line will not cached in FPGA, but may cause FPGA cache pollution.

RdLine_S Read Line Shared Memory Read Request, with FPGA cache hint set to Shared. An attempt will be made to keep it in FPGA cache in Shared state.

Rx Receive Receive or input from AFU’s perspective

Tx Transmit Transmit or output from AFU’s perspective

Upstream Direction up to CPU Logical direction towards CPU. Example, upstream port, means port going to CPU.

UMsg Unordered Message from CPU to AFU

An unordered notification with a 64-byte payload

UMsgH Unordered Message Hint from CPU to AFU

This is a Hint to a subsequent UMsg. No data payload.

UPI Intel© Ultra Path Interconnect Intel’s proprietary coherent interconnect protocol between Intel cores or other IP.

WrLine_I Write Line Invalid Memory Write Request, with FPGA cache hint set to Invalid. FIU will write the data with no intention of keeping the data in FPGA cache.

1 The cache tag is used to track the request status for all outstanding requests on UPI. Therefore, even though RdLine_I is marked Invalid upon completion, it consumes the cache tag temporarily to track the request status over UPI. This action may result in the eviction of a cache line, resulting in cache pollution. The advantage of using RdLine_I is that it is not tracked by CPU directory; thus it will prevent snooping from CPU.



2-Sep-16 5:07 PM of 69 Glossary

Intel Confidential


WrLine_M Write Line Modified Memory Write Request, with FPGA cache hint set to Modified. FIU will write the data and leave it in the FPGA cache in Modified state.

WrPush_I Write Push Invalid Memory Write Request, with FPGA cache hint set to Invalid. FIU writes the data into the processor’s last level cache (LLC) with no intention of keeping the data in FPGA cache. The LLC it writes to is always the LLC associated with the processor where the DRAM address is homed.


Introduction of 69 2-Sep-16 5:07 PM

Intel Confidential

1 Introduction

CCI-P is the hardware-side signaling interface between the Accelerated Function Unit (AFU) and the FPGA Interface Unit (FIU). This document defines the signaling interface. It specifies the access types, the request format and the memory model, and the mandatory AFU CSRs. It provides timing diagrams and AFU design guidelines.

CCI-P provides an abstraction of the physical links between the FPGA and CPU. An AFU sees a unified interface with four virtual channels and a unified address space. CCI-P uses data payloads with four cachelines (4 CL). Table 1 lists some key CCI-P features.

Table 1: CCI-P Features

Feature CCI-P

Data transfer size 64, 128, 256B

Addressing Mode Physical Addressing Mode

Addressing Width

(CL aligned addresses)

42 bits

Caching Hints Yes

Virtual Channels VA, VL0, VH0, VH1

Response Ordering Out of order responses

MMIO Read & Write Supported

FPGA to CPU Interrupt Supported

Interface Clk frequency 400MHz

CCI-P introduces two architectural concepts: Device Feature Lists (DFL) and Basic Building Blocks (BBBs).

DFL defines a structure for grouping like functionalities and enumerating them.

BBB defines an architecture for wrapping features into building blocks. You can incorporate these building blocks into your AFUs.

BBBs are source-visible reference designs; other than a few mandatory registers, there are no other requirements imposed on a BBB. For example, the Memory Properties Factory (MPF) is a BBB that translates virtual memory addresses to physical memory addresses for memory shared between the



2-Sep-16 5:07 PM of 69 Introduction

Intel Confidential

Xeon Processor and the FPGA . MPF also does read response ordering and provides data hazard resolution. Section 4 provides more information on BBBs.

1.1 Xeon® Processor + FPGA Block Diagram

FPGA logic (as shown in Figure 1) is divided into two parts: the Intel-provided FPGA Interface Unit (FIU) represented by the blue box (called the blue bitstream) and the user-developed AFU represented by the green box (called the green bitstream).

Note that although the FIU is called the blue bitstream, it is not actually a bitstream. A bitstream is a file that can be loaded onto an FPGA. The blue bitstream is not a file; it is the set of RTL files that make up the Intel IP. You must combine these RTL files with a green bitstream to get a bitstream that can be loaded onto the FPGA. Such a loadable bitstream is called a base bitstream or a full-chip bitstream, and it contains both blue and green parts.

The green bitstream is also not a base or full-chip bitstream; but you can replace the green part of a previously loaded base bitstream with another green part. This green part exists as a separate file.

The FIU implements all the key features required for deployment and manageability of FPGA in a Xeon datacenter. The FIU implements the interface protocols for links between the CPU and FPGA. The FIU also provides platform capabilities such as VT-d, security, error monitoring, performance monitoring, power and thermal management, partial reconfiguration of AFUs, etc.

Note the three physical links: PCIe0, PCIe1, and UPI. These physical links are presented as virtual channels on the CCI-P interface. Refer to Section 1.3 for more information about physical and virtual channels.

The SMBUS interface running between the Xeon processor and the FPGA is SMBus-like; it does not follow published SMBUS specifications. It is used for out-of-band temperature monitoring, configuration during the bootstrap process, and platform debug purposes.



Intel Confidential

FPGA Management Engine (FME)1. thermal monitor2. power monitor3. performance monitor4. Partial Reconfiguration5. global errors

CCI-U 64B@200MHz

CCI-U 64B@250MHz

Fabric

Intel IP:FPGA Interface Unit (FIU)

QPI 6.4G

PCIe Gen3x8

EP1

PCIe Gen3x8

EP0

CCI-P Port0- SignalTap- UMsg- port reset- port errors

AFU 0

Control Channel

Data Channel

CCI-U 64B@250MHz

IOMMU & Device TLB

BDX only blocks

SMBusslave

SKX only blocksUPI 9.2G

Coherent intf

Xeon

Optional- parameterized

PR Unit

Cache controller

CCI

-P

Figure 1: High-Level Block Diagram of Xeon®+ FPGA Logic

Refer to Table 2 for a list of platform capabilities.

Unified Address space

Even though FIU has three physical links going to the CPU, the AFU maintains a single view of the system address space. A write to address X directed over Coherent Interface or PCIe goes to the same cacheline in the system memory.

Intel Virtualization for Directed IO (VT-d) support

SKX+FPGA has hardware support for memory isolation.




Intel Confidential

Partial Reconfiguration (PR) of AFU

PR uses Altera FPGA technology to allow a user to reconfigure parts of the FPGA device dynamically, while the remainder of the FPGA continues to operate. Each CCI-P interface port supports one PR enabled AFU.

Remote Debug

The Xeon + FPGA product enables the PSG (formerly Altera) in-system debug tool This debug tool is Remote Access SignalTap (RSTP) via the Xeon processor. The remote access capability obviates the need for physical access to the machine when debugging an FPGA design.

Table 2: Comparison of Platform Capabilities

Capability BDW+FPGA SKX+FPGA

Unified Address space Yes Yes

VT-d support for AFU No Yes

Partial Reconfiguration Yes Yes

Support for two AFUs No No

Remote Debug Yes Yes

FPGA Cache size 64KiB direct mapped 128KiB direct mapped



Intel Confidential

1.2 Development models

The two AFU development models supported are HDL design and OpenCL design.

1. HDL design

This is the traditional FPGA development flow, where users design an AFU in an HDL language like Verilog, System Verilog or VHDL adhering to the CCI-P interface specification. Users then compile their code (the RTL) through the Quartus tool chain to generate an AFU bitstream.

2. OpenCL design

The PSG OpenCL SDK is a framework for writing programs at a higher level of C-like abstraction. Users develop an AFU in OpenCL C and compile it along with the Xeon + FPGA BSP to generate an FPGA bitstream and a software executable. For best performance, the OpenCL code must be optimized for the Xeon + FPGA platform.




Intel Confidential

1.3 Memory hierarchy

This section explains the memory hierarchy in the Xeon + FPGA system. Refer to Figure 2. The green dotted box shows the multi-processor coherence domain. The FIU on the FPGA extends the coherence domain from the processor to the FPGA, encompassing a cache implemented on the FPGA (called the FPGA cache).

The FIU implements a cache controller and UPI Caching Agent (CA). The CA makes read and write requests to coherent system memory and services snoop requests to the FIU cache.

N Cores

Last Level Cache

VC

steering

AFUUPI

DRAM

DDR

DRAMDRAM

Processor FPGA

CCI-P

FIU

Multi-processor Coherence Domain

cache

PCIe 1

PCIe RP

PCIe RP: PCIe rootportVC : Virtual Channel

12

3

Figure 2 Xeon+FPGA system memory hierachy, 1 Processor topology

The CCI-P interface abstracts the physical links to the processor and provides simple load/store semantics to the AFU for accessing system memory.

The physical links are presented as virtual channels on the CCI-P interface. Each request can select the virtual channel. The virtual channels are called VL0, VH0, and VH1. There is a fourth called VA (for V Auto) where the FIU chooses one of the other three. Refer to Table 3. The response header identifies which VC was selected by the FIU.

For a single-processor system, AFU sees a three-level memory hierarchy: (1) FIU Cache (2) Processor Last Level Cache (LLC) (3) DRAM.

The memory access latency increases as you go from (1) to (3).

Note that the AFU accesses 2nd and 3rd level memory along two independent paths, each with a different latency. Table 3 lists the different possible AFU Memory Read operations in increasing order of latency. Each row shows the request path, and the node that services the request is highlighted in GREEN.



Intel Confidential

Table 3 AFU Memory Read paths

Request FPGA Cache Virtual Channel Processor LLC DRAM

FPGA Cache Hit Hit

Processor Cache Hit Miss VL0 Hit

VH*

All Cache Miss Miss VL0 Miss Read

VH*

VH* - means either VH0 or VH1.

If you are still developing experience with the CCI-P interface CCI-P, choose the VA channel. This channel is optimized for maximum bandwidth and producer-consumer type data flows. Refer to Section 2.11 for ordering rules. When you choose VA, the FIU makes a decision to steer your request to a physical link based on the following:

Caching hint Cacheable requests will be biased towards the UPI link.

Data payload size 64B requests will be biased towards UPI link. A cache line is 64 byes. A multi-cacheline read/write will NOT be split, it is guaranteed to be processed by a single physical link.

Link utilization VA will attempt to balance the load across the virtual channels.

The cache is along the VL0 data path. The VC steering decision is made before the cache lookup. You could incur a high memory latency, if the requested cache line is cached in FPGA, and the request got steered to VH*. In this case, the processor will have to snoop the FPGA cache, in order to complete the VH* request.



2-Sep-16 5:07 PM of 69 CCI-P Interface

Intel Confidential

2 CCI-P Interface

CCI-P provides access to two types of memory: main memory and IO memory.

Main Memory Subsequent to this section, main memory is just referred to as memory. This is the memory attached to the processor and exposed to the operating system. Requests from the AFU to main memory are called upstream requests.

IO Memory IO memory is implemented within the IO device, which in our case is the AFU. How this memory is implemented and organized is up to the AFU. The AFU may choose flip-flops, M20Ks or MLABs.

The CCI-P interface defines a request format to access IO memory using Memory Mapped IO (MMIO) requests. Requests from the processor to IO Memory are called downstream requests.

The AFU’s MMIO address space is 256KiB

Figure 3 shows all CCI-P signals grouped into three Tx Channels, two Rx Channels and some additional control signals.

Tx/Rx The flow direction is from the AFU point of view. Tx flows from AFU to FIU. Rx flows from FIU to AFU.

Channels Grouping of signals that together completely defines the request or response.

Figure 3 reflects the organization shown in the files ccip_std_afu.sv and ccip_if_pkg.sv.


CCI-P Interface of 69 2-Sep-16 5:07 PM

Intel Confidential

Figure 3: CCI-P Signals




Intel Confidential

2.1 Features

Table 4 summarizes the features unique to the CCI-P interface for AFUs.

Table 4: CCI-P Features summary

Virtual Channels Physical links are presented to the AFU as virtual channels. The AFU can select the virtual channel for each memory request.

VL0 Low latency virtual channel. (Mapped to UPI)

VH0 High latency virtual channel. (Mapped to PCIe0). Protocol efficiency is better for larger data payloads.

VH1 High latency virtual channel. (Mapped to PCIe1). Protocol efficiency is better for larger data payloads.

VA

Virtual Auto: FIU auto selects the link based on link utilization, request caching hint, and payload size.

Latency: expect to see high variance

BW: expect to see high steady state BW

Memory Request AFU read/write to memory

Addressing Mode Physical address

Address Width 42 bits (CL address)

Data Lengths 64B 128B 256B

Byte Addressing Not supported

FPGA Caching Hint

The AFU can ask the FIU to cache the CL in a specific state. For requests directed to VL0, FIU attempts to cache the data in the requested state, given as a hint. Except for WrPush_I, cache hint requests on VH0/1 are ignored.

Note that the caching hint is only a hint and provides no guarantee of final cache state. Ignoring a cache hint, impacts performance but does not impact functionality.

<request>_I No intention to cache

<request>_S Desire to cache in S state

<request>_M Desire to cache in M state



Intel Confidential

MMIO Request CPU read/write to AFU IO Memory.

MMIO Read payload

4B 8B

MMIO Write payload

4B 8B 64B

MMIO writes could be combined by the x86 Write Combining buffer

UMsg Unordered Message

This is a spin loop optimization. It is an improvement to the AFU polling an address location in main memory. When the CPU writes to the memory, AFU receives a UMsg.

UMsgs data payload 64B

# UMsg supported 8 per AFU

2.2 Signaling information

All CCI-P signals must be synchronous to pClk.

All signals are active high, unless explicitly mentioned. Active low signals use a suffix _n.

We recommend using the CCI-P structures defined inside ccip_if_pkg.sv file. This is included in the RTL package.

All AFU output signals must be registered.

AFU output bits marked as RSVD are reserved and must be driven to 0.

AFU output bits marked as RSVD-DNC, are don’t care bits. The AFU can drive either 0 or 1.

All AFU input signals must also be registered.

AFU input bits marked as RSVD must be treated as don’t care (X) by the AFU.

Code 1 shows the port map for the ccip_std_afu module. The AFU must be instantiated under here. The subsequent sections explains the interface signals.

Code 1: ccip_std_afu port map

$ module ccip_std_afu( // CCI-P Clocks and Resets input logic pClk, // 400MHz - CCI-P clock domain. Primary interface clock input logic pClkDiv2, // 200MHz - CCI-P clock domain. input logic pClkDiv4, // 100MHz - CCI-P clock domain. input logic uClk_usr, // User clock domain. input logic uClk_usrDiv2, // User clock domain. Half the programmed frequency input logic pck_cp2af_softReset, // CCI-P ACTIVE HIGH Soft Reset input logic [1:0] pck_cp2af_pwrState, // CCI-P AFU Power State input logic pck_cp2af_error, // CCI-P Protocol Error Detected // Interface structures input t_if_ccip_Rx pck_cp2af_sRx, // CCI-P Rx Port output t_if_ccip_Tx pck_af2cp_sTx // CCI-P Tx Port );




Intel Confidential

2.3 Read from/Write to Main Memory

The AFU makes a memory read request to the FIU over C0, using Tx signals, and receives the response over C0, using Rx signals.

AFU drives the C0 valid signal to indicate that C0 Hdr contains a request. The c0_ReqMemHdr structure provides a convenient mapping from flat bit-vector to read request fields. The req_type signal provides a cache hint (RDLINE_I, Invalid or RDLINE-S, Shared) . The mdata field is a user defined request id.

Then, the FIU responds over C0. The resp_type signal in the c0_RspMemHdr structure indicates response type (Memory Read or UMsg Received). The data field in C0 contains the data that were read. The mdata field in the c0_RspMemHdr structure contains the same value that went out with the request.

The AFU makes a memory write request to the FIU over C1, using Tx signals, and receives the response over C1, using Rx signals.

AFU drives the C1 valid signal to indicate that C1 Hdr contains a request. The c1_ReqMemHdr structure provides a convenient mapping from flat bit-vector to write request fields. The req_type signal provides request type and cache hint.

Then, the FIU responds over C1 using Rx signals. The resp_type field in the c1_RespMemHdr structure indicates whether the response is for a memory write. The mdata field in the c1_RespMemHdr structure contains the same value that went out with the write request.

Write memory requests need explicit synchronization using WrFence.

2.4 UMsg

UMsg provides the same functionality as a spin loop from the AFU, without burning the CCI-P read bandwidth. Think of it as a spin loop optimization, where a monitoring agent inside the FPGA cache controller is monitoring snoops to cachelines allocated by the driver. When it sees a snoop to the cacheline, it reads the data back and sends a UMsg to the AFU.

UMsg flow makes use of the cache coherency protocol to implement a high speed unordered messaging path from CPU to AFU. This process consists of two stages as shown in Figure 4.

The first stage is initialization, this is where SW pins the UMsg Address Space (UMAS) and shares the UMAS start address with the FPGA cache controller. Once this is done, the FPGA cache controller reads each cache line in the UMAS and puts it as Shared State in the FPGA cache.

The second stage is actual usage, where the CPU writes to the UMAS. A CPU write to UMAS generates a snoop to FPGA cache. The FPGA responds to the snoop and marks the line as invalid. The CPU write request completes, and the data become globally visible. A snoop in UMAS address range, triggers the monitoring agent (MA), which in turn sends out a read request to CPU for the cache line (CL) and optionally sends out a UMsg with Hint (UMsgH) to the AFU. When the read request completes, a UMsg with 64B data is sent to the AFU.



Intel Confidential

Figure 4 : UMsg initialization and usage flow

Functionally, UMsg is equivalent to a spin loop or a monitor and mwait instruction on a Xeon.

Some key characteristics of UMsgs:

1. Just as spin loops to different addresses in a multi-threaded application have no relative ordering guarantee, UMsgs to different addresses have no ordering guarantee between them.

2. Every CPU write to a UMAS CL, may not result in a corresponding UMsg. The AFU may miss an intermediate change in the value of a CL, but it is guaranteed to see the newest data in the CL. Again it helps to think of this like a spin loop: if the producer thread updates the flag CL multiple times, it is possible that polling thread misses an intermediate change in value, but it is guaranteed to see the newest value.

Here is an example usage. Software updates to a descriptor queue pointer may be mapped to a UMsg. The pointer is always expected to increment. UMsg will guarantee that AFU sees the final value of the pointer, it may miss intermediate updates to the pointer, which is acceptable.

3. UMsg will use the FPGA cache, as a result it could cause cache pollution, a situation in which a program unnecessarily loads data into the cache and causes other needed data to be evicted, thus degrading performance.

4. Because the CPU may exhibit false snooping, UMsgH should be treated as a hint. That is, you can start speculative execution or pre-fetch based on UMsgH, but you should wait for UMsg before committing the results.

5. UMsg provides the same latency as a AFU read polling using RdLine_S, but it saves CCI-P channel bandwidth which can be used for read traffic.

Setup UMAS(Pinned Memory)

Inform FPGA of UMAS location

CPU Writes to UMASCPU Wr causes a Snoop to

FPGA UMsgH

Inti

aliz

atio

nU

sag

e

FPGA gets the read data UMsg + 64B data

CPU Memory FPGA QPI Agent

AFU

For ultra low latency, Snp itself is used as a UMsgH

Snp + Read Data is sent as UMsg




Intel Confidential

2.5 MMIO Cycles to IO Memory

MMIO Write requests posted AFU must not return a response.

MMIO Read requests non-posted AFU must return a response.

Key points:

Read data widths supported = 4B, 8B

Write data widths supported = 4B, 8B

AFU must support 8B MMIO accesses to IO memory and register file.

4B accesses are optional. It can be avoided by coordinating with the SW application developer.

Maximum outstanding MMIO read requests is limited to 64.

MMIO read request timeout value = 512 pClk cycles

Maximum MMIO request rate = 1 request per 2 pClks

MMIO Reads to undefined AFU registers should still return a response.

The FIU makes an MMIO read request to the AFU over C0, using Rx signals. mmioRdValid indicates that C0 Hdr contains a MMIO read request. The c0_ReqMmioHdr structure provides a convenient mapping from flat bit-vector to MMIO read request fields – {address, length, tid}.

Then, the AFU drives a response over C2 using Tx signals. The C2 signal mmioRdValid indicates that the C2 Hdr and data fields contain the MMIO Read response. The c0_RspMmioHdr.tid field must match that provided in c0_ReqMmioHdr.tid; this is used to match the response against request.

It is illegal to split a 8B MMIO Read request into 2 4B MMIO Read responses.

The FIU makes an MMIO write request to the AFU over C0, using Rx signals. mmioWrValid indicates that the c0_ReqMmioHdr structure is an MMIO write request and contains the IO address to be written. The C0 data field contains the data to be written.

For generating 64B MMIO Writes to AFU, refer to Section 11.3.1 in the Intel Software Developers Manual Volume 3.



Intel Confidential

2.6 CCI-P Tx Signals

Code 2: Tx interface structure inside ccip_if_pkg.sv

There are 3 Tx channels:

The C0 and C1 Tx channels are used for memory requests. They provide independent flow control. The C0 Tx channel is used for memory read requests; the C1 Tx channel ids used for memory write requests.

The C2 Tx channel is used to return MMIO Read response to the FIU. The CCI-P port guarantees to accept responses on C2 therefore it has no flow control.

11.3.1 Buffering of Write Combining Memory Locations

:

:

Once the processor has started to evict data from the WC buffer into system memory, it will make a bus-transaction style decision based on how much of the buffer contains valid data. If the buffer is full (for example, all bytes are valid), the processor will execute a burst-write transaction on the bus. This results in all ia32 (P6 family processors) orx86_ 64/EM64T (Pentium 4 and more recent processor) being transmitted on the data bus in a singleburst transaction. If one or more of the WC buffer’s bytes are invalid (for example, have not been written by software), the processor will transmit the data to memory using “partial write” transactions (one chunk at a time, where a “chunk” is 8 bytes).

This will result in a maximum of 4 partial write transactions (for P6 family processors) or 8 partial write transactions (for the Pentium 4 and more recent processors) for one WC buffer of data sent to memory.

$ typedef struct packed { t_if_ccip_c0_Tx c0; t_if_ccip_c1_Tx c1; t_if_ccip_c2_Tx c2; } t_if_ccip_Tx;




Intel Confidential

Code 3: Tx channel structures inside ccip_if_pkf.sv

Each Tx channel has a valid signal to qualify the corresponding hdr and data signals within the structure.

Table 5 describes the signals that make up the CCI-P Tx interface.

// Channel 0 : Memory Reads typedef struct packed { t_ccip_c0_ReqMemHdr hdr; // Request Header logic valid; // Request Valid } t_if_ccip_c0_Tx; // corresponding AlmostFull inside t_if_ccip_Rx.c0TxAlmFull // Channel 1 : Memory Writes typedef struct packed { t_ccip_c1_ReqMemHdr hdr; // Request Header t_ccip_clData data; // Request Data logic valid; // Request Wr Valid } t_if_ccip_c1_Tx; // corresponding AlmostFull inside t_if_ccip_Rx.c1TxAlmFull // Channel 2 : MMIO Read response typedef struct packed { t_ccip_c2_RspMmioHdr hdr; // Response Header logic mmioRdValid; // Response Read Valid t_ccip_mmioData data; // Response Data } t_if_ccip_c2_Tx;



Intel Confidential

Table 5: Tx Channel Signal Description

Signal Width Direction Description

pck_af2cp_sTx.c0.hdr 74b Output Channel 0 request header .Refer to Table 6 Tx Header Field Definitions.

pck_af2cp_sTx.c0.valid 1b Output When set to 1, it indicates channel 0 hdr is valid.

pck_cp2af_sRx.c0TxAlmFull 1b Input When set to 1, Tx Channel0 is almost full. After this signal is set, AFU is allowed to send a maximum of 8 requests.

When set to 0, AFU can start sending requests immediately.

pck_af2cp_sTx.c1.hdr 80b Output Channel 1 request header. Refer to Table 6 Tx Header Field Definitions.

pck_af2cp_sTx.c1.data 512b Output Channel 1 data.

pck_af2cp_sTx.c1.valid 1b Output When set to 1, it indicates channel 1 hdr and data is valid.

pck_cp2af_sRx.c1TxAlmFull 1b Input When set to 1, Tx Channel1 is almost full. After this signal is set, AFU is allowed to send a maximum of 8 requests or data.

When set to 0, AFU can start sending requests immediately.

pck_af2cp_sTx.c2.hdr 9b Output Channel 2 response header. Refer to Table 6 Tx Header Field Definitions.

pck_af2cp_sTx.c2.mmioRdValid 1b Output When set to 1, indicates Channel 2 hdr and data is valid

pck_af2cp_sTx.c2.data 64b Output MMIO Rd Data bus, used to read AFU registers. For 32b reads, data must be driven on bits [31:0]. For 64b reads, AFU must drive one 64b data response. Response cannot be split into two 32b responses.




Intel Confidential

2.7 Tx Header Format

Table 6 Tx Header Field Definitions

Field Description

mdata Metadata: user defined request id that is returned unmodified from request to response hdr.

For multi-CL writes on C1 Tx, mdata is only valid for the hdr when sop=1.

tid Transaction ID: AFU must return the tid MMIO Read request to response hdr. It is used to match the response against the request.

vc_sel Virtual Channel selected 2’h0 – VA 2’h1 – VL0 2’h2 – VH0 2’h3 – VH1

All CLs that form a multi-CL write request are routed over the same VC.

req_type Request types listed in Table 7

sop Start of Packet for multi-CL memory write

1’b1 – marks the first hdr. Must write in increasing address order. 1’b0 – subsequent hdrs

cl_len Length for memory requests 2’b00 – 64B 2’b01 – 128B 2’b11 – 256B

address 64B aligned Physical Address, that is, byte_address>>6

The address must be self-aligned w.r.t. cl_len field. Example for cl_len=2’b01, the address must be divisible by 128B, similarly for cl_len=2’b11, the address must be divisible by 256B.



Intel Confidential

Table 7: Tx Request Encodings & Mapping to Header Formats

Request Type Encoding Data Description Hdr Format

t_if_ccip_c0_tx: enum t_ccip_c0_req

eREQ_RDLINE_I 4’h0 No Memory read request with no intention to cache.

C0 Memory Request Header. Refer to Table 8.

eREQ_RDLINE_S 4’h1 No Memory read request with caching hint set to Shared.

t_if_ccip_c1_tx: enum t_ccip_c1_req

eREQ_WRLINE_I 4’h0 Yes Memory write request with no intention of keeping the data in FPGA cache.

C1 Memory Request Hdr. Refer to Table 9.

eREQ_WRLINE_M 4’h1 Yes Memory write request with caching hint set to Modified.

eREQ_WRPush_I 4’h2 Yes Memory Write Request, with caching hint set to Invalid. FIU writes the data into the processor’s last level cache (LLC) with no intention of keeping the data in FPGA cache. The LLC it writes to is always the LLC associated with the processor where the DRAM address is homed.

eREQ_WRFENCE 4’h4 No Memory write fence. This request doesn’t have a data payload.

Fence Hdr. Refer to Table 10.

t_if_ccip_c2_tx – doesn’t have a request type field

MMIO Rd N.A. Yes MMIO read response MMIO Rd Response Hdr. Refer to Table 11.

All unused encodings are considered Reserved.




Intel Confidential

Table 8:C0 Read Memory Request Header Format Structure: t_ccip_c0_ReqMemHdr

Bit # bits Field

[73:72] 2 vc_sel

[71:70] 2 RSVD

[69:68] 2 cl_len

[67:64] 4 req_type

[63:58] 6 RSVD

[57:16] 42 address

[15:0] 16 mdata

Table 9: C1 Write Memory Request Header Format Structure: t_ccip_c1_ReqMemHdr

Bit # bits Field,

SOP=1

Field,

SOP=0

[79:74] 6 RSVD RSVD

[73:72] 2 vc_sel RSVD-DNC

[71] 1 sop=1 sop=0

[70] 1 RSVD RSVD

[69:68] 2 cl_len RSVD-DNC

[67:64] 4 req_type req_type

[63:58] 6 RSVD RSVD

[57:18] 40 address

RSVD-DNC

[17:16] 2 address

[15:0] 16 mdata RSVD-DNC

Table 10: C1 Fence Header Format Structure: t_ccip_c1_ReqFenceHdr



Intel Confidential

Bit # bits Field

[79:74] 6 RSVD

[73:72] 2 vc_sel

[71:68] 4 RSVD

[67:64] 4 req_type

[63:16] 48 RSVD

[15:0] 16 mdata

Table 11: C2 MMIO Response Header Format

Bit # bits Field

[8:0] 9 tid




Intel Confidential

2.8 CCI-P Rx Signals

Code 4: Rx interface structure inside ccip_if_pkg.sv

There are 2 Rx channels.

Channel 0 interleaves memory responses, MMIO requests and UMsgs.

Channel 1 returns responses for AFU requests initiated on Tx Channel 1.

The c0TxAlmFull and c1TxAlmFull signals are inputs to the AFU. Although they are declared with the Rx signals structure, they logically belong to the Tx interface and so were described in the previous section.

Rx Channels have no flow control. The AFU must accept responses for memory requests it generated. The AFU must pre-allocate buffers before generating a memory request. The AFU must also accept MMIO requests.

Code 5: Rx channel structure inside ccip_if_pkg.sv

Rx Channel 0 has separate valid signals for memory requests and MMIO requests. Only one of those valid signals can be set in a cycle. MMIO request valid further has two valid signals, one for MMIO Rd and other for MMIO Wr. When either are true the hdr must be interpreted as an MMIO hdr instead of memory response header.

typedef struct packed { logic c0TxAlmFull; // C0 Request Channel Almost Full logic c1TxAlmFull; // C1 Request Channel Almost Full t_if_ccip_c0_Rx c0; t_if_ccip_c1_Rx c1; } t_if_ccip_Rx;

// Channel 0: Memory Read response, MMIO Request typedef struct packed { t_ccip_c0_RspMemHdr hdr; // Rd Response/ MMIO / UMsg req Header t_ccip_clData data; // Rd Data / MMIO / UMsg req Data logic rspValid; // Rd Response / UMsg Valid logic mmioRdValid; // MMIO Read Valid logic mmioWrValid; // MMIO Write Valid } t_if_ccip_c0_Rx; // Channel 1: Memory Writes typedef struct packed { t_ccip_c1_RspMemHdr hdr; // Response Header logic rspValid; // Response Valid



Intel Confidential

Table 12: Rx Channel Signal Description


pck_cp2af_sRx.c0.hdr 28b Input Channel 0 response header or MMIO request header. Refer to Table 13 Rx Header Field Definitions.

pck_cp2af_sRx.c0.data 512b Input Channel 0 Data bus Memory Read Response & UMsg:

Returns 64B data

MMIO Write Request: For 32b write, data driven on bits [31:0] For 64b write, data driven on bits [63:0]

pck_cp2af_sRx.c0.resp_valid 1b Input When set to 1, it indicates hdr and data on Channel 0 are valid. The hdr must be interpreted as a memory response, decode resp_type field.

pck_cp2af_sRx.c0.mmioRdValid 1b Input When set to 1, it indicates a MMIO Rd request Channel 0.

pck_cp2af_sRx.c0.mmioWrValid 1b Input When set to 1, it indicates a MMIO Wr request on Chanel 0.

pck_cp2af_sRx.c1.hdr 28b Input Channel 1 response header. Refer to Table 13 Rx Header Field Definitions.

pck_cp2af_sRx.c1.respValid 1b Input When set to 1, it indicates hdr on channel 1 is a valid response.




Intel Confidential

2.8.1 Rx Header and RxData Format

Table 13 Rx Header Field Definitions

Field Description

mdata Metadata: User defined request id, returned unmodified from memory request to response header.

For multi-CL memory response, the same mdata is returned for each CL.

vc_used Virtual channel used: when using VA, this field identifies the virtual channel selected for the request by FIU. For other VCs it returns the request VC.

format When using multi-CL memory write requests, FIU may return a single response for the entire payload or a response per CL in the payload.

1’b0 Unpacked write response: returns a response per CL. Look up the cl_num field to identify the cache line.

1’b1 Packed write response: returns a single response for entire payload. cl_num field gives the payload size, that is, 1 CL, 2 CLs, or 4CLs.

cl_num format=0 For a response with >1CL data payload, this field identifies the cl_num.

2’h0 – 1st CL. Lowest Address 2’h1 – 2nd CL 2’h3 – 4th CL. Highest Address

Responses may be returned out of order.

format=1 This field identifies the data payload size. 2’h0 – 1 CL or 64B 2’h1 – 2 CL or 128B 2’h3 – 4 CL or 256B

hit_miss Cache Hit/Miss status. AFU can use this to generate fine grained hit/miss statistics for various modules.

1’h0 – Cache Miss 1’h1 – Cache Hit

MMIO Length Length for MMIO requests: 2’h0 – 4B 2’h1 – 8B

MMIO Address DWord aligned MMIO address offset, that is, byte Address>>2.

UMsg ID Identifies the CL corresponding to the UMsg

UMsg Type Two type of UMsg are supported: 1’b1 – UMsgH (Hint) without data 1’b0 – UMsg with Data



Intel Confidential

Table 14: AFU Rx Response Encodings and Channels Mapping

Response Type Encoding Data Payload Header Format

t_if_ccip_c0_Rx: enum t_ccip_c0_rsp

eRSP_RDLINE 4’h0 Yes Memory Response Header. Refer to Table 15. Qualified with c0.rspValid

MMIO Read N.A. No MMIO Request Header. Refer to Table 16.

MMIO Write N.A. Yes

eRSP_UMSG 4’h4 Yes/No UMsg Response Header. Refer to Table 18. Qualified with c0.rspValid

t_if_ccip_c1_Rx: enum t_ccip_c1_rsp

eRSP_WRLINE 4’h0 No Memory Response Header. Refer to Table 17.

eRSP_WRFENCE 4’h4 No WrFence Response Header. Refer to Table 15.

Table 15: C0 Memory Read Response Header Format Structure: t_ccip_c0_RspMemHdr

Bit # bits Field

[27:26] 2 vc_used

[25] 1 RSVD

[24] 1 hit_miss

[23:22] 2 RSVD

[21:20] 2 cl_num

[19:16] 4 resp_type

[15:0] 16 mdata

Table 16: MMIO Request Header Format

Bit # bits Field

[27:12] 16 address

[11:10] 2 length

[9] 1 RSVD

[8:0] 9 TID




Intel Confidential

Table 17: C1 Memory Write Response Header Format Structure: t_ccip_c1_RspMemHdr

Bit # bits Field

[27:26] 2 vc_used

[25] 1 RSVD

[24] 1 hit_miss

[23] 1 format

[22] 1 RSVD

[21:20] 2 cl_num

[19:16] 4 resp_type

[15:0] 16 mdata

Table 18: UMsg Header Format

Bit # bits Field

[27:20] 8 RSVD

[19:16] 4 resp_type

[15] 1 UMsg Type

[14:3] 12 RSVD

[2:0] 3 UMsg ID

Table 19: WrFence Header Format Structure: t_ccip_c1_RspFenceHdr

Bit # bits Field

[27:20] 8 RSVD

[19:16] 4 resp_type

[15:0] 16 mdata



Intel Confidential

2.9 Multi-Cacheline Memory Requests

To achieve highest link efficiency, pack the memory requests into large transfer sizes. Use the multi-CL requests for this. Listed below are the characteristics of multi-CL memory requests:

VH0, VH1 and VA attain highest memory BW when using a data payload of 4CLs.

Memory write request should always begin with the lowest address first. SOP=1 in the c1_ReqMemHdr marks the first CL. All subsequent headers in the multi-CL request must drive the corresponding CL address.

An N CL memory write request takes N cycles on Channel 1. It is legal to have bubbles between the cycles that form a multi-CL request, but it cannot be interleaved with another request. It is illegal to start a new request without completing the entire data payload for a multi-CL write request.

FIU guarantees to complete the multi-CL VA requests on a single VC.

The memory request address must be self-aligned. A 2CL request should start on a 2CL boundary. Its CL address must be divisible by 2. A 4CL request should be aligned on a 4CL boundary. Its CL address must be divisible by 4.

Figure 5 is an example of a multi-CL Memory Write request.

‘h0‘h1 ‘h0 ‘h0 ‘h1 ‘h0 ‘h1 ‘h1 ‘h0

pClk

pck_af2cp_sTx.c1.hdr.sop

pck_af2cp_sTx.c1.valid

D1D0 D2 D3 D4 D5 D6 D7 D8pck_af2cp_sTx.c1.data

‘h3 ‘h1 ‘h0 ‘h1pck_af2cp_sTx.c1.hdr.cl_len

‘h1040 ‘h1041 ‘h1043 ‘h1044pck_af2cp_sTx.c1.hdr.addr[41:2]

‘h1‘h0 ‘h2 ‘h3 ‘h0 ‘h1 ‘h1 ‘h0 ‘h1pck_af2cp_sTx.c1.hdr.addr[1:0]

WrLine_I WrLine_MWrLin

e_MWrLine_Ipck_af2cp_sTx.c1.hdr.req_type

VA VH0 VL0 VH1pck_af2cp_sTx.c1.hdr.vc_sel

‘h10 ‘h11 ‘h12 ‘h13pck_af2cp_sTx.c1.hdr.mdata

Figure 5 : Multi-CL Memory Write Requests




Intel Confidential

Figure 6 is an example for a Memory Write Response Cycles. For unpacked response, the individual CLs could return out of order.

Figure 6 : Multi-CL Memory Write Reponses

Figure 7 is an example of a Memory Read Response Cycle. The read response can be re-ordered within itself; that is, there is no guaranteed ordering between individual CLs of a multi-CL Read. All CLs within a multi-CL response have the same mdata and same vc_used. Individual CLs of a multi-CL Read are identified using the cl_num field.

Figure 7 : Multi-CL Memory Read Responses

‘h0‘h1 ‘h0 ‘h0 ‘h1 ‘h0 ‘h0 ‘h1

pClk

pck_cp2af_sRx.c1.hdr.hit_miss

pck_cp2af_sRx.c1.valid

‘h0‘h1 ‘h0 ‘h1 ‘h0 ‘h2 ‘h1 ‘h3pck_cp2af_sRx.c1.hdr.cl_num

WrLinepck_cp2af_sRx.c1.hdr.resp_type

VH0VL0 VL0 VH0 VL0 VL0 VH1 VL0pck_cp2af_sRx.c1.hdr.vc_used

‘h11‘h10 ‘h12 ‘h11 ‘h10 ‘h10 ‘h13 ‘h10pck_cp2af_sRx.c1.hdr.mdata

‘h0‘h0 ‘h0 ‘h0 ‘h0 ‘h0 ‘h1 ‘h0pck_cp2af_sRx.c1.hdr.format

‘h0‘h1 ‘h0 ‘h0 ‘h1 ‘h0 ‘h0 ‘h1

pClk

pck_cp2af_sRx.c0.hdr.hit_miss


‘h0‘h1 ‘h0 ‘h1 ‘h0 ‘h2 ‘h1 ‘h3pck_cp2af_sRx.c0.hdr.cl_num

RdLinepck_cp2af_sRx.c0.hdr.resp_type

VH0VL0 VL0 VH0 VL0 VL0 VH1 VL0pck_cp2af_sRx.co.hdr.vc_used

‘h11‘h10 ‘h12 ‘h11 ‘h10 ‘h10 ‘h13 ‘h10pck_cp2af_sRx.c0.hdr.mdata



Intel Confidential

2.10 Additional Control Signals

Unless otherwise mentioned, all signals are active high.

Table 20: Clock and Reset


pck_cp2af_softReset 1b Input Synchronous ACTIVE HIGH soft reset.

When set to 1, AFU must reset all logic. Minimum Reset pulse width is 256 pClk cycles. All outstanding CCI-P requests will be flushed before de-asserting soft reset.

A soft reset will not reset the FIU.

pClk 1b Input Primary interface clock. All CCI-P interface signals are synchronous to this clock. Clock frequency is listed in Section 2.13.

pClkDiv2 1b Input Synchronous and in phase with pClk. 0.5x clock frequency.

pClkDiv4 1b Input Synchronous and in phase with pClk. 0.25x clock frequency.

uClk_usr 1b Input The user defined clock is not synchronous with the pClk.

AFU must synchronize the signals to pClk domain before driving the CCI-P interface.

Default frequency is 312.5 MHz.

Quartus partial reconfiguration flow does not allow PLLs to be instantiated in the reconfigurable region (that is, the AFU). The AFU load utility will program the user defined clock frequency before de-asserting pck_cp2af_softReset.

uClk_usrDiv2 1b Input Synchronous with uClk_usr and 0.5x the frequency.




Intel Confidential


pck_cp2af_pwrState 2b Input Indicates the current AFU power state request. In response to this, the AFU must attempt to reduce its power consumption. If sufficient power reduction is not achieved, the AFU may be Reset.

2’h0 – AP0 - Normal operation mode 2’h1 – AP1 - Request for 50% power reduction 2’h2 – Reserved, illegal 2’h3 – AP2 - Request for 90% power reduction

When pck_cp2af_pwrState is set to AP1, the FIU will start throttling the memory request path to achieve 50% throughput reduction. The AFU is also expected to reduce it power utilization to 50%, by throttling back accesses to FPGA internal memory resources and its compute engines. Similarly upon transition to AP2, the FIU will throttle the memory request paths to achieve 90% throughput reduction over normal state, and AFU in turn is expected to reduce its power utilization to 90%.

pck_cp2af_error 1b Input CCI-P protocol error has been detected and logged in the PORT Error register. This register is visible to the AFU.

It can be used as trigger for signal taps.

When such an error is detected, the CCI-P interface stops accepting new requests and sets AlmFull is set to 1.

There is no expectation to complete outstanding requests.

The AFU is not reset.



Intel Confidential

Protocol Flow

2.10.1 Upstream Requests

Table 21: Protocol Flow for upstream requests from AFU to FIU

Type Tx Request Tx Data Rx Response Rx Data

Memory Write WrLine_I

Yes WrLine No WrLine_M

WrPush_I

Memory Read RdLine_I

No RdLine Yes

RdLine_S

Special Messages WrFence No WrFence No

Column 3 Identifies whether the request expects a Tx Data payload Tx Data

Column 5 Identifies whether the response returns a Data payload Rx Data




Intel Confidential

Table 22 CCI-P VL0 protocol flows

CCI-P Request

FPGA Cache UPI Cycle

FPGA cache Next state

CCI-P Response

UPI Cycle FPGA cache Next State

CCI-P Response

UPI Cycle

FPGA cache Next State

CCI-P Response

Hit/ Miss

State Phase 1 Phase 2 Phase 3

WrLine_I Hit M None M WrLine WbMtoI I

Hit S InvItoE

Miss S, I

WrLine_M Hit M None M WrLine N.A.

Hit S InvItoE

Miss S, I

WrLine_I Miss M WbMotI I InvItoE M WrLine WbMotI I

WrLine_M

WrPush_I WbPushMotI

I

WrPush_I Hit M None M WrLine WbPushMotI

I

Hit S, I InvItoE

Miss S, I

RdLine_S Hit S, M None No Change

RdLine N.A.

Miss S,I RdCode S RdLine

RdLine_I Hit S, M None No Change

RdLine N.A.

Miss S,I RdCur I RdLine

RdLine_I Miss M WbMotI I RdCur I RdLine

RdLine_S RdCode S

WrLine_I Requires special handling, because it must first write to the CL and then evict it from the cache. The eviction forms Phase 2 of the request.

RdLine_I Recommended as the default read type.

RdLine_S Use sparingly only for cases where you have identified highly referenced CLs.

RdCode Updates the CPU directory and lets the FPGA cache the line in Shared state. RdCur does NOT update the CPU directory, FPGA will not cache this line. A future access to this line from CPU, will not snoop the FPGA.



Intel Confidential

2.10.2 Downstream Requests

Table 23: Protocol Flow for Downstream Requests from CPU to AFU

Rx Request Rx Data Tx Response Tx Data

MMIO Read No MMIO Read Data Yes

MMIO Write Yes None N.A.

UMsg Yes None N.A.

UMsgH No None N.A.




Intel Confidential

2.11 Ordering Rules

2.11.1 Memory Requests

The CCI-P memory consistency model is different from the PCIe consistency model. CCI-P implements a “relaxed” memory consistency model.

It relaxes ordering requirements for requests to:

Same address

Different addresses

Table 24 below defines the ordering relationship between two memory requests on CCI-P. The same rules apply for requests to the “same” address or “different” addresses. The table entries are defined as follows:

Yes the second (row) request is allowed to pass the first (column) request. No the second (row) request is not allowed to pass the first (column) request.

Table 24 Ordering rules for upstream requests from AFU

Row bypass column? (col 2) Read (col 3) Write (col 4) WrFence

(row 2) Read Yes Yes Yes

(row 3) Write Yes Yes No

(row 4) WrFence Yes No No

Interpret Table 24 as follows.

1. The Read (2nd row) can bypass an earlier Read (2nd column), a write (3rd column), and a WrFence (4th column).

2. The Write (3rd row) can bypass an earlier Read (2nd column) and a write (3rd column). It cannot bypass a WrFence (4th column).

3. The WrFence (4th row) can bypass an earlier Read (2nd column). It cannot bypass a Write (3rd column) and a WrFence (4th column).



Intel Confidential

Intra-VC Write observability Upon receiving a memory write response, the write has reached a local observability point.

- All future reads from AFU to same VC will get the new data - All future writes on same VC will replace the data

Inter-VC Write observability A memory write response does NOT mean the data are globally observable across the VCs. A subsequent read on a different VC may return old data. Use a WrFence VA to synchronize across VCs. A WrFence VA does a broadcast.

- It goes beyond waiting for write responses. It pushed all earlier writes to global observability point.

- Upon receiving a WrFence response, all future reads from AFU get the new data.

2.11.1.1 Write Fence usage

To enforce ordering between memory writes, use a WrFence. Because using a WrFence is an expensive operation, restrict its use to synchronization points.

WrFence guarantees that all writes preceding the fence are committed to memory before any writes following the Write Fence are processed.

A WrFence will not be re-ordered with other memory writes or WrFence requests.

WrFence provides no ordering assurances with respect to Read requests.

A WrFence does NOT block the reads. In other words, memory reads can bypass a WrFence. This is shown as item 1 in the interpretation of Table 24.

WrFence request has a vc_sel field. This allows you to determine which of the virtual channels the WrFence is applied to. For example, if you move the data block on VL0, you only need to serialize with respect to other write requests on VL0; that is, you must use WrFence with VL0. Similarly, if your use memory writes with VA, then use WrFence with VA.

A WrFence request returns a response. The response is delivered to the AFU over C1 and identified by the resp_type field. Recall that a read can bypass a WrFence. However, if you want to ensure that you read the latest written data, you can issue a WrFence and then wait for the WrFence response before driving a read.

2.11.1.2 Memory Consistency Explained

CCI-P can re-order requests to the same and different addresses. It does not implement logic to identify data hazards for requests to same address.




Intel Confidential

2.11.1.2.1 Two Writes on Different VCs

Example1 Figure 8 shows two writes on different VCs may be committed to memory in a different order

AFU Processor

VH1: Write1 X, Data=A VL0: Write2 X, Data=B

Read1 X, Data = B Read2 X, Data = A

Figure 8: Write Out of Order Commit

AFU writes to X twice, Data=A over VH1 and Data=B over VL0. The processor polls on X and may see updates to X in reverse order; that is, the CPU may see Data=B, followed by Data=A. In summary, the write order seen by the processor may be different from the order in which AFU completed the writes.

To enforce write ordering, the AFU must explicitly identify the ordering boundary and add a WrFence between the Writes.

Example 2 Figure 9 shows the use of WrFence to enforce Write ordering.

AFU Processor

VH1: Write1 X, Data=A VA: WrFence VL0: Write2 X, Data=B

Read1 X, Data = A Read2 X, Data = B

Figure 9: Use WrFence to Enforce Write Ordering

This time AFU adds a VA WrFence between the two writes. The WrFence ensures that the processor sees the writes before the WrFence followed by the writes after the WrFence. Hence, the processor sees Data=A and then Data=B. VA WrFence was used here, because the Writes to be serialized were sent on different VCs.



Intel Confidential

2.11.1.2.2 Two Writes on the Same VC

Memory may see two writes to the same VC in a different order from their execution, unless the second write request was generated after the first write response was received.

Example 1 Figure 10 shows two writes on the same VC when the second write is executed after the first write is received.

AFU Processor

VH1: Write1 X, Data=A Resp 1 VH1: Write2 X, Data=B Resp 2

Read1 X, Data = A Read2 X, Data = B

Figure 10: Two Writes on Same VC, Only One Outstanding

AFU writes to X twice on same VC, but it only sends the second write after the first write is received. This ensures that the first write was sent out on the link, before the next one goes out. The CCI-P guarantees that these writes will be seen by the Processor in the right order. Processor will see Data A, followed by Data B.

You may also use a WrFence instead to enforce ordering between writes to same VC. Note, however, that WrFence has stronger semantics, it will stall processing all writes after the fence until all previous writes have completed.




Intel Confidential

2.11.1.2.3 Two Reads on Different VCs

Two reads on different VCs may complete out of order; the last read response may return old data.

Example 1 Figure 11 shows how reads from the same address over different VCs may result in re-ordering.

Processor AFU

Store X=1 Store X=2

Request Response

VH1: Read1 X

VL0: Read2 X

VL0: Resp2 X, Data=2

VH1: Resp1 X, Data=1

Figure 11: Read Re-Ordering to Same Address, Different VCs

Processor writes X=1 and then X=2. The AFU reads X twice over different VCs, in Figure 11, Read1 was sent on VH1 and Read2 on VL0. The CCI-P may re-order the responses and return data out of order. AFU may see X=2, followed by X=1. This is different from the Processor write order.



Intel Confidential

2.11.1.2.4 Two Reads on the Same VC

Reads to the same VC may complete out of order; the last read response will always return the “new” data.

Note, however, that VA reads behave like two reads on different VCs.

Example 1 Figure 12 shows how reads from the same address over the same VC may result in re-ordering. However, the AFU sees updates in the same order in which they were written.

Processor AFU

Store X=1 Store X=2

Request Response

VL0: Read1 X

VL0: Read2 X



Figure 12: Read Re-Ordering to Same Address, Same VC

Processor writes X=1 and then X=2. The AFU reads X twice over the same VC; in Figure 12 both Read1 and Read2 are sent on VL0. The CCI-P may still re-order the Read responses, but CCI-P guarantees to return the newest data last; that is, AFU will see updates to X in the order in which Processor writes to it.

When using VA, CCI-P may return data out of order, because VA request may get directed on VL0, VH0 or VH1.

2.11.1.2.5 Read-after-Write on Same VC

CCI-P does not order read and write requests to even the same address. The AFU must explicitly resolve such dependencies. To do this, the AFU has two requirements:

1. AFU must use same VC for write and read requests. Do not use VA. 2. AFU must send the read request only after write response is received.

2.11.1.2.6 Read-after-Write on Different VCs

The AFU cannot resolve a read-after-write dependency when different VCs are used.

2.11.1.2.7 Write-after-Read on Same or Different VCs

CCI-P does not order write after read requests even when they are to the same address. The AFU must explicitly resolve such dependencies. The AFU must send the write request only after read response is received.




Intel Confidential

2.11.1.2.8 Some example scenarios:

1. More than one outstanding read/write requests to an address results in non-deterministic behavior.

Example 1 Two writes to same address X can be completed out of order. The final value at address X is non-deterministic. To enforce ordering add a WrFence between the write requests.

Example 2 Two reads from same address X, may be completed out of order. This is not a data hazard, but an AFU developer should make no ordering assumptions.

Example 3 Write to followed by read from address X. It is non-deterministic; that is, the Read will return the new data (data after the write) or the old data (data before the write) at address X.

Example 4 Read followed by write to address X. It is non-deterministic; that is, the read will return the new data (data after the write) or the old data (data before the write) at address X.

Use the read responses to resolve read dependencies.

Use a write Fence to implement a write memory barrier.

2. Read/write requests to different addresses may be completed out of order.

Example 1 AFU writes the data to address Z and then wants to notify the SW thread by updating a value of flag at address X.

To implement this, the AFU must use a write fence between write to Z and write to X. The write fence will ensure that Z is globally visible before write to X is processed.

Example 2 AFU reads data starting from address Z and then wants to notify the SW thread by updating the value of flag at address X.

To implement this, the AFU must perform the read from Z, wait for the read response and then perform the write to X.



Intel Confidential

2.11.2 MMIO Requests

MMIO memory is exposed as pre-fetchable memory to the OS. This means that accesses to MMIO region should have no read side-effects. This is same as the 64b pre-fetchable BAR as defined in the PCIe Specification.

MMIO Read cycles follow the UC (uncacheable) ordering rules. Refer to the Intel Software Developers Manual for more information on UC ordering rules.

MMIO Write cycles may follow either WC (write coalescing) or UC ordering rules.

Table 25: MMIO Ordering Rules

Request Memory Attribute

Payload size Memory Ordering Comments

MMIO

Write UC 4B or 8B or

using AVX 64B Strongly ordered Common case- AAL behavior

WC 4B or 8B or 64B Weakly ordered Special case

MMIO

Read UC 4B or 8B Strongly ordered Common case- AAL behavior

WC 4B or 8B Weakly ordered Special case- streaming read (MOVNTDQA) can cause wider reads. NOT supported

MMIO requests within the FIU, maintain the ordering set forth by the CPU.

MMIO read responses within the FIU, are not ordered w.r.t. memory read or write requests. The AFU must resolve ordering dependencies w.r.t. memory requests, before returning the MMIO Read response.




Intel Confidential

2.12 Timing diagrams

This section provides the timing diagrams for CCI-P interface signals.

H1H0 H2 H3 H4 H5 H6 H7 H8 H9

pClk

pck_af2cp_sTx.c1.hdr

pck_cp2af_sRx.c1.TxAlmostFull


D1D0 D2 D3 D4 D5 D6 D7 D8 D9pck_af2cp_sTx.c1.data

Up to 8 valid cycles

Tx Channel 0 & 1 timing

Figure 13: Tx Channel 0 & 1 almost full threshold

Wr1Wr0 Wr2 WrF Wr3 Wr4

pClk

pck_af2cp_sTx.c1.hdr


D1D0 D2 D3 D4pck_af2cp_sTx.c1.data

Wr1 Wr2 Wr0 WrFpck_cp2af_sRx.c1.valid


Wr4

Write Barrier

Write Barrier

*WrF- Write Fence

Wr3

WrFence behavior

Figure 14: Write Fence Behavior

WrFence is inserted between WrLine requests. A WrFence response returns on the Rx channel. Note that in Figure 14, all the writes generated before the write fence get response completions before the writes after the write fence are completed.

WrFence will only fence the Write on the VC selected. Chose VA if you want to fence across all VCs.



Intel Confidential

Wr0Rd0 Wr1 Rd1 Rd2 Wr2 Wr3 Wr4

pClk

pck_cp2af_sRx.c0.hdr

pck_cp2af_sRx.c0.mmioWrValid

D0 D1 D2 D3 D4pck_cp2af_sRx.c0.data[63:0]

C0 channel interleaved between MMIO Requests & Memory Responses

pck_cp2af_sRx.c0.mmioRdValid

pck_cp2af_sRx.c0.rspValid

Rsp0

D0

MMIO Wr Request MMIO Rd Request Memory Rd ResponseColor legend

Figure 15: C0 Rx Channel Interleaved between MMIO Requests and Memory Responses

Req

pClk

pck_cp2af_sRx.c0.hdr

Rsppck_af2cp_sTx.c2.hdr

pck_af2cp_sTx.c2.data

pck_cp2af_sRx.c0.mmioRdValid

pck_af2cp_sTx.c2.mmioRdValid

Data

Max response latency 512 pClk cycles

MMIO Rd Response timeout

Re

qu

est

Re

spo

nse

Figure 16:Rd Response Timeout

2.13 Clock Frequency

Table 26: Clock Frequency

CPU pClk (MHz)

“Interface Clock” pClkDiv2 (MHz) pClkDiv4 (MHz)

SKL+FPGA 400 200 100




Intel Confidential

2.14 CCI-P Guidance

This section suggests techniques and settings that are useful when you are just beginning to use the BDW + FPGA system.

The CCI-P interface provides several advanced features for fine grained control of FPGA caching states and virtual channels. When used correctly, you can get optimal performance through the interface; if used incorrectly, you may see significant degradation in performance.

Table 27 lists some suggested parameters for request fields.

Table 27 Recommended Choices for Memory Requests

Field Recommended Option

vc_sel For producer-consumer type flows VA For Latency sensitive flows VL0 For data dependent flow Use 1 VC, except VA

Length For maximum bandwidth 2’b11 – 256B

Request Type Memory Reads RdLine_I Memory Writes WrLine_M

CPU-to-FPGA notification Use MMIO Write for control notification only. FPGA-to-CPU notification Implement a polling loop on the SW thread, reading MMIO

Register in AFU. When setting the size of the request buffers in the AFU, follow this guidance:

64 outstanding requests each on VH0 and VH1 for a total of 128 requests.

Typical 128 outstanding requests with a maximum of 256 outstanding requests on VL0.

Total number of outstanding requests on VA is 128 + 128 (or 256) = 256 (or 384) outstanding requests.

In some cases UMsg may give better performance for CPU-to-FPGA notification. However, using UMsgs is an advanced technique that introduces additional complexity. It is best used by experienced users.


AFU Requirements of 69 2-Sep-16 5:07 PM

Intel Confidential

3 AFU Requirements

This section defines the AFU initialization flow upon power on, and mandatory AFU CSRs.

3.1 Mandatory AFU CSR Definitions

The following requirements are defined for software access to AFU CSRs.

1. Software is expected to access 64-bit CSRs as aligned quad words. For example, to modify a field (for example, bit or byte) in a 64-bit CSRs, the entire quad word is read, the appropriate field(s) are modified, and the entire quad word is written back.

2. Similarly for AFUs supporting 32-bit CSRs, software is expected to access them as aligned double words.

3. Locked operations to AFU CSRs are not supported. Software must not issue locked operations to access AFU CSRs.

Each CCI-P-compliant AFU is required the implement the four mandatory registers defined in Table 29. If you do not implement these registers or if you implement them incorrectly, AFU discovery could fail, or some other unexpected behavior may occur.

Table 28: Register Attribute Definition

Attribute Expansion Description

RO Read Only The bit is set by hardware only. Software can only read this bit. Writes do not have any effect.

Rsvd Reserved Reserved for future definition. AFU must set them to 0s. SW must ignore these fields.

Table 29 shows both byte and DWORD offsets for the mandatory AFU CSRs. The base address is set by the platform and need not be specified by the AFU.

Table 29: Mandatory AFU CSRs

DWORD Address Offset (CCI-P)

Byte Address Offset (AAL)

Width Attr Name

0x0000 0x0000 64b RO DEV_FEATURE_HDR (DFH)

0x0002 0x0008 64b RO AFU_ID_L Refer to Table 31

0x0004 0x0010 64b RO AFU_ID_H Refer to Table 32

0x0006 0x0018 64b Rsvd DFH_RSVD0 Refer to Table 33



2-Sep-16 5:07 PM of 69 AFU Requirements

Intel Confidential

DWORD Address Offset (CCI-P)

Byte Address Offset (AAL)

Width Attr Name

0x0008 0x0020 64b Rsvd DFH_RSVD1 Refer to Table 34.

Code 6 shows how the AFU might set the mandatory AFU CSRs. You must define your own AFU ID. Note that the AFU uses DWORD addresses. Code 7 shows how an AAL program might read the AFU ID.

Code 6: Set the Mandatory AFU Registers in the AFU

The AAL software and the AFU RTL must reference the same AFU ID.

Code 7: AAL Reads the AFU ID

t_ccip_c0_ReqMmioHdr mmioHdr; : : case (mmioHdr.address) // AFU header 16'h0000 : af2cp_sTxPort.c2.data <= { // DFH 4'b0001, // Feature Type = AFU 8'b0, // Reserved 4'b0, // AFU Minor Revision = 0 7'b0, // Reserved 1'b1, // End of DFH list = 1 24'b0, // Next DFH offset = 0 4'b0, // AFU Major version = 0 12'b0 // Feature ID = 0 }; 16'h0002 : af2cp_sTxPort.c2.data <= 64'ha12e_bb32_8f7d_d35c; // AFU_ID_L (arbitrary example) 16'h0004 : af2cp_sTxPort.c2.data <= 64'ha455_783a_3e90_43b9; // AFU_ID_H (arbitrary example) 16'h0006 : af2cp_sTxPort.c2.data <= 64'h0; // Next AFU 16'h0008 : af2cp_sTxPort.c2.data <= 64'h0; // Reserved

btUnsigned32bitInt AFUID_H, AFUID_L; : : IALIMMIO *m_pALIMMIOService; //< Pointer to MMIO Service: : : // the AFUID to be passed to the Resource Manager. It will be used to locate the appropriate device. ConfigRecord.Add(keyRegAFU_ID,"A455783A-3E90-43B9-A12E-BB328F7DD35C"); : m_pALIMMIOService->mmioRead32(0x0008, &AFUID_L); printf("Read AFUID_L= 0x%08x\n", AFUID_L); m_pALIMMIOService->mmioRead32(0x0010, &AFUID_H); printf("Read AFUID_H= 0x%08x\n", AFUID_H);



Intel Confidential

Table 30: Feature Header CSR Definition

Register Name Device Feature Header (DFH)

Address Offset 0x0

Bit Attr Default Description

63:60 RO 0x1 Type: AFU

59:52 Rsvd 0x0 Reserved

51:48 RO 0x0 AFU Minor version # User defined value

47:41 Rsvd 0x0 Reserved

40 RO N.A. End of List

1’b0 There is another feature header beyond this (see “Next DFH Byte Offset”)

1’b1 This is the last feature header for this AFU

39:16 RO 0x0 Byte offset to the Next Device Feature Header; that is, offset from the current address.

Example: Feature 0 @ Address 0x0 Next Feature offset = 0x100 Feature 1 @ Address 0x100 Next Feature offset = 0x100 Feature 2 @ Address 0x200 Next Feature offset = N.A. for last

feature

15:12 RO N.A. AFU Major version # User defined value

11:0 RO 0x070 CCI-P version # Use the CCIP_VERSION_NUMBER parameter from ccip_if_pkg.sv




Intel Confidential

Table 31: AFU_ID_L CSR Definition

Register Name AFU_ID_L

Address Offset 0x8


63:0 RO 0h Lower 64-bits of the AFU_ID GUID. Refer to Section 3.3.2

Table 32: AFU_ID_H CSR Definition

Register Name AFU_ID_H

Address Offset 0x10


63:0 RO 0h Upper 64-bits of the AFU_ID GUID. Refer to Section 3.3.2

Table 33: DFH_RSVD0 CSR Definition

Register Name DFH_RSVD0

Address Offset 0x18


63:0 Rsvd 0x0 Reserved for future definition.



Address Offset 0x20


63:0 Rsvd 0x0 Reserved for future definition.



Intel Confidential

3.2 AFU Discovery Flow

A CCI-P compliant AFU must implement the mandatory AFU CSRs. Figure 17 shows initial transactions immediately after pck_cp2af_softReset is de-asserted. The AFU has to accept the MMIO Read cycles immediately after soft rest is de-asserted.

Driver FIUUser AFU

De-assert Port Reset pck_cp2af_softReset=0

MMIO Rd to FIU CSR

Response with Reset statusDriver checks if all old requests are drained

User Application

MMIO Rd (DFH) MMIO Rd (0x0)

Rsp(DFH type=AFU)Rsp(DFH type=AFU)

MMIO Rd (AFU_ID_L) MMIO Rd (0x8)

Rsp(AFU_ID_L)Rsp(DFH type=AFU_ID_L)

MMIO Rd (AFU_ID_H) MMIO Rd (0x10)

Rsp(AFU_ID_H)Rsp(DFH type=AFU_ID_H)

Publish AFU resource/ Allocate AFU

Enumerate AFU DFH

Read AFU_ID

Driver hands over AFU control to

Application

Install Driver

AFU out of Reset

Figure 17 : AFU discovery flow

3.3 AFU_ID

The purpose of an AFU_ID is to precisely identify the architectural interface of an AFU. This interface is the contract that the AFU makes with the software.

Multiple instantiations of an AFU can have the same AFU_ID value, but if the architectural interface of the AFU changes, then it needs a new AFU_ID.

The architectural interface of an AFU comprises the syntax and semantics of the AFU design, consisting of the AFU’s functionality, its CSR definitions, the protocol expected by the AFU when manipulating its CSRs, and all implicit or explicit assumptions or guarantees about its buffers.

The AAL framework and the application software use the AFU_ID to ensure that they are matched to the correct AFU; that is, that they are obeying the same architectural interface.

Technically, the AFU_ID is a 128 bit GUID, and can be generated using standard GUID creation tools (see below).




Intel Confidential

3.3.1 How to Create an AFU_ID / GUID

Linux Use the command uuidgen.

$ uuidgen 1ad7bb9f-1371-4b3c-ab68-aaaa657f130b

Microsoft 1. Use Power Shell. From a PowerShell console, enter

> [guid]::NewGuid() cf545c57-9e6a-46cf-8601-32d3be765f4a

2. Or execute that Power Shell command from a standard Windows CMD shell:

> powershell -Command "[guid]::NewGuid()" 63ae3df9-1204-49ff-9144-45d23e27a4d3

3.3.2 How to Use an AFU_ID

Assuming that you get a GUID that looks like the following:

00112233-4455-6677-8899-aabbccddeeff

In the RTL, use this format (an underscore every four hex digits):

16'h0002 : af2cp_sTxPort.c2.data <= 64'ha12e_bb32_8f7d_d35c; // AFU_ID_L 16'h0004 : af2cp_sTxPort.c2.data <= 64'ha455_783a_3e90_43b9; // AFU_ID_H

In the software (for example, when constructing a configuration record), use this format(note the location of the dashes):

ConfigRecord.Add(keyRegAFU_ID,"A455783A-3E90-43B9-A12E-BB328F7DD35C");

These two AFU_ID values match each other. This allows the AAL runtime framework to match the hardware AFU with the software application using the AFU_ID.


Basic Building Blocks of 69 2-Sep-16 5:07 PM

Intel Confidential

4 Basic Building Blocks

Basic Building Blocks are Intel-provided reference designs that users can instantiate in their AFU. There are two types of Basic Building Blocks (BBBs): software-visible (exposes q register interface and requires software interaction) and software-invisible (does not require software interaction). In both cases, it is your responsibility to integrate the hardware and software into your AFU.

Examples:

SW visible Shared Virtual Memory (SVM) using the Memory Property Factory’s (MPF) VTP feature. SW invisible Reorder buffer; asynchronous CCI-P interface.

An example of a BBB is the Memory Properties Factory (MPF). MPF is an optional, parameterized basic building block to implement shared virtual memory in a proprietary manner. MPF is a collection of shims that transform CCI to CCI, adding some property. VTP (Virtual to Physical) is the translation shim. Other MPF Properties include read response sorting, order guarantees within a cache line, and VTP.

BBBs have mandatory registers defined in the next section.



2-Sep-16 5:07 PM of 69 Device Feature List

Intel Confidential

5 Device Feature List

This section defines a feature list structure that creates a linked list of feature headers within MMIO space, thus providing an extensible way of adding features. The software can walk through the feature headers to enumerate the following:

AFUs

Basic Building Blocks (BBBs)

Private features

Table 35:Differences between AFU, Private Features, and BBBs

AFU Private Feature BBB

Must implement mandatory AFU registers, including AFU ID. An AFU will be compliant to the CCI-P interface and connected directly to the CCI-P Port.

It is a primary unit of allocation, PR and reset from SW PoV

These are a linked list of features within the AFU, which provides a way of organizing functions within an AFU. It is the AFU developer’s responsibility to enumerate and manage them.

They are not required to implement a GUID.

BBBs are special features within the AFU, which are meant to be reusable building blocks (design once, reuse many times). SW visible BBBs typically come with a corresponding software service to enumerate and configure the BBB, and possibly provide a higher-level SW interface to the BBB.

BBBs do not have strong HW interface requirements like an AFU, but they must have strong architectural semantics from SW PoV.

They must implement a GUID.

A feature region (sometimes referred to simply as a “feature”) is a group of related CSRs. For example, two different features of a DMA engine are queue management and QoS functions. You can group queue management and QoS functions into two different feature regions.

BBB and private features are always children of an AFU and as such must be contained within an AFU.

Figure 18 shows an example of feature hierarchy and the relationship of the AFU BBB and private

features.


Device Feature List of 69 2-Sep-16 5:07 PM

Intel Confidential

(mandatory)Address: 0x0

DFH Type=AFUEOL=0

Private Feature 1EOL=0

BBB Feature 2EOL=0

Private Feature 3EOL=1

Figure 18 Example feature hierarchy

A Device Feature Header (DFH) register (shown in Table 36) marks the start of the feature region.




Intel Confidential

Table 36 : Device Feature Header CSR

Device Feature Header

Bit Description

63:60 Feature Type

4’h1 – AFU 4’h2 – BBB 4’h3 – Private Features

59:52 Reserved

51:48 AFU Minor version # User defined value

Reserved

47:41 Reserved

40 End of List

1’b0 There is another feature header beyond this (see “Next DFH Byte Offset”) 1’b1 This is the last feature header for this AFU

39:16 Next DFH Byte offset Next DFH Address = Current DFH Address + Next DFH Byte offset Refer to the example in Table 37.

15:12 AFU Major Version # User defined

Feature Revision User defined

11:0 CCI-P Version # Refer to Table 38 for the AFU DFH register map.

Feature ID Contains user defined ID to identify features within an AFU.

Table 37 Next DFH Byte offset example

Feature DFH Address EOL Next DFH Byte offset

0 0x0 0x0 0x100

1 0x100 0x0 0x180

2 – Last feature 0x280 0x1 0x80

Unallocated MMIO space, no DFH

0x300 N.A. N.A.



Intel Confidential

Table 38 Mandatory AFU DFH register map

Byte Address offset w.r.t DFH Register Name

0x0000 DFH Type=AFU

0x0008 AFU_ID_L

0x0010 AFU_ID_H

0x0018 Next AFU

0x0020 Reserved

Table 39 AFU_ID_L CSR definition

Register Name AFU_ID_L

Bit Attr Description

63:0 RO Lower 64-bits of the AFU_ID GUID

Table 40 AFU_ID_H CSR definition

Register Name AFU_ID_H


63:0 RO Upper 64-bits of the AFU_ID GUID

Table 41 Next AFU CSR

Register Name Next AFU


63:24 Rsvd Reserved

23:0 RO Next AFU DFH Byte offset

Next AFU DFH address = current address + offset Value of 0, implies it is the last AFU in the list.

Example: AFU 0 @ Address 0x0 Next AFU offset = 0x100 AFU 1 @ Address 0x100 Next AFU offset = 0x100 AFU 2 @ Address 0x200 Next AFU offset = 0x0 (indicates end of AFU list)




Intel Confidential




63:0 Rsvd Reserved

A DFH with Type=BBB must be followed by the mandatory BBB registers.

Table 43: Mandatory BBB DFH Register Map

Byte Address offset w.r.t DFH Register Name

0x0000 DFH Type=BBB

0x0008 BBB_ID_L

0x0010 BBB_ID_H

The mandatory BBB register definitions are defined below.

Table 44: BBB_ID_L CSR Definition

Register Name BBB_ID_L


63:0 RO Lower 64-bits of the BBB_ID GUID

Table 45: BB_ID_H CSR Definition

Register Name BBB_ID_H


63:0 RO Upper 64-bits of the BBB_ID GUID

The BBB_ID is a GUID, similar in concept to an AFU_ID. It is defined so that each BBB has a unique identifier from the SW PoV; this allows the AAL to identify the SW service associated with the BBB RTL.

Figure 19 shows how a logical feature hierarchy (shown on left-hand side) can be expressed using DFH registers defined in this section.



Intel Confidential

Feature CSRs

GUID_H

GUID_L

Type=BBB Feature Rev Feature IDNext DFH Byte offsetReserved

DFH Type=AFUEOL=0

Private Feature 1

EOL=0

BBB Feature 2EOL=0

Private Feature 3

EOL=1

Next DFH Byte offset Feature Rev Feature IDReserved

Feature CSRs

Reserved

Reserved

AFU_ID_H

AFU_ID_L

63 0

Type=AFU

ReservedType=Priv

If==1

No+

Feature 1 Addr

AFU major # CCI-P version #Next DFH Byte offsetAFU minor #

If==1

+

Feature 2 Addr

No

No

Next DFH Byte offset Feature Rev Feature IDReserved

Feature CSRs

ReservedType=Priv

Yes

+

Feature 3 Addr

End of Feature list

Register Map

EOL=0AFU DFH

(mandatory)

EOL=0

EOL=0

EOL=1

If!=1

If==1

Logical View

Figure 19: Device Feature Conceptual View

bdw + fpga beta release 5.0.3 core cache interface (cci-p ...athanas/harp tutorial... · 12-feb-15...

Documents