2 new hw_features_cat_cod_etc

New HW Features – CAT, COD, Haswell & other topics

Network Platforms Group

TRANSFORMING NETWORKING & STORAGE

2

Legal DisclaimerINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Intel® Active Management Technology requires the platform to have an Intel® AMT-enabled chipset, network hardware and software, as well as connection with a power source and a corporate network connection. With regard to notebooks, Intel AMT may not be available or certain capabilities may be limited over a host OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. For more information, see http://www.intel.com/technology/iamt.

64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.

No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology-enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technology/security/ for more information.

†Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Pentium® 4 Processor supporting HT Technology and an HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support HT Technology.

Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.

* Other names and brands may be claimed as the property of others.

Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representat ions or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice.

Copyright © 2013, Intel Corporation. All rights reserved.


3

Topics

• Run to completion vs pipeline models

• Lockless queues

• Cache Allocation Technology

• DPPD intro


4

Run to Completion vs. Pipeline model


5

Processor 0Physical Core 0Linux* Control Plane

NUMAPool CachesQueue/RingsBuffers

10 GbE

10 GbE

PhysicalCore 1Intel® DPDK

PMD Packet I/O Packet work

RxTx


PMD Packet I/O Flow work

RxTx


PMD Packet I/O Flow ClassificationApp A, B, C

RxTx



RxTx

Run to Completion model• I/O and Application workload can be handled on a single core• I/O can be scaled over multiple cores

PCIe* connectivity and core usageUsing run-to-completion or pipeline software models

10 GbE

Pipeline model• I/O application disperses packets to other cores• Application work performed on other cores

Processor 1

Physical Core 4Intel® DPDK

10 GbE



PMD Packet I/O Hash


App A App B App C


App A App B App C


RxTx

10 GbE

Pkt Pkt



RxTx

Pkt Pkt

Pkt Pkt

Pkt

Pkt

RSS Mode

QPI

PC

IeP

CIe

PC

IeP

CIe

PC

IeP

CIe

NUMAPool CachesQueue/RingsBuffers

Look at more I/O on fewer cores with vectorization


6

Applications will generally employ both models

Technical questions to consider:

How many cycles/packet do I need for my algorithms?

Are there large data structures that need to be sharedwith read/write access across packets?

Will I support timer / packet ordering functions?

Can I take advantage of a specific optimizationif you restrict an algorithm to one core?

How much data would I need to exchange betweensoftware modules?

When to Choose Run-to-Completion vs. Pipeline


7

General architecture questions to consider:

Do some cores have easier/faster access to a hw resource?

Do you want to view cores as offload engines?

Development environment questions to consider:

Do you need to employ legacy software modules?

Does ease-of-code-maintenance trump performance?

More Run-to-Completion vs. Pipeline…


8

Example: Building a More Complicated Pipeline Applications can be distributed

/pipelined across as many cores as needed to achieve throughput

Trade-offs will vary on when to distribute applications vs. consolidate

Queue/ring API serves as the communication mechanism

Current focus is a static (boot-time) configuration of queues

NIC driver pushes data to flow classifier

Classifier branches packet out to appropriate handler depending on packet inspection

IPSEC packets could be sent to CPM via CPM PMD or handled on-CPU for non–accelerated platforms

This is just an example

Poll Mode Driver -Rx

Poll Mode Driver -Tx

Flow Classification

Inbound IPsec pre-Processing

L3 Forwarding Application

Discard Application

IPsec Post Processing



Cave Creek CPMNIC


9

Lockless QueuesUsed to share data between cores, threads etc.


10

Connection Between DPDK Elements -- Rings

• Primary mechanism to move data between software units, or

between software and I/O sources or hardware accelerators

AcceleratorNIC

dispatch loop



Flow Classification

Inbound IPSec Pre Processing

L3 Forwarding Application

Discard Application

Free List

IPSec Post Processing

Forward packet to another core

FIB

DPDK Component



Customer Application


11

Queue/Ring Management APIEffectively a FIFO implementation in software

• Lockless implementations for single or multi-producer, single or multi- consumer enqueue/dequeue

• Supports bulk enqueue/dequeue to support packet-bunching

• Implements watermark thresholds for back-pressure/flow control

Essential to optimizing throughput

• Used to decouple stages of a pipeline


12

Pointers are implemented as binary values in a space of 2^32 addresses

Steps:

1. ring->prod_head and ring->cons_tailare copied to local variables

2. Use a compare-and-swap to update ring->ph… only if ring->ph = prod_head

3. Update the enqueue obj

4. Use a compare-and-swap to update ring->pt… only if ring->pt = prod_head

How “lockless” Operations Are Implemented

Multiple-producer enqueue Example


13

Haswell: Cache Allocation TechnologyEnables more deterministic VNF performance


14

Platform Quality of Service

Cache Monitoring Technology – Ability to monitor Last Level Cache occupancy for a set of RMID’s (Resource Monitoring ID). Extensible architecture for future monitoring events.

Cache Allocation Technology – Ability to partition Last Level Cache, enforcement on a per Core basis through Class of Service mapping.

https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-software-support-and-tools

https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-software-support-and-tools


15

Enumerate QoS

Configure class of

Service w/ bitmasks

QoS Enum/Config

On Context

Switch

Set Class of

Service in PQR

QoS Association

Config COS=2

Application

Memory Request

Tag with Cache

Class of service

Enforcement

Cache Subsystem

QoS Aware

Cache Allocation Set 1Way 2… … … Way 16

Set 2Way 2… … … Way 16

Set 3Way 2… … … Way 16

.. .. .. Way 2… … … Way 16

Set nWay 1. . .

Way 16

Memory Request

Cache Allocation Technology - flow

COS

Enforce mask

COS 1

COS 2

COS 3

COS 4

WayMask1

WayMask2

WayMask3

WayMask4

TransactionCOS 1

COS 2

COS 3

COS 4

BitMask1 WayMask1

Enforcement Target 2BitMask2 WayMask2

BitMask3 WayMask3

BitMask4 WayMask4

Architectural Implementation dependent


16

Cache Allocation TechnologyBitmask examples: Only masks with contiguous ‘1’s allowedApps can be separated or can share LLC spaceIsolated: Determinism BenefitShared / Overlapped: Throughput Benefit

M7

M6

M5

M4

M3

M2

M1

M0

COS 1 A A A A 50%

COS 2 A A 25%

COS 3 A 12.5%

COS 4 A 12.5%

Isolated

bitmasks

Overlapped

bitmasks

Examples of overlap and isolation (8b)

M7

M6

M5

M4

M3

M2

M1

M0

COS 1 A A A A A A A A 100%

COS 2 A A A A 50%

COS 3 A A 25%

COS 4 A 12.5%


17

CAT Benefit: Increase Determinism

Real-time applications require determinismShared platform resources reduce determinismSample “fork bomb” makes interrupt latency unpredictable (left)Cache QoS partitioning can solve this issue (right)

0

20

40

60

80

100

7 8 9 10 11

Pe

rce

nt

Dis

trib

uti

on

Interupt Latency (us)

Interrupt Latency -- With CQoS

With CQoS

CAT Restores Performance Determinism --> Critical for RTOS/Comms


18

Haswell: Cluster on Die (COD)On HSW, all L3 Cache is not the same


19

IVB EP Architecture


20

Haswell Cluster on Die (COD) Mode

Cluster0

CboLLC

CboLLC

Sbo

Sbo

CboLLC

CboLLC

CboLLC

CboLLC

CboLLC

CboLLC

CboLLC

CboLLC

CboLLC

CboLLC

CboLLC

CboLLC

HA0

QPI 0/1

IIO

HA1

Core

CoreCore

Core

CoreCore

Core

CoreCore

Core

Core

Core

Core

Core

Cluster1

CboLLC

CboLLC

CboLLC

Core

Core

Core

CoreCboLLC

COD Mode for 18C HSW-EPOn Haswell CPUs, all L3 cache is not on the same ring.

• Some L3 cache has higher latency to access

• Similar to NUMA, but for L3 cache

Supported on 2S HSW-EP SKUs with 2 Home Agents (10+ cores)

• Targeted at NUMA workloads where latency is more important than sharing data across Caching Agents (Cbo)• Reduces average LLC hit and local memory latencies • HA mostly sees requests from reduced set of

threads which can lead to higher memory bandwidth

• OS/VMM own NUMA and process affinity decisions


21

Intel ® 40Gb Ethernet Controllers


22

40GbE Fortville family (XL710/X710)

Comparing Controller Typical Power

82599EB

1 x 40GbE3.3 watts2

Typical Power

2 x 10GbE5.2 watts1

Typical Power

Source as of Aug 2014: 1: 82599 Datasheet rev 2.0 Table 11.5 for 2x10GbE Twinax Typical Power [W] 2: XL710 Data sheet rev 1.21 Table 14-7 Typical Active Power 1x40GbE Power [W]

30%

65%2x

Power Efficiency Improvements

UP TO 30%Reduction

TYPICAL POWER

UP TO 65%Reduction In

GIGABIT PER WATT

Increase in TOTAL

BANDWIDTH


23

40GbE Fortville family (XL710/X710)

2x10 4x10 1x40 2x40

• Low power single chip design for PCI Express 3.0• Intelligent load balance for high performance traffic flows• Network virtualization Overlay stateless offloads for

VXLAN, NVGRE, Geneve• Flexible pipeline processing – add new features after

production by upgrading firmware upgradable


24

Intro to DPPDData Plane Performance Demonstrators


25

DPPD: What is it?

• Data Plane Performance Demonstrators

• An open source DPDK application

• BSD3C license

• Available on 01.org (https://01.org/intel-data-plane-performance-demonstrators/downloads)

• Runs on host, vm and ovs

https://01.org/intel-data-plane-performance-demonstrators/downloads


26

• Config file defines

• Which cores are used

• Which interfaces are used

• Which tasks are executed and how configured

• Allows to

• Find bottlenecks and measure performance

• Try and compare different core layouts without changing code

• Reuse config file on different systems (CPUs, hyper-threads, sockets, interfaces)

DPPD – What is it? (continued)


27

Example

Main idea

Core 5

Task

Task

...Interface

Core 3

Core 4

Core 1

interface

Core 2

interface

interface

Interface

Interface

InterfaceTask

Task

Task

Task


28

Supported tasks• Load balance Position

• QinQ encap/decap IPv4/IPv6

• ARP

• QoS

• Routing

• Unmpls

• Policing

• ACL

• Classify

• Drop

• Basic Forwarding (no touch)

• L2 Forwarding (change MAC)

• GRE encap/decap

• Load balance network

• Load balance QinQ


29

• Easily reconfigurable (parses config file)

• Different pipelines through configuration• WiFi Gateway• BNG• QoS• Combination or part of the above

• Assign work to different cores

• Cache QoS Management

• Configuration follows design

• Each core is assigned to execute a (set of) task(s)

• Tasks are executed in round-robin fashion

• Tasks communicate through rings

Configuration and design


30

DPPD: Very simple Port Forwarding

FWDETH1 ETH2

[port 0] ;DPDK port number

name=cpe0

mac=00:00:00:00:00:01

[port 1] ;DPDK port number

name=cpe1

mac=00:00:00:00:00:02

[core 1]

name=FWD

task=0

mode=none

rx port=cpe0

tx port=cpe1