2 new hw_features_cat_cod_etc
TRANSCRIPT
TRANSFORMING NETWORKING & STORAGE
2
Legal DisclaimerINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Intel® Active Management Technology requires the platform to have an Intel® AMT-enabled chipset, network hardware and software, as well as connection with a power source and a corporate network connection. With regard to notebooks, Intel AMT may not be available or certain capabilities may be limited over a host OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. For more information, see http://www.intel.com/technology/iamt.
64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information.
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology-enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technology/security/ for more information.
†Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Pentium® 4 Processor supporting HT Technology and an HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support HT Technology.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.
* Other names and brands may be claimed as the property of others.
Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representat ions or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice.
Copyright © 2013, Intel Corporation. All rights reserved.
TRANSFORMING NETWORKING & STORAGE
3
Topics
• Run to completion vs pipeline models
• Lockless queues
• Cache Allocation Technology
• DPPD intro
TRANSFORMING NETWORKING & STORAGE
5
Processor 0Physical Core 0Linux* Control Plane
NUMAPool CachesQueue/RingsBuffers
10 GbE
10 GbE
PhysicalCore 1Intel® DPDK
PMD Packet I/O Packet work
RxTx
PhysicalCore 2Intel® DPDK
PMD Packet I/O Flow work
RxTx
PhysicalCore 3Intel® DPDK
PMD Packet I/O Flow ClassificationApp A, B, C
RxTx
PhysicalCore 5Intel® DPDK
PMD Packet I/O Flow ClassificationApp A, B, C
RxTx
Run to Completion model• I/O and Application workload can be handled on a single core• I/O can be scaled over multiple cores
PCIe* connectivity and core usageUsing run-to-completion or pipeline software models
10 GbE
Pipeline model• I/O application disperses packets to other cores• Application work performed on other cores
Processor 1
Physical Core 4Intel® DPDK
10 GbE
PhysicalCore 5Intel® DPDK
PhysicalCore 0Intel® DPDK
PMD Packet I/O Hash
PhysicalCore 1Intel® DPDK
App A App B App C
PhysicalCore 2Intel® DPDK
App A App B App C
PhysicalCore 3Intel® DPDK
RxTx
10 GbE
Pkt Pkt
PhysicalCore 4Intel® DPDK
PMD Packet I/O Flow ClassificationApp A, B, C
RxTx
Pkt Pkt
Pkt Pkt
Pkt
Pkt
RSS Mode
QPI
PC
IeP
CIe
PC
IeP
CIe
PC
IeP
CIe
NUMAPool CachesQueue/RingsBuffers
Look at more I/O on fewer cores with vectorization
TRANSFORMING NETWORKING & STORAGE
6
Applications will generally employ both models
Technical questions to consider:
How many cycles/packet do I need for my algorithms?
Are there large data structures that need to be sharedwith read/write access across packets?
Will I support timer / packet ordering functions?
Can I take advantage of a specific optimizationif you restrict an algorithm to one core?
How much data would I need to exchange betweensoftware modules?
When to Choose Run-to-Completion vs. Pipeline
TRANSFORMING NETWORKING & STORAGE
7
General architecture questions to consider:
Do some cores have easier/faster access to a hw resource?
Do you want to view cores as offload engines?
Development environment questions to consider:
Do you need to employ legacy software modules?
Does ease-of-code-maintenance trump performance?
More Run-to-Completion vs. Pipeline…
TRANSFORMING NETWORKING & STORAGE
8
Example: Building a More Complicated Pipeline Applications can be distributed
/pipelined across as many cores as needed to achieve throughput
Trade-offs will vary on when to distribute applications vs. consolidate
Queue/ring API serves as the communication mechanism
Current focus is a static (boot-time) configuration of queues
NIC driver pushes data to flow classifier
Classifier branches packet out to appropriate handler depending on packet inspection
IPSEC packets could be sent to CPM via CPM PMD or handled on-CPU for non–accelerated platforms
This is just an example
Poll Mode Driver -Rx
Poll Mode Driver -Tx
Flow Classification
Inbound IPsec pre-Processing
L3 Forwarding Application
Discard Application
IPsec Post Processing
Poll Mode Driver -Rx
Poll Mode Driver -Tx
Cave Creek CPMNIC
TRANSFORMING NETWORKING & STORAGE
10
Connection Between DPDK Elements -- Rings
• Primary mechanism to move data between software units, or
between software and I/O sources or hardware accelerators
AcceleratorNIC
dispatch loop
Poll Mode Driver -Rx
Poll Mode Driver -Tx
Flow Classification
Inbound IPSec Pre Processing
L3 Forwarding Application
Discard Application
Free List
IPSec Post Processing
Forward packet to another core
FIB
DPDK Component
Poll Mode Driver -Rx
Poll Mode Driver -Tx
Customer Application
TRANSFORMING NETWORKING & STORAGE
11
Queue/Ring Management APIEffectively a FIFO implementation in software
• Lockless implementations for single or multi-producer, single or multi- consumer enqueue/dequeue
• Supports bulk enqueue/dequeue to support packet-bunching
• Implements watermark thresholds for back-pressure/flow control
Essential to optimizing throughput
• Used to decouple stages of a pipeline
TRANSFORMING NETWORKING & STORAGE
12
Pointers are implemented as binary values in a space of 2^32 addresses
Steps:
1. ring->prod_head and ring->cons_tailare copied to local variables
2. Use a compare-and-swap to update ring->ph… only if ring->ph = prod_head
3. Update the enqueue obj
4. Use a compare-and-swap to update ring->pt… only if ring->pt = prod_head
How “lockless” Operations Are Implemented
Multiple-producer enqueue Example
TRANSFORMING NETWORKING & STORAGE
13
Haswell: Cache Allocation TechnologyEnables more deterministic VNF performance
TRANSFORMING NETWORKING & STORAGE
14
Platform Quality of Service
Cache Monitoring Technology – Ability to monitor Last Level Cache occupancy for a set of RMID’s (Resource Monitoring ID). Extensible architecture for future monitoring events.
Cache Allocation Technology – Ability to partition Last Level Cache, enforcement on a per Core basis through Class of Service mapping.
https://software.intel.com/en-us/blogs/2014/12/11/intels-cache-monitoring-technology-software-support-and-tools
TRANSFORMING NETWORKING & STORAGE
15
Enumerate QoS
Configure class of
Service w/ bitmasks
QoS Enum/Config
On Context
Switch
Set Class of
Service in PQR
QoS Association
Config COS=2
Application
Memory Request
Tag with Cache
Class of service
Enforcement
Cache Subsystem
QoS Aware
Cache Allocation Set 1Way 2… … … Way 16
Set 2Way 2… … … Way 16
Set 3Way 2… … … Way 16
.. .. .. Way 2… … … Way 16
Set nWay 1. . .
Way 16
Memory Request
Cache Allocation Technology - flow
COS
Enforce mask
COS 1
COS 2
COS 3
COS 4
WayMask1
WayMask2
WayMask3
WayMask4
TransactionCOS 1
COS 2
COS 3
COS 4
BitMask1 WayMask1
Enforcement Target 2BitMask2 WayMask2
BitMask3 WayMask3
BitMask4 WayMask4
Architectural Implementation dependent
TRANSFORMING NETWORKING & STORAGE
16
Cache Allocation TechnologyBitmask examples: Only masks with contiguous ‘1’s allowedApps can be separated or can share LLC spaceIsolated: Determinism BenefitShared / Overlapped: Throughput Benefit
M7
M6
M5
M4
M3
M2
M1
M0
COS 1 A A A A 50%
COS 2 A A 25%
COS 3 A 12.5%
COS 4 A 12.5%
Isolated
bitmasks
Overlapped
bitmasks
Examples of overlap and isolation (8b)
M7
M6
M5
M4
M3
M2
M1
M0
COS 1 A A A A A A A A 100%
COS 2 A A A A 50%
COS 3 A A 25%
COS 4 A 12.5%
TRANSFORMING NETWORKING & STORAGE
17
CAT Benefit: Increase Determinism
Real-time applications require determinismShared platform resources reduce determinismSample “fork bomb” makes interrupt latency unpredictable (left)Cache QoS partitioning can solve this issue (right)
0
20
40
60
80
100
7 8 9 10 11
Pe
rce
nt
Dis
trib
uti
on
Interupt Latency (us)
Interrupt Latency -- With CQoS
With CQoS
CAT Restores Performance Determinism --> Critical for RTOS/Comms
TRANSFORMING NETWORKING & STORAGE
18
Haswell: Cluster on Die (COD)On HSW, all L3 Cache is not the same
TRANSFORMING NETWORKING & STORAGE
20
Haswell Cluster on Die (COD) Mode
Cluster0
CboLLC
CboLLC
Sbo
Sbo
CboLLC
CboLLC
CboLLC
CboLLC
CboLLC
CboLLC
CboLLC
CboLLC
CboLLC
CboLLC
CboLLC
CboLLC
HA0
QPI 0/1
IIO
HA1
Core
CoreCore
Core
CoreCore
Core
CoreCore
Core
Core
Core
Core
Core
Cluster1
CboLLC
CboLLC
CboLLC
Core
Core
Core
CoreCboLLC
COD Mode for 18C HSW-EPOn Haswell CPUs, all L3 cache is not on the same ring.
• Some L3 cache has higher latency to access
• Similar to NUMA, but for L3 cache
Supported on 2S HSW-EP SKUs with 2 Home Agents (10+ cores)
• Targeted at NUMA workloads where latency is more important than sharing data across Caching Agents (Cbo)• Reduces average LLC hit and local memory latencies • HA mostly sees requests from reduced set of
threads which can lead to higher memory bandwidth
• OS/VMM own NUMA and process affinity decisions
TRANSFORMING NETWORKING & STORAGE
22
40GbE Fortville family (XL710/X710)
Comparing Controller Typical Power
82599EB
1 x 40GbE3.3 watts2
Typical Power
2 x 10GbE5.2 watts1
Typical Power
Source as of Aug 2014: 1: 82599 Datasheet rev 2.0 Table 11.5 for 2x10GbE Twinax Typical Power [W] 2: XL710 Data sheet rev 1.21 Table 14-7 Typical Active Power 1x40GbE Power [W]
30%
65%2x
Power Efficiency Improvements
UP TO 30%Reduction
TYPICAL POWER
UP TO 65%Reduction In
GIGABIT PER WATT
Increase in TOTAL
BANDWIDTH
TRANSFORMING NETWORKING & STORAGE
23
40GbE Fortville family (XL710/X710)
2x10 4x10 1x40 2x40
• Low power single chip design for PCI Express 3.0• Intelligent load balance for high performance traffic flows• Network virtualization Overlay stateless offloads for
VXLAN, NVGRE, Geneve• Flexible pipeline processing – add new features after
production by upgrading firmware upgradable
TRANSFORMING NETWORKING & STORAGE
25
DPPD: What is it?
• Data Plane Performance Demonstrators
• An open source DPDK application
• BSD3C license
• Available on 01.org (https://01.org/intel-data-plane-performance-demonstrators/downloads)
• Runs on host, vm and ovs
TRANSFORMING NETWORKING & STORAGE
26
• Config file defines
• Which cores are used
• Which interfaces are used
• Which tasks are executed and how configured
• Allows to
• Find bottlenecks and measure performance
• Try and compare different core layouts without changing code
• Reuse config file on different systems (CPUs, hyper-threads, sockets, interfaces)
DPPD – What is it? (continued)
TRANSFORMING NETWORKING & STORAGE
27
Example
Main idea
Core 5
Task
Task
...Interface
Core 3
Core 4
Core 1
interface
Core 2
interface
interface
Interface
Interface
InterfaceTask
Task
Task
Task
TRANSFORMING NETWORKING & STORAGE
28
Supported tasks• Load balance Position
• QinQ encap/decap IPv4/IPv6
• ARP
• QoS
• Routing
• Unmpls
• Policing
• ACL
• Classify
• Drop
• Basic Forwarding (no touch)
• L2 Forwarding (change MAC)
• GRE encap/decap
• Load balance network
• Load balance QinQ
TRANSFORMING NETWORKING & STORAGE
29
• Easily reconfigurable (parses config file)
• Different pipelines through configuration• WiFi Gateway• BNG• QoS• Combination or part of the above
• Assign work to different cores
• Cache QoS Management
• Configuration follows design
• Each core is assigned to execute a (set of) task(s)
• Tasks are executed in round-robin fashion
• Tasks communicate through rings
Configuration and design
TRANSFORMING NETWORKING & STORAGE
30
DPPD: Very simple Port Forwarding
FWDETH1 ETH2
[port 0] ;DPDK port number
name=cpe0
mac=00:00:00:00:00:01
[port 1] ;DPDK port number
name=cpe1
mac=00:00:00:00:00:02
[core 1]
name=FWD
task=0
mode=none
rx port=cpe0
tx port=cpe1