implementing policy control as a virtual network function · introduction to policy control and nfv...

Version 0.8

An Industry Whitepaper

Contents

Executive Summary ................................... 1

Introduction to Policy Control and NFV ........... 2

Considerations and Challenges ..................... 3

Maximizing Core Performance ................... 3

Core Affinity ..................................... 3

Intelligent Load Balancing ..................... 4

Load Balancer Options ......................... 6

DPDK and Core Performance .................. 7

Maximizing System Performance ................ 7

Amdahl’s Law .................................... 7

Preserving Core Affinity across Sockets ..... 8

Memory Writes and Reads ..................... 9

Partitioning Functions ............................ 10

Example: Aggregating Statistics ............. 10

Example: Network-Level Traffic Shaping .. 11

Example: Location-Specific Congestion

Management .................................... 11

Conclusions ............................................ 13

Summary of Solution Requirements ............ 13

Additional Resources ............................. 14

Invitation to Provide Feedback ................. 15

Executive Summary Traditionally, the processing demands of policy control (e.g.,

stateful packet processing, complex decision-making, etc.)

required proprietary hardware solutions, but technology

advances mean that virtualization now, or at the very least soon,

provides an alternative.

Transitioning from a purpose-built, proprietary hardware

component – one in which a vendor likely controls every aspect –

to a virtualized COTS model in which performance is dependent

on clock speed and available cores, and in which drivers vary by

hardware manufacturer is a formidable challenge.

Vendors who embark on this transition face a number of

considerations and must overcome many challenges in order to

preserve network policy control functionality in a virtualized

environment.

By understanding these considerations and challenges,

communications service providers gain an informed position from

which they can effectively evaluate alternatives.

To explore these topics, this paper asks and answers the

questions:

How can a solution maximize the performance of each

individual core?

How can a solution maximize the performance of the

overall system (i.e., of all the cores working together)?

How can a solution effectively coordinate aggregate

functions across many cores?

Implementing Policy Control as a Virtual Network Function: Challenges and

Considerations

Implementing Policy Control as a Virtual Network Function

2

Introduction to Policy Control and NFV Network policy control (also called policy management) refers to technology that enables the definition

and application of business and operational policies in networks. Policy control works by identifying

conditions (e.g., subscriber entitlement, current network conditions, data traffic identity, etc.),

evaluating decisions (e.g., determining if the network is congested, deciding whether certain traffic

constitutes a distributed denial of service attack, etc.), and enforcing actions (e.g., record the usage

into a database, decrement from a prepaid wallet, mitigate attack traffic, manage congestion, etc.).

Policy control powers many innovative subscriber services, network management actions, and business

intelligence (e.g., big data, customer experience management, analytics, etc.) initiatives.

Traditionally, the processing demands of policy control (e.g., stateful packet processing, complex

decision-making, etc.) required proprietary hardware solutions, but technology advances mean that

virtualization now, or at the very least soon, provides an alternative.

Network functions virtualization (NFV) is a carrier-led effort to move away from proprietary hardware,

motivated by desires to reduce costs by dramatically increasing agility and simplifying deployment. In

an NFV environment, software applications performing network functions share execution, storage, and

network resources on COTS hardware.

By using standard x86 commercial off-the-shelf (COTS) hardware for everything – that is, by running all

vendor solutions on the same hardware – an operator needs fewer spare parts, can standardize the

provisioning systems, and can simplify their supply chain.

This paper explores some of the challenges and considerations of implementing policy control functions

in virtualized environments.

To enable the discussion, it is worthwhile to quickly review some related terminology:

Socket: a physical connector on a motherboard that accepts a single processor chip

Core: a logical execution unit. In a multi-core processor, there are many cores that are each

able to execute threads independently.

QuickPath Interconnect (QPI): an Intel-specific point-to-point processor interconnect that

allows processors to access each other’s memory

Hyper-threading: an Intel technology that makes a single core appear logically as multiple

cores on the same chip (usually as two threads per core)

Hypervisor: software, firmware, or hardware that creates and runs virtual machines

Virtual Machine: an operating system or application environment that is installed on software

and imitates dedicated hardware

Bare Metal: a computer without its operating system. In the context of virtualization, ‘running

on bare metal’ means installing a solution directly on hardware (i.e., without an operating

system to slow things down)

Data Plane Development Kit (DPDK): an API consisting of a collection of C code libraries that

live in userland (also known as “user space”). The primary function of DPDK is to memory map

hardware into userland, thereby removing the need to copy from kernel to userland and

achieving performance increases as a result. DPDK is not, strictly speaking, a virtualization

technology, but it is a technology that has significant benefits for virtualization.


3

Considerations and Challenges Transitioning from a purpose-built, proprietary hardware component – one in which a vendor likely

controls every aspect – to a virtualized COTS model in which performance is dependent on clock speed

and available cores, and in which drivers vary by hardware manufacturer is a formidable challenge.

An additional degree of complexity comes in when you recognise that the hardware is shared by many

vendors simultaneously, thus sizing and capacity of one workload can be dependent on another.

Vendors who embark on this transition face a number of considerations and must overcome many

challenges in order to preserve network policy control functionality and high performance density in a

virtualized environment.

By understanding these considerations and challenges, communications service providers gain an

informed position from which they can effectively evaluate alternatives.

The subsections that follow examine key subjects, and seek to answer several questions:

How can a solution maximize the performance of each individual core?

How can a solution maximize the performance of the overall system (i.e., of all the cores

working together)?

How can a solution effectively coordinate aggregate functions across many cores?

Maximizing Core Performance Getting the maximum performance out of each available core provides the building blocks out of which

a scalable and efficient complete system is constructed.

In order to achieve the maximum performance, particular conditions must be met and specific

problems must be solved.

Core Affinity To maximize packet-processing performance in multicore and multiprocessor environments, a system

must avoid costly memory lookups. The time to access memory varies widely, depending on that

memory’s location, and core performance can be severely impacted. For instance, here are different

types of memory available to a processor core, listed from fastest to slowest1:

Layer 1 (L1) cache

Layer 2 (L2) cache

Last Layer (LL) cache

Local memory (on-socket RAM)

Remote memory (RAM on a different socket)

Memory access impacts performance in two ways: first, in the actual time it takes to look up and to

retrieve something from memory into the processor; second, by causing bottlenecks on the

interconnection paths that link cores and sockets together, which cause cores to wait until the

bottleneck is relieved.

1 Actual values (e.g., cycles and time) for these accesses are available online, but vary by processor. For instance, here is a

discussion on StackOverflow: http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory

http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory

http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory


4

To maximize packet-processing performance in multiprocessor environments, memory look-ups that use

core and socket interconnections must be kept to a minimum.

In the worlds of policy control and packet-processing (whether on proprietary hardware or in network

functions virtualization), the only way to completely avoid foreign memory access is to maintain core

affinity by ensuring all packets associated with a flow, session, and subscriber are processed by the

same core, and memory associated with the flow, session, and subscriber is also bound to the same

socket as the core. In this design, each core only needs to access its own dedicated memory cache.

Today’s architectures attempt to minimize memory checks (for instance, Intel’s Flow Director

technology on the network interface tries to ensure that all packets from the same flow are assigned to

the same processor), but these attempts are insufficient for applications that need to work across

flows.

In fact, there is only one way to ensure core affinity, and that is through the use of an intelligent (i.e.,

session-, flow-, and subscriber-aware) load balancer.2

As an added benefit that will be explored later, ensuring core affinity in a shared-nothing (i.e., no

shared state memory) architecture is also an enabler of maximal overall system scalability.

Intelligent Load Balancing At present, the only way to completely avoid foreign memory access in a virtualized packet-processing

application is to ensure that all packets associated with a flow, session, and subscriber are processed

by the same core.

To achieve this result, two conditions must be met:

1. There must be an aggregate solution to resolve network asymmetry by ensuring all packets

relating to a particular flow, session, and subscriber go to the same virtualized packet-

processing system (it is sufficient if the single system is actually made up of smaller,

connected, sub-systems)

2. The virtualized packet-processing system must include functionality that specifically directs

associated packets to a common processor core

The first requirement is a system-level design, so will not be examined in this paper.3

The second requirement calls for an intelligent load balancer that makes up part of the virtualized

solution.

This load balancer is the first point of inspection for incoming packets, and is dedicated to maintaining

flow, session and subscriber affinity for maximum element throughput.

The load balancer automatically removes local asymmetry within a packet-processing element by

steering packets from the same flow (and session and subscriber) to a single core, and then back out

through the appropriate exit port.

Functionally, this is how the load balancer works:

2 This topic is explored and explained in great detail in the whitepaper QuickPath Interconnect: Considerations in Packet

Processing, which is available at www.sandvine.com 3 …but for those who are interested, the whitepaper Applying Network Policy Control to Asymmetric Traffic: Considerations and

Solutions, available at www.sandvine.com, explains how this issue is solved in the physical world of proprietary hardware

https://www.sandvine.com/downloads/general/whitepapers/quickpath-interconnect.pdf


http://www.sandvine.com/

https://www.sandvine.com/downloads/general/whitepapers/applying-network-policy-control-to-asymmetric-traffic.pdf

https://www.sandvine.com/downloads/general/whitepapers/applying-network-policy-control-to-asymmetric-traffic.pdf



5

1. Incoming packets are first examined to determine whether the traffic even needs to be

inspected (i.e., passed to a core). For example, depending on the policy, traffic belonging to

certain VLANs may not be inspected, which may be desired if the service provider chooses not

to inspect traffic that belongs to a wholesale customer or business customer. Simply

performing this task in the load balancer already achieves performance advantages over

equipment that requires core examination of all traffic.

2. For those packets that should be sent to a core, the load balancer creates and relies upon a

map that determines which core will process particular flows, sessions, and subscribers, and

directs the packets appropriately. This mapping ensures that the same core is always used for

all packet-processing relating to a specific flow, session, and subscriber. To preserve

performance, the map must scale by the number of cores in the system, rather than packets

per second.

3. Once the core has completed its tasks, the load balancer returns the packet through the

appropriate exit path.

The load balancing solution as a whole works as a two-stage pipeline, with the first stage having 100%

of the performance needed to perform its task under all circumstances (i.e., inspecting packets to

appropriately direct them) and the second stage having a scale-out property to perform the packet

processing and policy management.

In essence, the load balancer can be thought of as a Flow Director that is specifically designed for

policy control and packet processing applications, and which completely eliminates foreign memory

checks and maximizes device throughput.

Figure 1 shows a simplified representation of the physical hardware being used by the virtualized

solution. This example uses a single socket for simplicity (a configuration with multiple sockets will be

examined later).

Figure 1 - Simplified representation of the virtualization hardware


6

As a packet travels through the data plane, it hits a physical interface (e.g., 1 GE, 10 GE, 40 GE), and

the associated network interface card (NIC) places the packet directly in the socket RAM, from which

the CPU can access it for processing.4

Functionally, this is the point at which the intelligent load balancer comes into play (Figure 2). The

load balancer examines the packet in RAM, and directs it to the appropriate core for processing. In this

manner, the core that is processing any existing flow always has the state of that flow in its dedicated

cache, and foreign memory access is entirely avoided.

Only by fulfilling this intelligent load balancing requirement can a virtualized policy control solution

achieve maximum core performance.

Figure 2 – The role of the intelligent load balancer: to avoid latency-inducing foreign memory access, the load balancer must direct packets to the appropriate core

Note, too, that the load balancer process itself consumes some processing capacity, and the amount of

consumption varies by implementation.

Load Balancer Options Broadly, there are two approaches to creating such an intelligent load balancer function:

1. Configure and modify Open vSwitch (OVS)5

2. Purpose-build a proprietary solution

Each approach has advantages and disadvantages, and network operators would do well to thoroughly

quiz their solution vendors to understand the implementation.

4 Note that while it is possible to have the NIC place the packet directly in a core cache via Intel’s Direct Data I/O

(http://www.intel.com/content/www/us/en/io/direct-data-i-o.html), doing so places the packet into the last layer of cache, because the NIC has no way of placing the packets in the correct L1 or L2 core cache (i.e., the core that will maintain affinity); neither RSS nor Flow Director can fulfill this requirement. 5 Open vSwitch is a production-quality open-source implementation of a distributed virtual multilayer switch, the main purpose

of which is to provide a switching stack for hardware virtualization environments. More information is available at http://openvswitch.org/

http://www.intel.com/content/www/us/en/io/direct-data-i-o.html

http://openvswitch.org/


7

DPDK and Core Performance The Data Plane Development Kit (DPDK) plays an important role in maximizing the per-core

performance by optimizing memory accesses.

In short, DPDK provides a map of the PCI memory so that userland can quickly access packets without

needing costly kernel interrupts and many memory copies across the kernel/userland boundary.

This approach results in massive performance increases and is a prerequisite for maximizing the

performance of any single processing core.

Maximizing System Performance Maximizing overall system performance demands, as a prerequisite, that the performance of the

individual cores is maximized; next, those cores must be made to work together effectively and

efficiently.

Combined, these many cores across many sockets are responsible for executing tasks that are simply

too large for any one core or socket – and the manner in which the cores are combined has enormous

implications on the total system performance.

Amdahl’s Law When dividing processing between multiple nodes, the architects must decide whether or not any

information will be shared between these nodes. Broadly, designs can be considered to be either

‘shared-nothing’ (i.e., literally nothing is shared) or ‘shared-something’ (e.g., subscriber state, 5-

tuples6, etc.). The less that is shared, and the less frequently there are references across the shared

context, the less locking/waiting will occur, and the greater the overall system performance as

instances are added.

In the specific context of system (i.e., horizontal) scaling, a key consideration with regards to

information sharing is Amdahl’s Law7, which is a law of diminishing returns in multi-system

architectures. Put simply, this means that if information is shared between processors then the return

derived from adding additional processors decreases with each subsequent processor – eventually,

adding a new processor will yield no additional processing capacity.

More specifically, each processor added to a system adds less usable power than the previous one; each

time the number of processors is doubled, the speedup ratio diminishes as the total throughput heads

toward the limit of 1/(1-P).

In contrast, a shared-nothing architecture scales linearly to infinity; that is, each new processor added

to a group adds its entire capacity to that of the group.

Implementing a shared-nothing architecture is challenging, but worthwhile, and the benefits are

extensive. For instance, sharing nothing means that a core never needs to access another core’s

memory, and as a consequence foeign memory look-ups are avoided and per-core performance is

maximized.

6 The set of five different values that comprise an Internet flow: source IP address, source port number, destination IP address,

destination port number, protocol. Strictly speaking, ‘connection’ is insufficient because it technically only applies to stateful protocols. 7 More information is available here: http://en.wikipedia.org/wiki/Amdahl%27s_law

http://en.wikipedia.org/wiki/Amdahl%27s_law


8

In fact, the shared-nothing architecture is so much better suited to achieving efficient horizontal

system scale that this whitepaper considers it by its nature to be the ideal design; alternatives have

already been condemned to inefficiency because a shared-something model requires a mesh of

communication that increases with the square of the number of processors in the system.

Practically, though, it may not be possible to design a horizontally scalable system with no sharing, so

it is important to understand a subtlety of sharing: it is the frequency of sharing that degrades

performance, more so than the amount shared, because sharing means waiting. That is, a system that

must occasionally share something large will have higher performance than a system that frequency

shares small things.

The question then becomes, how does one build a scalable shared-nothing architecture, or at least how

does one build something that shares very infrequently?

Preserving Core Affinity across Sockets To explore this topic, let’s use Figure 3 as a guide. Figure 3 takes the example from Figure 1 and

extends it to a higher-capacity network; now, a single socket is insufficient to provide the required

performance, and the system extends to two or more sockets.

Figure 3 – Simplified representation of multi-socket virtualization hardware


9

The packet follows a familiar path: on-the-wire, through an interface, and into a socket RAM. However,

the packet is written into the RAM associated with whatever interface it happened to traverse, and

there is no guarantee that this socket houses the particular processing core to which this packet is

destined.

Consequently, in a multi-socket environment the intelligent load balancer must be able to direct a

packet to a core on another socket, as depicted in Figure 4.

To facilitate packet movement between sockets there must be a mechanism that allows such transfers.

One option, but by no means the only one, is to use the DPDK queue, which is a shared ring. An option

that should explicitly be avoided is QuickPath Interconnect (QPI)8.

Figure 4 - The intelligent load balancer in a multi-socket environment

Memory Writes and Reads An additional consideration when dealing with memory access in a virtualized environment is the cost

of reads versus writes. Memory reads are very slow, as a read request is issued and then the processor

must wait until the request is fulfilled. Writes, on the other hand, are very fast9: the write is issued

and the processor keeps on processing.

This important and often-overlooked disparity can have enormous implications for the overall system

performance, particularly when reading or writing across sockets.

The most frequent activity performed in a network policy control system is flow-lookup. Consequently

to maximize performance it is imperative to have flow-state memory strictly local to a core.

8 For the same reasons discussed in the whitepaper QuickPath Interconnect: Considerations in Packet Processing, available at

www.sandvine.com. In short, while QPI is fantastic for some applications, it becomes a massive bottleneck in stateful packet-processing solutions. 9 By orders of magnitude.




10

Partitioning Functions When tasks are divided between multiple systems, there is a fundamental issue of determining how to

partition those tasks. For instance, in a packet processing application data traffic can be divided

between processors based on a wide range of factors (e.g., subscriber IP address, subscriber service

plan, application type, geographic location, etc.).

Partitioning also applies to the control plane; element statistics could be partitioned by type, with

prepaid usage statistics going to one node, and postpaid usage statistics going to another.

The challenges associated with partitioning can be very complex, particularly when one must

determine how to partition domain functionality across many smaller nodes.

Ensuring core affinity means that any per-subscriber policy control (e.g., measurements, billing and

charging, policy enforcement) use cases can be fulfilled while preserving maximum performance; in

other words, a single core can deal with all the policy control use cases that apply to a single

subscriber, without needing to involve another core (either for processing assistance or for memory

access).

But in the world of policy control, many use cases exist at an aggregate level. For instance, consider:

A policy that states that, during times of congestion, 50% of available network capacity shall be

dedicated to ‘high priority’ applications, 35% to ‘medium priority’ applications, and 15% to

‘low priority’ applications

A policy that must apply congestion management only at locations of the network where

congestion is manifesting (e.g., on a particular eNode B)

A policy to measure all YouTube traffic on the network

In each of these examples, applying the policy control requires coordinating between many separate

cores – cores that themselves are split across many sockets, and so on.

In the first example, each core must have an idea of the amount of traffic of each priority that the

other cores, collectively, are observing. Only with this knowledge can the cores as a set achieve the

policy management targets.

In the second example, each core must know which subscribers are currently in a location that is

congested, and must coordinate with other cores to collectively manage the congestion to a resolution.

In the third example, the statistics from all of the cores have to be aggregated together to create a

network-level measurement of YouTube.

In each example, a high-level task is split and shared between many processing elements. By

investigating some example use cases, we can discover the challenges that must be overcome to

effectively achieve them, and in doing so we can extract some specific solution requirements. Key is

that the split is not done at the packet or flow level, but at some more manageable sharing rate.

Example: Aggregating Statistics Combined, the many cores in the virtualization solution are performing lots of activities, and those

activities generate statistics. In a simple example, the statistic itself might be the goal: for instance, a

network operator might want to measure the total amount of YouTube traffic on the network. As


11

another example, the statistics might be a byproduct of other activities, and the operator wants to

track general performance metrics.

In either case, the general challenge is that the system must be able to aggregate statistics from many

cores, which themselves are distributed across physical sockets. In addition to questioning how these

statistics are accurately rolled-up, any network operator investigating virtualization solutions should

inquire about the performance impact and potential bottlenecks associated with the aggregation

process itself.

Example: Network-Level Traffic Shaping Consider this simple example: a communications service provider is running a network with 200 Gbps

capacity and has a policy that peer-to-peer (P2P) traffic shall not exceed 100 Gbps. When P2P levels

rise to this level, shaping policies being to act and enforce the 100 Gbps aggregate limit.

Assuming a per-core throughput of 10 Gbps, the 200 Gbps is split across 20 processing cores. In reality,

the P2P traffic is non-uniformly shared across all the processing cores – that is, each core will likely see

some of the P2P traffic.

In order to limit the aggregate amount of P2P traffic to 100 Gbps, some conditions must be met:

At any point in time, each core must be aware of the amount of P2P traffic on the network as a

whole

To ensure a fair distribution of P2P among the subscriber base, each core must act

proportionally

There is no perfect technical solution to this problem.10 Ensuring complete inter-core knowledge

imposes overhead inter-core communication demands that are simply not achievable at the throughput

rates with which packet processing must manage.

Nevertheless, this use case can be achieved approximately with known (probabilistic accuracy).

To hit a particular P2P shaping target, in our case 100 Gbps, at time t, each core must be made aware

of the amount of P2P traffic that was on the network at time t-1. Using this knowledge, each core can

adjust its own share of P2P so that the overall amount of P2P on the network at time t approximately

hits 100 Gbps.

At any infinitely fine point in time, the exact amount of P2P on the network will vary around 100 Gbps,

but at practical or meaningful time intervals, the amount of P2P achieves the target.

The precise algorithms used and accuracy achieved vary by vendor, so network operators should be

prepared to make detailed inquiries.

Example: Location-Specific Congestion Management Finding an effective solution to network congestion is an important subject for network operators

around the world.11

10 Even shaping at the interface hardware level has significant shortcomings, not least of which is that the subscribers who are

impacted by the policy are ‘chosen’ arbitrarily, which could run afoul of network neutrality guidelines for reasonableness and proportionality. 11 The whitepaper Network Congestion Management: Considerations and Techniques, available at www.sandvine.com, explores

this topic in detail.

https://www.sandvine.com/downloads/general/whitepapers/network-congestion-management.pdf



12

For our example, suppose a mobile operator has detected that a particular eNode B is congested and

needs to resolve the congestion by managing the traffic of only those subscribers who are currently

using that eNode B.

In addition to subscriber awareness, and real-time knowledge of subscriber location, the solution

requires that the many cores in the virtualized solution coordinate their efforts to resolve congestion

with minimum subscriber management.

In our example, the group is the set of subscribers on a particular eNode B, but this example can easily

be generalized to any ‘group’ of subscribers (e.g., all iPhone subscribers, all subscribers who signed up

in the last 6 months, all subscribers who subscribe to an on-deck video service, etc.) and any type of

policy enforcement (e.g., well beyond this simple congestion management example).


13

Conclusions Transitioning from a purpose-built, proprietary hardware component – one in which a vendor likely

controls every aspect – to a virtualized COTS model in which performance is dependent on clock speed

and available cores, and in which drivers vary by hardware manufacturer is a formidable challenge.

Vendors who embark on this transition face a number of considerations and must overcome many

challenges in order to preserve network policy control functionality in a virtualized environment.

Getting the maximum performance out of each available core provides the building blocks out of which

a scalable and efficient complete system is constructed.

To maximize packet-processing performance in multiprocessor environments, it is necessary to

maintain core affinity by ensuring all packets associated with a flow, session, and subscriber are

processed by the same core. In this design, each core only needs to access its own dedicated memory

cache. Achieving this requirement demands an intelligent load balancer.

When dividing processing between multiple nodes, there are additional considerations. Broadly, designs

can be considered to be either ‘shared-nothing’ (i.e., literally nothing is shared) or ‘shared-something’

(e.g., subscriber state, 5-tuples, etc.). The less that is shared, and the less frequently there are

references across the shared context, the less locking/waiting will occur, and the greater the overall

system performance as instances are added.

Practically, though, it may not be possible to design a horizontally scalable system with no sharing, so

it is important to understand a subtlety of sharing: it is the frequency of sharing that degrades

performance, more so than the amount shared, because sharing means waiting. That is, a system that

must occasionally share something large will have higher performance than a system that frequency

shares small things.

To ensure a low frequency of sharing, the intelligent load balancer must be able to direct a packet to a

core on another socket, and flow-state memory strictly local to a core (the most frequent activity

performed in a network policy control system is flow-lookup).

When tasks are divided between multiple systems, there is a fundamental issue of determining how to

partition those tasks. The challenges associated with partitioning can be very complex, particularly

when one must determine how to partition domain functionality across many smaller nodes. Key is that

the split is not done at the packet or flow level, but at some more manageable sharing rate.

Ensuring core affinity means that any per-subscriber policy control (e.g., measurements, billing and

charging, policy enforcement) use cases can be fulfilled while preserving maximum performance, but in

the world of policy control, many use cases exist at an aggregate level; for these use cases, applying

the policy control requires coordinating between many separate cores – cores that themselves are split

across many CPUs and sockets – to efficiently aggregate stats and apply policy control and management

that extends beyond the subscriber level.

Summary of Solution Requirements The following table summarizes the minimum requirements to effectively and efficiently implement

real-time network policy control as a virtual network function.


14

Table 1 - Summary of solution requirements

Objective Requirement Explanation

Maximize per-core performance and efficiency

Must maintain core affinity at the flow, session, and subscriber levels

Core affinity is required to avoid foreign memory access – the use of which leads to significant performance degradations as cores wait for information to be retrieved and the links themselves become congested. Consequently, the system requires an intelligent load balancer that directs each packet to the specific core that has flow, session, and subscriber state stored in the dedicated cache.

Must use DPDK

DPDK introduces tremendous performance advantages for memory access operations, and maximum per-core performance is not possible without these optimizations.

Maximize horizontal scale performance and efficiency

Must make infrequent references across any shared context

The less that is shared, and the less frequently there are references across the shared context, the less locking/waiting will occur, and the greater the overall system performance as instances are added.

Must maintain core affinity across CPUs and sockets

The requirement to maintain core affinity extends to the CPU and socket level for the same reasons as it is required within a multi-core processor: memory look-ups using interconnects must be eliminated. Consequently, the intelligent load balancer must be able to direct each packet to the appropriate core even if that core exists on a different CPU or socket.

Must have strictly local flow-state memory

The most frequent activity performed in a network policy control system is flow-lookup; consequently to maximize performance it is imperative to have flow-state memory strictly local to a core.

Effectively partition tasks across multiple processing cores

Must have an efficient means of policy coordination across cores

Ensuring core affinity means that any per-subscriber policy control use cases can be fulfilled while preserving maximum performance. But in the world of policy control, many use cases exist at an aggregate level; for these use cases, applying the policy control requires coordinating between many separate cores – cores that themselves are split across many CPUs and sockets.

Must not partition tasks at the packet or flow level

To minimize the overhead associated with coordinating multiple processors in a subscriber-aware system, the lowest level at which tasks can be partitioned is the subscriber level.

Must have an efficient means of stats aggregation across cores

The many cores in the virtualization solution are performing lots of activities, and those activities generate statistics. The general challenge is that the system must be able to aggregate statistics from many cores, which themselves are distributed across physical sockets. Additionally, the aggregation process itself will consume processing capacity and is prone to bottlenecks.

Additional Resources In addition to the resourced linked and footnoted throughout this document, please consider reading

The PTS Virtual Series: Maximizing Virtualization Performance (available at www.sandvine.com) to

understand how Sandvine has implemented our network policy control as a highly scalable virtual

network function.



15

Invitation to Provide Feedback Thank you for taking the time to read this whitepaper. We hope that you found it useful, and that it

contributed to a greater understanding of some of the challenges that must be overcome to implement

policy control in a virtualized network.

If you have any feedback at all, then please get in touch with us at [email protected].

mailto:[email protected]

Copyright ©2015 Sandvine

Incorporated ULC. Sandvine and

the Sandvine logo are registered

trademarks of Sandvine Incorporated

ULC. All rights reserved.

European Offices

Sandvine Limited

Basingstoke, UK

Phone: +44 0 1256 698021

Email: [email protected]

Headquarters

Sandvine Incorporated ULC

Waterloo, Ontario Canada

Phone: +1 519 880 2600

Email: [email protected]

mailto:[email protected]

implementing policy control as a virtual network function · introduction to policy control and nfv...

Documents