accelerating sdn/nfv with transparent offloading architecture

http://opennetsummit.org/workshops/?utm_source=ONS&utm_medium=speakerdeck&utm_campaign=ONS2015

http://opennetsummit.org/conference/?utm_source=ONS&utm_medium=speakerdeck&utm_campaign=ONS2015

http://opennetsummit.org/webinars/?utm_source=ONS&utm_medium=speakerdeck&utm_campaign=ONS2015

http://archives.opennetsummit.org?utm_source=ONS&utm_medium=speakerdeck&utm_campaign=ONS2015

http://opennetsummit.org/conference/?utm_source=ONS&utm_medium=speakerdeck&utm_campaign=ONS2015

http://twitter.com/opennetsummit

Copyright©2014 NTT corp. All Rights Reserved.

Accelerating SDN/NFV with transparent offloading architecture

*NTT Microsystem Integration Laboratories, Japan †NTT Network Service Systems Laboratories, Japan

Open Networking Summit, Mar. 3-5, 2014, Santa Clara, CA, USA

Koji Yamazaki*, Takeshi Osaka†, Sadayuki Yasuda*, Shoko Ohteru*

and Akihiko Miyazaki*

1 Copyright©2014 NTT corp. All Rights Reserved.

Outline

• Challenge

• Our approach

• Experimental results

• Conclusion


Challenge

How can we enhance the performance of virtual network functions without increasing CAPEX or OPEX?

• Background

Lots of COTS accelerators and SDKs (FPGAs, NPUs, and GPUs)

Framework for saving energy in future networks (ITU-T Y.3021)

• Two objectives

To reduce programming effort

To enable high-performance, energy-efficient operations


Our approach

Goal: To accelerate required functions easily and efficiently

*ASIP: Application Specific Instruction-set Processor

1. Transparent offloading architecture

New programmable accelerator (ASIP*)

Harmonization among x86 environments

2. Design of application-specific instruction set

Optimal instructions for coarse DPI

Implementation of simple data structure (Bloom filter)


Overview of ASIP architecture

ALU

MUX

ALU

GPR

MUX

Load/Store

SRAM GPIO

Writeback

Decode

Execute

State Register

Fetch

Memory Access

SDRAM

*ISA: Instruction Set Architecture

Extension

• Embedded RISC CPUs - MIPS - Cadence - Synopsys, etc.

• Tunable architecture

• ISA* extension with tailored compiler allowed

• Customize HW resources


ASIP for packet stream processing

ALU

MUX

ALU

GPR

MUX

Load/Store

SRAM GPIO

Writeback

Decode

Execute

State Register

Fetch

Memory Access

SDRAM

Configure fast U-plane

FIFO

Ingress Stream FIFO

Egress Stream


Transparent offloading

C-plane configuration

x86 ASIP

PCIe PCIe

RAM

Core

I-RAM

D-RAM

DMAC

DMA Transfers

Issue DMA instructions

Concept: Control ASIP functions from x86 environment



x86 ASIP

PCIe PCIe

RAM

Core

I-RAM

D-RAM

I-RAM

DMAC

Memory map

DMA Transfers

Search ASIP section



x86 ASIP

PCIe PCIe

RAM

Core

I-RAM

D-RAM

I-RAM I-RAM

Forward functions

DMAC

Memory map

DMA Transfers


Invoke coarse DPI function

x86 ASIP

PCIe PCIe

RAM

Core

I-RAM

D-RAM

I-RAM I-RAM

Forward functions

DMAC

Memory map

DMA Transfers

... // Test data #define QUEUE_CHECK 50000 ...

int main() { ... // scan 50000 packets loop = 0; do {

bloom_scan(); loop++; } while (loop<QUEUE_CHECK); bloom_destroy(); return EXIT_SUCCESS; }

main()


Invoke coarse DPI function

x86 ASIP

PCIe PCIe

RAM

Core

I-RAM

D-RAM

I-RAM I-RAM

Forward functions

DMAC

Memory map

DMA Transfers ... …

void bloom_scan() { // Invoke instructions as intrinsics

if (queue_vacancy_check()) {

pop_queue(); sax_hash_match(); sdbm_hash_match(); bernstein_hash_match(); forward_data(); } }

bloom_scan()

22 instructions and 14 registers were added for coarse DPI


Disassembly of bloom filter matching

x86 ASIP

PCIe PCIe

RAM

Core

I-RAM

D-RAM

I-RAM I-RAM

Forward functions

DMAC

Memory map

DMA Transfers

# of

cycles Profiled disassembly

3 entry a1, 32 1 queue_vacancy_check a2

2 beqz.n a2, 60000465 <bloom_scan+0x19>

1 pop_queue 2 sdbm_hash_match 1 bernstein_hash_match 1 sax_hash_match 1 forward_data

1 retw.n bloom_scan+0x19

0 retw.n

13 cycles per one bloom_scan()


Experimental results

Evaluation items w/o acceleration w/ our instructions

Run-time

(mean # of cycles)

*50000 packets,

64-bit fixed field,

45-nm sim library.

hash(sax) 116 1

hash(sdbm) 115 2

hash(bernstein) 98 1

bloom_scan 678 13

Hardware size

(logic gate count) core and SRAM 75 KGates 79 KGates

Power dissipation

(mW) core and SRAM < 100 mW < 100 mW

Performance

(64 bytes)

pps (packets/s) 1 Mpps 57 Mpps

bps (bits/s) 723 Mbps 38 Gbps

Down 98%

Extremely low power

50x faster


Conclusion

Designing an optimal-instruction-set that harmonizes x86 environments

will reduce the costs required for acceleration


More challenging issues

Proprietary architecture White box architecture

Can open ISA transform the ecosystem of accelerators?

My assumption: Common, open ISA-based APIs reduce further programming costs.

Intel’s AVX ISA Extension Berkeley RISC-V open ISA

Emerging trends of open source SDKs (i.e. Centec’s Lantern)

Accelerators have “the Force” Dark side Light side

Other black box SDKs of COTS (ASSPs, NPUs)

Thank you! Questions?

accelerating sdn/nfv with transparent offloading architecture

Technology

loop10 copyright2014

control asip functions

x86 environments

architecture extension

required functions

scan loop

tunable architecture

optimal instructions