accelerating sdn/nfv with transparent offloading architecture
TRANSCRIPT
Copyright©2014 NTT corp. All Rights Reserved.
Accelerating SDN/NFV with transparent offloading architecture
*NTT Microsystem Integration Laboratories, Japan †NTT Network Service Systems Laboratories, Japan
Open Networking Summit, Mar. 3-5, 2014, Santa Clara, CA, USA
Koji Yamazaki*, Takeshi Osaka†, Sadayuki Yasuda*, Shoko Ohteru*
and Akihiko Miyazaki*
1 Copyright©2014 NTT corp. All Rights Reserved.
Outline
• Challenge
• Our approach
• Experimental results
• Conclusion
2 Copyright©2014 NTT corp. All Rights Reserved.
Challenge
How can we enhance the performance of virtual network functions without increasing CAPEX or OPEX?
• Background
Lots of COTS accelerators and SDKs (FPGAs, NPUs, and GPUs)
Framework for saving energy in future networks (ITU-T Y.3021)
• Two objectives
To reduce programming effort
To enable high-performance, energy-efficient operations
3 Copyright©2014 NTT corp. All Rights Reserved.
Our approach
Goal: To accelerate required functions easily and efficiently
*ASIP: Application Specific Instruction-set Processor
1. Transparent offloading architecture
New programmable accelerator (ASIP*)
Harmonization among x86 environments
2. Design of application-specific instruction set
Optimal instructions for coarse DPI
Implementation of simple data structure (Bloom filter)
4 Copyright©2014 NTT corp. All Rights Reserved.
Overview of ASIP architecture
ALU
MUX
ALU
GPR
MUX
Load/Store
SRAM GPIO
Writeback
Decode
Execute
State Register
Fetch
Memory Access
SDRAM
*ISA: Instruction Set Architecture
Extension
• Embedded RISC CPUs - MIPS - Cadence - Synopsys, etc.
• Tunable architecture
• ISA* extension with tailored compiler allowed
• Customize HW resources
5 Copyright©2014 NTT corp. All Rights Reserved.
ASIP for packet stream processing
ALU
MUX
ALU
GPR
MUX
Load/Store
SRAM GPIO
Writeback
Decode
Execute
State Register
Fetch
Memory Access
SDRAM
Configure fast U-plane
FIFO
Ingress Stream FIFO
Egress Stream
6 Copyright©2014 NTT corp. All Rights Reserved.
Transparent offloading
C-plane configuration
x86 ASIP
PCIe PCIe
RAM
Core
I-RAM
D-RAM
DMAC
DMA Transfers
Issue DMA instructions
Concept: Control ASIP functions from x86 environment
7 Copyright©2014 NTT corp. All Rights Reserved.
Transparent offloading
x86 ASIP
PCIe PCIe
RAM
Core
I-RAM
D-RAM
I-RAM
DMAC
Memory map
DMA Transfers
Search ASIP section
8 Copyright©2014 NTT corp. All Rights Reserved.
Transparent offloading
x86 ASIP
PCIe PCIe
RAM
Core
I-RAM
D-RAM
I-RAM I-RAM
Forward functions
DMAC
Memory map
DMA Transfers
9 Copyright©2014 NTT corp. All Rights Reserved.
Invoke coarse DPI function
x86 ASIP
PCIe PCIe
RAM
Core
I-RAM
D-RAM
I-RAM I-RAM
Forward functions
DMAC
Memory map
DMA Transfers
... // Test data #define QUEUE_CHECK 50000 ...
int main() { ... // scan 50000 packets loop = 0; do {
bloom_scan(); loop++; } while (loop<QUEUE_CHECK); bloom_destroy(); return EXIT_SUCCESS; }
main()
10 Copyright©2014 NTT corp. All Rights Reserved.
Invoke coarse DPI function
x86 ASIP
PCIe PCIe
RAM
Core
I-RAM
D-RAM
I-RAM I-RAM
Forward functions
DMAC
Memory map
DMA Transfers ... …
void bloom_scan() { // Invoke instructions as intrinsics
if (queue_vacancy_check()) {
pop_queue(); sax_hash_match(); sdbm_hash_match(); bernstein_hash_match(); forward_data(); } }
bloom_scan()
22 instructions and 14 registers were added for coarse DPI
11 Copyright©2014 NTT corp. All Rights Reserved.
Disassembly of bloom filter matching
x86 ASIP
PCIe PCIe
RAM
Core
I-RAM
D-RAM
I-RAM I-RAM
Forward functions
DMAC
Memory map
DMA Transfers
# of
cycles Profiled disassembly
3 entry a1, 32 1 queue_vacancy_check a2
2 beqz.n a2, 60000465 <bloom_scan+0x19>
1 pop_queue 2 sdbm_hash_match 1 bernstein_hash_match 1 sax_hash_match 1 forward_data
1 retw.n bloom_scan+0x19
0 retw.n
13 cycles per one bloom_scan()
12 Copyright©2014 NTT corp. All Rights Reserved.
Experimental results
Evaluation items w/o acceleration w/ our instructions
Run-time
(mean # of cycles)
*50000 packets,
64-bit fixed field,
45-nm sim library.
hash(sax) 116 1
hash(sdbm) 115 2
hash(bernstein) 98 1
bloom_scan 678 13
Hardware size
(logic gate count) core and SRAM 75 KGates 79 KGates
Power dissipation
(mW) core and SRAM < 100 mW < 100 mW
Performance
(64 bytes)
pps (packets/s) 1 Mpps 57 Mpps
bps (bits/s) 723 Mbps 38 Gbps
Down 98%
Extremely low power
50x faster
13 Copyright©2014 NTT corp. All Rights Reserved.
Conclusion
Designing an optimal-instruction-set that harmonizes x86 environments
will reduce the costs required for acceleration
14 Copyright©2014 NTT corp. All Rights Reserved.
More challenging issues
Proprietary architecture White box architecture
Can open ISA transform the ecosystem of accelerators?
My assumption: Common, open ISA-based APIs reduce further programming costs.
Intel’s AVX ISA Extension Berkeley RISC-V open ISA
Emerging trends of open source SDKs (i.e. Centec’s Lantern)
Accelerators have “the Force” Dark side Light side
Other black box SDKs of COTS (ASSPs, NPUs)
Thank you! Questions?