augustus: a ccn router for programmable networks - acm...
TRANSCRIPT
Augustus: a CCN router for programmable networks
ACM ICN 2016, Kyoto
Davide Kirchner1∗, Raihana Ferdous2∗, Renato Lo Cigno3,
Leonardo Maccari3, Massimo Gallo4, Diego Perino5∗, and Lorenzo Saino6
September 27, 2016
1Google Inc., Dublin, Ireland; 2Create-Net, Trento, Italy; 3DISI – University of Trento, Italy4Bell Labs – Nokia, Paris, France; 5Telefonica Research, Spain; 6Fastly, London, UK∗This work was done while D. Kirchner and R. Ferdous were at the University of Trento, and D.
Perino and L. Saino at Bell Labs.
Outline
1. Introduction
2. The Augustus CCN router
3. Performance evaluation
4. Conclusions and lessons learned
2
Introduction
Objectives
The main goal is to explore the possibilities offered by modern
general-purpose hardware in the context of information-centric
networking:
• Implement a CCN data plane forwarder fully in software
• Run on a commodity x86 64 machine
• Performance-oriented, open-source and extensible
• Analyze the performance in a worst-case scenario
Why software router? Flexibility:
• Quicker development/deployment cycle and (re)configuration
• Hardware can be dynamically allocated to network functions
Tools
• Off-the-shelf high-performance hardware
• High-speed packet I/O libraries [Int, Riz12]
• Software routing frameworks built on top [BSM15, KJL+15]
4
Objectives
The main goal is to explore the possibilities offered by modern
general-purpose hardware in the context of information-centric
networking:
• Implement a CCN data plane forwarder fully in software
• Run on a commodity x86 64 machine
• Performance-oriented, open-source and extensible
• Analyze the performance in a worst-case scenario
Why software router? Flexibility:
• Quicker development/deployment cycle and (re)configuration
• Hardware can be dynamically allocated to network functions
Tools
• Off-the-shelf high-performance hardware
• High-speed packet I/O libraries [Int, Riz12]
• Software routing frameworks built on top [BSM15, KJL+15]4
Forwarding flow
• Focus on the Content Centric Networing
approach [JST+09]
• Interests hold full content name
• Similar to CCNx (vs NDN)
• CS and PIT: exact match
• Longest-prefix match at FIB
Example: get /com/updates/sw/v4.2.5.tar.gz
Router R2:
/com/updates eth0
Forwarding information base (FIB)
/com/updates/sw/v4.2.5.tar.gz {eth1}
Pending Interest Table (PIT)
/com/updates/sw/v4.2.5.tar.gz (data. . . )
Content Store (CS)
A
R1
R2
B R3
C
eth0
eth1 eth2
5
Forwarding flow
• Focus on the Content Centric Networing
approach [JST+09]
• Interests hold full content name
• Similar to CCNx (vs NDN)
• CS and PIT: exact match
• Longest-prefix match at FIB
Example: get /com/updates/sw/v4.2.5.tar.gz
Router R2:
/com/updates eth0
Forwarding information base (FIB)
/com/updates/sw/v4.2.5.tar.gz {eth1}
Pending Interest Table (PIT)
/com/updates/sw/v4.2.5.tar.gz (data. . . )
Content Store (CS)
A
R1
R2
B R3
C
eth0
eth1 eth2
5
Forwarding flow
• Focus on the Content Centric Networing
approach [JST+09]
• Interests hold full content name
• Similar to CCNx (vs NDN)
• CS and PIT: exact match
• Longest-prefix match at FIB
Example: get /com/updates/sw/v4.2.5.tar.gz
Router R2:
/com/updates eth0
Forwarding information base (FIB)
/com/updates/sw/v4.2.5.tar.gz {eth1}
Pending Interest Table (PIT)
/com/updates/sw/v4.2.5.tar.gz (data. . . )
Content Store (CS)
A
R1
R2
B R3
C
eth0
eth1 eth2
5
Forwarding flow
• Focus on the Content Centric Networing
approach [JST+09]
• Interests hold full content name
• Similar to CCNx (vs NDN)
• CS and PIT: exact match
• Longest-prefix match at FIB
Example: get /com/updates/sw/v4.2.5.tar.gz
Router R2:
/com/updates eth0
Forwarding information base (FIB)
/com/updates/sw/v4.2.5.tar.gz {eth1}
Pending Interest Table (PIT)
/com/updates/sw/v4.2.5.tar.gz (data. . . )
Content Store (CS)
A
R1
R2
B R3
C
eth0
eth1 eth2
5
The Augustus CCN router
Design principles
• Exploit parallelism at all possible levels:
• Hardware multi-queue at NIC
• DRAM memory channels
• Multiple cores on chip
• Multiple NUMA sockets
• Data structures designed to match the x86 cache system
• Shared read-only FIB, duplicated in all NUMA sockets
• Sharded, thread-private CS and PIT
• Exploit NIC’s Receive Side Scaling capabilities to dispatch incoming
packets to threads
• Zero-copy packet processing
• Based on DPDK for fast packet I/O [Int]
• Explored two trade-offs: max performance or more flexibility
7
Design - standalone
Low-level standalone C implementation:
• Based on low-level optimized APIs
• Pushes the platform to its limits
• Architecture based on Caesar [PVL+14]
8
Design - modular
• Based on (Fast)Click
[KMC+00, BSM15]
• Easy to extend, experiment
with
• Same optimized data structures
• Can be deployed aside other
routing components
InputMux
CheckICNHeader
ICN_CS
ICN_PIT ICN_FIB
OutputDemux
FromDPDKDevice(n)
0 1
0
0
0
I
2 01
D
0
2
0
0
0D(hit)
1
0 1
I(miss)I(hit)
0
0 2
0 1 2
I(miss)
012
D (hit)
1
1
Input portoutput port
Discard
ToDPDKDevice(n)
I = Interest PacketD = Data Packet
9
Performance evaluation
Experimental setup
• Two twin machines, each with two 10Gbps Ethernet ports
• Measurements expressed in data packets per second
• Work in slight overload conditions
Worst-case assumptions:
• Every interest packet has a unique name: no CS hits, no PIT aggregation
• Minimal-sized packets, to stress the forwarding engine
Augustus router
Traffic generator and sink
Interest generator Echo server
eth1 eth0
inte
rest
interestd
ata
da
ta
11
Threads and core mapping
Threads are pinned to processing cores
Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
L3
L2L1 DL1 I0
16 CPU
L2L1 DL1 I2
18 CPU
L2L1 DL1 I10
26 CPU
L2L1 DL1 I12
28 CPU
L2L1 DL1 I14
30 CPU
L2L1 DL1 I8
24 CPU
L2L1 DL1 I6
22 CPU
L2L1 DL1 I4
20 CPU
L2L1 DL1 I 1
17
CPU
L2L1 DL1 I 3
19
CPU
L2L1 DL1 I 11
27
CPU
L2L1 DL1 I 13
29
CPU
L2L1 DL1 I 15
31
CPU
L2L1 DL1 I 9
25
CPUL2
L1 DL1 I 7
23CPU
L2L1 DL1 I 5
21
CPUL3
12
Threads and core mapping
Threads are pinned to processing cores
Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
L3
L2L1 DL1 I0
16 CPU
L2L1 DL1 I2
18 CPU
L2L1 DL1 I10
26 CPU
L2L1 DL1 I12
28 CPU
L2L1 DL1 I14
30 CPU
L2L1 DL1 I8
24 CPU
L2L1 DL1 I6
22 CPU
L2L1 DL1 I4
20 CPU
L2L1 DL1 I 1
17
CPU
L2L1 DL1 I 3
19
CPU
L2L1 DL1 I 11
27
CPU
L2L1 DL1 I 13
29
CPU
L2L1 DL1 I 15
31
CPU
L2L1 DL1 I 9
25
CPUL2
L1 DL1 I 7
23CPU
L2L1 DL1 I 5
21
CPUL3
12
Threads and core mapping
Threads are pinned to processing cores
Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
L3
L2L1 DL1 I0
16 CPU
L2L1 DL1 I2
18 CPU
L2L1 DL1 I10
26 CPU
L2L1 DL1 I12
28 CPU
L2L1 DL1 I14
30 CPU
L2L1 DL1 I8
24 CPU
L2L1 DL1 I6
22 CPU
L2L1 DL1 I4
20 CPU
L2L1 DL1 I 1
17
CPU
L2L1 DL1 I 3
19
CPU
L2L1 DL1 I 11
27
CPU
L2L1 DL1 I 13
29
CPU
L2L1 DL1 I 15
31
CPU
L2L1 DL1 I 9
25
CPUL2
L1 DL1 I 7
23CPU
L2L1 DL1 I 5
21
CPUL3
12
Threads and core mapping
Threads are pinned to processing cores
Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
L3
L2L1 DL1 I0
16 CPU
L2L1 DL1 I2
18 CPU
L2L1 DL1 I10
26 CPU
L2L1 DL1 I12
28 CPU
L2L1 DL1 I14
30 CPU
L2L1 DL1 I8
24 CPU
L2L1 DL1 I6
22 CPU
L2L1 DL1 I4
20 CPU
L2L1 DL1 I 1
17
CPU
L2L1 DL1 I 3
19
CPU
L2L1 DL1 I 11
27
CPU
L2L1 DL1 I 13
29
CPU
L2L1 DL1 I 15
31
CPU
L2L1 DL1 I 9
25
CPUL2
L1 DL1 I 7
23CPU
L2L1 DL1 I 5
21
CPUL3
12
Threads and core mapping
Threads are pinned to processing cores
Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
L3
L2L1 DL1 I0
16 CPU
L2L1 DL1 I2
18 CPU
L2L1 DL1 I10
26 CPU
L2L1 DL1 I12
28 CPU
L2L1 DL1 I14
30 CPU
L2L1 DL1 I8
24 CPU
L2L1 DL1 I6
22 CPU
L2L1 DL1 I4
20 CPU
L2L1 DL1 I 1
17
CPU
L2L1 DL1 I 3
19
CPU
L2L1 DL1 I 11
27
CPU
L2L1 DL1 I 13
29
CPU
L2L1 DL1 I 15
31
CPU
L2L1 DL1 I 9
25
CPUL2
L1 DL1 I 7
23CPU
L2L1 DL1 I 5
21
CPUL3
12
Threads and core mapping
Threads are pinned to processing cores
Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
L3
L2L1 DL1 I0
16 CPU
L2L1 DL1 I2
18 CPU
L2L1 DL1 I10
26 CPU
L2L1 DL1 I12
28 CPU
L2L1 DL1 I14
30 CPU
L2L1 DL1 I8
24 CPU
L2L1 DL1 I6
22 CPU
L2L1 DL1 I4
20 CPU
L2L1 DL1 I 1
17
CPU
L2L1 DL1 I 3
19
CPU
L2L1 DL1 I 11
27
CPU
L2L1 DL1 I 13
29
CPU
L2L1 DL1 I 15
31
CPU
L2L1 DL1 I 9
25
CPUL2
L1 DL1 I 7
23CPU
L2L1 DL1 I 5
21
CPUL3
12
Threads and core mapping
Threads are pinned to processing cores
Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
L3
L2L1 DL1 I0
16 CPU
L2L1 DL1 I2
18 CPU
L2L1 DL1 I10
26 CPU
L2L1 DL1 I12
28 CPU
L2L1 DL1 I14
30 CPU
L2L1 DL1 I8
24 CPU
L2L1 DL1 I6
22 CPU
L2L1 DL1 I4
20 CPU
L2L1 DL1 I 1
17
CPU
L2L1 DL1 I 3
19
CPU
L2L1 DL1 I 11
27
CPU
L2L1 DL1 I 13
29
CPU
L2L1 DL1 I 15
31
CPU
L2L1 DL1 I 9
25
CPUL2
L1 DL1 I 7
23CPU
L2L1 DL1 I 5
21
CPUL3
12
Threads and core mapping
Threads are pinned to processing cores
Test servers: 2 sockets × 8 cores × 2 (hyperthreading)
L3
L2L1 DL1 I0
16 CPU
L2L1 DL1 I2
18 CPU
L2L1 DL1 I10
26 CPU
L2L1 DL1 I12
28 CPU
L2L1 DL1 I14
30 CPU
L2L1 DL1 I8
24 CPU
L2L1 DL1 I6
22 CPU
L2L1 DL1 I4
20 CPU
L2L1 DL1 I 1
17
CPU
L2L1 DL1 I 3
19
CPU
L2L1 DL1 I 11
27
CPU
L2L1 DL1 I 13
29
CPU
L2L1 DL1 I 15
31
CPU
L2L1 DL1 I 9
25
CPUL2
L1 DL1 I 7
23CPU
L2L1 DL1 I 5
21
CPUL3
12
Standalone performance
1 2 4 6 8 10 12 14 16 18202224262830320
2
4
6
8
10
12
Data
thro
ugh
pu
t[M
pp
s]
Standalone
1 2 4 6 8 10 12 14 16 1820222426283032
Number of threads
0.0
0.1
0.2
0.3
0.4
0.5
0.6
L3
cach
em
isse
sra
tio Hyperthreading
Single Socket
Dual Socket
• 2 threads: large gap
hyperthreaded vs physical cores
• Best performance: 4 threads
(dual socket), 8 threads
(single/dual)
13
Click module performance
1 2 4 6 8 10 12 14 16 18202224262830320
2
4
6
8
10
12
Data
thro
ugh
pu
t[M
pp
s]
Click module
1 2 4 6 8 10 12 14 16 1820222426283032
Number of threads
0.0
0.1
0.2
0.3
0.4
0.5
0.6
L3
cach
em
isse
sra
tio
Hyperthreading
Single Socket
Dual Socket
• 1 thread: same cache miss
ratio, half performance
• Best performance: 16 threads
14
FIB size scaling
212 214 216 218 220 222 224 2260
1
2
3
4
5
6
7
8
9
10
11
12
Data
thro
ughput
[Mpps]
Standalone, 8 threads
Standalone, 4 threads
Click module, 16 threads
Standalone, 1 thread
Click module, 1 thread
212 214 216 218 220 222 224 226
Number of FIB buckets
0.00.10.20.30.40.50.6
Cach
em
iss
rati
o
15
Conclusions and lessons learned
Conclusions and lessons learned
Present Augustus, a CCN software router which:
• Forwards packets at more than 10 millions data packets per second
and supports a FIB with up to 226 entries, and it is able to saturate
the 10 Gbit/s link with Ethernet payloads as small as 87 bytes;
• Tested with a thorough worst-case oriented performance evaluation
• Runs both as a stand-alone system, achieving the best performance,
or as a set of elements in the Click modular router framework
• Is open source and can be used in software based networks for fast
and incremental ICN deployment
Lessons learned:
• Manual configuration for best performance
• Abstraction hides critical low level properties
• Complex zero-copy in modular framework
17
Conclusions and lessons learned
Present Augustus, a CCN software router which:
• Forwards packets at more than 10 millions data packets per second
and supports a FIB with up to 226 entries, and it is able to saturate
the 10 Gbit/s link with Ethernet payloads as small as 87 bytes;
• Tested with a thorough worst-case oriented performance evaluation
• Runs both as a stand-alone system, achieving the best performance,
or as a set of elements in the Click modular router framework
• Is open source and can be used in software based networks for fast
and incremental ICN deployment
Lessons learned:
• Manual configuration for best performance
• Abstraction hides critical low level properties
• Complex zero-copy in modular framework
17
Augustus: a CCN router for programmable networks
ACM ICN 2016, Kyoto
September 27, 2016
Thanks for your [email protected]
Bibliography
References I
[BSM15] Tom Barbette, Cyril Soldani, and Laurent Mathy.
Fast userspace packet processing.
In Proceedings of the Eleventh ACM/IEEE Symposium on Architectures
for Networking and Communications Systems, ANCS ’15, pages 5–16,
Washington, DC, USA, 2015. IEEE Computer Society.
[Int] Intel R©.
DPDK: Data plane development kit.
http://dpdk.org.
[JST+09] Van Jacobson, Diana K. Smetters, James D. Thornton, Michael F. Plass,
Nicholas H. Briggs, and Rebecca L. Braynard.
Networking named content.
In Proceedings of the 5th International Conference on Emerging
Networking Experiments and Technologies, CoNEXT ’09, pages 1–12,
New York, NY, USA, 2009. ACM.
20
References II
[KJL+15] Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim,
and Sue Moon.
Nba (network balancing act): A high-performance packet processing
framework for heterogeneous processors.
In Proceedings of the Tenth European Conference on Computer Systems,
EuroSys ’15, pages 22:1–22:14, New York, NY, USA, 2015. ACM.
[KMC+00] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans
Kaashoek.
The Click modular router.
ACM Trans. Comput. Syst., 18(3):263–297, August 2000.
[PVL+14] Diego Perino, Matteo Varvello, Leonardo Linguaglossa, Rafael Laufer, and
Roger Boislaigue.
Caesar: A Content Router for High-speed Forwarding on Content
Names.
In Proceedings of the Tenth ACM/IEEE Symposium on Architectures for
Networking and Communications Systems, ANCS ’14, pages 137–148,
New York, NY, USA, 2014. ACM.
21
References III
[Riz12] Luigi Rizzo.
netmap: A novel framework for fast packet I/O.
In 21st USENIX Security Symposium (USENIX Security 12), pages
101–112, Bellevue, WA, August 2012. USENIX Association.
22