ieee hpsr 2014 scaling multi-core network processors without the reordering bottleneck alex shpiner...
TRANSCRIPT
![Page 1: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/1.jpg)
IEEE HPSR 2014
Scaling Multi-Core Network Processors Without the Reordering Bottleneck
Alex Shpiner (Technion / Mellanox)
Isaac Keslassy (Technion)
Rami Cohen (IBM Research)
![Page 2: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/2.jpg)
Network Processors (NPs)
NPs used in routers for almost everything Forwarding Classification Deep Packet Inspection (DPI) Firewalling Traffic engineering VPN encryption LZS decompression Advanced QoS …
Increasingly heterogeneous processing demands.2
![Page 3: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/3.jpg)
Parallel Multi-Core NP Architecture
Each packet is assigned to a Processing Element (PE) Any per-packet load balancing scheme
3
E.g., Cavium CN68XX NP, EZChip NP-4
PE2
PE1
PEN
![Page 4: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/4.jpg)
PE2
PE1
PEN
Packet Ordering in NP
NPs are required to avoid out-of-order packet transmission within a flow.
TCP throughput, cross-packet DPI, statistics, etc.
Naïve solution is avoiding reordering at all. Heavy packets often delay light packets.
Can we reduce this reordering delay?
4
12
Stop!
![Page 5: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/5.jpg)
5
The Problem
Reducing reordering delay in parallel network processors
![Page 6: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/6.jpg)
Multi-core Processing Alternatives
Static (hashed) mapping of flows to processing elements (PEs) [Cao et al., 2000], [Shi et al., 2005]
Potential to insufficient utilization of the PEs. Feedback-based adaptation of static mapping
[Kencl et al., 2002], [He et al., 2010], [We et al., 2011]
Causes packet reordering.
Pipeline without parallelism [Weng et al., 2004]
Not scalable, due to heterogeneous requirements and commands granularity.
6
![Page 7: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/7.jpg)
Sequence Number (SN)
Generator
PE2
PE1
PEN
Ordering Unit
Single SN (Sequence Number) Approach
[Wu et al., 2005], [Govind et al., 2007]
SN (sequence number) generator. Ordering unit - transmits only the oldest packet.
Large reordering delay.
7
PE2
PE1
PEN
12
![Page 8: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/8.jpg)
Per-flow Sequencing (Ideal)
Actually, we need to preserve order only within a flow.
[Khotimsky et al., 2002], [Wu et al., 2005], [Shi et al., 2007], [Cheng et al., 2008]
SN (sequence number) generator for each flow. Ideal approach: minimal reordering delay. Not scalable to a large number of flows [Meitinger et al., 2008] 8
SN Generator
Flow 47
PE2
PE1
PEN
Ordering Unit
SN Generator
Flow 13
SN Generator
Flow 1
SN GeneratorFlow 1000000
47:113:1
![Page 9: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/9.jpg)
Hashed SN (Sequence Number) Approach
[M. Meitinger et al., 2008]
Multiple SN (sequence number) generators(ordering domains).
Hash flows (5-tuple) to a SN generator.
Yet, reordering delay of flows in same hash bucket.9
PE2
PE1
PEN
Ordering Unit
Hashing
SN Generator K
SN Generator i
SN Generator 1
1:17:1 1:2
Note: the flow is hashed to an SN generator, not to a PE
![Page 10: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/10.jpg)
Our Proposal
Leverage estimation of packet processing delay. Instead of arbitrary ordering domains created by a hash
function, create ordering domains of packets with similar processing delay requirements. Heavy-processing packet does not delay light-processing packet
in the ordering unit.
Assumption: All packets within a given flow have similar processing requirements. Reminder: required to preserve order only within the flow.
10
![Page 11: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/11.jpg)
Processing Phases
E.g.: IP Forwarding = 1 phase Encryption = 10 phases
11
Processing phase #1
Processing phase #2
Processing phase #3
Processing phase #4
Processing phase #5
Disclaimer: it is not a real packet processing code
![Page 12: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/12.jpg)
RP3 (Reordering Per Processing Phase) Algorithm
12
PE2
PE1
PEN
Ordering Unit
Processing Estimator
SN Generator K
SN Generator i
SN Generator 1
1:17:1 7:2
All the packets in the ordering domain have the same number of processing phases (up to K).
Lower similarity of processing delay affects the performance (reordering delay), but not the order!
![Page 13: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/13.jpg)
PE2
PE1
PEN
Knowledge Frameworks
At what stage the packet processing requirements are known:
1. Known upon packet arrival.
2. Known only at the processing start.
3. Known only at the processing completion.
13
1
![Page 14: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/14.jpg)
RP3 Algorithm for Framework 3
Assumption: the packet processing requirements are known only when the processing completed.
Example: Packet that finished all its processing after 1 processing phase is not delayed by another currently processed packet in the 2nd phase.
Because it means that they are from different flows
Theorem: Ideal partition into phases would minimize the reordering delay to 0. 14
Time
Order of arrival
A, ϕ=2
B, ϕ=1
Phase no.1
Phase no.1
Aout
Bout
Phase no.2
Number of phases
![Page 15: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/15.jpg)
RP3 Algorithm for Framework 3
But, in reality:
15
Time
Order of arrival
A, ϕ=2
B, ϕ=1
Phase no. 1
Phase no. 1
Bout
AoutPhase no. 2
![Page 16: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/16.jpg)
RP3 Algorithm for Framework 3
Each packet needs to go through several SN generators. After completing the φ-th processing phase it will ask for the next SN from the
(φ+1)-th SN generator.
16
Time
Order of arrival
A, ϕ=2
B, ϕ=1
SN=1:1
SN= 1:2
tA,1
Bout
tC,1
AoutSN= 2:1
Next SN Generator
![Page 17: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/17.jpg)
RP3 Algorithm for Framework 3
When a packet requests a new SN, it cannot always get it automatically immediately.
The φ-th SN generator grants new SN to the oldest packet that finished processing of φ phases.
There is no processing preemption!
17
Time
Order of arrival
A, ϕ=2
B, ϕ=1
SN=1:1
SN= 1:2
tA,1
Bout
tC,1
AoutSN= 2:1
C, ϕ=2 SN=1:3 CoutSN= 2:2
Request next SN
Granted next SN
![Page 18: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/18.jpg)
RP3 – Framework 3
18
(1) A packet arrives and is assigned an SN1
(2) At end of processing phase φ send request for SNφ+1. When granted increment SN.
(3) SN Generator φ: Grant token when SN==oldestSNφ
Increment oldestSNφ, NextSN φ
(4) PE: When finish processing phases, send to OU
(5) OU: complete the SN grants
(6) OU: When all SNs are granted– transmit to the output
![Page 19: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/19.jpg)
SimulationsReordering Delay vs. Processing Variability
Synthetic traffic Poisson arrivals Uniform processing requirements distribution between [1,10] phases.
• For a fair comparison, 10 hash buckets in Hashed-SN algorithm.
Zipf distribution of the packets between 300 flows.
Phase processing delay variability: Delay ~ U[min, max]. Variability = max/min. E[delay]=100 time units
Improvement in orders of
magnitude
Improvement also with high phase
processing delay variability
Phase processing delay variability
Me
an
re
ord
erin
g d
ela
y
Ideal conditions: no reordering
delay.
Improvement by an order
of magnitude
Improvement by an order
of magnitude
![Page 20: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/20.jpg)
SimulationsReordering Delay vs. Load
20
Improvement by orders of
magnitude
Improvement by orders of
magnitude
% Load
Me
an
re
ord
erin
g d
ela
y
Real-life trace: CAIDA anonymized Internet traces
Note: reordering delay occurs even under low load.
![Page 21: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/21.jpg)
21
Summary
Novel reordering algorithms for parallel multi-core network processors reduce reordering delays
Rely on the fact that all packets of a given flow have similar required processing functions.
Three frameworks that define the stages at which the network processor knows about the packet processing requirements.
Analysis using simulations Reordering delays are negligible, both under synthetic traffic and real-
life traces. Analytical model (in the paper)
![Page 22: IEEE HPSR 2014 Scaling Multi-Core Network Processors Without the Reordering Bottleneck Alex Shpiner (Technion / Mellanox) Isaac Keslassy (Technion) Rami](https://reader036.vdocument.in/reader036/viewer/2022062422/56649f1d5503460f94c33fc6/html5/thumbnails/22.jpg)
Thank you.