sundar iyer winter 2012 lecture 8a packet buffers with latency ee384 packet switch architectures
TRANSCRIPT
![Page 1: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/1.jpg)
Sundar Iyer
Winter 2012Lecture 8a
Packet Buffers with Latency
EE384Packet Switch Architectures
![Page 2: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/2.jpg)
Reducing the size of the SRAM
Intuition• If we use a lookahead buffer to peek at the
requests “in advance”, we can replenish the SRAM cache only when needed.
• This increases the latency from when a request is made until the byte is available.
• But because it is a pipeline, the issue rate is the same.
![Page 3: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/3.jpg)
Algorithm
2. Compute: Determine which queue will run into “trouble” soonest. green!
1. Lookahead: Next Q(b – 1) + 1 arbiter requests are known.
Q(b-1) + 1
Requests in Lookahead Buffer
b - 1
Q
Queues
3. Replenish: Fetch b bytes for the “troubled” queue. Q
b - 1
Queues
![Page 4: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/4.jpg)
Winter 2008
EE384x 4
Example Q=4, b=4
t = 0; Green Critical
Requests in lookahead buffer
Queues
t = 1
Queues
Requests in lookahead buffer
t = 2
Queues
Requests in lookahead buffer
t = 3
Requests in lookahead buffer
t = 4; Blue Critical
Requests in lookahead buffer
t = 5
Requests in lookahead buffer
t = 6
Requests in lookahead buffer
t = 7
Requests in lookahead buffer
t = 8; Red Critical
Requests in lookahead buffer
![Page 5: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/5.jpg)
Claim
Patient Arbiter: With an SRAM of size Q(b – 1), a requested byte is available no later than Q(b – 1) + 1 after it is requested.
Example: 160Gb/s linecard, b=2560, Q=512: SRAM = 1.3MBytes,delay bound is 65ms (equivalent to 13 miles of fiber).
![Page 6: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/6.jpg)
Controlling the latency
What if application can only tolerate a latency l < Q(b – 1) + 1 timeslots?
Algorithm: Serve a queue once every b timeslots in the following order:
1.If there is an earliest critical queue, replenish it.2.If not, then replenish the queue that will have the largest deficit l timeslots in the future.
![Page 7: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/7.jpg)
Pipeline Latency, x
SR
AM
Siz
e
Queue Length for Zero Latency
Queue Length for Maximum Latency
1dw
dx x
Queue length vs. Pipeline depthQ=1000, b = 10
![Page 8: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/8.jpg)
Some Real Implementations• Cache size can be further optimized based on
¾ Using embedded DRAM¾ Dividing queues across ingress, egress¾ Grouping queues into aggregates (10 X 1G is a 10 G stream)¾ Knowledge of the scheduler pattern¾ Granularity of the read scheduler
• Most Implementations– Q < 1024 (original buffer cache), Q < 16K (modified schemes)– 15+ unique designs– 40G to 480G streams– Cache Sizes < 32 Mb (~ 8mm2 in 28nm implementation)
![Page 9: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/9.jpg)
Sundar Iyer
Winter 2012Lecture 8b
Statistics Counters
EE384Packet Switch Architectures
![Page 10: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/10.jpg)
What is a Statistics Counter?
• When a packet arrives, it is first classified to determine the type of the packet.
• Depending on the type(s), one or more counters corresponding to the criteria are updated/incremented
![Page 11: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/11.jpg)
Examples of Statistics Counters• Packet switches maintain counters for
– Route prefixes, ACLs, TCP connections– SNMP MIBs
• Counters are required for – Traffic Engineering
• Policing & Shaping
– Intrusion detection– Performance Monitoring (RMON)– Network Tracing
![Page 12: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/12.jpg)
Stanford University
Determine and analyze techniques for building very high speed (>100Gb/s) statistics counters and support an
extremely large number of counters.
Motivation: Build high speed counters
OC192c = 10Gb/s; Counters/pkt (C) = 10; Reff = 100Gb/s;
Total Counters (N) = 1 million; Counter Size (M) = 64 bits;
Counter Memory = N*M = 64Mb; MOPS =150M reads, 150M writes
![Page 13: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/13.jpg)
Question
• Can we use the same packet-buffer hierarchy to build statistics counters?
• Ans: Yes– Modify MDQF Algorithm– Numbers of Counters = N– Consider counters in base 1
• Mimics cells of size 1 bit per second– Create Longest Counter First Algorithm– Maximum size of counter (base 1): ~ b(2 + ln N)– Maximum size of counter (base 2): ~ log2 b (2 + ln N)
![Page 14: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/14.jpg)
Stanford University14
Optimality of LCF-CMACan we do better?
Theorem:
LCF-CMA, is optimal in the sense that it minimizes the size of the counter maintained in SRAM
![Page 15: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/15.jpg)
Stanford University15
Minimum Size of the SRAM Counter
Theorem: (speed-down factor b>1)
LCF-CMA, minimizes the size of counter (in bits) in the SRAM to:
f(b, N) = log2
ln [b/(b-1)]
ln bN________( ) (log log N)=
![Page 16: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/16.jpg)
Implementation Numbers Example:
• OC192 Line Card :R = 10Gb/sec.• No. of counters per packet :C = 10• Reff= R*C :100Gb/s• Cell Size :64bytes• Teff = 5.12/2 :2.56ns• DRAM Random Access Time :51.2ns• Speed-down Factor, b >= T/Teff :20 • Total Number of Counters :1000, 1 million• Size of Counters :64 bits, 8 bits
![Page 17: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/17.jpg)
Implementation Examples 10 Gb/s Line Card
Values f(b,N) Brute Force LCF-CMA
N=1000M=8
8 SRAM=8KbDRAM=0
SRAM=8Kb DRAM=8Kb
N=1000M=64
8 SRAM=64KbDRAM=0
SRAM=8KbDRAM=64Kb
N=1000000M=64
9 SRAM=64MbDRAM=0
SRAM=9MbDRAM=64Mb
![Page 18: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/18.jpg)
Conclusion
• The SRAM size required by LCF-CMA is a very slow growing function of N
• There is a minimal tradeoff in SRAM memory size
• The LCF-CMA technique allows a designer to arbitrarily scale the speed of a statistics counter architecture using existing memory technology
![Page 19: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/19.jpg)
Sundar Iyer
Winter 2012Lecture 8c
QoS and PIFO
EE384Packet Switch Architectures
![Page 20: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/20.jpg)
20
Contents
1. Parallel routers: work-conserving (Recap)
2. Parallel routers: delay guarantees
![Page 21: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/21.jpg)
21
Memory
Memory
Memory
Memory
Memory
Memory
Memory
The Parallel Shared Memory Router
C
A
Departing Packets
R
R
Arriving Packets
A5
A4
B1
C1
A1
C3
A5
A4
Memory
1
K=8
C3
At most one operation – a write or a read per time slot
B
B3
C1
A1
A3
B1
![Page 22: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/22.jpg)
22
• Problem :
– How can we design a parallel output-queued work-conserving router from slower parallel memories?
Problem
Theorem 1: (sufficiency)
A parallel output-queued router is work-conserving with 3N –1 memories that can perform at most one memory operation per time slot
![Page 23: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/23.jpg)
23
Re-stating the Problem• There are K cages which can contain an infinite number of
pigeons.
• Assume that time is slotted, and in any one time slot – at most N pigeons can arrive and at most N can
depart. – at most 1 pigeon can enter or leave a cage via a
pigeon hole.– the time slot at which arriving pigeons will
depart is known
• For any router:
– What is the minimum K, such that all N pigeons can be immediately placed in a cage when they arrive, and can depart at the right time?
![Page 24: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/24.jpg)
24
Only one packet can enter or leave a memory at time t
Intuition for Theorem 1
• Only one packet can enter a memory at time t
Time = t
DT=t+X
DT=t+X
DT=t
Only one packet can enter or leave a memory at any time
Memory
![Page 25: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/25.jpg)
25
Proof of Theorem 1• When a packet arrives in a time slot it must choose a
memory not chosen by
1. The N – 1 other packets that arrive at that timeslot.
2. The N other packets that depart at that timeslot.3. The N - 1 other packets that can depart at the
same time as this packet departs (in future).
• Proof:
– By the pigeon-hole principle, 3N –1 memories that can perform at most one memory operation per time slot are sufficient for the router to be work-conserving
![Page 26: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/26.jpg)
26
Memory
Memory
Memory
Memory
Memory
Memory
Memory
The Parallel Shared Memory Router
C
A
Departing Packets
R
R
Arriving Packets
A5
A4
B1
C1
A1
C3
A5
A4
From theorem 1, k = 7 memories don’t suffice .. but 8 memories do
Memory
1
K=8
C3
At most one operation – a write or a read per time slot
B
B3
C1
A1
A3
B1
![Page 27: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/27.jpg)
27
Summary - Routers which give 100% throughput
2NR2NR2NR/kNk None
Maximal2NR6NR3R2N
MWMNR2NR2RNCrossbarInput Queued
None2NR2NR2NR1BusC. Shared Mem.
Switch Algorithm
Switch BW
Total Mem. BW
Mem. BW per Mem1
# Mem.Fabric
NoneNRN(N+1)R(N+1)RNBusOutput-Queued
P Shared Mem.
C. Sets4NR2N(N+1)R2R(N+1)/kNkClosPPS - OQ
C. Sets4NR4NR4RN
C. Sets6NR3NR3RN
Edge Color4NR3NR3RN
Xbar
C. Sets2NR3NR3NR/kkBus
C. Sets4NR4NR4NR/kNkClos
Time Reserve*
3NR6NR3R2NCrossbar
PPS –Shared Memory
DSM
(Juniper)
CIOQ (Cisco)
1 Note that lower mem. bandwidth per memory implies higher random access time, which is better
![Page 28: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/28.jpg)
28
Contents
1. Parallel routers: work-conserving
2. Parallel routers: delay guarantees
![Page 29: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/29.jpg)
29
Delay Guarantees
one output, many logical FIFO queues
1
m
1 Weighted fair queueing
sorts packetsconstrained traffic
PIFO models
Weighted Fair Queueing
Weighted Round Robin Strict priority etc.
one output, single PIFO queue
Push In First Out (PIFO)
1 constrained traffic
push
![Page 30: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/30.jpg)
30
Delay Guarantees
• Problem :
– How can we design a parallel output-queued router from slower parallel memories and give delay guarantees? This is difficult because
The counting technique depends on being able to predict the departure time and schedule it.
The departure time of a cell is not fixed in policies such as strict priority, weighted fair queueing etc.
![Page 31: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/31.jpg)
31
Theorem 2
Theorem 2: (sufficiency)
A parallel output-queued router can give delay guarantees with 4N –2 memories that can perform at most one memory operation per time slot.
![Page 32: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/32.jpg)
32
Intuition for Theorem 2N=3 port router
9 8 7 456 3 2 1
2.5
DepartureOrder
…
8 7 6 345 2.5 2 1
1.5
… Departure Order
8 7 6 345 2.5 2 1
N-1 packets before cellat time of insertion
9 8 7 456 3 2 1
DT = 3 DT= 2 DT= 1
…
7 6 5 2.534 2 1.5 1
N-1 packets after cellat time of insertion
PIFO: 2 windows of memories of size N-1 that can’t be used
FIFO: Window of memories of size N-1 that can’t be used
DepartureOrder
![Page 33: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/33.jpg)
33
Proof of Theorem 2
A packet cannot use the memories:
1. Used to write the N-1 arriving cells at t.2. Used to read the N departing cells at t.
Time = t
DT=t
DT=t+T
Cell C
Before C
After C
3. Will be used to read the N-1 cells that depart before it.
4. Will be used to read the N-1 cells that depart after it.
![Page 34: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/34.jpg)
34
Summary- Routers which give delay guarantees
Marriage2NR6NR3R2N
-2NR2NR2NR/kNk
-NR2NR2RNCrossbarInput Queued
None2NR2NR2NR1BusC. Shared Mem.
Switch Algorithm
Switch BW
Total MemoryBW
Mem. BW per Mem.
# Mem.Fabric
NoneNRN(N+1)R(N+1)RNBusOutput-Queued
P. Shared M
C. Sets6NR3N(N+1)R3R(N+1)/kNkClosPPS - OQ
C. Sets6NR6NR6RN
C. Sets8NR4NR4RN
Edge Color5NR4NR4RN
Xbar
C. Sets2NR4NR4NR/kkBus
C. Sets6NR6NR6NR/kNk
Clos
Time Reserve
3NR6NR3R2NCrossbar
PPS –Shared Memory
DSM
(Juniper)
CIOQ (Cisco)
![Page 35: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/35.jpg)
35
Backups
![Page 36: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/36.jpg)
36
Distributed Shared Memory Router
• The central memories are moved to distributed line cards and shared
• Memory and line cards can be added incrementally• From theorem 1, 3N –1 memories which can perform one operation per time slot i.e. a total memory bandwidth of 3NR suffice for the router to be work-conserving
Switch Fabric
Line Card 1 Line Card 2 Line Card NR R R
Memories Memories Memories
![Page 37: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/37.jpg)
37
Corollary 1• Problem:
– What is the switch bandwidth for a work-conserving DSM router?
• Corollary 1: (sufficiency)
– A switch bandwidth of 4NR is sufficient for a distributed shared memory router to be work-conserving
• Proof 1:
– There are a maximum of 3 memory accesses and 1 port access
![Page 38: Sundar Iyer Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d835503460f94a68ea9/html5/thumbnails/38.jpg)
38
Corollary 2• Problem:
– What is the switching algorithm for a work-conserving DSM router?
• Bus : No algorithm needed, but impractical• Crossbar : Algorithm needed because only permutations are allowed
• Corollary 2: (existence)
– An edge coloring algorithm can switch packets for a work-conserving distributed shared memory router
• Proof :
– Follows from König’s theorem - Any bipartite graph with maximum degree has an edge coloring with colors