mitch gusat and wolfgang denzel with ibm team ibm zrl ieee

23
Case study: InfiniBand CM Parameter Tuning Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE 802 Plenary Session Dallas, Nov. 2006

Upload: others

Post on 16-Apr-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Case study: InfiniBand CM Parameter Tuning

Mitch Gusat and Wolfgang Denzelwith IBM team

IBM ZRL

IEEE 802 Plenary SessionDallas, Nov. 2006

Page 2: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 2

Outline

• Problem: Congestion in IBA Networks

• Solution (elements thereof): CCA

• Need: Tune the CCA parameters

• Method: ZRL Congestion Benchmarking

• Simulation results and conclusions

Page 3: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 3

Problem: Hotspot Congestion in Lossless ICTNs => Saturation Trees

shortcut routes within same switch elements (SE)

same SEs

port 128

port 4

port 1HCA

HCA

. .

. .

.

HCA

HCA

. .

. .

.

16 8x8 SEs

. .

..

. .

. .

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

SourceHost

ChannelAdapters

DestinationHost

ChannelAdapters

“Inn

ocen

t”tr

affic

+ 1

Hot

spot

hotspot

Page 4: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Zurich Research Laboratory

Congestion Control © 2005 IBM Corporation

Effect: Global collapse of throughput caused by single hotspot

shortcut routes within same switch elements (SE)

same SEs

port 128

port 4

port 1HCA

HCA

. .

. .

.

HCA

HCA

. .

. .

.

16 8x8 SEs

. .

..

. .

. .

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

32 4x4 half SEs

.

.

.

.

SourceHost

ChannelAdapters

DestinationHost

ChannelAdapters

Too

muc

h tr

affic

to h

otsp

ot

hotspot

0

0.2

0.4

0.6

0.8

1

0 2ms 4ms 6ms 8ms 10ms

Thro

ughp

ut (

for e

ach

outp

ut)

Time

congestion start congestion end

300% hotspot oversubscriptionduring congestion period

all time constant offered load = 0.9

throughput of hotspot output

saturates

throughput of other outputsbreaks down

Page 5: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Zurich Research Laboratory

Congestion Control © 2005 IBM Corporation

Solution (elements thereof): IBA Congestion Control Annex (CCA)

Source HCASwitch

congestionthresholdIndex

FECNDestination

HCA

Congestion Control Tablecontaining inter-packet delays (IPD)

IPD−

+

Timer

BECNBECN

IPD

packet packet

Forward ExplicitCongestion Notification

Backward ExplicitCongestion NotificationParameters

Switch queue threshold (~Qeq): sw_th– congestion detection and FECN marking

IPD table size = 64-256 entries – dynamic range of per flow rate control

highest IPD entry: max_ipd – largest inter-packet gap = 1/Rmin

IPD index increment: ipd_idx_incr– step on each new BECN => rate control granularity

IPD recovery timer: rec_time– timeout of the autonomic rate increase timer

Page 6: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 6

Questions

1. Does it work?

2. Tuning: What parameter settings for which cases?=> exploration of a combinatorial sub-space

3. Where are the limits of its operation?

Page 7: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 7

Does it work?

• Qualified “yes” => needs tuningeasy for small fabrics w/ simple traffic, hard for others...

• Param tuning required per (1) fabric architecture and (2) traffic pattern

• Fragile stability (stiffness): CCA sensitivity to traffic and params...

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

Tuned parameters (almost..)Un-tuned congestion control

Page 8: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 8

Early Observations• With min. pkt size under high background load, ECN traffic can

lead to instability

• A hotspot relay destination receiving permanent short-packet traffic may be overloaded with F/BECN signaling

• BECN and ACK packets can be caught in reverse congestion

• Reaction delay improvable by direct BCN vs. the CCA mechanism

• CCA operation is sensitive to parameters Oscillations observed... seem to be controllable

• Parameter settings strongly depend on fabric and traffic!

Page 9: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 9

Simulation Model

• Switching network4x4..12x12 input-buffered CIOQ switchesCredit-based flow controlMultistage bidirectional fat-tree networkSource routing: Static or dynamic (shortest path)

• Edge adapter (HCA) operationQueue pairs, credit-controlled round-robin scheduling with rate ctrl. triggered by CMDestination will ACK (64B pkt) after successful reception of user packets (MTUs)Return of short BECN packets if FECN bit is set

• CM mechanism

Switch action: Output buffer threshold with hysteresis -> FECN bit set in packets leaving switch module and BECN packets returned from destination HCA

SRC HCA action: Rate control according to CCA.

Page 10: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Zurich Research Laboratory

Congestion Control © 2005 IBM Corporation

Baseline Switch: Xbar-based CIOQ with Input Buffering

sche

dule

r

sche

dule

r

sche

dule

r

In[1]

In[2]

In[n]

Out[1]

Out[2]

Out[n]

. . . . .

. . . . .

. . . . .

. . .

. .

. . .

. .

. . .

. .

Logi

cal O

utpu

t Que

ue

Logi

cal O

utpu

t Que

ue

Logi

cal O

utpu

t Que

ue

Physical Input Buffer

Physical Input Buffer

Physical Input Buffer

Switch ArchitectureIN buffer = 72KB

Output service: RR

Switch element (SE) described in Omnet++ #include <omnetpp.h>#include <packetdefinition.h>

class Switch : public cSimpleModule{Module_Class_Members( Switch, cSimpleModule, 0 )< some parameter and variable definitions …….. >virtual void initialize();virtual void handleMessage( cMessage *msg );virtual void finish();

}

Define_Module( Switch );

void Switch::initialize(){

< some initializations, allocation of queue arrays, schedule first SCHEDULE_EVENT, …….. >

void Switch::handleMessage(cMessage *msg){switch( msg->kind() ){case PACKET: < en-queue received packet …….. >case SCHEDULE_EVENT: < de-queue and send packet, return credit,

update scheduler pointer, schedule nextSCHEDULE_EVENT, ….. >

case CREDIT: < return credit to credit pool …….. >}

}

void Switch::finish(){}

3. performed after simulation run

Typical 3-method structure

1. performed before simulation run

2. handling of received messages during simulation run

Page 11: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 11

Baseline Fabric: MIN Topology => Bidir Fat Tree

• 2-level / 3-stage bidir MIN

• Simulated: 8 – 32 nodes

• Time per run: 2-30 min.

shortcut routes within same modules

same modules

port 32

port 4

port 1

port 4

port 1

8 4x4 modules 8 4x4 modules4 8x8 modules

HCATx

HCATx

HCARx

HCARxport 32

shortcut routes within same modules

32 4x4 modules

same modules

port 128

port 4

port 1

port 4

port 1

32 4x4 modules 32 4x4 modules32 4x4 modules

16 8x8 modules

HCATx

HCATx

HCARx

HCARxport 128

• 3-level / 5-stage bidir MIN

• Simulated: 128 – 432 nodes

• Time per run: 20-500 min.

Page 12: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 12

Traffic: ZRL Congestion Benchmark• Source nodes generate one or more hotspots according to matrix

[λij_hot]: tp->q = αk_hot [λij_hot]:tp->q , [λij_hot] is specified per each case below

1. Congestion type: IN- or OUTput-generated

2. Hotspot severity: HSV = λaggr / μHS , λaggr = ∑ λi at hotspotted output, μHS = service rate of the HS

Mild HSV ~ 1 .. 2 Moderate HSV = 3 .. 7Severe HSV >> 10.

3. Hotspot degree: HSD is the fan-in of congestive tree at the measured hotspot

Small’ HSD <10% (of all sources inject hot traffic)Medium HSD 20..60%Large HSD >90%.

Page 13: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 13

Sensitivity Analysis: Combinatorial Tree

Via analysis and validated by simulation, rank CCA params by sensitivity:1. ipd_indx_incr2. rec_time3. ipd_max

Simulation subtree for a 128-node IB network

background load

0.5

0.9

IPD recovery timemin

max

10.5us

.

.

.

.

.

.

IPD index step

min

max

20

.

.

.

.

.

.

maximum IPD

min

max

84us

.

.

.

.

.

.

hotspot degree

(# hot flows)

3 32

3/16x16

5/8x8

# stages / SE size

.

.

.

.

.

.

.

.

.

• 1 sim. point: ~ 1hr• Exhaustive coverage: ~ 78 yrs.

output: ~ 23 PB of data...not feasible, therefore...

• Prune the tree by analytical pre-selection

Network: 32-node 3-stage/128-node 5-stage

Traffic:single HSV = moderate congestiontwo HSD = 3 and 32 flows ca. 800 cases

ca 800 cases were analyzed

Page 14: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 14

Sample Results: “Input-generated” CongestionSelection: 4 ‘corner’ cases

Offered load• pkt size = 2KB (all user traffic MTU)• negative-exponential interarrival times• background traffic destinations uniformly distributed, load:

a) 50%b) 90%

2 congestion patterns, both with HSV ~ 3• Small HSD:

During hotspot period, 3 sources send 100% of their load to DST #31• Large HSD:

During the hotspot all 32 sources send 10% of their load to DST #31

Page 15: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 15

Four corner cases: Un-tuned CM ~ No CM

large hotspot, 50% load

small hotspot, 50% load

large hotspot, 90% load

small hotspot, 90% load

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

Page 16: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Zurich Research Laboratory

Congestion Control © 2005 IBM Corporation

With marginal settings of ipd_idx_incr and rec_time

0

.5

1

0 2ms 4ms 6ms 8ms 10ms

Thro

ughp

ut (

for o

utpu

ts 0

..31)

Time

start congestion end0

.5

1

0 2ms 4ms 6ms 8ms 10msTh

roug

hput

(fo

r out

puts

0..3

1)Time

start congestion end0

.5

1

0 2ms 4ms 6ms 8ms 10ms

Thro

ughp

ut (

for o

utpu

ts 0

..31)

Time

start congestion end0

.5

1

0 2ms 4ms 6ms 8ms 10ms

Thro

ughp

ut (

for o

utpu

ts 0

..31)

Time

start congestion end

Corner case: 4

Load: High (0.9)Hotspot degree: Large (32)

Out-of-range parametersIPD index step too low (=2) IPD index step too high (=40) IPD recovery timer too fast (=2.6us) IPD recovery timer too slow (=84us)

32x32 3-stage (2-level) fat-tree Common parameters: Packet size: 2 KB Hotspot severity: 300%

oscillations breakdown hot flow suffering

Throughput: with Congestion Control:

oscillations

Focus: 4th corner case

Page 17: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 17

Tuning for IG congestion in 32 to 128-node fabrics

32-node• sw_th = 90% of switch input buffer size• max_ipd = 1000 ( = 21µs )

from fairness under worst case => IPD sufficient for all other N-1 sources flows to access a share (1 pkt) of the HS bandwidth

• IPD table index increment ipd_idx_incr = 5 (sweetspot)• IPD recovery time rec_time = 500 ( = 10.5µs )

128-node• max_ipd and ipd_idx_incr ~ 4x larger than above

linear scalability confirmed by simulations

Page 18: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 18

Output-generated congestion with CM tuned for IG

ipd

50% load

ipd

90% load

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

200

400

600

800

1000

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

IPD[31] IPD[30] IPD[29] IPD[28] IPD[27] IPD[26] IPD[25] IPD[24] IPD[23] IPD[22] IPD[21] IPD[20] IPD[19] IPD[18] IPD[17] IPD[16] IPD[15] IPD[14] IPD[13] IPD[12] IPD[11] IPD[10] IPD[9] IPD[8] IPD[7] IPD[6] IPD[5] IPD[4] IPD[3] IPD[2] IPD[1] IPD[0]

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

200

400

600

800

1000

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

IPD[31] IPD[30] IPD[29] IPD[28] IPD[27] IPD[26] IPD[25] IPD[24] IPD[23] IPD[22] IPD[21] IPD[20] IPD[19] IPD[18] IPD[17] IPD[16] IPD[15] IPD[14] IPD[13] IPD[12] IPD[11] IPD[10] IPD[9] IPD[8] IPD[7] IPD[6] IPD[5] IPD[4] IPD[3] IPD[2] IPD[1] IPD[0]

Page 19: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 19

OG w/ “Big Bang” control50% load 90% load

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

Parameters for OG:1. ipd table filing: multiplicative (vs. linear in the IG)2. max_ipd = 10000 ( = 210us )3. recovery time: rec_time = 10000 ( = 210us )4. ipd increase: ipd_idx_incr[i] += 127• ipd decrease: ipd_idx_incr[i] -= 1

Page 20: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 20

OG w/ “Big Bang” control and enhancements

ipd

90% load, with false BECN filtering

ipd

90% load, 3x switch memory

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

0

0.2

0.4

0.6

0.8

1

1.2

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

simulation time

output[31]output[30]output[29]output[28]output[27]output[26]output[25]output[24]output[23]output[22]output[21]output[20]output[19]output[18]output[17]output[16]output[15]output[14]output[13]output[12]output[11]output[10]output[9]output[8]output[7]output[6]output[5]output[4]output[3]output[2]output[1]output[0]

hotspot

• With above enhancements, different (from IG congestion) control algorithm, and, different table values,

• a (marginally) stable solution to OG congestion is feasible.

Page 21: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 21

Tuning Guidelines for IG CongestionN = network size, i.e., number of portsHSD = maximum hot spot degree

Worst case = N, but can be less if system partitioning known.Absolute time (µsec) values are w/ reference to our model’s RTTof 2-20 µsec. (w/o congestion)

Item Value

IPD Table size 128

Switch threshold for congestion detection 90% of queue capacity, with hysteresis

IPD table index increment Min(1/6•N, 1/2•HSD)

Max IPD value 2/3 HSD µsec.

Recovery timer 10 µsec.

Page 22: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

IBM Zurich Research Lab 22

Limitations• Stability limits...?

Large hotspot degree (128 sources send 5% to hotspot) with 50% background load has no solution in the param space

• Input- and Output-generate congestions require two distinct control loops (table values and rate control algorithms)

Corollary: Any mixture of IG and OG must be controlled by the dominant case, although the solution might not be stable.

• Why? • Under-sampling

Generally not enough true BECNs for convergence Next problem: false BECNs

» That’s all.

Page 23: Mitch Gusat and Wolfgang Denzel with IBM team IBM ZRL IEEE

Contributors

N. Ni†, G. Pfister†, C. Minkenberg*, T. Engbersen*, D. Craddock§, W. Rooney§, R. Luijten* and T. Schlipf‡

* IBM Research GmbH, Rüschlikon, Switzerland; § IBM Systems and Technology Group, Poughkeepsie, NY USA;

† IBM Systems and Technology Group, Austin, TX USA; ‡ IBM Systems and Technology Group Boeblingen, Germany.

Contact: [email protected]