openvpx backplane fabric choice calls for careful … journal - openvpx...vpx 16 expan plane data...

Reprinted from COTS Journal | June 2012

When evaluating choices be-tween interconnect fabrics and topologies as part of a

systems engineering exercise, there are many factors to be considered. Papers have been published that purport to il-lustrate the advantages of some schemes over others. However, some of these analyses adopt a simplistic model of the architectures that can be misleading when it comes to mapping a real-world problem onto such systems. A more rig-orous approach to the analysis is needed in order to derive metrics that are more meaningful and which differ signifi-cantly from those derived from a sim-plistic analytical approach.

This article compares the band-width available to two common types of dataf low for systems based on the VITA 65 CEN16 central switched topology, using three different fab-rics—Serial RapidIO (SRIO), 10 Gigabit Ethernet (10GbE) and Double Data Rate InfiniBand (DDR IB). The analysis will show that the difference in routing for the three fabrics on the CEN16 backplane is minimal, and that for the use cases presented, 10GbE is closer in performance to SRIO than is

claimed elsewhere, and that DDR IB matches SRIO.

The System ArchitectureFor the purposes of this analysis,

consider an OpenVPX CEN16 backplane that allows for 14 payload (processor) slots and two switch slots (Figure 1). This is a non-uniform Data Plane topology that routes one connection from each payload slot to each of the two switch slots, plus one connection to the adjacent slot on each side.

First consider a system built from boards that each have two processing nodes (which may be multicore), where each node is connected via Gen2 Serial RapidIO (SRIO) to an onboard SRIO switch, which in turn has four SRIO connections to the backplane. Each pro-cessor has two SRIO connections. That is then compared with a similar system based around dual processor boards with two 10GbE links per node and no onboard switch.

If, as in the case of the GE DSP280 multiprocessor board, a Mellanox net-work interface chip is used, the same board can be software reconfigured to support InfiniBand: In fact, the only

change required to migrate from 10GbE to DDR InfiniBand is to change the sys-tem central switch from 10GbE (for ex-ample, GE’s GBX460) to InfiniBand (such as GE’s IBX400). The backplane can re-main unchanged, as can the payload boards. The interfaces can be software selected as 10GbE or IB. By using appro-priate middleware such as AXIS or MPI, the application code remains agnostic to the fabric.

Assumptions and Data Rate Arithmetic

It is assumed that the switches for all three fabrics are non-blocking for the number of ports that each switch chip supports. However, as will be seen, the number of chips and the hier-archy used to construct a 20-port cen-tral switch board can have a significant impact on the true network topology and therefore the bandwidth available to an application.

One factor that can be overlooked is that in addition to the primary data fabric connections, there can be an al-ternate path between nodes on the same board that can be seamlessly leveraged by the application. For example, GE’s

Peter Thompson, Director of Applications, Military and AerospaceGE Intelligent Platforms

There’s a host of factors to consider when evaluating the system bandwidth of various OpenVPX fabrics. A detailed comparison of 10 Gbit Ethernet, Serial Rapid IO and InfiniBand sheds some light.

OpenVPX Backplane Fabric Choice Calls for Careful Analyses

Fabric Bandwidth Comparisons on VPX Backplanes

SYSTEM DEVELOPMENT

SYSTEM DEVELOPMENT

Reprinted from June 2012 | COTS Journal

DSP280 multiprocessor board has eight lanes of PCIe Gen2 between the two pro-cessors—via a switch with non-trans-parent bridging capability. This adds a path with up to 32 Gbit/s available. It’s important that the inter-processor com-munication software is able to leverage mixed data paths within a system. The AXIS tools from GE can do that and can be used to build a dataflow model that represents the algorithm’s needs, and the user has complete control over which in-terconnect mechanism is used for each data link.

Gen2 SRIO (which is only just start-ing to emerge on products as of early 2012) runs at 5 GHz with the chipsets commonly in use. A 4-lane connection, with the overhead of 8b10b encoding, yields a raw rate of 4 x 5 Gbit/s x 0.8 = 16 Gbit/s. 10GbE clocks at 3.125 GHz on 4 lanes with the same 8b10B encoding, so has a raw rate of 4 x 3.125 Gbit/s x 0.8 = 10 Gbit/s. DDR InfiniBand clocks at 5

GHz, with a raw rate of 4 x 5 Gbit/s x 0.8 = 16 Gbit/s. Mellanox interface chips that support both 10GbE and IB have been available and have been deployed for some time now, and are considered a mature technology with widespread adoption in mainstream high-perfor-mance computing.

Expanded System ArchitectureNow consider a system built from

fourteen such boards in an OpenVPX chassis with a backplane that con-forms to the BKP6-CEN16-11.2.2.n profile. This supports fourteen pay-load boards and two central switch boards, and yields a nominal inter-connect diagram as shown in Figure 2 for the SRIO case. For 10GbE or In-finiBand, the same backplane results in an inter-connect mapping that is represented in Figure 3.

Those diagrams do not tell the whole story however. They would be correct if

the central switches shown were con-structed from a single, non-blocking, 18- to 20-port switch device. However, this is not the case for all the fabrics. In the 10GbE case, a GBX460 switch card can be used, which employs a single 24-port switch chip. For an InfiniBand system, IBX400 can be used, which has a single, 36-port switch chip where each port is x4 lanes wide.

In the case of Gen2 SRIO, the switch chip commonly selected is a de-vice that supports 48 lanes—in other words 12 ports of x4 links. In order to construct a switch of higher order, it is necessary to use several chips in some kind of a tree structure. Here a tradeoff must be made of the number of chips used against the overall performance of the aggregated switch.

All-to-All Data MeasurementWhen evaluating network archi-

tectures, a common approach is to look

VPX1

ExpanPlane

DataPlane

ControlPlane

VPX2

ExpanPlane

DataPlane

ControlPlane

VPX3

ExpanPlane

DataPlane

ControlPlane

VPX4

ExpanPlane

DataPlane

ControlPlane

VPX5

ExpanPlane

DataPlane

ControlPlane

VPX6

ExpanPlane

DataPlane

ControlPlane

VPX7

ExpanPlane

DataPlane

ControlPlane

VPX8

ExpanPlane

DataSwitch

ControlSwitch

VPX9

ExpanPlane

DataSwitch

ControlSwitch

VPX10

ExpanPlane

DataPlane

ControlPlane

VPX11

ExpanPlane

DataPlane

ControlPlane

VPX12

ExpanPlane

DataPlane

ControlPlane

VPX13

ExpanPlane

DataPlane

ControlPlane

VPX14

ExpanPlane

DataPlane

ControlPlane

VPX15

ExpanPlane

DataPlane

ControlPlane

VPX16

ExpanPlane

DataPlane

ControlPlane

IPMC IPMC IPMC IPMC ChMC ChMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC IPMC

Slot numbersare logical,physical slotnumbers maybe different

ExpansionPlane(DFP)

Data Plane(FP)

Control Plane(UTP)

ManagementPlane (IPMB)

Utility PlaneIncludes Power

Payload SlotsSwitch/

Management Payload Slots

FP

FPTP

UTP

}

}

Figure 1

An OpenVPX CEN16 backplane that allows for 14 payload (processor) slots and two switch slots. This is a non-uniform Data Plane topology that routes one connection from each payload slot to each of the two switch slots, plus one connection to the adjacent slot on each side.

SYSTEM DEVELOPMENT

at an all-to-all exchange of data. This is of interest as it represents a common problem encountered in embedded processing systems: a distributed cor-ner turn of matrix data. This is a core function in synthetic aperture radars, for instance, where it is termed a cor-ner turn. It is commonly seen when the processing algorithm calls for a two (or higher) dimensional array to be sub-jected to a two dimensional (or higher) Fast Fourier Transform. In order to meet system time constraints, the transform is often distributed across many processor nodes. Between the row FFTs and the column FFTs the data must be exchanged between nodes. This requires an all-to-all exchange of data that can tax the available band-

width of a system.A simple analysis of this topology

might make the following assump-tions: there are links between nodes on each board via the onboard switch, there are links to nodes on adjacent cards via links between the onboard switches, and there are 22 connections made via the central switches. In this approach, the overall performance for an all-to-all exchange might be as-sumed to be determined by the low-est aggregate bandwidth of these three connection types—in other words that of a single link divided by the number of connections. This equates to 4 lanes x 5 Gbit/s x 0.8 encoding / 22 nodes = 0.73 Gbit/s.

If we apply the same simplistic

analysis to the 10GbE system, this sug-gests that the available bandwidth for all-to-all transfers is 4 lanes x 3.125 Gbit/s x 0.8 encoding of x8 connections between switches / 368 paths = 0.217 Gbit/s per CPU when using SRIO. That means SRIO has an apparent speed ad-vantage of 3.4 to 1. However, this is a f lawed analysis and gives a misleading impression as to the relative perfor-mance that might be expected from the two systems when doing a corner turn. The two architectures are evaluated with different methods—one by divid-ing the worst-case bandwidth by the number of processors sharing it, and the other by dividing the worst-case bandwidth by the number of links that share it.

CentralSwitch

CentralSwitch

SRIO

SRIOCPU

1

SRIO

SRIOCPU

2

PayloadSwitch

SRIO

SRIOCPU

3

SRIO

SRIOCPU

4

PayloadSwitch

SRIO

SRIOCPU

5

SRIO

SRIOCPU

6

PayloadSwitch

SRIO

SRIOCPU

7

SRIO

SRIOCPU

8

PayloadSwitch

SRIO

SRIOCPU

9

SRIO

SRIOCPU10

PayloadSwitch

SRIO

SRIOCPU11

SRIO

SRIOCPU12

PayloadSwitch

SRIO

SRIOCPU13

SRIO

SRIOCPU14

PayloadSwitch

CPU27Payload

Switch

SRIO

SRIOSRIO

SRIOCPU28

CPU25Payload

Switch

SRIO

SRIOSRIO

SRIOCPU26

CPU23Payload

Switch

SRIO

SRIOSRIO

SRIOCPU24

CPU21Payload

Switch

SRIO

SRIOSRIO

SRIOCPU22

CPU19Payload

Switch

SRIO

SRIOSRIO

SRIOCPU20

CPU17Payload

Switch

SRIO

SRIOSRIO

SRIOCPU18

CPU15Payload

Switch

SRIO

SRIOSRIO

SRIOCPU16

Figure 2

This interconnect diagram for the Serial RapidIO use case has an OpenVPX backplane that conforms to the BKP6-CEN16-11.2.2.n profile. This supports 14 payload boards and two central switch boards.


SYSTEM DEVELOPMENT

Switch Architecture MattersA second potential error is to ignore

the internal architecture of each switch device, as this can have an effect in cases where the switch does not have balanced bandwidth. However, the biggest flaw is the suggestion that the performance of a non-uniform tree architecture can be modeled by deriving the lowest band-width connection in the system. In net-work theory, it is widely accepted that the best metric for the expected performance of such a system is represented by the bisectional bandwidth of the network. The bisectional bandwidth of a system is found by dividing the system into two equal halves along a dividing line, and enumerating the rate at which data can be communicated between the two halves.

Reconsidering the network dia-gram of the SRIO system, the bisec-

tion width is defined by the number of paths that the median line crosses, which adds up to be 19. Similarly, the bisectional width of the 10GbE or DDR IB system would add up to be 19. Given that the link bandwidth for the SRIO system is 16 Gbit/s and for 10GbE is 10 Gbit/s, the bisectional bandwidth of the SRIO system is 19 x 16 = 304 Gbit/s, and for the 10GbE system it is 19 x 10 = 190 Gbit/s. This represents an expected performance ratio for the to-tal exchange scenario of 1.6 to 1 in fa-vor of the SRIO system—not the 3.4 to 1 predicted in the simplistic model. If we now replace the 10GbE switch with an InfiniBand switch, which fits the same slot and backplane profiles, the bisectional bandwidth is 19 x 16 = 304 Gbit/s. Therefore the performance of DDR InfiniBand matches that of SRIO.

Bandwidth Calculations—Pipeline Case

Another dataflow model commonly considered is a pipeline, where data streams from node to node in a linear manner. When designing such a data-flow, it is normal to map the tasks and flow to the system in an optimal man-ner. This can include using different fabric connections for different parts of the flow. A good IPC library and in-frastructure will allow the designer to do so without requiring any modifica-tions to the application code. AXIS has this characteristic. Here, for simplicity, it is assumed that the input and output data sizes at each processing stage are the same (no data reduction or increase). In this instance the rate of the slowest link in the chain dictates the overall achiev-able performance.

CentralSwitch

CentralSwitch

10GbE

10GbECPU

1

10GbE

10GbECPU

2

10GbE

10GbECPU

3

10GbE

10GbECPU

4

10GbE

10GbECPU

5

10GbE

10GbECPU

6

10GbE

10GbECPU

7

10GbE

10GbECPU

8

10GbE

10GbECPU

9

10GbE

10GbECPU10

10GbE

10GbECPU11

10GbE

10GbECPU12

10GbE

10GbECPU13

10GbE

10GbECPU14

CPU27

10GbE

10GbE10GbE

10GbECPU28

CPU25

10GbE

10GbE10GbE

10GbECPU26

CPU23

10GbE

10GbE10GbE

10GbECPU24

CPU21

10GbE

10GbE10GbE

10GbECPU22

CPU19

10GbE

10GbE10GbE

10GbECPU20

CPU17

10GbE

10GbE10GbE

10GbECPU18

CPU15

10GbE

10GbE10GbE

10GbECPU16

Figure 3

This interconnect diagram has an OpenVPX backplane with an interconnect mapping for 10GbE or InfiniBand.

Reprinted from June 2012 | COTS Journal

SYSTEM DEVELOPMENT

If Task 1 is mapped to CPU1, Task 2 to CPU2 and Task 3 to CPU3, the avail-able paths are shown in yellow in Figure 4 for the 10GbE system. The path from Task 1 to Task 2 is over x8 PCIe Gen2, with an available bandwidth of 32 Gbit/ss. The path from Task 2 to Task 3 has access to two 10GbE links, an aggregate rate of 20 Gbit/s. Therefore the minimum path is 20 Gbit/s. In the DDR IB system, the path from Task 2 to Task 3 has access to two IB links, an aggregate rate of 32 Gbit/s. The PCIe link is unchanged, so the minimum leg here is 32 Gbit/s. Now, for the SRIO system, with paths between CPU1 and CPU2 and between CPU2 and CPU3, two separate links are available, so 32 Gbit/s is available for both legs.

The result of all this is that the limiting bandwidths for the pipeline use case are 20 Gbit/s for 10GbE, 32 Gbit/s for DDR IB and 32 Gbit/s for SRIO.

Other Factors to ConsiderThe push to support open software

architectures—MOSA, FACE and so on—is leading the military embed-ded processing industry to support middleware packages such as Open Fabric Enterprise Distribution (OFED) and OpenMPI for data movement. Typically OpenMPI is layered over a network stack, and its performance is highly reliant on the efficiency of how the layers map to the underlying fabric. Some SRIO implementations rely on

“rionet,” a Linux network driver that presents a TCP/IP interface to SRIO. Contrast this with an OpenMPI imple-mentation that maps through OFED to RDMA over 10GbE or InfiniBand, and it can be seen that the potential exists for a large gap in performance at the application level, with RDMA being much more efficient.

Meanwhile, it is sometimes claimed that SRIO is more power efficient than the other fabrics. If we total up the power of the bridge and switch components for each 16-slot system, a truer picture emerges. If you do the math, the power efficiency of SRIO and DDR IB is on par, with 10GbE fairly close.

Differences Not SignificantFigure 5 summarizes the sys-

tem bandwidth analyses for the SRIO, 10GbE and DDR InfiniBand systems. These show that for both use cases, the simplistic analysis presented elsewhere overestimates the performance advan-tage of SRIO over 10GbE by a factor of two, and that the advantage is com-pletely attributable to the difference in clock rates. The CEN16 topology has little to no effect in reality. It also shows that the DDR InfiniBand system matches the performance of the SRIO system for both use cases.

GE Intelligent PlatformsCharlottesville, VA.(800) 368-2738.[www.ge-ip.com].

CentralSwitch

CentralSwitch

10GbE

10GbECPU

1

10GbE

10GbECPU

2

10GbE

10GbECPU

3

10GbE

10GbECPU

4

10GbE

10GbECPU

5

10GbE

10GbECPU

6

1 to 2

2 to 3 2 to 3

Figure 4

Shown here is a pipeline dataflow scheme mapped to a 10GbE system.

Backplane Use Case 10GbE SR10 DDR 1B SRIO:10GbE IB:SRIOCEN16

14 payload 2 switch

All-to-all 190 Gbits/s bisectional bandwidth

304 Gbits/s bisectional bandwidth

304 Gbits/s bisectional bandwidth

1.6x 1x

CEN16 14 payload 2 switch

Pipeline 20 Gbits/s min

32 Gbits/s min

32 Gbits/s min

1.6x 1x

Figure 5

The table summarizes the system bandwidth analyses for the SRIO, 10GbE and DDR InfiniBand systems. The DDR InfiniBand system matches the performance of the SRIO system for both use cases.


openvpx backplane fabric choice calls for careful … journal - openvpx...vpx 16 expan plane data...

Documents