rep

44
INFINIBAND CS 708 Seminar NEETHU RANJIT (Roll No. 05088) B. Tech. Computer Science & Engineering College of Engineering Kottarakkara Kollam 691 531 Ph: +91.474.2453300 http://www.cek.ihrd.ac.in [email protected]

Upload: api-19588525

Post on 16-Nov-2014

719 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: rep

INFINIBANDCS 708 Seminar

NEETHU RANJIT (Roll No. 05088)

B. Tech. Computer Science & Engineering

College of Engineering Kottarakkara

Kollam 691 531

Ph: +91.474.2453300

http://www.cek.ihrd.ac.in

[email protected]

Page 2: rep

Certificate

This is to certify that this report titled InfiniBand is a bonafide recordof the CS 708 Seminar work done by Miss.NEETHU RANJIT, RegNo.10264042 , Seventh Semester B. Tech. Computer Science & Engineeringstudent, under our guidance and supervision, in partial fulfillment of therequirements for the award of the degree, B. Tech. Computer Science andEngineering of Cochin University of Science & Technology.

October 16, 2008

Guide Coordinator & Dept. Head

Mr Renjith S.R Mr Ahammed Siraj K KLecturer Asst. ProfessorDept. of Computer Science & Engg. Dept. of Computer Science & Engg.

Page 3: rep

Acknowledgments

I express my whole hearted thanks to our respected Principal Dr JacobThomas, Mr.Ahammed Siraj sir, Head of the Department, for provid-ing me with the guidance and facilities for the seminar. I wish to expressmy sincere thanks toMr Renjith sir, lecturer in Computer Science De-partment,and also my guide for his timely advises during the course periodof my seminar.I thank all faculty members of College of Engineering Kot-tarakara for their cooperation in completing my seminar. My sincere thanksto all those well wishers and friends who have helped me during the courseof the seminar work and have made it a great success. Above all I thankthe Almighty Lord, the foundation of all wisdom for guiding me step bystep throughout my seminar.Last but not the least i would like to thank myparents for their moral support.

NEETHU RANJIT

Page 4: rep

Abstract

InfiniBand is a powerful new architecture designed to support I/Oconnectivity for the internet infrastructure. InfiniBand is supportedby all major OEM server vendors as a means to expand and createthe next generation I/O interconnect standard in servers. For the firsttime, a high volume, industry standard I/O interconnect extend therole of traditional in the box requirements are related to mean band-width needed and maximum latency tolerated by this application. Itprovides a comprehensive silicon software and system solution whichprovides an overview to layered protocol and InfiniBands managementinfrastructure. The comprehensive nature of architecture provide aoverview to major sections of InfiniBand I/O specification ranges fromindustry standard electrical interfaceand mechanical connectors towell defined software and management services.InfiniBand is uniquein providing connectivity in a way previously reserved only for tradi-tional networking. This unification of I/O and system area networkingrequire a new architecture domain. Underlying this major transitionis InfiniBands superior abilities to support the internet requirementfor RAS: Reliability, Availability, and Serviceability. The InfiniBandArchitecture (IBA) is an industry standard architecture for server I/Oand interprocessor communication.IBA that enables QoS: Quality ofServices which support with certain mechanisms. These mechanismsare basically service levels,virtual lanes and table based arbitrationof virtual lanes.InfiniBand has a formal model to manage the Infini-Band to provide QoS,according to this model, each application needa sequence of entries in the IBA arbitration tables based on require-ments. These requirements are related to mean bandwidth neededand maximum latency tolerated by this application. It provides acomprehensive silicon software and system solution which provides anoverview to layered protocol and InfiniBands management infrastruc-ture. The comprehensive nature of architecture provide a overviewto major sections of InfiniBand I/O specification ranges from indus-try standard electrical interface and mechanical connectors to welldefined software and management services.InfiniBridge is the channeladapter architecture of InfiniBand which aids packet switching featureof InfiniBand.

i

Page 5: rep

Contents

1 INTRODUCTION 1

2 INFINIBAND ARCHITECTURE 3

3 COMPONENTS OF INFINIBAND 53.1 HCA and TCA Channel adapters . . . . . . . . . . . . 53.2 Switches . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Routers . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 INFINIBAND BASIC FABRIC TOPOLOGY 7

5 IBA Subnet 95.1 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.2 Endnodes . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 FLOW CONTROL 11

7 INFINIBAND SUBNET MANAGEMENT AND QoS12

8 REMOTE DIRECT ACESS (RDMA) 148.1 Comparing a Traditional Server I/O and RDMA-Enabled

I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

9 INFINIBAND PROTOCOL STACK 179.1 Physical Layer . . . . . . . . . . . . . . . . . . . . . . 179.2 Link Layer . . . . . . . . . . . . . . . . . . . . . . . . 189.3 Network Layer . . . . . . . . . . . . . . . . . . . . . 189.4 Transport Layer . . . . . . . . . . . . . . . . . . . . . 19

10 COMMUNICATION SERVICES 2010.1 Communication Stack :InfiniBand support for the Vir-

tual Interface Architecture (VIA) . . . . . . . . . . . . 21

11 INFINIBAND FABRIC VERSUS SHARED BUS 22

12 INFINIBRIDGE 2412.1 Hardware transport performance of InfiniBridge . . . . 24

ii

Page 6: rep

13 INFINIBRIDGE CHANNEL ADAPTER ARCHITEC-TURE 26

14 VIRTUAL OUTPUT QUEUEING ARCHITECTURE 27

15 FORMAL MODEL TO MANAGE INFINIBAND AR-BITRATION TABLES TO PROVIDE TO QUALITYOF SERVICE(QoS) 2915.1 THREE MECHANISMS TO PROVIDE QoS . . . . . 29

15.1.1 Service Level . . . . . . . . . . . . . . . . . . . 2915.1.2 Virtual Lanes . . . . . . . . . . . . . . . . . . . 3015.1.3 Virtual Arbitration table . . . . . . . . . . . . 30

16 FORMAL MODEL FOR THE INFINIBAND ARBI-TRATION TABLE 31

16.0.4 Initial Hypothesis . . . . . . . . . . . . . . . . 33

17 FILLING IN THE VL ARBITRATION TABLE 3517.1 Insertion and elimination in the table . . . . . . . . . 35

17.1.1 Example 1. . . . . . . . . . . . . . . . . . . . . 3517.2 Disfragmentation Algorithm . . . . . . . . . . . . . . . 3617.3 Reordering Algorithm . . . . . . . . . . . . . . . . . . 3617.4 Global management of the table . . . . . . . . . . . . 36

18 CONCLUSION 37

REFERENCES 38

iii

Page 7: rep

1 INTRODUCTION

Bus architectures have a tremendous amount of inertia because theydictate the bus interface architecture of semiconductor devices. Forthis reason successful bus architectures typically enjoy a dominantposition for ten years or more. The PCI bus was introduced tothe standard PC architecture in the early 90s and has maintainedits dominance with only one major upgrade during that period:from32bit/33MHz to 64bit/66Mhz.The PCI-X initiative takes this one stepfurther to 133MHz and seemingly should provide the PCI architecturewith a few more years of life.But there is a divergence between whatpersonal computer and servers require.

Throughout the past decade of fast paced computer developmentthe traditional Peripheral Component Interconnect architecture hascontinued to be the dominant input/output standard for most in-ternal back-plane and external peripheral connections.However,thesedays,the PCI bus,using shared bus approach is beginning to be notice-ably lagging.Performance limitations, poor bandwidth and reliabilityissues are surfacing within the higher market tiers, especially as thePCI bus is quickly becoming an outdated technology.

Computers are made up of a number of addressable elements-CPU,memory,screen,hard disks,LAN and SAN interface etc, that use a sys-tems bus for communications. As these elements have become faster,the system bus and overhead associated with data movement com-monly reffered to as I/O between devices has become a gating factorin computer performance.To address the problem of server perfor-mance with respect to I/O in particular,InfiniBand was developed asa standards-based protocol to provide data movement data movementoffload from the CPU to dedicated hardware, thus allowing more CPUto be dedicated to application processing.As a result ,InfiniBand,byleveraging networking technologies and principles provide scable,highbandwidth transport for efficient communications between InfiniBandattached devices.

InfiniBand technology advances I/O connectivity for data centerand enterprise infrastructure deployment, overcoming the I/O bot-tleneck in todays server architectures. Although primarily suited fornext generation server I/O, the InfiniBand can also extend to the em-bedded computing, storage, and telecommunications industries. Thishigh-volume, industry-standard I/O interconnect extends the role oftraditional backplane and board buses beyond the physical connector.

1

Page 8: rep

Another major bottleneck is the scalability problems with parallel-bus architectures such as the peripheral component interconnect (PCI).As these buses scale in speed, they cant support the multiple networkinterfaces that system designersrequire. For example the PCI-X busat 133 MHz can only support one slot and at higher speeds thesebuses begin to look like point-to point connections. Mellanox Tech-nologies InfiniBand silicon product, InfiniBridge, lets system designersconstruct entire fabrics based on the devices switching and channeladapter functionality.

InfiniBridge implements an advanced set of packet switching, qual-ity of service, and flow control mechanisms. These capabilities supportmultiprotocol environments with many I/O devices shared by multipleservers. These InfiniBridge features include an integrated switch andPCI channel adapter, InfiniBand 1X and 4X link speeds (defined as2.5 and 10 Gbps), eight virtual lanes, a maximum transfer unit (MTU)size of up to 2 Kbytes. InfiniBridge also offers multicast support, anembedded subnet management agent, and InfiniPCI for transparentPCI-to PCI. InfiniBand is architecture and specification for data flowbetween processors and I/O devices that promise greater band widthand almost unlimited expandability.Infiniband is hence used to re-place the existing Peripheral Component Interconnect (PCI).Offeringthroughput of up to 2.5 gigabytes per second and support for up to64000addresable devices, the architecture. Also promises increased re-liability better sharing of data between clustered processors, and builtin security. The InfiniBand architecture spec was released by the In-finiBand Trade association. InfiniBand is backed by top companies inthe industry like Compaq,Dell,Hewleet Packard,IBM,Intel,Microsoftand Sun. Underlying the major I/O transition in InfiniBand is able toprovide a unique feature of Quality Of Service and many mechanismsexist to provide this once such mechanism is the formal method ofusing arbitration table.

2

Page 9: rep

2 INFINIBAND ARCHITECTURE

InfiniBand is a switched, point-to-point interconnect for data centersbased on a 2.5-Gbps link speed up to 30 Gbps. The architecture de-fines a layered hardware protocol (physical, link, network, and trans-port layers) and a software layer to support fabric management andlow-latency communication between devices.

InfiniBand provides transport services for the upper layer proto-cols and supports flow control and Quality Of Service to provide or-dered,guaranteed packet delivery across the fabric.An InfiniBand fab-ric may comprise a number of infiniband subnets that are inter con-nected using InfiniBand routers,where each subnet may conist of oneor more Infiniband switchesand InfiniBand attached switches.

The InfiniBand standard defines Reliability, Availability, and Ser-viceability from the ground up, making the specification efficient toimplement in silicon yet able to support a broad range of applications.InfiniBands physical layer supports a wide range of media by using adifferential serial interconnect with an embedded clock. This signalingsupports printed circuit board, backplane,copper, and fiber links; itleaves room for further growth in speed and media types.

The physical layer implements 1X, 4X, and 12X links by byte strip-ing over multiple links.The InfiniBand layered protocol features side-bar lists InfiniBands other features.An InfiniBand system area networkhas four basic system components that interconnect using InfiniBandlinks, as Fig 1 shows: The host channel adapter (HCA) terminatesconnection for a host node. It includes hardware features to supporthigh-performance memory transfers into CPU memory.

The target channel adapter (TCA) terminates connection for aperipheral node. It defines a subset of HCA functionality and can beoptimized for embedded applications.

The switch handles link-layer packet forwarding. A switch doesnot consume or generate packets other than management packets.

The router sends packets between subnets using the network layer.InfiniBand routers divide InfiniBand networks into subnets and do notconsume or generate packets other than management packets. A sub-net manager runs on each subnet and handles device and connectionmanagement tasks. A subnet manager can run on a host or embeddedin switches and routers. All system components must include a sub-net management agent that handles communication with the subnetmanager.

3

Page 10: rep

Figure 1: INFINIBAND ARCHITECTURE

4

Page 11: rep

3 COMPONENTS OF INFINIBAND

The main components in the InfiniBand architecture are:

3.1 HCA and TCA Channel adapters

HCAs are present in servers or even desktop machines and provide aninterface that is used to interhrate the InfiniBand with the operatingsystem.TCAs are present on I/O devices such as RAID subsystemor a JBOD subsystem.Host and Target Channel adapters present aninterface to the layers above them that allow those layers to generateand consume packets.In the case of a server writing a file to a storagedevice,the host is generating the packets that are then consumed bythe storage device. Each channel adapter has one or more ports.Achannel adapter with more than one port may be connected to multipleswitch ports.

3.2 Switches

Switches simply forward packets between two of their ports based onthe established routing table and addressing information stored on thepackets.Acollection of end nodes connected to one another through oneor more switches form a subnet.Each subnet must have atleast one sub-net manager that is responsible for the configuration and managementof the subnet

5

Page 12: rep

Figure 2: InfiniBand Switch

3.3 Routers

Are like switches in the respect that they simply forward packets be-tween their ports. The difference between routers and the switches isthat a router is used to interconnect two or more subnets to form amultidomain system area network. Within a subnet each port is as-signed a unique identifier by the subnet manager called the LOCAL IDor LID. In addition to the LID each port is assigned a globally uniqueidentifier called the GID. Main feature of the InfiniBand architectureis that is not available in the current shared bus I/0 architecture is theability to partition the ports within the fabric that can communicatewith one another. This is useful for partitioning the available storageacross one or more servers for management reasons.

6

Page 13: rep

Figure 3: System Network of Infiniband

4 INFINIBAND BASIC FABRIC TOPOL-

OGY

Infiniband is a high -speed serial ,channel based ,switch-fabric messagepassing architecture that can have server,fibre channel,SCSI RAID,routerand other end nodes each with its own dedicated fat pipe.Each nodecan talk to any other node in a many-yo-many configuration.redundantpaths can be set up through an InfiniBand Fabric for fault toleranceand InfiniBand routers can connect multiple subnets. Figure belowshows the simplest configuration of an InfiniBand Installation,wheretwo or more nodes are connected to one another through the fabric.Anode represents either a host device such as a server or an I/O de-vice such as RAID subsystem.The fabric itself may consist of a singleswitch in the simplest case or a collection of interconnected switchesand routers.Each connection between nodes ,switches,and routers is apoint-point ,serial connection.

7

Page 14: rep

Figure 4: InfiniBand Fabric Topology

8

Page 15: rep

Figure 5: IBA SUBNET

5 IBA Subnet

The smallest complete IBA unit is a subnet, illustrated in the figure .Multiple subnets can be joined by routers (notshown) to create largeIBA networks.The elements of a subnet, as shown in the figure, areendnodes, switches, links, and a subnet manager. Endnodes, suchas hosts and devices, send messages over linksto other endnodes; themessages are routed by switches.Routing is defined, and subnet dis-covery performed, by the Subnet Manager. Channel Adapters (CAs)(not shown) connect endnodes to links.

5.1 Links

IBA links are bidirectional point-to-point communication channels,and may be either copper and optical fibre. The signalling rate on alllinks is 2.5 Gbaud in the 1.0 release; later releases will undoubtedly befaster. Automatic training sequences are defined in the architecture

9

Page 16: rep

that will allow compatibility with later faster speeds. The physicallinks may be used in parallel to achieve greater bandwidth. The dif-ferent link widths are referred to as 1X, 4X, and 12X. The basic 1Xcopper link has four wires, comprising a differential signaling pair foreach direction. Similarly, the 1X fibre link has two optical fibres, onefor each direction. Wider widths increase the number of signal pathsas implied. There is also a copper backplane connection allowing densestructures of modules to be constructed; unfortunately, an illustrationof that which reproduces adequately in black and white were not avail-able at the time of publication. The 1X size allows up to six ports onthe faceplate of the standard (smallest) size IBA module. Short reach(multimode) optical fibre links are provided in all three widths; whiledistances are not specified (as explained earlier), it is expected thatthey will reach 250m for 1X and 125m for 4X and 12X. Long reach(single mode) fiber is defined in the 1.0 IBA specification only for 1Xwidths, with an anticipated reach of up to 10Km.

5.2 Endnodes

IBA endnodes are the ultimate sources and sinks of communicationinIBA. They may be host systems or devices(network adapters, storagesubsystems, etc.). It is also possible that endnodes will be developedthat are bridges to legacy I/O busses such as PCI, but whether andhow that is done is vendor-specific; it is not part of the InfiniBandarchitecture. Note that as a communication service, IBA makes nodistinction between these types; an endnode is simply an endnode. Soall IBA facilities may be used equally to communicate between hostsand devices; or between hosts and other hosts like normal networking;or even directly between devices, e.g., direct disk-to-tape backup with-out any load imposed on a host. IBA defines several standard formfactors for devices used as endnodes, illustrated in Figure 3: standard,wide, tall, and tall wide. The standard form factor is approximately20x100x220 mm. Wide doubles the width, tall doubles

10

Page 17: rep

Figure 6: Flow control in InfiniBand

6 FLOW CONTROL

InfiniBand defines two levels of credit-based flow control to managecongestion: link level and end-to-end. Link-level flow control appliesback pressure to traffic on a link, while end-to end flow control pro-tects against buffer over-flow at endpoint connections that might bemultiple hops away. Each receiving end of a link/connection suppliescredits to the sending device to specify the amount of data that the de-vice can reliably receive. Sending devices do not transmit data unlessthe receiver advertises credits indicating available receive buffer space.The link and connection protocols have built in credit passing betweeneach device to guarantee reliable flow control operation. InfiniBandhandles link-level flow control on a per-quality-of-service-level (virtuallane) basis. InfiniBand has a unidirectional 2.5 Gbps(250MB/sec us-ing 10 bits per data byte encoding called 8B/10B similar to 3 GIO)wirespeed connection, and uses either one differential signal pair per di-rection called 1X,or 4(4X)or 12(12X) for bandwidth up to 30 Gbpsper direction(12x2.5 Gbps).Bidirectional throughput with InfiniBandis often expressed in MB/sec,yeiding 500MB/sec for 1X,2 GB/sec for4X and 6 GB/sec for 12X respectively.

Each bi-directional 1X connections consist of four wires, two forsend and two for receive. Both fiber and copper are supported. Coppercan be n the form of traces or cables and fiber distances between nodescan be far as 300 meters and more. Each infiniBand subnet can hostup to 64 000 nodes

11

Page 18: rep

7 INFINIBAND SUBNET MANAGE-

MENT AND QoS

InfiniBand Subnet Management and QoS InfiniBand supports twolevels of management packets: subnet management and the generalservices interface (GSI). High-priority subnet management packets(SMP) are used to discover the topology of the network, attachednodes, and so on, and are transported within the high-priority VLane(which is not subject to flow control). The low-priority GSI man-agement packets handle management functions such as chassis man-agement and other functions not associated with subnet management.These services are not critical to subnet management, so GSI manage-ment packets are neither transported within the high-priority VLanenor subject to flow control.

InfiniBand supports quality of service at the link level throughvirtual lanes. The InfiniBand virtual lane is a separate logical com-munication link that shares, with other virtual lanes, a single physicallink. Each virtual lane has its own buffer and flow-control mecha-nism implemented at each port in a switch.InfiniBand allows up to15 general-purpose virtual lanes plus one additional lane dedicatedformanagement traffic.Link layer quality of service comes from isolatingtraffic congestion to individual virtual lanes. For example, the linklayer will isolate isochronous real-time traffic from non-realtime datatraffic; that is, isolate real-time voice or multimedia streams from Webor FTP data traffic. The system manager can assign a higher virtual-lane priority to voice traffic, in effect scheduling voice packets aheadof congested data packets in each link buffer encountered in the voicepackets end-to-end path. Thus,the voice traffic will still move throughthe fabric with minimal latency.

InfiniBand presents a number of transport services that providedifferent characteristics. To ensure reliable, sequenced packet de-livery, InfiniBand uses flow control and service levels in conjunctionwith VLanes to achieve end-to-end QoS. InfiniBand VLanes are logicalchannels that share a common physical link, where VLane 15 has thehighest priority and is used exclusively for management traffic, andVLane=0 the lowest. The concept of a VLane is similar to that of thehardware queues found in routers and switches.

For applications that require reliable delivery, InfiniBand supportsreliable delivery of packets using flow control. Within an InfiniBandnetwork, the receivers on a point-to-point link periodically transmit

12

Page 19: rep

information to the upstream transmitter to specify the amount of datathat can be transmitted without data loss, on a per-VLane basis. Thetransmitter can then transmit data up to the amount of credits thatare advertised by the receiver. If no buffer credits exist, data cannotbe transmitted. The use of credit-based flow control prevents packetloss that might result from congestion. Furthermore, it enhances ap-plication performance, because it avoids packet retransmission. Forapplications that do not require reliable delivery, InfiniBand also sup-ports unreliable delivery of packetsi.e. they may be dropped with littleor no consequencethat are not subject to flow control; some manage-ment traffic, for example does not require reliable delivery. At theInfiniBand network layer, the GRH contains an 8-bit traffic class field.This value is mapped to a 4-bit service level field within the LRH toindicate the service levelmatches the packets service level against aservice level-to-VLane table, which has been populated by the subnetmanager. The HCA then that the packet is requesting from the Infini-Band network. As transmits the packet on the VLane associated withthat service level. As the packet traverses the network, each switchmatches the service level against the packets egress port to identifythe VLane within which the packet should be transported.

13

Page 20: rep

Figure 7: RDMA Hardware

8 REMOTE DIRECT ACESS (RDMA)

One of the key problems with server I/O is the CPU overhead asso-ciated with data movement between memory and I/O devices suchas LAN and SAN interfaces. InfiniBand solves this problem by us-ing RDMA to offload data movement from the server CPU to theInfiniBand host channel adapter (HCA). RDMA is an extension ofhardware-based Direct Memory Access (DMA) capabilities that al-lows the CPU to delegate data movement within the computer to theDMA hardware.location where data that is associated with a partic-ular process resides and the memory location the data is to be movedto. Once the DMA instructions are sent, the CPU can process otherthreads while the DMA hardware moves the data. RDMA enablesdata to be moved from one memory location to another, even if thatmemory resides on another device.

8.1 Comparing a Traditional Server I/O andRDMA-Enabled I/O

The process in a traditional server i/o is extremely inefficient becauseit results in multiple copies of the same data traveresing between the

14

Page 21: rep

Figure 8: Traditional Server I/O

memory system bus and also invokes multiple CPU interrupts andcontext switches.

By Contrast RDMA, an embedded hardware function of the Infini-Band handles all communications operations without interrupting theCPU.Using RDMA,the sending devices either reads data or writes tothe target device user space memory thereby avoiding CPU interruptsand multiple data copies on the memory buswhich enables RDMA tosignificantly reduce the CPU overhead.

15

Page 22: rep

Figure 9: RDMA-Enabled Server I/O

16

Page 23: rep

Figure 10: InfiniBand Protocol Stack

9 INFINIBAND PROTOCOL STACK

From a protocol perspective, the InfiniBand architecture consists offour layers: physical, link, network, and transport. These layers areanalogous to Layers 1 through 4 of the OSI protocol stack.TheInfiniBandis divided into multiple layers where each layer operates independentlyof one another.

9.1 Physical Layer

InfiniBand is a comprehensive architecture that defines both electri-cal and mechanical characteristics for the system. These include ca-bles and receptacles and copper media; backplane connectors and hotswap characteristics.InfiniBand defines three link speeds at the phys-ical layer,1X,4X,12X.each individual link is a four wire serial connec-tion (two wires in each direction)that provides a full duplex connectionat 2.5Gb/s.This physical layer specifies the hardware components.

17

Page 24: rep

9.2 Link Layer

The link layer (along with the transport layer)is the heart of the Infini-Band architecture. The link layer encompasses packet layout, point-to-point link operations and switching within a subnet. At the packetcommunication level two packets types for data transfer and networkmanagement are specified. The management packets provide oper-ational control over device enumeration, subnet directing and faulttolerance. Data packets transfer the actual information with eachpacket deploying a maximum of four kilobytes of transaction infor-mation. Within each specific device subnet the packet direction andswitching properties are directed via a Subnet Manager with 16 bitlocal identification address. The link layer also allows for the QualityOf Service characteristics of InfiniBand.The primary consideration isthe usage of the Virtual Lane(VL) architecture for interconnectivity.Even though a single IBA data path may be defined at the hardwarelevel, the VL approach allows for 16 logical links. With 15 indepen-dent levels(VL0-14) and one management path (VL15) available, theability to configure device specific prioritization is available. Sincemanagement requires the most priority, VL15 retains the maximumpriority. The ability to assert a priority driven architecture lends notonly to Quality Of Service but performance as well. Credit BasedFlow Control. is also used to manage data flow between two point topoint links.Flow control is handled on a per VL basis allowing separatevirtual fabrics to maintain communication utilizing the same physicalmedia.

9.3 Network Layer

The network layer handles routing of packets from one subnet to an-other (within a subnet, the network layer is not required).Packets thatsent between subnets contain a Global Route Header (GRH).The GRHcontains the 128 bit IPv6 address for the source and destination of thepacket. The packets are forward between subnet through router basedon each devices 64bit globally unique ID(GUID).The router modifiesthe LRH with the proper local address within each subnet. Thereforethe last router in the path replaces LID in the LRH with the LID of thedestination port. Within the network layer InfiniBand packets do notrequire the network layer information and the header overhead whenused within a single subnet (which is a likely scenario for InfiniBandsystem area networks)

18

Page 25: rep

9.4 Transport Layer

The transport layer is responsible for in-order packet delivery, par-tioning, channel multiplexing and transport services (reliable connec-tion, reliable datagram, unreliable datagram).The transport layer alsohandles transaction data segmentation when sending and reassemblywhen receiving. Based on the Maximum Transfer Unit (MTU) of thepath the transport layer divides the data in to packets of the propersize. The receiver reassembles the packets based on a Base TransportHeader (BTH) that contains the destination queue pair and packet se-quence number. The receiver acknowledges the packets and the senderreceives the acknowledge and updates the completion queue with thestatus of the operation. There is a significant improvement that theIBA offers for the transport layer. All functions are implemented inhardware.InfiniBand specifies multiple transport services for data re-liability.

19

Page 26: rep

10 COMMUNICATION SERVICES

IBA provides several different types of communication services be-tween endnodes: Reliable Connection (RC): a connection is estab-lished between end nodes, and messages are reliably sent betweenthem. This is optional for TCAs (devices), but mandatory for HCAs(hosts). (Unreliable) Datagram (UD): a single packet message can besent to an end nodes without first establishing a connection; transmis-sion is not guaranteed. Unreliable Connection (UC): a connection isestablished between end nodes, and messages are sent, but transmis-sion is not guaranteed. This is optional. Reliable Datagram (RD): asingle packet message can be reliably sent to any end node without aone-to-one connection. This is optional. Raw IPv6 Datagram RawEther type Datagram (optional) (Raw): single-packet unreliable data-gram service with all but local transport header information strippedoff; this allows packets using non-IBA transport layers to traverse anIBA network, e.g., for use by routers and network interfaces to transferpackets to other media with minimal modification. In the above, reli-ably send means the data is, barring catastrophic failure, guaranteedto arrive in order, checked for correctness, with its receipt acknowl-edged. Each packet, even those for unreliable data grams, containstwo separate CRCs, one covering data that cannot change (ConstantCRC) and one that must be recomputed (V-CRC) since it covers datathat change; such change can occur only when a packet moves fromone IBA subnet to another, however. This is intentional, since theyprovide essentially the same services. However, these are designedfor hardware implementation, as required by a high-performance I/Osystem. In addition, the host-side functions have been designed toallow all service types to be used completely in user mode,withoutnecessarily using any operating system services; RDMA moving datadirectly into or out of the memory of an endnode. This and user modeoperation implies that virtual addressing must be supported by thechannel adapters, since real addresses are unavailable in user mode. Inaddition to RDMA, the reliable communication classes also optionallysupport atomic operations directly against endnodes memory. Theatomic operations supported are Fetch-and-Add and Compare-and-Swap, both on 64-bit data. Atomics are effectively a variation onRDMA: a combined write and read RDMA, carrying the data.

20

Page 27: rep

10.1 Communication Stack :InfiniBand supportfor the Virtual Interface Architecture (VIA)

The Virtual Interface Architecture is a distributed messaging technol-ogy that is both hardware independent and compatible with currentnetwork interconnects. The architecture provides an API that can beutilized to provide high speed and low latency communications be-tween peers in clustered applications .InfiniBand was developed withthe VIA architecture in mind.InfiniBand off loads traffic control fromthe software client through the use of execution queues. These queuescalled work queue, are initiated by the client and then left for Infini-Band to manage. For each communication channel between devices,a Work Queue Pair (WQP-send and receive queue)is assigned at eachend. The client places a transaction in to the work queues (WorkQueue entry-WQE) which is then processed by the channel adapterfrom the queue and sent out to the remote device. When the re-mote device responds the channel adapter returns status to the clientthrough a completion queue or event. The client can post multipleWQEs and the channel adapters hardware will handle each of thecommunication requests. The channel adapter then generates a Com-pletion Queue Entry (CQE)to provide status for each WQE in theproper prioritized order. This allows the client to continue with theactivities while the transactions are being processed.

21

Page 28: rep

Figure 11: InfiniBand Protocol Stack

11 INFINIBAND FABRIC VERSUS

SHARED BUS

The switched fabric architecture of InfiniBand is designed around acompletely different approach as compared as compared to the lim-ited capabilities Of shared bus .IBA specifies a point to point (PTP)communication protocol for primary connectivity. Being based uponPTP,each link along the fabric terminates at one connection point (ordevice).The actual underlying transport addressing standard is de-rived from the impressive IP method employed by advanced networks.Each InfiniBand device is assigned an IP address ,thus the load man-agement and signal termination characteristics are clearly defined andmore efficient .To add more TCA connection points or endnodes ,thesimple addition of a dedicated IBA switch is required Unlike the sharedbus ,each TCA and IBA switch can be interconnected via multipledata paths in order to sustain maximum aggregate device bandwidthand provide fault tolerance by way of multiple redundant connections.

22

Page 29: rep

Figure 12: InfiniBand Protocol Stack

23

Page 30: rep

12 INFINIBRIDGE

InfiniBridge is effective for implementation of HCAs, TCAs, or stand-alone switches with very few external components. The devices chan-nel adapter side has a standard 64-bit-wide PCI interface operatingat 66 MHz that enables operation with a variety of standard I/Ocontrollers, motherboards, and backplanes. The devices InfiniBandside is an advanced switch architecture that is configurable as eight1ports, two 4ports, or a mix of each. Industry standard externalserial/deserializes interface the switch ports to InfiniBand-supportedmedia (printed circuit board traces, copper cable connectors, or fibertransceiver modules). No external memory is required for switching orchannel adapter functions. The embedded processor initializes the ICon reset and executes subnet management agent functions in firmware.An I2C EPROM holds boot configuration.

InfiniBridge also effectively implements managed or unmanagedswitch applications. The PCI or CPU interface can connect externalcontrollers running Infini-Band management software. Or an unman-aged switch design can eliminate the processor connection for appli-cations with low area and part count. Appropriate configuration ofthe ports can implement a 4X to four 1aggregation Switches. The In-finiBridge switching architecture implements these advanced featuresof the InfiniBand architecture: standard InfiniBand packets up to anMTU size of 4 Kbytes, eight virtual and one management lane, 16-Kbyte Unicast local identifications, 1-Kbyte multicast LIDs, VCRCand ICRC integrity checks, and 4to 1link aggregation.

12.1 Hardware transport performance of In-finiBridge

Hardware transport is probably the most significant feature Infini-Band offers to next generation data center and telecommunicationsequipment. Hardware transport performance is primarily a measure-ment of CPU utilization during a period of a devices maximum wirespeed throughput. Lowest CPU utilization is desired. The follow-ing test setup was used to evaluate InfiniBridge hardware transport:two 800-MHz PIII servers with InfiniBridge64-bit/66-MHz PCI chan-nel adapter cards and running Red Hat Linux 7.1, a 1InfiniBandlink between the two server channel adapters, an InfiniBand protocolanalyzer inserted in the link, and an embedded storage protocol run-

24

Page 31: rep

ning over the link. The achieved wire speed was 1.89 Gbps in bothdirections simultaneously, which is 94 percent of the maximum pos-sible bandwidth of a 1link (2.5 Gbps minus 8/10 Byte encoding or 2Gbps). During this time, the driver used an average of 6.9 percent ofthe CPU. The bidirectional traffic also traverses the PCI bus, whichhas a unidirectional upper limit of 4.224 Gbps. Although the Infini-Bridge DMA engine can efficiently send burst packet data across thePCI bus, we speculate that PCI is the limiting factor in this test case.

25

Page 32: rep

13 INFINIBRIDGE CHANNEL ADAPTER

ARCHITECTURE

The InfiniBridge channel adapter architecture has two blocks, eachhaving independent ports to the switch fabric, as figure shows .Oneblock uses a direct memory access (DMA) engine interface to the PCIbus, and the other uses PCI target and PCI master interfaces. Thisprovides flexibility in the use of the PCI bus and enables implemen-tation of the Infini PCI feature. This unique feature lets the trans-port hardware automatically translate PCI transactions to InfiniBandpackets, thus enabling transparent PCI-to-PCI bridging over the In-finiBand fabric. Both blocks include hardware transport engines thatimplement the InfiniBand features of reliable connection, unreliabledatagram, raw datagram, RDMA reads/writes, message size up to2 Kbytes, and eight virtual lanes. The PCI target includes addressbar/limit hardware to claim PCI transactions in segments of the PCIaddress space. Each segment can be associated with a standard Infini-Band channel in the PCI-target transport engine. The association letsclaimed transactions be translated into InfiniBand packets that will goout over the corresponding channel. In the reverse direction, the PCImaster also has segment hardware that lets a channel automaticallytranslate InfiniBand packet payload into PCI transactions generatedonto the PCI bus. This flexible segment capability and channel asso-ciation enables transparent PCI bridges construction over the Infini-Band fabric. The DMA interface can move data directly between localmemory and InfiniBand channels. This process uses execution queuescontaining linked lists of descriptors that one of multiple DMA exe-cution engines will execute. Each descriptor can contain a multientryscat ter-gather list, and each engine can use this list to gather datafrom multiple locations in local memory and combine it into a singlemessage to send into an InfiniBand channel. Similarly, the engines canscatter data received from an InfiniBand channel to local memory.

26

Page 33: rep

Figure 13: InfiniBridge Channel Adapter Architecture

14 VIRTUAL OUTPUT QUEUEING

ARCHITECTURE

InfiniBridge uses an advanced virtual output queuing (VOQ) and cut-through switching architecture to implement these features with lowlatency and non blocking performance. Each port has a VOQ buffer,transmit scheduling logic, and packet decoding logic.. Incoming datagoes to both the VOQ buffer and packet-decoding logic. The decoderextracts the parameters needed for flow control, scheduling, and for-warding decisions. Processing of the flow-control inputs gives linkflow-control credits to the local transmit port, limiting output packetsbased on available credits. InfiniBridge decodes the destination lo-cal identification from the packet and uses it to index the forwardingdatabase. and retrieve the destination port number. The switch fab-ric uses the destination port number to decide which port to send thescheduling information. The service level identification field is alsoextracted from the input packet by the decoder and used to deter-mine the virtual lane, which goes to the destination ports transmitscheduling logic. All parameter decoding takes place in real time andis given to the switch fabric to make scheduling requests as soon as the

27

Page 34: rep

Figure 14: Virtual output-queuing architecture

information is available. The packet data is stored only once in theVOQ. The transmit-scheduling logic of each port arbitrates the orderof output packets and pulls them from the correct VOQ buffer. Eachport logic module is actually part of a distributed scheduling archi-tecture that maintains the status of all output ports and receives allscheduling requests. In cut-through mode, a port scheduler receivesnotification of an incoming packet as soon as the local identification forthat packets destination is decoded. Once the port scheduler receivesvirtual lane and other scheduling information, it schedules the packetfor output. This transmission could start immediately, based on thepriority of waiting packets and flow control credits for the packetsvirtual lane. The switch fabric actually includes three on-chip portsin addition to the eight external ones, as Figure shows. One port isa management tport that connects to the internal RISC processor,which handles management packets and exceptions. The other twoports interface with the channel adapter.

28

Page 35: rep

15 FORMAL MODEL TO MANAGE

INFINIBAND ARBITRATION TABLES

TO PROVIDE TO QUALITY OF SER-

VICE(QoS)

The InfiniBand Architecture (IBA) has been proposed as an indus-try standard both for communication between processing nodes andI/O devices and for interprocessor communication. It replaces thetraditional bus-based interconnect with a switch-based network forconnecting processing nodes and I/O devices. It is being developedby the InfiniBand Trade Association (IBTA) in the aim to provide thelevels of reliability, availability, performance, scalability, and quality ofservice (QoS) required by present and future server systems. For thispurpose, IBA provides a series of mechanisms that are able to guar-antee QoS to the applications. Therefore, it would be important forInfiniBand to be able to satisfy both the applications that only needminimum latency, and also those different applications that need othercharacteristics to satisfy their QoS requirements. InfiniBand providesa series of mechanisms that, properly used, are able to provide QoSfor the applications. These mechanisms are mainly the segregationof traffic according to categories and the arbitration of the output-ports according to an arbitration table that can be configured to givepriority to the packets with higher QoS requirements.

15.1 THREE MECHANISMS TO PROVIDEQoS

Basically, IBA has three mechanisms to support QoS: service levels,virtual lanes, and virtual lane arbitration.

15.1.1 Service Level

According to the different link service levels that an InfiniBand archi-tecture provide the quality of service at various communication levelhence quality provision is greater.

29

Page 36: rep

15.1.2 Virtual Lanes

IBA ports support virtual lanes (VLs), providing a mechanism forcreating multiple virtual links within a single physical link. A VL isan independent set of receiving and transmitting buffers associatedwith a port

Each VL must be an independent resource for flow control pur-poses. IBA ports have to support a minimum of two and a maximumof 16 virtual lanes (VL0 . . . VL15). All ports support VL15, whichis reserved exclusively for subnet management, and must always havepriority over data traffic in the other VLs. Since systems can beconstructed with switches supporting different numbers of VLs, thenumber of VLs used by a port is configured by the subnet manager.Also, packets are marked with a service level (SL), and a relation be-tween SL and VL is established at the input of each link with theSLtoVL Mapping Table. When more than two VLs are implemented,an arbitration mechanism is used to allow an output port to selectwhich virtual lane to transmit from. This arbitration is only for dataVLs, because VL15, which transports control traffic, always has pri-ority over any other VL. The priorities of the data lanes are definedby the VL Arbitration Table.

15.1.3 Virtual Arbitration table

A limit of high priority value specifies the maximum number of highpriority packets that can be sentbefore a low priority packet is sent.More specifi- cally, the VLs of the High Priority table can transmitlimit of high priority X4096 bytes before a packet from the Low Prior-ity table could be transmitted. If no high priority packets are ready fortransmission at a given time, low priority packets can also be transmit-ted.When more than two VLs are implemented, the VL ArbitrationTable defines the priorities of the data lanes.Each VL Arbitration Ta-ble has two tables: one for delivering packets from high-priority VLsand another one for low priority VLs.Up to 64 table entries are cycledthrough, each one specifying a VL and a weight. The weight is thenumber of units of 64 bytes to be sent from that VL. This weightmust be in the range of 0 to 255 and is always rounded up in order totransmit a whole packet.

30

Page 37: rep

Figure 15: Virtual Lanes

16 FORMAL MODEL FOR THE IN-

FINIBAND ARBITRATION TABLE

We present an algorithm to find a new sequence of free entries ableto locate a connection request in the table. This algorithm is partof a formal model to manage the IBA arbitration table. In the nextsections, we will present a formal model to manage the IBA arbitrationtable and several algorithms in order to adapt this model for beingused in a dynamic scenario when new requests and releases are made.

To propose a concrete algorithm to find a new sequence of freeentries able to locate a connection request in the table. The treat-ment of the problem that we present basically consists of setting outan efficient algorithm able to select a sequence of free entries on thearbitration table. These entries must be selected with a maximumseparation between any consecutive pair. To develop this algorithm,we first propose some hypotheses and definitions for establishing thecorrect frame to later present the algorithm and its associated theo-rems. we consider some specific characteristics of IBA:the number of

31

Page 38: rep

Figure 16: Virtual Arbitration Table

32

Page 39: rep

table entries (64) and the value of the weight 0 . . . 0.255. All weneed to know is that the requests areoriginated by the connections sothat some requirements are guaranteed. Besides, the group of entriesassigned to a request belongs to the arbitration table associated withthe output ports and interfaces of the InfiniBand switches and hosts,respectively.

We formally define the following concepts:Table: Round list of 64 entries.Entry: Each one of the 64 parts compounding a table.Weight: Numerical value of the entries in the table.This can vary between 0 and 255.Status of an entry: Situation of an entry of the table.The different situations can be free weight 0 or occupied weight.Request: A demand of a certain number of entries.Distance: Maximum separation between two consecutiveentries in the table that are assigned to one request.Type of request: Each one of the different types intowhich the requests can be grouped. They are based on the re-

quested distances and, so, on the requested number of entries.Group or sequence of entries: The set of entries of the table

with a fixed distance between any consecutive pair. In order to char-acterize a sequence of entries, it will be enough to give the first entryand the distance between a consecutive pair.

16.0.4 Initial Hypothesis

In what follows, and when not indicated to the contrary, the followinghypotheses will be considered:

1. There are no request eliminations, so the table is filled in whennew requests are received and these requests are never removed. Inother words, the entries could change from a free status to an occupiedstatus, but it is not possible for an occupied entry to change to free.This hypothesis permits us to do a more simple and clear initial study,but, logically, it will be discarded later on.

2. It may be necessary to devote more than a group of entries toa set of requests of the same type.

3. The total weight associated with one request is distributedamong the entries of the selected sequence so that the weight for thefirst entry of this sequence is always larger than or equal to the weightof the other entries of the sequence.

33

Page 40: rep

Figure 17: Structure of a VL Arbitration Table

4. The distance d associated to one request will always be a powerof 2 and it must be between 1 and 64. These are the different typesof requests that we are going to consider.

34

Page 41: rep

17 FILLING IN THE VL ARBITRA-

TION TABLE

The classification of traffic into categories based on its QoS require-ments, is just a first step to achieve the objective of providing QoS.A suitable filling in of the arbitration table is critical. We propose astrategy to fill in the weights for the arbitration tables. In this section,we see how to fill in the table in order to provide the bandwidth re-quested by each application also on the basis of how to provide latencyguarantee.

Each arbitration table only has 64 entries, hence we can fill a dif-ferent entry to each connection, this could limit the number of con-nections that can be accepted.Also a connection requiring very highbandwidth could also need slots in more than one entry in the table sofor that reason, we propose grouping the connections with the same SLinto a single entry of the table until completing the maximum weightfor that entry, before moving to another free entry. In this way, thenumber of entries in the table is not a limitation for the acceptance ofnew connections, but only the available bandwidth.

Each set contains the needed entries to meet the request of a certaindistance The first one of these sets having all of its entries free isselected. The order in which the sets are examined has as an objectiveto maximize the distance between two free consecutive entries thatwould remain in the table after carrying out the selection. This way,the table remains in the optimum condition to be able to later meetthe most restrictive possible request. For a new request of maximumdistance d=2 to the power i.

17.1 Insertion and elimination in the table

The elimination of requests is now possible. As a consequence, theentries used for the eliminated requests will be released. Consideringthe filling-in algorithm, and if the entries are not correctly separated.We can eliminate that request

17.1.1 Example 1.

We have the table filled and two requests of type d is 8 are eliminated.These requests were made using the entries of the sets specified in

35

Page 42: rep

the tree This means that, now, the table has free entries, and, so, arequest that is not need can be eliminated

17.2 Disfragmentation Algorithm

The basic idea of this algorithm is to group all of the free entries of thetable into several free sets that permit meeting any request needinga number of entries equal to or lower than the available table entries.Thus, the objective of the algorithm is to perform a grouping of thefree entries.A process that consists of joining the entries of two freesets of the same size in a unique free set.This joining will be effectiveonly if the two free sets do not already belong to the same greaterfree set. Therefore, the algorithm is restricted to only singular sets.The goal is to have a free set of the biggest size in order to be ableto meet a request of this size. For that purpose, the table has enoughfree entries which, however, belong to two small free sets that are notable to meet that request.

17.3 Reordering Algorithm

The reordering algorithm basically consists of an order algorithm, butapplies it at a level of sets. This algorithm has been designed to beapplied to a table that is non ordered, with the purpose of leaving thetable ordered. So that a ordered table will ensure proper sending ofrequest.

17.4 Global management of the table

For the global management of the table, having both insertions andreleases, we have shown that a combination of the filling-in and disfrag-mentation algorithms (and even the reordering algorithm, if needed)must be used. Using this global management table to prove that thetable will always have a correct status in order that the propositions ofthe filling-in algorithm continue to be true.Hence overall managementof arbitration table occurs.

36

Page 43: rep

18 CONCLUSION

The InfiniBand is a powerful new architecture designed to support I/Oconnectivity for the Internet infrastructure. InfiniBand is supportedby all major OEM server vendors as a means to expand and createthe next generation I/O interconnect standard in servers. IBA thatenables QoS: Quality of Services which support with certain mech-anisms. These mechanisms are basically service levels, virtual lanesand table based arbitration of virtual lanes.InfiniBand has a formalmodel to manage the InfiniBand to provide QoS,according to thismodel, each application need a sequence of entries in the IBA arbi-tration tables based on requirements. These requirements are relatedto mean bandwidth needed and maximum latency tolerated by thisapplication. It provides a comprehensive silicon software and sys-tem solution which provides an overview to layered protocol and In-finiBands management infrastructure. InfiniBand provides a layeredarchitecture.Mellanox and related companies are now positioned torelease InfiniBand as a multifaceted architecture within several mar-ket segments .The most notable application area is enterprise classnetwork clusters and Internet data centers .These types of applicationrequire extreme performance with the maximum in fault toleranceand reliability. Other computing system uses include Internet serviceproviders, collocations hosting and large corporate networks Atleastthe for the introduction InfiniBand is positioned as a complimentaryarchitecture.IBA will move through a transitional period where futurePCI,IBA and other interconnect standards can be offered within thesame system or network. The understanding of PCI limitation (evenPCI-X) should allow InfiniBand to be an aggressive market contendershigher-class systems move the conversion to IBA devices.

Currently Mellanox is developing the IBA software interface stan-dard using linux as their internal OS choice. Another key concernis the cost of implementing InfiniBand at consumer level. Industrysources are currently projecting IBA prices to fall somewhere betweenthe currently available Gigabit Ethernet and Fibre Channel technolo-gies.Infiniband could be positioned as the dominant I/O connectivityarchitecture at all upper tier levels that provide the top level in Qual-ity of Service(QoS) that can be implemented in various method asdiscussed. This is definitely a technology to watch and can providecompetitive market.

37

Page 44: rep

References

[1] Chris Eddington. Infinibridge:an infiniband channel adapter withintergrated switch. IEEE Magazine micro, pages 492–524, March-April 2006.

[2] Sanchez. JL Menduia m;Duato J Alfaro F.J. A formal model tomanage the infiniband arbitration tables providing qos. In Com-puter, IEEE Transaction,, page 10241039, August 2007.

[3] CISCO Collection Library. UNDERSTANDING INFINIBAND.Cisco Public Informations, Second edition, 2006.

38