infiniband: today and tomorrow
TRANSCRIPT
1© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
InfiniBand: Today and Tomorrow
Jamie Riotto
Sr. Director of EngineeringCisco Systems (formerly Topspin Communications)
2© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Agenda
• InfiniBand Today
– State of the market
– Cisco and InfiniBand
– InfiniBand products available now
– Open source initiatives
• InfiniBand Tomorrow
– Scaling InfiniBand
– Future Issues
• Q&A
3© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
InfiniBand Maturity Milestones
• High adoption rates
– Currently shipping > 10,000 IB ports / Qtr
• Cisco acquisition will drive broader market adoption
• End-to-end price points of <$1000.
• New Cluster scalability proof-points
– 1000 to 4000 nodes
4© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Cisco Adopts InfiniBand
• Cisco acquired Topspin on May 16, 2005
• Adds InfiniBand to Switching Portfolio
– Network Switches, Storage Switches, now Server Switches
– Creates independent Business Unit to promote InfiniBand & Server Virtualization
• New Product line of Server Fabric Switches (SFS)
– SFS 7000 Series InfiniBand Server Switches
– SFS 3000 Series Multifabric Server Switches
5© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Network Switch
Clients
Network Resources (Internet, Printer, Server)
Storage Switch
Server
Storage (SAN)
Server Switch
Servers
StorageNetwork
Cisco and InfiniBandThe Server Fabric Switch
6© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Cisco HPC Case Studies
7© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Real Deployments Today: Wall Street Bank with 512 Node Grid
SAN LAN
2 96-portTS-270
23 24-port TS-120
512 Server Nodes
2 TS-360 w/ Ethernet and Fibre Channel Gateways
Core Fabric
Edge Fabric
GRID I/O
Existing Networks
Fibre Channel and GigE connectivity built seamlessly into the cluster
8© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
520 Dual CPU Nodes1,040 CPUs
NCSANational Center for Supercomputing Applications
Tungsten 2: 520 Node Supercomputer
Core Fabric
Edge Fabric
6 72-portTS270
29 24-port TS120
174 uplinkcables
512 1mcables
18 Compute Nodes
18 Compute Nodes
Parallel MPI codes for commercial clients
Point to point 5.2us MPI latency
Deployed: November 2004
9© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
D.E. Shaw Bio-Informatics:1,066 Node Super Computer
Fault Tolerant
Core Fabric
Edge Fabric
12 96-portTS-270
89 24-port TS-120
1,068 5m/7m/10m/15muplink cables
1,066 1mcables
12 Compute Nodes
12 Compute Nodes
1,066 Fully Non-Blocking Fault Tolerant IB Cluster
10© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Large Government LabWorlds Largest Commodity Server Cluster – 4096 nodes
• Application:
High Performance Super Computing Cluster
• Environment:
4096 Dell Servers
50% Blocking Ratio
8 TS-740s
256 TS-120s
• Benefits:
Compelling Price/Performance
Largest Cluster Ever Built (by approx. 2X)
Expected to be 2nd Largest Supercomputer in the world by node count
CoreFabric
8x SFS TS740288 ports each
Edge256x TS120
24-ports each
18 Compute Nodes)
18 Compute Nodes)
8192 Processor 60TFlop SuperCluster
2048 uplinks(7m/10m/15m/20m)
11© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
InfiniBand Products Available Today
12© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
InfiniBand Switches and HCAs
• Fully non-blocking switch building blocks available in sizes from 24 up to 288 ports.
• Blade servers offer integrated switches and pass-through modules
• HCAs available in PCI-X and PCI-Express
• IP & Fibre-Channel Gateway Modules
13© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Integrated InfiniBand for Blade Servers Create “wire-once” fabric
• Integrated 10Gbps InfiniBand switches provide unified “wire-once” fabric
• Optimize density, cooling, space, and cable management.
• Option of integrated InfiniBand switch (ex: IBM BC) or pass-thru module (ex: Dell 1855)
• Virtual I/O provides shared Ethernet and Fibre Channel ports across blades and racks
IB SwitchIB Switch
10Gbps 30Gbps
Blade Chassis with InfiniBand Switches
HCA
14© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Ethernet and Fibre Channel Gateways Unified “wire-once” fabric
SAN Server FabricLAN/WAN
Server Cluster
Fibre Channel to InfiniBand gateway for storage accessFibre Channel to InfiniBand gateway for storage access
Ethernet to InfiniBand gateway for LAN accessEthernet to InfiniBand gateway for LAN access
Single InfiniBand link for: - Storage - Network
Single InfiniBand link for: - Storage - Network
15© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
InfiniBand Price / Performance
InfiniBandPCI-Express
10GigE GigE Myrinet D Myrinet E
Data Bandwidth(Large Messages)
950MB/s 900MB/s 100MB/s 245MB/s 495MB/s
MPI Latency(Small Messages)
5us 50us 50us 6.5us 5.7us
HCA Cost(Street Price)
$550 $2K-$5K Free $535 $880
Switch Port $250 $2K-$6K $100-$300 $400 $400
Cable Cost(3m Street Price)
$100 $100 $25 $175 $175
•Myrinet pricing data from Myricom Web Site (Dec 2004) ** InfiniBand pricing data based on Topspin avg. sales price (Dec 2004)*** Myrinet, GigE, and IB performance data from June 2004 OSU study
• Note: MPI Processor to Processor latency – switch latency is less
16© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
InfiniBand Cabling
• CX4 Copper (15m)
• Flexible 30-Gauge Copper (3m)
• Fiber Optics up to 150m
17© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Host Drivers for Standard Protocols
• Open source strategy = reliability at low cost
• IPoIB: legacy TCP/IP applications
• SDP: reliable socket connections (optional RDMA)
• MPI: leading edge HPCC applications (RDMA)
• SRP: block storage access (RDMA)
• uDAPL: User level RDMA
18© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
OS Support
• Operating Systems Available:
– Linux (Red Hat, SuSE, Fedora, Debian, etc.)
– Windows 2000 and 2003
– HP-UX (Via HP)
– Solaris (Via Sun)
19© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
The InfiniBand Driver Architecture
BSD Sockets FS API
TCPSDP
IP
DriversVERBS
ETHER INFINIBAND HCA
DAT FILE SYSTEM
SCSI
SRP
FC
FCP
SDP
INFINIBAND SAN
API
BSD Sockets NFS-RDMA
LAN/WAN SERVER FABRICSAN
INFINIBAND SWITCHETHERSWITCH
FCSWITCHFC GW
EETH GW
NETWORK
APPLICATION
UDAPL
TS TS
IPoIB
User
Kernel
20© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Open Software Initiatives
• OpenIB.org– Topspin primary authors of major portions including IPoIB, SDP, SRP and TS-API. Cisco will continue to invest.
– Current protocol development nearing production quality code. Expect release by end of year.
– Charter has been expanded to include Windows and iWarp
– MPI will be available in the near future (MVAPICH 0.96)
• OpenSM
• OpenMPI
21© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
InfiniBand Tomorrow
22© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Looking into the future
• Cost
• Speed
• Distance Limitations
• Cable Management
• Scalability
• IB and Ethernet
23© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Speed: InfiniBand DDR / QDR, 4X / 12X
• DDR Available end of 2005
Doubles wire speeds to ? (ok, still working on this one)
PCI-Express DDR
Distances of 5-10m using copper
Distances of 100m using fiber
• QDR Available WHEN?
• 12X (30 Gb/s) available for over one year!!
– Not interesting until 12X HCA
• Not interesting until > 16X PCIe
24© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Future InfiniBand Cables
• InfiniBand over CAT5 / CAT6 / CAT7
Shielded cable distances up to ???
Leverage existing 10-GigE cabling
10-GigE too expensive?
25© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
IB Distance Scaling
• IB Short Haul– New Copper drivers
– 25 – 50 Meters (KeyEye)
– 75 - 100 Meters (IEEE 10Ge)
• IB Wan– Same Subnet over distance (300 KM target)
– Buffer / Credit / Timeout issues
– Applications: Disaster Recover, Data Mirroring
• IB Long Haul– IB over IP (over SONET?)
– utilizes existing public plant (WDM, Debugging, etc)
26© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Scaling InfiniBand
• Subnet Management
• Host-side Drivers
MPI
IPoIB
SRP
• Memory Utilization
27© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
IB Subnet Manager
• Subnets are getting bigger
– 4,000 -> 10,000 nodes
– Topology convergence times
• Topology disturbance times
• Topology disturbance minimization
28© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Subnet Management Challenges
• Cluster Cold Start times–Template Routing
– Persistent Routing
• Cluster Topology Change Management– Intentional Change - Maintenance
– Unintentional Change – Dealing with Faults
• How to impact minimum number of connections
• Predetermine fault reaction strategy?
• Topology Diagnostic Tools– Link/Route Verification
– Built-in BERT testing
• Partition Management
29© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Multiple Routing Models
• Minimum Latency Routing: – Load-Balanced Shortest-Path Routing
• Minimum Contention Routing: – Lowest-Interference Divergent-Path Routing
• Template Driven Routing: – Supports Pre-Determined Routing Topology
– For example: Clos Routing, Matrix Row/Column, etc
– Automatic Cabling Verification for Large Installations
30© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
IB Routing Challenges
• Static / Dynamic Routing– IB impliments Static Routing through Linear Forwarding Tables at each chip
– Multi-LID Routing enables Dynamic Routing
• Credit Loops
• Cost Base Routing– Speed mismatches cause Store & Forward (vs. cut through)
– SDR <> DDR <>QDR
– 4X <> 12X
– Short Haul <> Long Haul
31© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
Multi-LID Source-Based Routing Support
• Applications can implement “Dynamic” Routing for Contention Avoidance, Failover, Parallel Data Transfer
1,2,3,4
Spine SwitchesLeaf Switches Leaf Switches
32© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
New IB Peripherals
• CPUs?
• Storage– SAN
– NFS-RDMA
• Memory (coherent / non-coherent)
• Purpose built Processors?– Floating Point Processors
– Graphics Processors
– Pattern Matching Hardware
– XML Processor
33© 2005 Cisco Systems, Inc. All rights reserved.Session NumberPresentation_ID Cisco Public
THANK YOU!
• Questions & Answers