architektur im rechenzentrum - 25, 50 und 100g ... · architektur im rechenzentrum - 25, 50 und...
TRANSCRIPT
Arne Heitmann | Sr. System Engineer EMEA
ComConcult Netzwerk Forum | Koenigswinter | 19.04.2016
Architektur im Rechenzentrum - 25, 50 und 100G
Architectures for the Datacenter – 25, 50 and 100G
© 2016 Mellanox Technologies 2
Agenda
Introduction – Drivers for higher speeds
Bandwidth Factors
• Silicon / Buffers
• Transceivers / Cabling
Bandwidth and Resource Opitmization
• Remote Direct Memory Access (RDMA)
• Remote Direct Memory Access over Converged Ethernet (RoCE)
Possible Scenarios
© 2016 Mellanox Technologies 3
Drivers for Higher Speeds
Introduction
© 2016 Mellanox Technologies 4
Entering The Era of 25GbE, 50GbE And 100GbE
Copper (Passive, Active) Optical Cables (VCSEL) Silicon Photonics
100GbE Adapter
(10 / 25 / 40 / 50 / 56 / 100GbE)
Multi Host Solution
32 100GbE Ports, 64 25/50GbE Ports
(10 / 25 / 40 / 50 / 56 / 100GbE)
Throughput of 6.4Tb/s
© 2016 Mellanox Technologies 5
Demand
More Virtual Machines Per Server
Interconnect Bandwidth Determines VM Density
10GbE
adapter card
Mellanox
40GbE
adapter card
20 VMs
+
Vs.
+
60 VMs
© 2016 Mellanox Technologies 6
Demand
The World of Bandwidth is changing
International bandwidth growth (projected
2012-2019)
Global IP traffic by type in petabytes/month
Source: Ars Technica, 2012 Source: TeleGeography/ITU
© 2016 Mellanox Technologies 7
Description Hres Vres
Colour
depth
(bits)
Pixels
RGBFPS
RAW BW
(MB/sec)
RAW BW
(Gbits/sec)
8Gb FC
lanes
16Gb FC
lanes
No. 10Gb
lanes
No. 40Gb
lanes
No. 56Gb
lanes
No.
100Gb
lanes
Storage
GB/sec
Strg 90
min
Movie
(TB)
HD Video - Low FPS 1920 1080 16 3 30 373.25 2.99 1 1 1 1 1 1 0.37 2.02
HD Video (US) 1920 1080 16 3 50 622.08 4.98 1 1 1 1 1 1 0.62 3.36
HD Video (EMEA) 1920 1080 16 3 60 746.50 5.97 1 1 1 1 1 1 0.75 4.03
2K Video (US) 2048 1080 16 3 50 663.55 5.31 1 1 1 1 1 1 0.66 3.58
2K Video (EMEA) 2048 1080 16 3 60 796.26 6.37 2 1 1 1 1 1 0.80 4.30
4K UHD (Std FPS) 3840 2160 16 3 30 1492.99 11.94 2 1 2 1 1 1 1.49 8.06
4K UHD (3D FPS) 3840 2160 16 3 60 2985.98 23.89 4 2 4 1 1 1 2.99 16.12
4K Cinema (Std FPS) 4096 2160 16 3 30 1592.52 12.74 3 2 2 1 1 1 1.59 8.60
4K-Full Cinema (Std FPS) 4096 3112 16 3 30 2294.42 18.36 4 2 3 1 1 1 2.29 12.39
4K Cinema (3D FPS) 4096 2160 16 3 60 3185.05 25.48 5 3 4 1 1 1 3.19 17.20
5K Cinema (Std FPS) 5120 2700 16 3 30 2488.32 19.91 4 2 3 1 1 1 2.49 13.44
5K Cinema (3D FPS) 5120 2700 16 3 60 4976.64 39.81 7 4 6 2 1 1 4.98 26.87
8K UHD (Std FPS) 7680 4320 16 3 30 5971.97 47.78 8 4 7 2 1 1 5.97 32.25
8K UHD (3D FPS) 7680 4320 16 3 60 11943.94 95.55 16 8 14 3 2 2 11.94 64.50
Super Hi-Vision 7680 4320 16 3 120 23887.87 191.10 32 16 28 6 4 3 23.89 128.99
Demand
Media/Entertainment – Acceleration already happening
Data rates and storage are exploding, due to high pixel counts and frame rates
10GbE Ethernet is not
going to provide the
necessary BW going
forward
© 2016 Mellanox Technologies 8
Demand
New Storage Media Require Faster Networks
Transition to faster storage media requires
faster networks
Flash SSDs move the bottleneck from the
storage to the network
What does it take to saturate one 10Gb/s link?
• 24 x HDDs
• 2 x SATA SSDs
• 1 x SAS SSD
• NVMe…
© 2016 Mellanox Technologies 9
Demand
Clouds: Private, Public, Hybrid
Scale up vs. Scale out
The SDDC requires more network interaction
Higher bandwidth required
© 2016 Mellanox Technologies 10
Moving to 25GbE, 50GbE And 100GbE
Compute
Nodes150% Higher
Bandwidth
Storage
Nodes25% Higher
Bandwidth
Network150% Higher
Bandwidth 100GbE
25GbE 50GbE
Same Connectors
Similar Infrastructure
Better Cost / Power
Compute
NodesStorage
Nodes
40GbE
Network
40GbE
10GbE
© 2016 Mellanox Technologies 11
Bandwidth Factor
Silicon
© 2016 Mellanox Technologies 12
Silicon - SerDes
Silicon is connected to board
SerDes• Serializer / Deserializer
• May work at
• ~10Gb/s
• ~14Gb/s
• ~25Gb/s
• Addresses a certain number of ports
Can be bundled:• i.e. 4x10G for a 40Gb/s link
• i.e. 4x14G for a 56Gb/s link
• i.e. 4x25G for a 100G link
© 2016 Mellanox Technologies 13
Silicon – Port Architecture Example
100 GigE
50 GigE
50 GigE
40 GigE
40 GigE
10 GigE
10 GigE
40 GigE
OR
OR
OR
OR
25Gig
25Gig
25Gig
25Gig
25Gig
25Gig
25Gig
25Gig
20Gig
20Gig
20Gig
20Gig
10Gig
10Gig
10Gig
10Gig
10Gig
10Gig
25Gig
25Gig
25Gig
25Gig
25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig
25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig25Gig
n x 100Gig
Port Basic UnitPort Options ASIC
© 2016 Mellanox Technologies 14
Silicon - Many Port Configuration Options for n*25GbE
SFP28
SFP+
25GbE (1x25Gb/s)
10GbE (1x10Gb/s)
SFP+10GbE (1x10Gb/s)
SFP+10GbE (1x10Gb/s)
SFP+10GbE(1x10Gb/s)
SFP2825GbE (1x25Gb/s)
SFP2825GbE (1x25Gb/s)
SFP2825GbE(1x25Gb/s)
QSFP40GbE(4x10Gb/s)
QSFP2850GbE (2x25Gb/s)
QSFP2850GbE
(2x25Gb/s)
QSFP28100GbE(4x25Gb/s)
135 watts
© 2016 Mellanox Technologies 15
Silicon and Buffering
How to handle different Port Speeds
Ports at same speed:
• Cut Through – fast and efficient
• Store & Forward – slow
Fast port to slow port:
• Cut Through – immediate transmission, risk of “oversubscription”
• Store & Forward – safe, buffer intense
Slow port to fast port:
• Cut Through – needs intelligent S&F, buffer “gap” between speeds
• Store & Forward – very buffer intense
Buffer should be…
… dynamically allocated
… flexibly reserved
© 2016 Mellanox Technologies 16
Enabling the Most Efficient Storage and Data Analytics Systems
Fast Port to Slow Port
Slow Port to Fast Port
Ports at Same Speed
Performance
• Zero Packet Loss
• 300ns Cut-Through Latency
• Non-Blocking 25/50/100GbE
Dynamic Buffer
• Dynamically allocated
• Flexible buffer reservation
Spectrum Competition
Cut Through Store & Forward
Spectrum Competition
Cut Through Store & Forward
Spectrum Competition
Intelligent S&F* Store & Forward
* Buffers minimum possible amount of packet data
© 2016 Mellanox Technologies 17
Bandwidth Factor
Transceiver / Cabling
© 2016 Mellanox Technologies 18
Transceiver – some Numbers
What comes next?
50G over a single Lane
• IEEE 802.3 50 Gb/s Ethernet Over a Single Lane and Next Generation 100 Gb/s and 200 Gb/s Ethernet Study Group
IEEE P802.3by 25 Gb/s Ethernet Task Force
• Standard to come
IEEE P802.3bs 400 GbE Task Force
• Adopted timeline says Standard for 2018??
© 2016 Mellanox Technologies 19
Transcerver - The Evolution between 10Gb, 25Gb, 40Gb, 50Gb &
100Gb
IEEE 802.3bm
© 2016 Mellanox Technologies 20
Transceiver - Pluggable Module Standards
CFPThe CFP MSA defines hot-pluggable optical transceiver form factors to enable 40 Gbit/s and 100 Gbit/s applications. CFP modules use the 10-lane CAUI-10 electrical interface.
CFP2CFP2 modules use the 10-lane CAUI-10 electrical interface or the 4-lane
CAUI-4 electrical interface.
CFP4CFP4 modules use the 4-lane CAUI-4 electrical interface.
QSFP28QSFP28 modules use the 4-lane CAUI-4 electrical interface.
CPAKCisco has the CPAK optical module that uses the four lane CEI-28G-VSR electrical interface.
CXPThere are also CXP and HD module standards. CXP modules use the CAUI-10 electrical interface.
© 2016 Mellanox Technologies 21
Cables - Optical Connector types for Parallel and single fiber
infrastructures
MPO Optical Connectors
4 Transmit 4 Receive
12-fiber Optical Connector(4-unused fibers in middle)
Duplex LC 2-fiber Optical Connector
Typically for Single-Mode
Can be used for
Multi-mode (SR4)
or
Single-mode (PSM4)
Also called MTP or MPO/MTP
Single-mode (LR4)
© 2016 Mellanox Technologies 22
Cables - Solutions for data center applications
22
Data Center Fabrics
Link Length (m)
10 100 500150 300 1000 2000
10
25
50
3 51
20
Da
ta R
ate
pe
r L
an
e (
Gb
\s)
10000500020 30 50 752
Single mode fiber
OM4OM3
Copper Multi-mode fiber
Silicon Photonics
Direct Attach Copper
• Zero power
• Demo’s 8m at 100G
• Best fit 3m
VCSELsDACs
Active Optical Cables
• VCELs or SiP
• Reaches to 200m
• Best fit for 5-20m
VCSEL Transceivers
• Reaches to 100m
• Best fit for MMF
SiP Transceivers
• Reaches to 2km
• Best fit for SMF
• Parallel or WDM
© 2016 Mellanox Technologies 23
Cables/Transceivers - 100GbE Products
100G SR4Ethernet-Only Transceiver
100G Copper DACInfiniBand & Ethernet
100G AOCInfiniBand & Ethernet
For lowest-cost optical 100G
switch-to-switch links .
For Breakouts to 25G / 50G servers +
storage with breakout fibers.
For low-cost, 100G links up to 100m.
Lowest-cost, 100G-to-Quad-
25/50G Breakout cables.
For Linking Servers + Storage to
ToR Switch & NICs.
© 2016 Mellanox Technologies 24
Bandwidth and Resource Optimization
RDMA / RoCE
© 2016 Mellanox Technologies 25
Convergence: Eliminates Dedicated Storage Network
Storage – prio1
Management – prio2
vMotion – prio3
Networking – prio4
Web 2.0, Public & Private Clouds Converging on Fast RDMA Interconnects
Single Interconnect for Compute, Networking, Storage
RDMA: InfiniBand & Ethernet (RoCE*)
There is no Fibre Channel in the Cloud!
Converged Fabrics
56Gb/s InfiniBand
10/40Gb/s Ethernet
Compute
Networking
Storage
* RoCE: RDMA over Converged Ethernet
© 2016 Mellanox Technologies 26
Solving the Storage (Synchronous) IOPs Bottleneck
100usec 200usec 6000usec
25
usec
1 us
20 usec
10
usec
Mechanical
Disks
(~6msec)
Software Disk
With SSDs
(~0.5msec)
With Fast Network
(~0.2msec)
With RDMA
(~0.05msec)
Network
100usec 200usec
200usec25
usec
25
usec
180 IOPs
3000 IOPs
4300 IOPs
20,000 IOPs
Synchronous (back to back)
With Full OS Bypass
& NV-Dimm/Cache
(~0.007msec)
1 us
6
us
3
us
>100,000 IOPs
Synchronous
© 2016 Mellanox Technologies 27
Remote Direct Memory Access (RDMA)
Remote Direct Memory Access over Converged Ethernet (RoCE)
What is RDMA?
• Direct memory access from the memory of one
computer to that of another without involving
either one's operating system. This permits high-
throughput, low-latency networking, omitting the
OS and freeing the Processor to other tasks.
IBTA specified
Zero-copy, CPU bypass technology for data
transfer
Supported over standard interconnect
protocols
Allows applications to transfer data directly to
the buffer of a remote application
Provides extremely low latency data
transfers
Standard RDMA Protocols
• InfiniBand – up to 100Gb/s (EDR)
• RDMA-over-Converged-Ethernet (RoCE) – Up to
100Gb/s
Supports diverse storage protocols
© 2016 Mellanox Technologies 28
I/O Offload Frees Up CPU for Application Processing
~88% CPU
Efficiency
Us
er
Sp
ac
eS
ys
tem
Sp
ac
e
~53% CPU
Efficiency
~47% CPU
Overhead/Idle
~12% CPU
Overhead/Idle
Without RDMA With RDMA and Offload
Us
er
Sp
ac
eS
ys
tem
Sp
ac
e
© 2016 Mellanox Technologies 29
RDMA – How it Works
RDMA over InfiniBand or
Ethernet
KE
RN
EL
HA
RD
WA
RE
US
ER
RACK 1
OS
NIC Buffer 1
Application
1Application
2
OS
Buffer 1
NICBuffer 1
TCP/IP
RACK 2
HCA HCA
Buffer 1Buffer 1
Buffer 1
Buffer 1
Buffer 1
© 2016 Mellanox Technologies 30
Congestion Control – The Need
Source(s) is pushing more traffic than the network can handle
• Usually due to the bandwidth available a bottleneck, congested link
• Can arise from other causes as well
• Situation lasts for relatively long time
Buffers fill up, latency climbs
Lossless vs. lossy network
• Lossy
- drop packets when buffer is full
- Requires drop indication mechanism – timeouts, NACKs, etc.
- Bad goodput, latency ~ buffer size/BW
• Lossless
- Stop the previous hop when buffer is full, no dropping of packets
- goodput=throughput in congested link - no wasted effort
- Congestion spreading and victim flows with long lived congestion
F ABCDE
G
Y
X
© 2016 Mellanox Technologies 31
RDMA over Converged Ethernet (RoCE) and Routable RoCE require lossless medium• Application assumes lossless media
In order to provide lossless network, few mechanisms may be used:• Global Pause
- IEEE 802.3x standard
• Priority Flow Control (PFC)
- IEEE 802.1Qbb standard
• DSCP based PFC
- Not a standard but becoming more and more popular
The Challenge – lossless traffic over lossy network
© 2016 Mellanox Technologies 32
RoCEv1 Operates Within one L2 Network
• Many cloud applications span multiple L2 domains
• Some customers use L3 across datacenter, want RDMA
across racks or IP subnets in L3 datacenter
• L2 ToR switch for intra-rack communication
• L3 (IP) router for inter-rack communication
Need for L3 Routable RDMA Protocol
• RoCEv2 Meets This Need
• IBTA collaboration defined Routable RoCE
• Small change—transparent to applications and networks
Approved by IBTA, announced Sept 16th, 2014
• Mellanox ConnectX-3 Pro supports RoCEv2 today
• ConnectX-4 supports RoCEv1/RoCEv2
• Drivers already released for Linux & Windows
Need For A Routable RoCE—RoCEv2
L2 L2 L2
L2 Domain L2 Domain L2 Domain
© 2016 Mellanox Technologies 33
RoCE Is an Open Standard And Routable
IBTA Collaboration on RoCE• Steering Committee: Cray, Emulex, HP, IBM, Intel,
Mellanox, Microsoft, Oracle,
• RoCE specification first released in 2010
• Most widely deployed Ethernet RDMA standard
• Routable since September 2015
Standardization paves way for multi-
vendor interoperable solutions RoCEv2
Specification
InfiniBand RoCEv1 RoCEv2
© 2016 Mellanox Technologies 34
Possible Scenarios
© 2016 Mellanox Technologies 35
Multi-Mode Optics
3m-100m
Where Interconnects are Being Used in Data Center
DACServer/ToR-to-ToR
SR4For structured cabling
Short Reaches
PSM4
DAC
WDM4
8-Fiber
MPOAOC: 3-50m
AOCToR-Leaf/Spine
“DAC In the Rack”
3m
Quad 25G SFP
breakout
Dual 50G
Breakout
25G SFP
Quad 25G SFP
breakout
Dual 50G
Breakout
25G SFP
For Structured Cabling
Long Reaches
2-Fiber
LC
Single-Mode Optics
Up to 2Km
Optical
Patch
Panel
© 2016 Mellanox Technologies 36
Small/Medium Cloud Deployment 10GbE endpoints
(48 + 48)x
10/25GbE
active HA
(48 + 48)x
10/25GbE
active HA
15x Racks
Full L2 solution
1440x 10GbE ports
WAN
access2x 10/40GbE
uplink to
router2x 100G Spines
To Spine
mLAG2x 10/25G ToRs
400G
960G
Per Rack
Pure L2 Network; full HA and no-SPoF
Phase-1: Start with as small as 1 Rack and
2x ToR
Phase-2: Add 2x spines (32x100G) and
build up-to 15x Racks in pure L2 domain
ToR to spine uplinks with 50GbE to ensure
(2+2)x link bundle; mitigate cable failure
(4+4)x 100/40GbE or (8+8)x 50GbE ports
available per rack for high performance/”fat”
storage nodes
48+48x 10GbE for compute/hyper-
converged infrastructure
50GbE
100GbE
© 2016 Mellanox Technologies 37
Small/Medium Cloud Deployment 25GbE endpoints
(48 + 48)x
10/25GbE
active HA
(48 + 48)x
10/25GbE
active HA
7x Racks
Full L2 solution
672x 25GbE ports
WAN
access2x 10/40GbE
uplink to
router2x 100G Spines
To Spine
mLAG2x 25G ToRs
800G
2400G
Per Rack
Pure L2 Network; full HA and no-SPoF
Ideal for small/medium private cloud
Phase-1: Start with as small as 1 Rack and
2x ToR
Phase-2: Add 2x spines (32x100G) and
build up-to 7x Racks in pure L2 domain
mLAG on ToRs and spines for full active-
active HA
(2+2)x 100GbE or (4+4)x 50GbE ports
available per rack for high performance/”fat”
storage nodes
48+48x 25GbE for compute/hyper-
converged infrastructure
100GbE
© 2016 Mellanox Technologies 38
1024-port 1:1 100GbE, 2048-port 1:1 50GbE
32x Spines
64x Leafs
16x
100GbE
16x
100GbE16x
100GbE
1024x 100GbE 1:1 non-
blocking network
All leaf-spine
links are 50GbE
32x 50GbE
uplinks
Same Concept can be used for 2048 port 50GbE 1:1
Can be used as Spine for 3-level networks
© 2016 Mellanox Technologies 39
Leaf-2
Port 1,2
Leaf-1
Port 1,2
Spine-2
Port 1,2
Spine-1
Port 1,2
How to split-connect 50GbE with standard optics
MPO to 4x LC-
LC splitter
cables
• Leaf-Spine links are all 50GbE
• Constructed by splitting each 100GbE port to
2x 50GbE
• Use SR4 or PSM4 100GbE optics on each
port
• Use standard MPO- 4xLC-LC splitter cables
(MM for SR4 and SM for PSM4)
• Lane 1-2 is port-1
• Lane 3-4 is port-2
• Cables connected using standard LC-LC
passive couplers
Thank You
© 2016 Mellanox Technologies 41
References
www.ieee802.org/3/ad_hoc/bwa/BWA_Report.pdf
http://www.ieee802.org/3/50G/public/adhoc/
http://www.ethernetalliance.org/wp-content/uploads/2013/04/Ethernet-Alliance-Technology-
Roadmap-FINAL.pdf
https://en.wikipedia.org/wiki/Terabit_Ethernet
http://www.ieee802.org/3/bs/
http://www.open-ethernet.com/
http://25gethernet.org/
https://community.mellanox.com/docs/DOC-1451
http://www.mellanox.com/ethernet/