modern & scalable network designs forfiles.meetup.com/12607092/nyc-hadoop-meetup-sm.pdfmodern...
TRANSCRIPT
Modern & ScalableNetwork Designs for
Benoît “tsuna” SigoureMember of the Yak Shaving [email protected] @tsunanet
strange change by Robert Couse-Baker
1970
A Brief History of Network Software
Custommonolithicembedded
OS 19801990
2000
ModifiedBSD, QNXor Linux
kernel base2010
1970
A Brief History of Network Software
Custommonolithicembedded
OS
“The switchis just a
Linux box”
19801990
2000
ModifiedBSD, QNXor Linux
kernel base2010
1970
A Brief History of Network Software
Custommonolithicembedded
OS
“The switchis just a
Linux box”
19801990
2000
ModifiedBSD, QNXor Linux
kernel base2010
EOS = Linux
1970
A Brief History of Network Software
Custommonolithicembedded
OS
“The switchis just a
Linux box”
19801990
2000
ModifiedBSD, QNXor Linux
kernel base2010
EOS = Linux
Custom ASIC
Inside a Modern Switch
Arista 7050Q-16
Inside a Modern Switch
2 PSUs4 Fans
Arista 7050Q-16
Inside a Modern Switch
2 PSUs4 Fans
DDR3
x86 CPUUSB FlashSATA SSD
Arista 7050Q-16
Inside a Modern Switch
2 PSUs4 Fans
DDR3
x86 CPU
Networkchip(s)
USB FlashSATA SSD
Arista 7050Q-16
Inside a Modern Switch
2 PSUs4 Fans
DDR3
x86 CPU
Networkchip(s)
USB FlashSATA SSD
Arista 7050Q-16
Command API
It’s All About The Software
Zero TouchReplacement VmTracer CloudVision Email Wireshark Fast
Failover DANZ Fabric Monitoring
Command API
It’s All About The Software
Zero TouchReplacement VmTracer CloudVision Email Wireshark Fast
Failover DANZ Fabric Monitoring
OpenTSDB
Command API
#!/bin/bash C++
It’s All About The Software
Zero TouchReplacement VmTracer CloudVision Email Wireshark Fast
Failover DANZ Fabric Monitoring
OpenTSDB
Command API
#!/bin/bash C++
It’s All About The Software
Zero TouchReplacement VmTracer CloudVision Email Wireshark Fast
Failover DANZ Fabric Monitoring
OpenTSDB$ sudo yum install <pkg>$ sudo yum install <pkg>
Command APIOr a CLI over JSON-RPC
Command APIOr a CLI over JSON-RPC
Command APIOr a CLI over JSON-RPC
Command APIOr a CLI over JSON-RPC
Command APIOr a CLI over JSON-RPC
import pexpect, reconnector = pexpect.spawn("ssh [email protected]")connector.expect(".ssword:*")connector.sendline(password)index = connector.expect([">","#"])if index == 0: connector.sendline("enable") index = connector.expect(["assword", "#"]) if index == 0: connector.sendline(enable) connector.expect("#")connector.sendline("show version")connector.expect("#")output = connector.beforeswitches = re.findall("^(\*?)\s+(\d)\s+\d+\s+([A-Z\d-]+)", output, re.MULTILINE)for master, num, model in switches: ...
Command APIOr a CLI over JSON-RPC
import pexpect, reconnector = pexpect.spawn("ssh [email protected]")connector.expect(".ssword:*")connector.sendline(password)index = connector.expect([">","#"])if index == 0: connector.sendline("enable") index = connector.expect(["assword", "#"]) if index == 0: connector.sendline(enable) connector.expect("#")connector.sendline("show version")connector.expect("#")output = connector.beforeswitches = re.findall("^(\*?)\s+(\d)\s+\d+\s+([A-Z\d-]+)", output, re.MULTILINE)for master, num, model in switches: ...
Traditional approach:“screen scraping”
Command APIOr a CLI over JSON-RPC
import pexpect, reconnector = pexpect.spawn("ssh [email protected]")connector.expect(".ssword:*")connector.sendline(password)index = connector.expect([">","#"])if index == 0: connector.sendline("enable") index = connector.expect(["assword", "#"]) if index == 0: connector.sendline(enable) connector.expect("#")connector.sendline("show version")connector.expect("#")output = connector.beforeswitches = re.findall("^(\*?)\s+(\d)\s+\d+\s+([A-Z\d-]+)", output, re.MULTILINE)for master, num, model in switches: ...
Traditional approach:“screen scraping”
from jsonrpclib import Serverswitch = Server("http://admin:" + passwrd + "@1.2.3.4/command-api")
response = switch.runCmds(1, ["show privilege"])if response[0]["privilegeLevel"] != 15: switch.runCmds(1, [{"cmd": "enable", "input": enable}])
response = switch.runCmds(1, ["show version"])for num, switch in enumerate(response[0]["stack"]): # use switch["model"] and switch["master"]
Compare to:A Standard HTTP API
Integration
Neutron
Arista Driver
Datacenter Leaf Layer
Red VLAN
Blue VLAN
TOR1 TOR2 TOR3
VM1 VM2VM1 VM2 VM3 VM4VM3
Host 3Host 1 Host 4 Host 6Host 2 Host 5
EAPI
Spine-Leaf Architecture
Red Venation by Swamibu
Server 1
Server 2
Server 3
Server N
Rack-1
Server 1
Server 2
Server 3
Server N
Rack-2
Server 1
Server 2
Server 3
Server N
Rack-3
5 to 7 racks, 7 to 9 switchesScale: 240 to 336 nodesOversubscription ratio 1:3
2x40Gbpsuplinks
Server 1
Server 2
Server 3
Server N
Rack-4
Server 1
Server 2
Server 3
Server N
Rack-5
Server 1
Server 2
Server 3
Server N
Rack-1
Server 1
Server 2
Server 3
Server N
Rack-2
56 racks, 8 core switchesScale: 3136 nodesOversubscription ratio 1:3
8x10Gbpsuplinks
Server 1
Server 2
Server 3
Server N
Rack-55
Server 1
Server 2
Server 3
Server N
Rack-56
...
...
...
Server 1
Server 2
Server 3
Server N
Rack-1
Server 1
Server 2
Server 3
Server N
Rack-2
1152 racks, 16 core modular switchesScale: 55296 nodesOversubscription ratio still 1:3
16x10Gbpsuplinks
Server 1
Server 2
Server 3
Server N
Rack-1151
Server 1
Server 2
Server 3
Server N
Rack-1152
...
...
...
Build Layer 3 Networks
3 layers of pink Cake by Catherine Scott
Layer 2 vs Layer 3
• Layer 2: “data link layer” – Ethernet “bridge”• Layer 3: “network layer” – IP “router”
A B C D E F G H
Uplinksblockedby STP
Layer 2
E, F
, G, H
→
← A, B, C, D
E, F, G, H →
← A
, B, C
, D
A B C D E F G H
ECMP:0.0.0.0/0 via core1 via core2
Layer 3x
Layer 2 vs Layer 3
• Layer 2: “data link layer” – Ethernet “bridge”• Layer 3: “network layer” – IP “router”
A B C D E F G H
E, F
, G, H
→
← A, B, C, D
E, F, G, H →
← A
, B, C
, D
A B C D E F G H
ECMP:0.0.0.0/0 via core1 via core2
Layer 3Layer 2with vendor magic x
Layer 2 vs Layer 3
• Layer 2: “data link layer” – Ethernet “bridge”• Layer 3: “network layer” – IP “router”
A B C D E F G H
E, F
, G, H
→
← A, B, C, D
E, F, G, H →
← A
, B, C
, D
A B C D E F G H
ECMP:0.0.0.0/0 via core1 via core2
Layer 3Layer 2with vendor magic x
Pipes by the autowitch
Hoover Dam by Christina Hsu
1G vs 10G
• 1Gbps is today what 100Mbps was yesterday
• New generation servers come with 10G LAN on board
• At 1Gbps, the network is one of the first bottlenecks
• A single HBase client can very easily push over 4Gbps to HBase 0
175
350
525
700
TeraGen in seconds
1Gbps 10Gbps
3.5x
Jumbo Frames
Elephant Framed! by Dhinakaran Gajavarathan
Jumbo Frames
0
85
170
255
340
Context switches
MTU=1500 MTU=9212
11%
0
125
250
375
500
Soft IRQ CPU Time
15%
(millions) (millions) (thousand jiffies)
0
500
1000
1500
2000
Packets per TeraGen
MTU=1500 MTU=9212
3x less
«Hardware» buffers by Roger Ferrer Ibáñez
Deep Buffers
«Hardware» buffers by Roger Ferrer Ibáñez
Deep Buffers
0
38
75
113
150
Packet buffer per 10G port (MB)
Ari
sta
7500
Ari
sta
7500
E
Indu
stry
Ave
rage
1.5
48
128
0
250000
500000
750000
1000000
Packets dropped per TeraGen
1MB 4:11MB 5.33:148MB 5.33:1
Zero!
4x10G ⇒ 4:13x10G ⇒ 5.33:1
...
16 hostsx 10G
16 hostsx 10G
...
Deep Buffers
1k T
CP
slo
w s
tart
/sec
RAILMaking the NetworkWork Better With
&
Host Failures in Hadoop Clusters
Server 1
Server 2
Server 3
Server NRack-1 Rack-2 Rack-3
Server 1
Server 2
Server 3
Server N
Server 1
Server 2
Server 3
Server N
Server 1
Server 2
Server 3
Server NRack-N
• Hardware failures• Kernel panics• Operator errors• NIC driver bugs
• RegionServer: wait for ZooKeeper lease timeoutTypical: ~30s
• DataNode: wait for heartbeats to timeout enough for NameNode to declare deadTypical: ~10min (or ~30s with dfs.namenode.check.stale.datanode see HDFS-3703)
Host Failures for HBase & Hadoop
• RegionServer: wait for ZooKeeper lease timeoutTypical: ~30s
• DataNode: wait for heartbeats to timeout enough for NameNode to declare deadTypical: ~10min (or ~30s with dfs.namenode.check.stale.datanode see HDFS-3703)
Host Failures for HBase & Hadoop
• RegionServer: wait for ZooKeeper lease timeoutTypical: ~30s
• DataNode: wait for heartbeats to timeout enough for NameNode to declare deadTypical: ~10min (or ~30s with dfs.namenode.check.stale.datanode see HDFS-3703)
Host Failures for HBase & Hadoop
• Redirect traffic to the switch• Have the switch send TCP resets• Immediately kills all active and new flows
10.4.5.1
10.4.5.3
10.4.6.1
10.4.6.2
10.4.6.3
10.4.5.254 10.4.6.254
10.4.5.2
How Does This Work?
10.4.5.2
• Redirect traffic to the switch• Have the switch send TCP resets• Immediately kills all active and new flows
10.4.5.1
10.4.5.2
10.4.5.3
10.4.6.1
10.4.6.2
10.4.6.3
10.4.5.254 10.4.6.25410.4.5.210.4.5.2
EOS = Linux
How Does This Work?
MapReduceTracer
For
MapReduce Tracer
• Keep track of nodes in the Hadoop cluster• Poll statistics about Hadoop activity:‣What jobs are running, when, where‣How “big” each job is, network-wise‣Historical data of MapReduce activity‣MapReduce vs HDFS traffic accounting
• CLI integration for real-time visibility• Multi-cluster support
Server 1
Server 2
Server 3
Server 16
Rack-1
Server 17
Server 18
Server 19
Server 32
Rack-2
40Guplinks
• 2 racks, 16 hosts each• L3 network (/27 per rack,
ECMP)• Both TOR use the following
additional config: monitor hadoop cluster batch jobtracker host 172.19.33.1 jobtracker user tsuna tasktracker http-port 10060 no shutdown
• No change required to Hadoop• No plugin, config change, or
additional daemon to run
Server 1
Server 2
Server 3
Server 16
Rack-1
Server 17
Server 18
Server 19
Server 32
Rack-2
NameNodeJobTracker
DN / TT
DN / TT
DN / TT
DN / TT
DN / TT
DN / TT
DN / TT
MapReduceTracer
MapReduceTracer
40Guplinks
• 2 racks, 16 hosts each• L3 network (/27 per rack,
ECMP)• Both TOR use the following
additional config: monitor hadoop cluster batch jobtracker host 172.19.33.1 jobtracker user tsuna tasktracker http-port 10060 no shutdown
• No change required to Hadoop• No plugin, config change, or
additional daemon to run
Hadoop Versions Supported• Apache Hadoop:
0.20.203 to 1.2.1• Cloudera’s distribution:
CDH3 and CDH4• HortonWorks’ distribution:
HDP 1.1, 1.2, 1.3• Microsoft HDInsight 0.10
• Incompatible:Apache 0.20.x and earlierApache 0.21.x, 0.22.x, 0.23.xApache 2.0.x
Thank You
Benoît “tsuna” SigoureMember of the Yak Shaving Staff
[email protected]@tsunanet
We’re hiring in SF, Santa Clara, Vancouver, and Bangalore