introduction to the cluster infrastructure and the systems provisioning engineering teams
TRANSCRIPT
Cluster Infrastructure &System Provisioning Engineering
Angelo FaillaProduction Engineer – ClusterInfra Dublin
supporting rapid infrastructure and user growth
What do we do?
Efficiently bring up new capacity and manage the
health of core services
required to operate our
infra.
• DNS Infrastructure• NTP infrastructure• Provisioning infrastructure
(DHCP, TFTP, Grub2, etc…)• Cluster/DC level automation
Cluster Infrastructure
Team Responsibilitie
s
System Provisioning Engineering
Team Responsibilitie
s
• Cyborg• Built on top of provisioning infra• Orchestrates server / TOR
provisioning• Image parameters tool• Repair ticketing system• Hardware checking systems
(some of the) challenges
The number of machines
PROVISION-ING:
IT’S HANDS FREE
The number of variables is too high
https://www.flickr.com/photos/curveto/2698598542/ - CC-BY-2.0-
Let’s talk about TFTP…
TFTP: D.O.B. 1981 Angelo: D.O.B. 1981
POP TFTP: Asia -> Oregon
Latency: 150ms
POP
POP TFTP: Asia -> OregonRRQ: 150ms
ACK: 150ms
GET DATA BLOCK0: 150ms
DATABLOCK 0 PAYLOAD: 150ms
GET DATABLOCK N: 150ms
DATABLOCK N PAYLOAD: 150ms
POP
File size
Block Size
Latency
Time to download
80 MB 512 B 150ms 12.5 hours
80 MB 1400 B 150ms 4.5 hours
80 MB 512 B/ 1400 B 1ms <1 minute
POP TFTP: Asia -> Oregon
Solution 1: let’s use iPXE as it talks TCP/HTTP! - It had a 10 minutes watchdog (which we had to patch) - after patch it was still taking > 10 min-utes
Solution 2: put fbtftp server in every POP - our own home made TFTP server - have it stream files from http - cache files locally - couple of minutes to download initrd/ker-nel
Solution 3 (currently investigating):use Grub2 and download initrd/kernel via HTTPconfigurable tcp window size, patch sent up-stream.
Solutions
Vendors tell you they are
IPv6 compliant, but
are they really?
Bring up/down clusters as fast as possible
Come talk to us at our
poster sessions!