inside the atlassian ondemand private cloud

80
Tuesday, July 10, 12

Post on 19-Oct-2014

2.689 views

Category:

Technology


4 download

DESCRIPTION

In order to launch Atlassian OnDemand, we needed to rethink the way we did infrastructure. Join Atlassian SaaS Platform Architect, George Barnett as he discusses how we delivered a scalable platform that runs tens of thousands of JVMs, all while reducing the cost by ten-fold. This talk will cover design decisions, technology choices and the lessons learned during the build out.

TRANSCRIPT

Page 1: Inside the Atlassian OnDemand Private Cloud

Tuesday, July 10, 12

Page 2: Inside the Atlassian OnDemand Private Cloud

SAAS Platform ArchitectGeorge Barnett

Inside the Atlassian OnDemand private cloud

Tuesday, July 10, 12

Page 3: Inside the Atlassian OnDemand Private Cloud

In 2010 a team of engineers moved into our secret lair (above a pub) to re-imagine our hosted platform.

Tuesday, July 10, 12

Page 4: Inside the Atlassian OnDemand Private Cloud

Launch - October 20111000 VMs

6 months later13,500 VMs

Tuesday, July 10, 12

Page 5: Inside the Atlassian OnDemand Private Cloud

We have a cloud. So what?

Tuesday, July 10, 12

Page 6: Inside the Atlassian OnDemand Private Cloud

Poor performance

We also had a cloud.. and ..

Slow deployments

VM sprawl

Over provisioning

Low visibility into the full stack

Tuesday, July 10, 12

Page 7: Inside the Atlassian OnDemand Private Cloud

Virtualisation often creates new challenges but does

nothing about existing ones.

Tuesday, July 10, 12

Page 8: Inside the Atlassian OnDemand Private Cloud

Tuesday, July 10, 12

Page 9: Inside the Atlassian OnDemand Private Cloud

Tuesday, July 10, 12

Page 10: Inside the Atlassian OnDemand Private Cloud

Tuesday, July 10, 12

Page 11: Inside the Atlassian OnDemand Private Cloud

Tuesday, July 10, 12

Page 12: Inside the Atlassian OnDemand Private Cloud

Focus

Tuesday, July 10, 12

Page 13: Inside the Atlassian OnDemand Private Cloud

Be less flexible about what infrastructure you provide.

Tuesday, July 10, 12

Page 14: Inside the Atlassian OnDemand Private Cloud

#summit12

“You can use any database you like, as

long as its PostgreSQL 8.4.”

Tuesday, July 10, 12

Page 15: Inside the Atlassian OnDemand Private Cloud

• Stop trying to be everything to everyone• (we have other clouds within Atlassian)

• Lower operational complexity• Easier to provide a deeply integrated, well supported

toolchain• Small test surface matrix

Tuesday, July 10, 12

Page 16: Inside the Atlassian OnDemand Private Cloud

Fail fast. Learn quickly.

Tuesday, July 10, 12

Page 17: Inside the Atlassian OnDemand Private Cloud

Do as littleas possible

deploy anduse it

Tuesday, July 10, 12

Page 18: Inside the Atlassian OnDemand Private Cloud

A small scale model of the initial proposed platform architecture. 4 desktop machines and a switch.

Purpose: Validate design, evaluate failure modes.

Block-1

http://history.nasa.gov/Apollo204/blocks.html

Tuesday, July 10, 12

Page 19: Inside the Atlassian OnDemand Private Cloud

Creation of VM’s over NFS too resource and time intensive. (more on this later)

Block-1

Network boot assumptions validated.

Applications do not fall over.

Tuesday, July 10, 12

Page 20: Inside the Atlassian OnDemand Private Cloud

A large scale model of the platform architecture.

Purpose: Validate hardware resource assumptions and compare CPU vendors.

Block-2

http://history.nasa.gov/Apollo204/blocks.html

Tuesday, July 10, 12

Page 21: Inside the Atlassian OnDemand Private Cloud

Initial specs of compute hardware too conservative. Decided to add 50% more RAM.

Block-2

VM Distribution and failover tools work.

Customers per GB of RAM metric validated

Tuesday, July 10, 12

Page 22: Inside the Atlassian OnDemand Private Cloud

Hardware

Tuesday, July 10, 12

Page 23: Inside the Atlassian OnDemand Private Cloud

Existing platform hardware was a poor fit for our workload.

Memory and IO were heavily constrained, but CPU was not.

Challenge

Tuesday, July 10, 12

Page 24: Inside the Atlassian OnDemand Private Cloud

We took 6 months worth of monitoring data from our existing platform.We used this to data to determine the right mix of hardware.

Monitoring

Tuesday, July 10, 12

Page 25: Inside the Atlassian OnDemand Private Cloud

• 10 x Compute nodes (144G RAM, 12 cores, NO disks)• 3 x Storage nodes (24 disks)• Each rack delivered fully assembled

• Unwrap, provide power, networking

• Connected to customers in ~2 hours

Tuesday, July 10, 12

Page 26: Inside the Atlassian OnDemand Private Cloud

Reliable.

Each machine goes through a 2 day burn in before it goes into the rack.

Advantage #1

Tuesday, July 10, 12

Page 27: Inside the Atlassian OnDemand Private Cloud

Neat.

Advantage #2

Tuesday, July 10, 12

Page 28: Inside the Atlassian OnDemand Private Cloud

Consistent.

Advantage #3

Tuesday, July 10, 12

Page 29: Inside the Atlassian OnDemand Private Cloud

Easy to deploy.

Advantage #4

Tuesday, July 10, 12

Page 30: Inside the Atlassian OnDemand Private Cloud

No disks.

Tuesday, July 10, 12

Page 31: Inside the Atlassian OnDemand Private Cloud

Wait. What?

Tuesday, July 10, 12

Page 32: Inside the Atlassian OnDemand Private Cloud

Existing compute infrastructure used local disk for swap and hypervisor boot.Once we got the memory density right, it’s only boot.

Challenge

Tuesday, July 10, 12

Page 33: Inside the Atlassian OnDemand Private Cloud

• No disks in compute infrastructure• Avoid spinning 20 more disks per rack for a hypervisor OS

• Evaluated booting from:• USB drives

• NFS

• Custom binary initrd image + kernel

Tuesday, July 10, 12

Page 34: Inside the Atlassian OnDemand Private Cloud

• No disks in compute infrastructure• Avoid spinning 20 more disks per rack for a hypervisor OS

• Evaluated booting from:• USB drives (unreliable and slow!)

• NFS (what if the network goes away?)

• Custom binary initrd image + kernel

Tuesday, July 10, 12

Page 35: Inside the Atlassian OnDemand Private Cloud

• Image is ~170Mb gzipped filesystem• Download on boot, extract into ram - ~400Mb

• No external dependencies after boot• All compute nodes boot from the same image

• Reboot to known state

Tuesday, July 10, 12

Page 36: Inside the Atlassian OnDemand Private Cloud

Compute Node Netboot Server

PXE DHCP

TFTP

dhcp

gpxe

response

Etherboot

dhcp

responseDHCP

HTTPbootscript

kernel & boot image

Boot

Tuesday, July 10, 12

Page 37: Inside the Atlassian OnDemand Private Cloud

Sharp Edges.• No swap == provision carefully

• Not a problem if you automate provisioning

• Treat running hypervisor image like an appliance• Don’t change code - rebuild image and reboot

• Doing this often? Too many services in the hypervisor

Tuesday, July 10, 12

Page 38: Inside the Atlassian OnDemand Private Cloud

Software

Tuesday, July 10, 12

Page 39: Inside the Atlassian OnDemand Private Cloud

Virtualisation is often inefficient. There’s a memory and CPU penalty which is hard to avoid.

Challenge

Tuesday, July 10, 12

Page 40: Inside the Atlassian OnDemand Private Cloud

Open VZ• Linux containers

• Basis for Parallels Virtuozzo Containers

• LXC isn’t there yet

• No guest OS kernels• No performance hit

• Better resource sharing

Tuesday, July 10, 12

Page 41: Inside the Atlassian OnDemand Private Cloud

Performance

Tuesday, July 10, 12

Page 42: Inside the Atlassian OnDemand Private Cloud

http://wiki.openvz.org/Performance/vConsolidate-SMP

Tuesday, July 10, 12

Page 43: Inside the Atlassian OnDemand Private Cloud

http://wiki.openvz.org/Performance/LAMP

Tuesday, July 10, 12

Page 44: Inside the Atlassian OnDemand Private Cloud

Resource de-duping

Tuesday, July 10, 12

Page 45: Inside the Atlassian OnDemand Private Cloud

“Don’t load the same thing twice”

Tuesday, July 10, 12

Page 46: Inside the Atlassian OnDemand Private Cloud

Java VM’s aren’t lightweight.

Challenge

Tuesday, July 10, 12

Page 47: Inside the Atlassian OnDemand Private Cloud

• Full virtualisation does a poor job at this• 50 VMs = 50 Kernels + 50 caches + 50 shared libs!

• Memory de-dupe combats this, but burns CPU.

• Memory de-dupe works across all OSes• We don’t use Windows.

• By being less flexible, we can exploit Linux specific features.

Tuesday, July 10, 12

Page 48: Inside the Atlassian OnDemand Private Cloud

OpenVZ containers all share the same kernel.

Tuesday, July 10, 12

Page 49: Inside the Atlassian OnDemand Private Cloud

• Provide a single OS image to all - free benefits:• Shared libraries only load once.

• OS is cached only once.

• OS image is the same on every instance.

Tuesday, July 10, 12

Page 50: Inside the Atlassian OnDemand Private Cloud

If all containers share the same OS image, then managing state is a nightmare!One bad change in one container would break them all!

Challenge

Tuesday, July 10, 12

Page 51: Inside the Atlassian OnDemand Private Cloud

• But managing state on multiple machines is a solved problem!• What if you have >10,000 machines.

• Why are you modifying the OS anyway?

Tuesday, July 10, 12

Page 52: Inside the Atlassian OnDemand Private Cloud

Does your iPhone upgrade iOS when you install an

app?

Tuesday, July 10, 12

Page 53: Inside the Atlassian OnDemand Private Cloud

#summit12

“Fix problems by removing them, not by adding

systems to manage them.”

Tuesday, July 10, 12

Page 54: Inside the Atlassian OnDemand Private Cloud

Read-only OS images

Tuesday, July 10, 12

Page 55: Inside the Atlassian OnDemand Private Cloud

Data classes in a system• OS and system daemon code• Application code• Application and user data

Tuesday, July 10, 12

Page 56: Inside the Atlassian OnDemand Private Cloud

Tuesday, July 10, 12

Page 57: Inside the Atlassian OnDemand Private Cloud

Tuesday, July 10, 12

Page 58: Inside the Atlassian OnDemand Private Cloud

OpenVZ Kernel

Tuesday, July 10, 12

Page 59: Inside the Atlassian OnDemand Private Cloud

OpenVZ Kernel

Tuesday, July 10, 12

Page 60: Inside the Atlassian OnDemand Private Cloud

OpenVZ Kernel

Container

Tuesday, July 10, 12

Page 61: Inside the Atlassian OnDemand Private Cloud

OpenVZ Kernel

Container

Tuesday, July 10, 12

Page 62: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

Container

Tuesday, July 10, 12

Page 63: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

/ - Read Only

Container

Tuesday, July 10, 12

Page 64: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

/ - Read Only

Container

Tuesday, July 10, 12

Page 65: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

/ - Read Only Applications, JVM’sConfigs

Container

Tuesday, July 10, 12

Page 66: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

/ - Read Only Applications, JVM’sConfigs /sw - Read Only

Container

Tuesday, July 10, 12

Page 67: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

/ - Read Only Applications, JVM’sConfigs /sw - Read Only

Container

Tuesday, July 10, 12

Page 68: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

/ - Read Only Applications, JVM’sConfigs /sw - Read Only

Container

Application and user data - /data (R/W)

Tuesday, July 10, 12

Page 69: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

/ - Read Only Applications, JVM’sConfigs /sw - Read Only

Container

Application and user data - /data (R/W)

/data/service/

Tuesday, July 10, 12

Page 70: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

/ - Read Only Applications, JVM’sConfigs /sw - Read Only

Container

Application and user data - /data (R/W)

/data/service/

Tuesday, July 10, 12

Page 71: Inside the Atlassian OnDemand Private Cloud

OS toolsSystem supplied code

OpenVZ Kernel

/ - Read Only Applications, JVM’sConfigs /sw - Read Only

Container

Application and user data - /data (R/W)

/data/service/

Tuesday, July 10, 12

Page 72: Inside the Atlassian OnDemand Private Cloud

How?• Storage nodes export /e/ro/ & /e/rw• Build an OS distro inside a chroot.

• Use whatever tools you are comfortable with.

• Put this chroot tree in the RO location on storage nodes• Make a “data” dir in the RW location for each container

Tuesday, July 10, 12

Page 73: Inside the Atlassian OnDemand Private Cloud

How?• On Container start bind mount:/net/storage-n/e/ro/os/linux-image-v1/-> /vz/<ctid>/root

• Replace etc, var & tmp with a memfs• Linux expects to be able to write to these

• Mount containers data dir (RW) to /data

Tuesday, July 10, 12

Page 74: Inside the Atlassian OnDemand Private Cloud

More benefits• Distribute OS images as a simple directory.• Prove that environments (Dev, Stg, Prd) are identical

using MD5sum.• Flip between OS versions by changing a variable

Tuesday, July 10, 12

Page 75: Inside the Atlassian OnDemand Private Cloud

The Swear Wall

Tuesday, July 10, 12

Page 76: Inside the Atlassian OnDemand Private Cloud

The swear wall helps prevent death by a thousand cuts.

Your team has a gut feeling about whats hurting them - this helps you quantify that feeling and act on the pain.

Tuesday, July 10, 12

Page 77: Inside the Atlassian OnDemand Private Cloud

Tuesday, July 10, 12

Page 78: Inside the Atlassian OnDemand Private Cloud

1.!@&*^# Solaris!2.Solaris gets a mark3.Repeat4.Periodically throw out offensive technology5...6.PROFIT!! (swear less)

Tuesday, July 10, 12

Page 79: Inside the Atlassian OnDemand Private Cloud

Optimise for the task at hand.

Don’t layer solutions onto problems. Get rid of them.

Tuesday, July 10, 12

Page 80: Inside the Atlassian OnDemand Private Cloud

Thank you!

Tuesday, July 10, 12