aws to bare metal: motivation, pitfalls, and results

Post on 25-Jul-2015

135 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

AWS CLOUDTO

BARE METAL

Wish saved 35% on MongoDB costs

Improved latency by 20%

And reduced latency variance

HI, I’M ADAM.(I’m a software engineer; I also run production…)

I WORK AT WISH.(we’re a mobile eCommerce platform)

I WORK AT WISH.(we also grow really fast…)

AWS TO BARE METAL• The Why

• The Scope

• The Servers

• The Network

• The Operations

• The Results

THE THEME

The Why

there was spinning disk EBS

In the beginning

DB slows to a crawl

Replica set detects failureElection kills the app for 30s

App slows down

EBS LATENCY SPIKE

Provisioned IOPS EBS launches

Summer 2012

But - super expensive!

Maybe time for bare metal?

So we modeled the costs…

The Scope

?

The Servers

Server Specs?

GOAL

Find lowest cost per query

for your workload

THROUGHPUT & LATENCY

• Typically: more throughput → more latency

• Application dictates max latency (p95?)

• For each hardware config…

• Find highest throughput under max latency

THE WORKLOAD

• db.setProfilingLevel(2)

• Snapshot the DB volume

• Dump system.profile after 1 hour

OUR TOOL

• Restore the snapshot

• Clear filesystem caches

• Replay ops at configured throughput

• Report on latency / MongoDB stats

LATEST SPECS

• 2x Ivy Bridge 3.3 GHz (32 hyperthreads)

• 256 GB RAM

• 3.2 TB LSI WarpDrive PCI-e

YOUR M

ILEAGE M

AY VARY

!

The Network

NETWORKS ARE WEIRD

• Network engineering is weird for software people

• Need to master a few, big pieces

• We wasted a lot of time improvising…

PLAN TO FAIL• Every component and connection fails

• Switch dies?

• NIC dies?

• Switch ⟷ switch connection dies?

• DirectConnect dies?

The Operations

THE OPERATIONS

• Migration / Rollback• Backups• Processes• Documentation

MIGRATION (PREP)

• Add new nodes to replica set

• hidden: true, priority: 0

• Wait for them to sync

MIGRATION (READ-ONLY)

• Unhide nodes:

• hidden: false, priority: 0

MIGRATION (READ-WRITE)

• Force primary into colo:

• hidden: false, priority: 2

MIGRATION (DONE)

• Hide old AWS nodes:

• hidden: true, priority: 0

ROLLBACK

• No big deal

• Adjust hidden/priority to move traffic back

BACKUPS

• EBS snapshots rock!

• Hidden member in EC2 for backup

• Nice for DR too…

PROCESSES

• No RackServer() API

• Ensure consistency:

• Checklists

• Verification tools

DOCUMENTATION

• No DescribeInstances either…

• Consider life without AWS Management Console

• Worse: consider it being occasionally wrong

DOCUMENTATION

• Wiremaps

• Network maps (IPs, VLANs, etc)

• Equipment specs

• Serial numbers

The Results

Big project - took about 6 months

Savings made it worthwhile

Bonus: it got faster!

Budget a lot of time for learning

Benchmark & validate your assumptions

Obsess over the details

Thanks!

adam@wish.com

top related