migrating big data - o'reilly mediaassets.en.oreilly.com/1/event/61/moving day_...

Moving Day:Migrating your Big Data from A to B

Justin [email protected]

Corey [email protected]

Laura [email protected]

Overview

• What is Socorro?

• Rationale for migration

• Build out

• Automation

• Smoke testing

• Troubleshooting

• Moving Day

• Aftermath

What is Socorro?

“Socorro has a lot of moving parts”

...

“I prefer to think of them as dancing parts”

A different type of scaling:

• Typical webapp: scale to millions of users without degradation of response time

• Socorro: less than a hundred users, terabytes of data.

Basic law of scale still applies:

The bigger you get, the more spectacularly you fail

Some numbers

• At peak we receive 3000 crashes per minute

• 3 million per day

• Median crash size 150k

• ~40TB stored in HBase and growing every day

What can we do?

• Does betaN have more (null signature) crashes than other betas?

• Analyze differences between Flash versions x and y crashes

• Detect duplicate crashes

• Detect explosive crashes

• Find “frankeninstalls”

• Email victims of a malware-related crash

Rationale and planning

Why?

• Initial reason for move was stability: lots of downtime, partly due to approaching capacity limits

• Needed more infrastructure, which meant new data center

• Decided if we were going to do it over...we’d do it right

Fragility

• Ongoing stability problems with HBase, and when it went down, everything went with it

• Releases were nightmares, requiring manual upgrades of multiple boxes, editing of config files, and manual QA

• Troubleshooting done via remote (awful)

Moving data

• Setting up new architecture and new machines is one thing

• Moving such a large amount of data...with no downtime...is harder

• Downtime requirement turned out to be on collection, so we changed code to a pluggable storage architecture. Spool to disk when HBase unavailable

• PostgreSQL easy to move data

• HBase orginally intended to use distcp, but ran into trouble and ended up using a dirty copy tool authored in house

Planning tools

• Bugzilla for tasks

• Pre-flight checklist and in-flight checklist to track tasks

• Read Atul Gawande’s The Checklist Manifesto

• Rollback plan

• Failure scenarios

• Rehearsals

Build out

What was wrong?

• legacy hardware

• improperly managed code

• each server was different

• no configuration management

• shared resources with other webapps

• vital daemons were started with “nohup ./startDaemon &”

• insufficient monitoring

• one sysadmin. rest of team and developers had no insight into production

• no automated testing

Automation

Configuration Management

• new rule: if it wasn’t checked in and managed by puppet, it wasn’t going on the new servers

• no local configuration/installation of anything

• daemons got init scripts and proper nagios plugins

• application configuration is done centrally in one place

• staging application matches production

Packages for production

• 3rd party libraries and packages are pulled in upstream

• IT doesn’t need to know/care how a developer develops. What goes into production is a tested, polished package.

• packages for production are built and tested by Jenkins the same way every time.

• local patches aren’t allowed. A patch to production means a patch to the source upstream, a patch to stage and a proper rollout to production

• Every package is fully tested in a staging environment

Smoke testing

Load Testing

• Used a small portion (40 nodes) of a 512-node Seamicro cluster

• Simulating real traffic by submitting crashes from a test farm

• Testing the entire system as a whole, under real production load

Troubleshooting

The Bleeding Edge... bleeds!

• Moving HBase data over, issues with distcp

• New data center, new load balancers, new challenges

• New hardware = new challenges with hbase

• Discovering a misconfigured kickstart script, from which all servers were built.

• Using puppet to correct the error

• Testing various configurations

Moving Day

• Flew the team in

• Migration day checklist: http://tinyurl.com/migrationday

• Went remarkably smoothly due largely to good co-operation between teams

Aftermath

• Before moving day, every possible failure scenario was discussed and planned for

• Most failure scenarios actually happened. The new cluster failed gracefully with 0 data loss.

• New data that came in during the migration had to be dealt with.

• Problems with new hbase cluster and various other tweaks.

• Postmortem to learn what we did right and wrong

It’s all open

Get the code (moving to github in 3-4 weeks):

http://code.google.com/p/socorro

Read/file/fix bugs:

https://bugzilla.mozilla.org/

Call in for the weekly meetings:

https://wiki.mozilla.org/Breakpad/Status_Meetings

Join us in IRC:

irc.mozilla.org #breakpad

Questions?

We’re Hiring!!

• Help us work to improve and protect the open web

• Offices worldwide, work from home possible for many jobs

• careers.mozilla.com

• System Administrators (Web Ops, VCS, LDAP, etc)

• Site Reliability Engineers, QA Engineers

• Developers, developers, developers, developers!!

migrating big data - o'reilly mediaassets.en.oreilly.com/1/event/61/moving day_...

Documents