migrating big data - o'reilly mediaassets.en.oreilly.com/1/event/61/moving day_...
TRANSCRIPT
Moving Day:Migrating your Big Data from A to B
Justin [email protected]
Corey [email protected]
Laura [email protected]
Overview
• What is Socorro?
• Rationale for migration
• Build out
• Automation
• Smoke testing
• Troubleshooting
• Moving Day
• Aftermath
What is Socorro?
“Socorro has a lot of moving parts”
...
“I prefer to think of them as dancing parts”
A different type of scaling:
• Typical webapp: scale to millions of users without degradation of response time
• Socorro: less than a hundred users, terabytes of data.
Basic law of scale still applies:
The bigger you get, the more spectacularly you fail
Some numbers
• At peak we receive 3000 crashes per minute
• 3 million per day
• Median crash size 150k
• ~40TB stored in HBase and growing every day
What can we do?
• Does betaN have more (null signature) crashes than other betas?
• Analyze differences between Flash versions x and y crashes
• Detect duplicate crashes
• Detect explosive crashes
• Find “frankeninstalls”
• Email victims of a malware-related crash
Rationale and planning
Why?
• Initial reason for move was stability: lots of downtime, partly due to approaching capacity limits
• Needed more infrastructure, which meant new data center
• Decided if we were going to do it over...we’d do it right
Fragility
• Ongoing stability problems with HBase, and when it went down, everything went with it
• Releases were nightmares, requiring manual upgrades of multiple boxes, editing of config files, and manual QA
• Troubleshooting done via remote (awful)
Moving data
• Setting up new architecture and new machines is one thing
• Moving such a large amount of data...with no downtime...is harder
• Downtime requirement turned out to be on collection, so we changed code to a pluggable storage architecture. Spool to disk when HBase unavailable
• PostgreSQL easy to move data
• HBase orginally intended to use distcp, but ran into trouble and ended up using a dirty copy tool authored in house
Planning tools
• Bugzilla for tasks
• Pre-flight checklist and in-flight checklist to track tasks
• Read Atul Gawande’s The Checklist Manifesto
• Rollback plan
• Failure scenarios
• Rehearsals
Build out
What was wrong?
• legacy hardware
• improperly managed code
• each server was different
• no configuration management
• shared resources with other webapps
• vital daemons were started with “nohup ./startDaemon &”
• insufficient monitoring
• one sysadmin. rest of team and developers had no insight into production
• no automated testing
Automation
Configuration Management
• new rule: if it wasn’t checked in and managed by puppet, it wasn’t going on the new servers
• no local configuration/installation of anything
• daemons got init scripts and proper nagios plugins
• application configuration is done centrally in one place
• staging application matches production
Packages for production
• 3rd party libraries and packages are pulled in upstream
• IT doesn’t need to know/care how a developer develops. What goes into production is a tested, polished package.
• packages for production are built and tested by Jenkins the same way every time.
• local patches aren’t allowed. A patch to production means a patch to the source upstream, a patch to stage and a proper rollout to production
• Every package is fully tested in a staging environment
Smoke testing
Load Testing
• Used a small portion (40 nodes) of a 512-node Seamicro cluster
• Simulating real traffic by submitting crashes from a test farm
• Testing the entire system as a whole, under real production load
Troubleshooting
The Bleeding Edge... bleeds!
• Moving HBase data over, issues with distcp
• New data center, new load balancers, new challenges
• New hardware = new challenges with hbase
• Discovering a misconfigured kickstart script, from which all servers were built.
• Using puppet to correct the error
• Testing various configurations
Moving Day
• Flew the team in
• Migration day checklist: http://tinyurl.com/migrationday
• Went remarkably smoothly due largely to good co-operation between teams
Aftermath
• Before moving day, every possible failure scenario was discussed and planned for
• Most failure scenarios actually happened. The new cluster failed gracefully with 0 data loss.
• New data that came in during the migration had to be dealt with.
• Problems with new hbase cluster and various other tweaks.
• Postmortem to learn what we did right and wrong
It’s all open
Get the code (moving to github in 3-4 weeks):
http://code.google.com/p/socorro
Read/file/fix bugs:
https://bugzilla.mozilla.org/
Call in for the weekly meetings:
https://wiki.mozilla.org/Breakpad/Status_Meetings
Join us in IRC:
irc.mozilla.org #breakpad
Questions?
We’re Hiring!!
• Help us work to improve and protect the open web
• Offices worldwide, work from home possible for many jobs
• careers.mozilla.com
• System Administrators (Web Ops, VCS, LDAP, etc)
• Site Reliability Engineers, QA Engineers
• Developers, developers, developers, developers!!