modifying the engine while the airplane is flying - 3-12-2018otto.normandale.edu/events/modifying...
TRANSCRIPT
Replacing the Engine while the Airplane is
Flying Modifying and replacing software that cannot be taken
offline
Automated, Rolling Rollouts
Load Balancer
ncc.com
Application Server 1
Application Server 2
Application Server N
Automated, Rolling Rollouts
Load Balancer
ncc.com
Updating… Application Server 2
Application Server N
Automated, Rolling Rollouts
Load Balancer
ncc.com
Application Server 1* Updating Application
Server N
Automated, Rolling Rollouts
Load Balancer
ncc.com
Application Server 1
Application Server 2* Updating…
Automated, Rolling Rollouts
Load Balancer
ncc.com
Application Server 1
Application Server 2
Application Server N*
What were some things we did right?
• Software Versioning: Because we versioned our code, rolling back was possible.
• Basic Monitoring: At least we knew it was happening and could see the database server was where things were slowing down.
What could we have done better?
• Canary Deployment: Introduce new software on one server• Cross‐Training: Only CTO familiar with handling production issues
• Don’t Repeat Yourself: Centralize queries that do the same thing rather than spreading them all over the application.
• Performance Testing: Automated performance tests of a similar scale to production would have likely caught this issue
• Better Metrics/Logs: This would have made it far easier to triage and identify what query was the problem
• Deprecation: Should have made it possible to roll back without losing data
Canary Deployments
Load Balancer
ncc.com
Application Server 1
Application Server 2
App Server Running New Version of Software
Automated Performance Testing
• Simulate User Load on the software
• Measure success/failure rate, responsive time, etc to validate that new major performance issues are found
Better Logs/Metrics
• Metrics that break down how long each query is taking
• Aggregating Logs across multiple servers to make it easy to search through errors
• Could use products like ELK (Elasticsearch + Logstash + Kibana) or Splunk
What about a Simple Rollback?
Rollout: Convert old data to new format
Rollback: Convert new data back to old data
But what do you do if the formats are incompatible or the conversion to/from the new data format takes hours?
Old Format New Format
New Format Old Format
Deprecation Patterns
1. Old Data Format2. New Data Format, but with new features
disabled
• Validate Software Works• New Feature is not enabled, so there is still a
path to roll back• Software needs to be programmed to save/read
in both the old and new data format to enable co‐existence with old systems
3. Stop Using Old Data Format Entirely
4. Toggle On New Feature
• Roll Out New Version of Software Completely that no longer uses the old data format
• Toggle on the new feature (which is causing you to change the data format) knowing that backwards compatibility with the old data format is no longer required
What could we learn?
• Healthchecks are good, but sometimes extra automation can hurt you
• Circular Dependencies should be avoided• Many architecture problems are actually people problems• Sometimes issues only surface after running in production