acquia managed cloud: highly available architecture for highly unpredictable traffic
TRANSCRIPT
Acquia Managed Cloud:!Highly Available Architecture for Highly Unpredictable Traffic!
Kieran Lal!Technical Director!
Acquia!
January 19th, 2012!
Jess Iandiorio!Sr. Director, Cloud Product Marketing!
Acquia!
Set-up/Launch Production Crisis
2!
Build • Load balancers • Fast page cache • App Servers • Database • File systems • Web servers • App Configuration • HA architecture
Deploy • Integrated Git/SVN • Drag and drop content
management
Application updates • Drupal App code
Infrastructure updates • OS • Debugging • Security
Operations • 24X7 monitoring & alerts • Backups • Load testing
Diagnosis • Site failure • Infrastructure failure • Application errors Resolution • Resize • Launch new virtual servers • Multi-region failover
Your Drupal Application Life Stages
Dec Nov Oct Sept Aug Jul
0
.002
.004
.006
.008
.010 Users hitting your site
Capacity Planning Options
3!
Options
Over Plan 1
Over Pay
Dec Nov Oct Sept Aug Jul
0
.002
.004
.006
.008
.010 Users hitting your site
Capacity Planning Options
4!
Options
Over Plan 1
Over Pay
Under Plan 2
Expect Outages
Dec Nov Oct Sept Aug Jul
0
.002
.004
.006
.008
.010 Users hitting your site
Capacity Planning Options
5!
Options
Over Plan 1
Over Pay
Under Plan 2
Expect Outages
Acquia Plan 3
No Failure
Events Businesses News/ M&E Organizations High Growth Sites
6!
Challenges • Plagued by prior event stats • Failure extends beyond web Consequences of failure • Sales (tickets) • Brand Damage • Missed donation
opportunities
Challenges • You never know when you’ll be
“Huff Po’d” • Time-to-market is critical Consequences of failure • Loss of credibility • Readership • Contractual failures per
advertising agreements • Impact to the ad sales cycle
Challenges • Lack of experience/skill set • No prior benchmarking data Consequences of failure • Missed opportunities • Discouraged users • Loss of confidence
Unpredictable Traffic Victims
The Framework
7!
Profile • Companies that are
experienced with resizing exercises
• Allocate 3+ weeks for resizing exercises combined with load testing
• Don’t underestimate administrative challenges
Profile • Companies that plan to handle
it themselves but don’t have the “crisis” speed skill set
• Web teams that have no prior experience manually scaling servers
• Web teams who don’t have a triage plan in place for evaluating application v. infrastructure failures
• Companies that are unlucky
Profile
• Companies with truly volatile businesses
• Mission-critical sites where failure isn’t an option
• Web teams that haven’t invested in HA architecture
• Web teams that have separate application and infrastructure support
Planned Successfully 1
Test early, often
Planned Unsuccessfully 2
Best Effort Not Enough Unplanned
3 “Crisis mode”
• Advanced notice • Work with our team to develop a plan and load test it
Acquia: • Plan development • Provision resources • Continuous monitoring day of event
Profile
8!
Planned Successfully
Planned Successfully 1
Test early, often
The King Center
10!
Planned Successfully 1
Test early, often
The Players!Customer: The King Center!Partner: Palantir, Soasta!Acquia: Sales, Operations, Support!
Triage to Resolution: 3 Weeks!
• Advanced notice • Tried to plan for the “worst case scenario” • Planning fell short of worst case scenario
Acquia: • Immediate detection & resolution of infrastructure issues
Profile
11!
Planned Unsuccessfully
Planned Unsuccessfully 2
Best Effort Not Enough
The BRIT Awards
13!
Planned Unsuccessfully 2
Best Effort Not Enough
The Players!Customer: The BRIT Awards!Acquia: Support, Operations, Cloud Engineering!
Triage to Resolution: 20 minutes!
• No advanced notice • Resources not available • Site goes down • Panic
Acquia: • Triage the issue – Code, attack or capacity? • Resolve
Profile
15!
Unplanned
Unplanned 3
“Crisis mode”
Mother Jones
17!
Unplanned 3
“Crisis mode”
The Players!Customer: Mother Jones!Partner: New Eon Media!Acquia: Operations, Cloud Engineering, Support, Sales!
Triage to Resolution: 2 months (code base, Drupal upgrade !
Foreign Policy
19!
Unplanned 3
“Crisis mode”
The Players!
Customer: Foreign Policy!Acquia: Operations, Cloud Engineering, Sales!
Al Jazeera
21!
Unplanned 3
“Crisis mode”
The Players!Customer: Al Jazeera!Acquia: Support, Operations, Sales!
Triage to Resolution: 12 Hours!
Al-Masry
23!
Unplanned 3
“Crisis mode”
The Players!Customer: Al-Masry!Acquia: Support, Operations!
Triage to Resolution: 1 Day!
The Acquia Triage Checklist
25!
Determine nature of the problem Check monitoring Check logs
Mitigate problem Code Roll back or remediate Attack DOS – Block offending IP DDOS – Bring in DOSarrest Resize Automatic: Server HA, Web/DB failover Manual: Clone site for internal testing (Nagios) Increase size of DB Faster load balancers Larger Varnish Page Caching File system updates (GlusterFS) Increase web servers
10 to 30 minutes
30 minutes to 2+ hrs
Low Cost, Flexible, Reliable Platform Features!
Application!Lifecycle!
Management!
Customized environment, Analyze, Code management, Work!ow, Cloud migration
Search, Spam, Insight, Mobile, Functional testing, Marketing testing,
Load testing, Runtime reporting
Application Network!Services!
24/7 break-"x, Advisory support, Technical account managers,
Audits: Site, security, performance World Class Application
Support!
Platform-as-a-Service Stack
Underlying Elastic Technology Stack
27!
Page Caching Load Balancing
PHP
Web Servers
Caching
Drupal Modules
International Data Centers Amazon AWS
Caching Load Balancer
Drupal Application Servers
Data Services
Secure Infrastructure
Each layer is composed of multiple redundant servers. If one fails, there is little or no downtime!
Memcache Email
MySQL File Storage
Monitoring Backups
For Back-ups across Borders
• Acquia can deploy instances in any Amazon EC2 regions: - US East
- US West
- Europe
- Singapore
- Japan
• Who is this for? - Organizations who see significant risk
hosting their sites out of one geographic location
Multi-region replication & failover
28!
Lessons Learned
29!
How can I be successful?
You need elastic infrastructure
You need scaling automation
You need a team that can do diagnosis
You need 24X7 support
Engage Acquia early and often
Planned Successfully 1
Test early, often
Planned Unsuccessfully 2
Best Effort Not Enough Unplanned
3 “Crisis mode”
Conclusion
Acquia won’t let you fail
We have the talent & infrastructure in place to ensure you’re successful
We’ll find the needle in a haystack, and ensure your best day will never be your worst
30!
Predictable outcomes for unpredictable businesses!
Check out our website Speak to a Sales rep
For more information about Managed Cloud
31!
http://www.acquia.com/products-services/acquia-managed-cloud!
Questions • For more information visit:
http://www.acquia.com
• Contact us: [email protected] or 888.9.ACQUIA • Follow us: @acquia
• Comments welcome: • [email protected] • [email protected]
!"#$%&'()*+,-$.(.*/".#,-0(),11(+*(2"'3*#(3"4(http://acquia.com/resources/recorded_webinars!