acquia managed cloud: highly available architecture for highly unpredictable traffic

32
Acquia Managed Cloud: Highly Available Architecture for Highly Unpredictable Traffic Kieran Lal Technical Director Acquia January 19 th , 2012 Jess Iandiorio Sr. Director, Cloud Product Marketing Acquia

Upload: acquia

Post on 20-Aug-2015

1.490 views

Category:

Technology


0 download

TRANSCRIPT

Acquia Managed Cloud:!Highly Available Architecture for Highly Unpredictable Traffic!

Kieran Lal!Technical Director!

Acquia!

January 19th, 2012!

Jess Iandiorio!Sr. Director, Cloud Product Marketing!

Acquia!

Set-up/Launch Production Crisis

2!

Build •  Load balancers •  Fast page cache •  App Servers •  Database •  File systems •  Web servers •  App Configuration •  HA architecture

Deploy •  Integrated Git/SVN •  Drag and drop content

management

Application updates •  Drupal App code

Infrastructure updates •  OS •  Debugging •  Security

Operations •  24X7 monitoring & alerts •  Backups •  Load testing

Diagnosis •  Site failure •  Infrastructure failure •  Application errors Resolution •  Resize •  Launch new virtual servers •  Multi-region failover

Your Drupal Application Life Stages

Dec Nov Oct Sept Aug Jul

0

.002

.004

.006

.008

.010 Users hitting your site

Capacity Planning Options

3!

Options

Over Plan 1

Over Pay

Dec Nov Oct Sept Aug Jul

0

.002

.004

.006

.008

.010 Users hitting your site

Capacity Planning Options

4!

Options

Over Plan 1

Over Pay

Under Plan 2

Expect Outages

Dec Nov Oct Sept Aug Jul

0

.002

.004

.006

.008

.010 Users hitting your site

Capacity Planning Options

5!

Options

Over Plan 1

Over Pay

Under Plan 2

Expect Outages

Acquia Plan 3

No Failure

Events Businesses News/ M&E Organizations High Growth Sites

6!

Challenges •  Plagued by prior event stats •  Failure extends beyond web Consequences of failure •  Sales (tickets) •  Brand Damage •  Missed donation

opportunities

Challenges •  You never know when you’ll be

“Huff Po’d” •  Time-to-market is critical Consequences of failure •  Loss of credibility •  Readership •  Contractual failures per

advertising agreements •  Impact to the ad sales cycle

Challenges •  Lack of experience/skill set •  No prior benchmarking data Consequences of failure •  Missed opportunities •  Discouraged users •  Loss of confidence

Unpredictable Traffic Victims

The Framework

7!

Profile •  Companies that are

experienced with resizing exercises

•  Allocate 3+ weeks for resizing exercises combined with load testing

•  Don’t underestimate administrative challenges

Profile •  Companies that plan to handle

it themselves but don’t have the “crisis” speed skill set

•  Web teams that have no prior experience manually scaling servers

•  Web teams who don’t have a triage plan in place for evaluating application v. infrastructure failures

•  Companies that are unlucky

Profile

•  Companies with truly volatile businesses

•  Mission-critical sites where failure isn’t an option

•  Web teams that haven’t invested in HA architecture

•  Web teams that have separate application and infrastructure support

Planned Successfully 1

Test early, often

Planned Unsuccessfully 2

Best Effort Not Enough Unplanned

3 “Crisis mode”

• Advanced notice • Work with our team to develop a plan and load test it

Acquia: • Plan development • Provision resources • Continuous monitoring day of event

Profile

8!

Planned Successfully

Planned Successfully 1

Test early, often

The King Center

9!

Planned Successfully 1

Test early, often

The King Center

10!

Planned Successfully 1

Test early, often

The Players!Customer: The King Center!Partner: Palantir, Soasta!Acquia: Sales, Operations, Support!

Triage to Resolution: 3 Weeks!

• Advanced notice • Tried to plan for the “worst case scenario” • Planning fell short of worst case scenario

Acquia: • Immediate detection & resolution of infrastructure issues

Profile

11!

Planned Unsuccessfully

Planned Unsuccessfully 2

Best Effort Not Enough

The BRIT Awards

12!

Planned Unsuccessfully 2

Best Effort Not Enough

The BRIT Awards

13!

Planned Unsuccessfully 2

Best Effort Not Enough

The Players!Customer: The BRIT Awards!Acquia: Support, Operations, Cloud Engineering!

Triage to Resolution: 20 minutes!

Lilith Fair (RIP)

14!

Planned Unsuccessfully 2

Best Effort Not Enough

• No advanced notice • Resources not available • Site goes down • Panic

Acquia: • Triage the issue – Code, attack or capacity? • Resolve

Profile

15!

Unplanned

Unplanned 3

“Crisis mode”

Mother Jones

16!

Unplanned 3

“Crisis mode”

Mother Jones

17!

Unplanned 3

“Crisis mode”

The Players!Customer: Mother Jones!Partner: New Eon Media!Acquia: Operations, Cloud Engineering, Support, Sales!

Triage to Resolution: 2 months (code base, Drupal upgrade !

Foreign Policy

18!

Unplanned 3

“Crisis mode”

Foreign Policy

19!

Unplanned 3

“Crisis mode”

The Players!

Customer: Foreign Policy!Acquia: Operations, Cloud Engineering, Sales!

Al Jazeera

20!

Unplanned 3

“Crisis mode”

Al Jazeera

21!

Unplanned 3

“Crisis mode”

The Players!Customer: Al Jazeera!Acquia: Support, Operations, Sales!

Triage to Resolution: 12 Hours!

Al-Masry

22!

Unplanned 3

“Crisis mode”

Al-Masry

23!

Unplanned 3

“Crisis mode”

The Players!Customer: Al-Masry!Acquia: Support, Operations!

Triage to Resolution: 1 Day!

When Failure is Not an Option

24!

The Acquia Triage Checklist

25!

Determine nature of the problem Check monitoring Check logs

Mitigate problem Code Roll back or remediate Attack DOS – Block offending IP DDOS – Bring in DOSarrest Resize Automatic: Server HA, Web/DB failover Manual: Clone site for internal testing (Nagios) Increase size of DB Faster load balancers Larger Varnish Page Caching File system updates (GlusterFS) Increase web servers

10 to 30 minutes

30 minutes to 2+ hrs

Low Cost, Flexible, Reliable Platform Features!

Application!Lifecycle!

Management!

Customized environment, Analyze, Code management, Work!ow, Cloud migration

Search, Spam, Insight, Mobile, Functional testing, Marketing testing,

Load testing, Runtime reporting

Application Network!Services!

24/7 break-"x, Advisory support, Technical account managers,

Audits: Site, security, performance World Class Application

Support!

Platform-as-a-Service Stack

Underlying Elastic Technology Stack

27!

Page Caching Load Balancing

PHP

Web Servers

Caching

Drupal Modules

International Data Centers Amazon AWS

Caching Load Balancer

Drupal Application Servers

Data Services

Secure Infrastructure

Each layer is composed of multiple redundant servers. If one fails, there is little or no downtime!

Memcache Email

MySQL File Storage

Monitoring Backups

For Back-ups across Borders

•  Acquia can deploy instances in any Amazon EC2 regions: -  US East

-  US West

-  Europe

-  Singapore

-  Japan

•  Who is this for? -  Organizations who see significant risk

hosting their sites out of one geographic location

Multi-region replication & failover

28!

Lessons Learned

29!

How can I be successful?

You need elastic infrastructure

You need scaling automation

You need a team that can do diagnosis

You need 24X7 support

Engage Acquia early and often

Planned Successfully 1

Test early, often

Planned Unsuccessfully 2

Best Effort Not Enough Unplanned

3 “Crisis mode”

Conclusion

Acquia won’t let you fail

We have the talent & infrastructure in place to ensure you’re successful

We’ll find the needle in a haystack, and ensure your best day will never be your worst

30!

Predictable outcomes for unpredictable businesses!

Check out our website Speak to a Sales rep

For more information about Managed Cloud

31!

http://www.acquia.com/products-services/acquia-managed-cloud!

Questions • For more information visit:

http://www.acquia.com

• Contact us: [email protected] or 888.9.ACQUIA • Follow us: @acquia

• Comments welcome: • [email protected] • [email protected]

!"#$%&'()*+,-$.(.*/".#,-0(),11(+*(2"'3*#(3"4(http://acquia.com/resources/recorded_webinars!