managing rightscale on rightscale
of 21
/21
1 Managing RightScale on RightScale Rafael H. Saavedra VP of Engineering
Embed Size (px)
TRANSCRIPT
- 1. Managing RightScale on RightScale
Rafael H. Saavedra
VP of Engineering - 2. Topics
RightScale managed by RightScale
Meta, production, staging & development
An overview of the production system
QuisCustodietIpsosCustodes
Deploying RightScale best practices
What we love about using RightScale
Features that are difficult to use - 3. RightScale: Cloud Management Platform
RightScale Production
Customer A
Customer D
Customer B
Customer C - 4. RightScale: Cloud Management Platform
RightScale Meta Production
RightScale
Production
RightScale
Staging
RightScale
Development
Customer A
Customer D
RightScale
Development - 5. A multitude of RightScale systems
Meta Production currently lives outside the cloud
Use only to manage the production system
Only RightScale ops accounts
Production: my.rightscale.com
Reaching 200 servers, a large fraction in EC2 us-east
Servers in every cloud to achieve high availability
Servers allocated in well defined availability zones
A few staging systems used for integration and QA
Ad hoc systems for performance testing, demos, betas
Many development systems with simplified configurations
A development system at the click of a button - 6. Significant increase in cloud usage
- 7. Some interesting RightScale numbers
1.65M servers launched by RightScale
RightScale continuously monitors more than 60k servers
Every day at RightScale:
2,000 array resize actions are executed
35,000 alert escalations are triggered
20,000 escalation emails are sent to users
9.0TB of monitoring data is exchange with our servers
1.6TB of logging data is sent to our servers - 8. RightScale production simplified
others
Main App
Front Ends
logging
API
dashboard
databases
daemons
DB Master
monitoring
DB Slave
mirrors - 9. What is that our users do?
Dashboard, API, monitoring graphs & event notifications
Most of the requests are monitoring updates 85% (70%)
Dashboard and API represent 7% of requests but 26% of traffic - 10. We eat our own dog food
Production servers organized into independent deployments
Core servers: frontends, core/api servers, databases, daemons - 11. We eat our own dog food
Extensive use of security groups to isolate servers
ServerTemplates are maintained for each major release
Ability to launch exact configurations of past versions - 12. Monitoring, alerts & escalations
Monitor as much as possible, what is relevant and display it in insightful ways
The need to quickly detect patterns and abnormalities
Proactively eliminate the conditions that raise critical alerts
No broken windows policy
APIsCores - 13. QuisCustodietIpsosCustodes?*
The need to monitor the monitoring and alerting systems
Extensive use of alerts to monitor the responsiveness of all the RightScale servers
Instance and EBS failures gives us headaches
Decoupling the meta & production monitoring and alerting systems
* Who watches the watchmen? - 14. How to Monitor hundreds of servers?
Starting to use stacked graphs & heat maps
The need to quickly detect patterns andabnormalities - 15. Our favorite RightScale features
RightImages: never again the need to build custom images
Input inheritance: makes it easy to keep the configurations of dozens of servers in sync
ServerTemplates: very easy to reproduce configurations in production, staging and development
The Library: there is always an example of something new that can be adapted to our needs
Monitoring: easy to make a collectdplugins to monitor just about anything - 16. Our not so favorite features
ServerTemplate inputs: powerful but too many of them make templates difficult to use
Revision management: a way to go to make users aware of new revisions and version and how to update
The Library: checking out new resources from library is not easy
Alerts: they work pretty well but they are not easy to configure, in particular, custom ones - 17. Best practices: upgrading RightScale
Avoid upgrading existing servers; instead launch fresh ones with new software (fail forward)
Not possible on some components, e.g. monitoring servers, which are in the hundreds
The cost of duplicating servers is minimal
Old servers can take over in case something goes wrong
Launch additional slaves to capture recovery points
One slave continues to replicate in case of master failure
Another slave is frozen at upgrade point can rollback by failing over
Dont forget to take snapshots in case of major failure - 18. Upgrading RightScale: Step by Step
Front Ends
Main App
Main App
servers with new code
take snapshot at cutoff
Databases
DB Master
DB Slave
DB Slave
stop replication
cut access to site
stop all access to DB
servers with old code - 19. Upgrading RightScale: Step by Step
Front Ends
Main App
Main App
servers with new code
snapshot at cutoff
reconnect all servers
Databases
DB Master
DB Slave
DB Slave
stop replication
servers with old code - 20. Upgrading RightScale: Step by Step
Front Ends
Main App
Main App
servers with new code
reconnect all servers
open access to site
Databases
DB Master
DB Slave
DB Slave
servers with old code