war stories from building a public cloud

25
War Stories from Building a Public Cloud QCon New York - 2015 Amila Maharachchi Senior Tech Lead, WSO2 Inc. [email protected]

Upload: wso2

Post on 28-Jul-2015

238 views

Category:

Technology


1 download

TRANSCRIPT

War Stories from Building a Public Cloud

QCon New York - 2015

Amila Maharachchi

Senior Tech Lead, WSO2 [email protected]

WSO2 Cloudhttp://wso2.com/cloud

Motivation

o Historyo StratosLiveo Stratos -> Donated to Apacheo Wanted to provide a better user experienceo WSO2 API Manager was becoming a hot producto WSO2 AppFactory was in the making

Beginning

o Two cloudso App Cloud

■Powered by WSO2 AppFactoryo API Cloud

■Powered by WSO2 API Manager

oIn beta for nearly two yearsoAPI Cloud is commercial now

War Stories

Why it is a war :)

o It is not a war, buto More than 100 instanceso Can’t let a bug to live too longo Need to upgrade frequentlyo Customer issues/questionso We depend on other WSO2 products, but...

Will be sharing the experience on..

o Customizations and new developmentso Planning the deploymento Configuration managemento Monitoring and alertso Bug fixes, upgrades and migrationso Securityo Backups and restorationo Statisticso Feedback and customer supporto Performance issueo Processes

Customizations and new developments

o New user modeloRequirement of plugging the wso2.com

userstoreoEase of registering and working in

organizationsoWrote our own userstore implementation

oManagement appoTwo clouds (and more in the future) to be

centrally managed

Customizations and new developments

user1@tenant1

user2@tenant2

user3@tenant3

email

Tenant 1

Tenant 2

Tenant 3

Old Model New Model

Planning the deployment

o AWS as the IaaSoPrevious experience in running a cloud in our

infrastructureoWe are not specialized in maintaining data

centersoSo, why waste our time

o EC2, VPC, RDS, R53o High availability

Configuration Management

o We had experienced our own solution previously

o We were also playing with Puppeto Some facts to consider

oAWS instances are shutdown for maintenanceoNecessity of scalingoSetting up multiple environments

o Decided to go with puppeto We manage more than 100 nodes now

Monitoring & Alerting

o Three types of monitoring were neededoHealth of the instances and JVM processes

■SNMP, Nagios■Emails, Phone alerts etc.

oFunctionality health■Our own heartbeat monitoring tool■Improved to track the uptime as well■Keeps adding more tests

oLogs■To smell trouble■Logstash and Kibana from https://www.elastic.co/

Monitoring & Alerting

Bug fixes, Upgrades & Migration

o Bug fixingoWe can’t let a bug to exist in the live systemoWe are a customer of WSO2 :-)oGet patches from WSO2 Support

oUpgrades & MigrationsoDeploy AppFactory milestones every 2 or 3

weeksoSome ends up needing migrationsoHave our own ways

oTargetoContinuous deployment

Bug fixes, Upgrades & Migration

Load Balancer

Serving the traffic Doing the migration

Security

o Access to infrastructureo Public/Private access for services

oWhich service/product should be exposed publicly/privately

oSecuring users’ dataoNative multi-tenancy supportoJava security manager enabled

■Very strict at the moment

Backups & Restoration

o Any user artifact is in Git or SVNo Snapshots taken

oRDSoEBSoAutomated via AWS facilities

oLDAP backups

Statistics

oWe need to know what is happeningoHow many users are using this dailyoHow far they gooUnderstand about our UX

oPublish stats to WSO2 BAM on various user activities

oRun analytic scriptsoUptime trackingoAlso refer logstash stats

Feedback and customer support

oThere was no way for the users to contact usoProvided few methods

oContact us menu at the top■Via StackOverflow■Via email - which will automatically create a jira

oImproved our customer serviceoMonitor the dashboardoKeep the user informed regularly

Performance

o Identified and fixed several performance issues

o Changed some architectures as wello Several issues

oRegistry related issuesoFile system related issues

oHad problems when the number of tenants were growing

oNow we have cleanup mechanisms in place

Processes

o If same mistake happens more than once, its negligence.

oBut, people do make mistakeso Processes are the best way to minimize them

oApplying patchesoMaking a config changeoMonitoring logs for errorsoSupporting users

oChecklistsoIn upgrades

Future Wars...

Integration Cloud

Device Cloud

Keep in touch

o [email protected]

oTwittero@maharachchi

Contact us !