continuous deployment into the unknown with artifactory, bintray, docker and mesos

Click here to load reader

Post on 15-Jan-2017




1 download

Embed Size (px)


Continuous Deployment into the Unknown with Artifactory, Bintray, Docker and!! Mesos

Continuous Deployment into the Unknown with Artifactory, Bintray, Docker and Mesos

Gilad GaronKiril Nesenko

2015 VMware Inc. All rights reserved.


AgendaWhat is the Common SaaS Platform (CSP)CI/CD processes for CSPUpgrading CSP Xenon - Distributed Control Plane (If we have the time)2

Who are we ?3

Kiril NesenkoDevOps [email protected] Gilad [email protected] , Twitter @giladgaron

Hi,My name is Gilad and along here with is Kiril and we are a part of Vmwares CPSBU or Cloud provider software business unit which a fancy way of saying the we build software for cloud providers.3

VMwares SaaS TransitionVMware is developing many SaaS offerings Many services have the same common requirements (Billing, Identity, etc.)Like other good engineers, we like to reuse code wherever possibleVMwares Common SaaS Platform (CSP) is platform that internal SaaS offerings are using to leverage existing internal components


Vmware is transitioning from a product based company to a services based company.More and more teams are developing services, and need to interact with internal backoffice system such as identity and billing.As development moved forward, weve noticed two things:No one like to write integrations with billing or identity developers prefer to write services! Not integrationsEvery service implements its integrations in its own way, and if different services wants to share this integration, most of the time its too domain specific

Like all good engineers we want to share code and not waste time on reinventing the wheel.So, our main goal with CSP is to create a platform that will enable acceleration of internal services development and standardize the way a service interacts withthe various intergations


Designing a SaaS platformDesign Principles


Cloud Agnostic

Highly Available


Great Public APIs


In Practice

Infrastructure needs to support containers

Dynamic, Stateful and Distributed cluster

Tunable consistency helps to achieve availability & scalability

No internal APIs

Capabilities as libraries, Coupling is done with APIs

Ease of operability / development

Single JAR, limited classpath dependencies set

How do you design such a platform? When designing CSP weve decided on a set of design principles:1. Run on any infrastructure2. High availability self explanatory3. Scalable support N nodes4. Public APIs dogfooding we believe that a good API experience is only achievable when you consume your own APIs5. Modular add capabilities to the platform easily and be able to not use certain capabilities6. Ease of operability / development try to limit the tech zoo, and be able to run the platform with a single click

How does it looks in practice?Our lowest common denominator is container support. If a provider can support containers, we can run on it.Our platform is distributed and Stateful. we use tunable consistency in which most of our data is eventually consistentIn order to be scalable, we use gossip or to me more precise, SWIM protocol to be highly availableNo internal APIs, if you dont have them, you need to consume the public onesOur capabilities or modules are just jars in the class path. Coupling between modules is done at the public API levelOur executable is a JAR, not a web / application server which is easy on development and operations. We limited our tech zoo to technologies that are aligned with our design principles.

Most of these principles are provided by Vmwares own Xenon framework, a distributed control plane. More on xenon in a few seconds. When we sticked to our guns with the design princples (and it wasnt easy) we had a big win:


Deployment Architecture. yep thats it.6

Xenon Host JarContainer

Xenon Host JarContainer

Xenon Host JarContainer

Xenon Host JarContainer

Some Cloud Provider Inc.

When deployed in production, CSP looks like this. (also in Dev) the number of nodes can scale. A lot.How did we achieve this? Vmwares xenon framework6

Infrastructure and Patch Life Cycle

CI/CD Overview8

Customer 1Customer NCustomer 2





CSP Mesos Infrastructure


CI/CD ToolsArtifacts: Artifactory, BintrayCI: JenkinsSource Control: gitCode review: gerritSlaves: dockersInfrastructure: mesos, dockersCode Analysis: SonarBuild: gradle, MakefilesLanguages: Java, JS, Python, GoCommunication: Slack10

CI Infrastructure~300 jenkins jobs20 git repositoriesOn the fly jenkins slavesJenkins and Slack integrationMesos cluster (Marathon, marathon-lb, mesos-dns, Calico, chronos)11

Jenkins Jobs Management

Jenkins Job Builder13

Jenkins job builder to the rescue!

Jenkins Job BuilderDeveloped by OpenStack folksConfiguration as code (yaml format)Easy to review changesConfiguration de-duplicationInclude shell/groovy/python scriptsTest before deployingEasier to organize (per directory, per file)Serves as backup (easy to replicate to another jenkins)




TemplatesFor nearly identical jobs better to use templates17


Jobs Update19


Jenkins Jobs TypesGating listens for patch-set-created eventsBuild for building purposes (gradle, docker etc)Listeners listens for change-merged events on gerrit (orchestrators for the pipelines)21

Gating JobsFor each patch we run a gating jobEach git project has its own gating jobBuild + test + post results to gerrit22

Gating Jobs23

Developer sends a patchRun build and tests(gating)Post results to gerritMerge ?Start build pipeline(listener)


Gerritweb-based code review tool built on top of the git24

Jenkins Failure25


Sonar Failure26

Gerrit FailureGerrit hooksExecuted on the server sideExecute per event typeVarious checks: commit message style, trailing white spaces, etc.Integrations with external systems: bugzilla, jira, etc.27


Dynamic Pipelines

Listener JobsExecuted on patch-merged eventOrchestrating the build and delivery pipeline dynamicallyOrchestration done via the BuildFlow plugin (groovy)All listeners run the same code baseOn failure, user is notified on slack channel30



32Dynamic FlowsCONFIDENTIAL32

Listener - 1Listener - 2Listener - n








Parallel Deployments33






Upgrading a Stateful platformGoals:Minimal service interruptionsSupport schema changes

Challenges:Symmetrical cluster: Cant refactor / add API pathsState & Business Logic in the same tier: cant separate schema upgrade from BL changes


So how do we upgrade our customer envs? Upgrading services to a new version is not a new concept, All of us are familiar with the popular strategiesRolling upgrade inside an existing clusterBlue/GreenEven hybrid solutions existsWe had two main goals when designing the upgrade mechanism, other than the oblivious one of actually upgrading the code base:We must support schema transformation (renaming of fields) since adding or subtracting fields is free in Xenon.The other goal is that the customer should not feel service interruptionsCSP has some challenges that needed to be addressed when we designed our upgrade mechanism:CSP is stateful and the state and the business logic reside together in the same tier. This causes a challenge when considering a rolling upgrade.You cant seprate the schema changes and the business logic changes since they both reside in the same jar.And you you cant modify API paths and or logic since our cluster is symmetrical.So what did we do?


Upgrading a Stateful platformDesign:Work in cycles, get meaningful metrics per cycleEach cycle migrates and transforms stateUse a Threshold to determine progress and cutoff pointSmartly queue external trafficReroute traffic to new cluster


Since rolling upgrades are not easily achievable for now, we went with a green / blue strategy.Our goal here is to migrate most of the data while the platform is live. Once the migration is almost done, we queue the incoming traffic,copy the remaining data, and then reroute the traffic to the new cluster.In order to achieve that, we run in cycles. When a cycle is finished, we examine its telemetry and pass it to a threshold mechanism.The thresholds mechanism purpose it to determine whether it is safe to queue the external traffic and migrate the remaining data.If the last cycle took too long, we start a new cycle picking up from where the last cycle finished in terms of state. (the platform is still live so data is modified in runtime and we need to address these changes)So, we migrate, check and repeat until weve crossed a certain threshold. Once the threshold is crossed we queue the traffic, perform a finalCycle and reroute the traffic.Lets see an example.38

Start Migration39NodeNodeNodeNodeNodeNode

Blue NodeGroup

Green NodeGroupCreate Green ClusterDiscover StateMigrate & TransformCycleCheck ThresholdPull State{ documents:15M, duration:25S}{ documents:15M, duration:25S}Migrate & TransformCyclePull State{ documents:6M, duration:5S}{ documents:6M, duration:5S}Check ThresholdMigrate & TransformCyclePu