highly available cloud_foundry

29
Welcome Highly Available Cloud Foundry Deployment for Allstate

Upload: henry-sinclair

Post on 13-Apr-2017

76 views

Category:

Technology


0 download

TRANSCRIPT

Welcome

Highly  Available  Cloud  FoundryDeployment  for  Allstate

About  Me

Henry SinclairSenior Platform Engineer for CompoZed Platform at Allstate

Our team engineers & operates the Cloud Foundry platform, hosting infrastructure, and the different services that make up the CompoZed Platform

@HenryMSinclair

GoalsLeave with a shared understanding of

Foundational infrastructure that underlies the Allstate platform

High availability features designed into the platform

Operational Principles that underlies the platform

Highly available deployments

Concepts

Availability Zone Region Security

Zone

Physical  and  logical  segment  of  infrastructure  with  low  latency  

connectivity  (<2ms)   representing  an  isolated  failure  domain

Typically  a  geographically   segmented  infrastructure  with  higher   latency  

connectivity  (<10ms)

Logical  segmentation  of  infrastructure  restricted  by  

escalating  security  controls  typically  mapped   to  data  classification

Limitations  of  Cloud  Foundry

Cannot deploy a single Cloud Foundry deployment across multiple networks

On VMware vSphere, requires a single management plane (vCenter)

Requires shared storage across virtual machines

Architecture  – Availability  Zone

Availability Zones- Represents an isolated failure domain across

the infrastructure stack.- A componentized unit for a highly available

deployment.- Each AZ has independent infrastructure:

- Network Switches and L3 networks- Firewalls for DMZ and Interal Network- Load Balancers- Storage- Servers- Physical cabinets

- Continuing to integrate “shared” services into the AZ (i.e. AD, DNS, etc…).

R2 AZ2R1 AZ1 G1 AZ1 G2 AZ2

Region One Region Two

Regions- Regions have a latency of 2ms within. Latency across

regions is > 2ms.- Typically represented by a datacenter but may be

represented an AWS or Azure region.- Hosts multiple availability zones.

How do you build an AZ?

Anatomy  of  an  AZ

Physical Server

Anatomy  of  an  AZ

13x capacity

Cluster

Cluster

Cluster

Cluster

Anatomy  of  an  AZ

Compute Cabinet37x Servers756 CPU Cores14,208 GB Memory

Anatomy  of  an  AZ

Add in storageStorage

Anatomy  of  an  AZConnect them with Switches

Storage

Anatomy of an AZLet’s virtualize.

Storage

Anatomy  of  an  AZ

Add in some load balancers. A pair for MPN and a pair for DMZ.

Load  Balancer

Load  Balancer

Storage

Anatomy of an AZ

Add in some firewalls. A pair for MPN and a pair for DMZ.

Load  Balancer

Firewall

Load  Balancer

Firewall

Storage

Anatomy  of  an  AZ

Build a DMZ/public security zone.

Build an Internal Network/restricted and confidential security zone.

Storage

Firewall

Load  Balancer

Load  Balancer

Firewall

Anatomy  of  an  AZ

Storage

Load  Balancer

Load  Balancer

Firewall

Firewall

Extranet Core Internal Core

Internet Allstate

Anatomy  of  an  AZ

Extranet Core Internal Core

Internet Allstate

FirewallFirewall

Load   Balancer

Firewall

Load   Balancer

Firewall

Load Balancer

Firewall

Load   Balancer

Load   Balancer

Firewall

Load   Balancer

Load   Balancer

Firewall

Load   Balancer

Firewall

Putting  Cloud  Foundry  on  the  AZs

Cloud  FoundryMPN

Cloud  FoundryDMZ

Cloud  FoundryMPN

Cloud  FoundryDMZ

Cloud  FoundryMPN

Cloud  FoundryDMZ

Cloud  FoundryMPN

Cloud  FoundryDMZ

Storage Storage Storage Storage

How does this compare with our datacenter?

Racked and rolled into our datacenters

Isolated failure domains

Portability

Consolidated vs fragmented

Operational  Principles

Target  Availability  99.9%

Platform  Maintenances  Release    

-­‐ Maintenances  are  executed  1  AZ  at  a  time.

-­‐ An  entire  AZ  (with  all  apps)  is  taken  out  for  maintenance;  it  is  not  possible  to  take  just  1  app  in  an  AZ  (while  keeping  the  app  in  3  other  ASs)  out  of  traffic

-­‐ Maintenances  are  normally  routine  events  that  require  no  developer  participation

Capacity  planning   is  managed  in  a  true  cloud  provider  experience.

-­‐ Capacity  is  maintained  by  the  platform  as  an  aggregated  sum  of  apps  used  and  available  infrastructure

Disaster  recovery  is  automatic  with  derived  availability  from  the  platform

Architecture – CF

• GSLB resolves to the IP addresses of all availability zones in traffic. Traffic is evenly load balanced across each availability zone

• Whole availability zones can be taken out of traffic by the Platform Team.

• Only HTTPS traffic is available in Production.

CF  Application  Architecture

Internal Applications• All applications must be deployed with a minimum of 2 instances per

availability zone (8 instances total minimum).

• MPN Cloud Foundry has data services available but these are not replicated across availability zones. Use external databases for persistence for now.

CF  Application  Architecture

Internet-facing or Authenticated App• All applications must be deployed with a minimum of 2 instances in the DMZ and 2

instances in the Internal Network per availability zone (16 instances total minimum).• A DMZ component is required for any authenticated application. This component

must have “public” data only.• An ISAM junction is created for each app deployed in the DMZ. If the app requires no

authentication, it can be configured as a pass through in ISAM.• All applications requiring authentication must have a DMZ component even if all the

users are in the Internal Network.

Highly Available App Deployments

Apps must be deployed consistently to all Availability Zones

Change Management Automation

Zero Downtime Deployments across multiple foundations

Achieving  highly  Available  App  Deployments

Developed Conveyor to deploy Applications• Delivers Blue-Green Deployments across multiple foundations

• Fully handles the creation and closure of change records

• Auto Rollback functionality• 1st Piece of Open Source Software in Allstate

• https://github.com/compozed/deployadactyl

What  it  looks  like

Early Successes• Situation

– Needed to refit the PDUs on the rack and roll cabinets for future maintainability– Datacenter layout in Data Center 2 required full power down of equipment

• Action– Data Center 1 was completed live relying on intra-AZ resiliency to swap out PDU with 0

downtime and 0 impact to apps– Data Center 2 was completed with full power down activity rolling through each of the AZ

• Each AZ was down for about 6 hours but 0 impact to apps• Result

– Maintenance can be completed with 0 impact to apps

Further  Information

• SpringOne Unwinding Platform Complexity with Concourse — Matt Curry, Alan Moran; Allstate

• https://github.com/compozed/deployadactyl

• CF Summit Building a Brand Around a Technology and Cultural Transformation - Matthew Curry, Allstate