building and running applications at scale in zalando · in zalando online fashion store checkout...

77
Building and running applications at scale in Zalando Online fashion store Checkout case By Pamela Canchanya

Upload: others

Post on 20-May-2020

27 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Building and running applications at scale in ZalandoOnline fashion store Checkout caseBy Pamela Canchanya

Page 2: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

About Zalando

Page 3: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million
Page 4: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

~ 5.4billion EUR

revenue 2018

> 250million

visitspermonth

> 15.500employees inEurope

> 70%of visits via mobile devices

> 26millionactive customers

> 300.000product choices

~ 2.000brands

17countries

About Zalando

Page 5: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Black Friday at a glance

Page 6: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Zalando Tech

Page 7: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

From monolith to microservice architecture

> 1000 microservices

Reorganization

Page 8: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Platform

> 1100developers

> 200development teams

Tech organization

Page 9: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

End to end responsibility

Page 10: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Checkout

“Allow customers to buy seamlessly and conveniently”

Goal

Page 11: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Checkout landscape

JavaScalaNode JS

REST & messaging

Cassandradata storage

ETCDconfigurations

AWS&Kubernetesinfrastructure

Reactclient side

Dockercontainer

Manymore

programming languages

Communication

Page 12: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Checkout architecture

Cassandra

Checkout service

Dependencies

Backend for

frontendFrontend fragments

Dependencies

Tailor

Skipper

Dependencies

Page 13: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Checkout is a critical component in the shopping journey- Direct impact in business revenue- Direct impact in customer experience

Page 14: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Checkout challengesin a microservice ecosystem- Increase points of failures- Multiple dependencies evolving independently

Page 15: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Lessons learnt building Checkout with- Reliability patterns- Scalability- Monitoring

Page 16: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Building microservices with reliability patterns

Page 17: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Checkout confirmation page

DeliveryDestination

Payments Service

Cart

Delivery Service

Page 18: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Checkout confirmation page

Delivery Service

Page 19: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Unwanted error

Page 20: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Doing retries

for (var i = 1; i < numRetries; i++) { try { return getDeliveryOptionsForCheckout(cart) } catch(error) { if (i >= numRetries) { throw error; } }}

Page 21: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Retry for transient errors like a network error or service overload

Page 22: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Retries for some errors

try { getDeliveryOptionsForCheckout(cart) match { case Success() => // return result case TransientFailure => // retry operation case Error => // throw error }}catch { println("Delivery options exception")}

Page 23: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Retries with exponential backoff

Exponential Backoff time

Attempt 1 Attempt 2 Attempt 3

Exponential Backoff time

100 ms100 ms 100 ms

Page 24: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Exhaustion of retries and failures become permanent

Page 25: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Prevent execution of operations that are likely to fail

Page 26: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Circuit breaker pattern

Circuit breaker pattern - Martin Fowler blog post

Page 27: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Open circuit, operations fails immediately

Target

error rate > threshold 50%

getDeliveryOptionsForCheckout = failure

Page 28: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Fallback as alternative of failure

Unwanted failure: no Checkout Fallback: Only Standard delivery service with a default delivery promise

Page 29: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Putting all together

Do retries of operations with exponential backoffWrap operations with a circuit breakerHandle failures with fallbacks when possibleOtherwise make sure to handle the exceptions

circuitCommand( getDeloveryOptionsForCheckout(cart) .retry(2)).onSuccess(//do something with result).onError(getDeloveryOptionsForCheckoutFallback)

Page 30: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Scaling microservices

Page 31: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Traffic pattern

Page 32: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Traffic pattern

Page 33: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Microservice infrastructure

Load balancer

Instance InstanceInstance

Container

Incoming requests

Distributed by instance

Use Zalando base image

Node envJVM env

Page 34: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Scaling horizontally

Load balancer

Instance InstanceInstance

Container

Page 35: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Scaling horizontally

Load balancer

Instance InstanceInstance

Container

Instance

Page 36: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Scaling vertically

Load balancer

Instance InstanceInstance

Container

Page 37: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Scaling vertically

Load balancer

Instance InstanceInstance

Container

Page 38: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Scaling consequences

Cassandra

> service connections> saturation and risk of unhealthy database

Page 39: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Microservices cannot be scalable if downstream microservices cannot scale

Page 40: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Low traffic rollouts

1 2

3 4

Service v2 Traffic 0%

Service v1Traffic 100%

1 2

3 4

Page 41: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

High traffic rollouts

1 2

3 4

1 2

4 5

3

6

Service v2 Traffic 0%

Service v1Traffic 100%

Page 42: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Rollout with not enough capacity

Page 43: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Rollouts should consider allocate same capacity like version with 100% traffic

Page 44: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Monitor microservices

Page 45: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Hardware

Communication

Application platform

Microservice

Four layer model of microservice ecosystem

Monitoring microservice ecosystem

Page 46: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Hardware

Communication

Application platform

Microservice

For layer model of microservice ecosystem

Infrastructure metrics

Monitoring microservice ecosystem

Page 47: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Hardware

Communication

Application platform

Microservice

For layer model of microservice ecosystem

Microservicemetrics

Monitoring microservice ecosystem

Page 48: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

First example

Page 49: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Hardware metrics

Page 50: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Communication metrics

Page 51: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Rate and responses of API endpoints

Page 52: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Dependencies metrics

Page 53: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Language specific metrics

Page 54: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Second Example

Page 55: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Infrastructure metrics

Page 56: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Node JS metrics

Page 57: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Frontend microservice metrics

Page 58: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Anti pattern: Dashboard usage for outage detection

Page 59: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

“Something is broken, and somebody needs to fix it right now! Or, something might break soon, so somebody should look soon.”

Practical Alerting - Monitoring distributed systemsGoogle SRE Book

Alerting

Page 60: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Unhealthy instances 1 of 5

Alert

No more memory, JVM is misconfigured

Page 61: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Service checkout is returning 4XXs responses above threshold 25%

Alert

Recent change broke contract of API for unconsidered business rule

Page 62: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

No orders in last 5 minutes

Alert

Downstream dependency is experimenting connectivity issues

Page 63: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Checkout database disk utilization is 80%

Alert

Saturation of data storage by an increase in traffic

Page 64: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Alerts notify about symptoms

Page 65: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Alerts should be actionable

Page 66: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Incident response

Figure Five stages of incident response. Microservices ready to production

Page 67: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Example of postmortem

Summary of incidentNo orders in last 5 minutes 13.05.2019 between 16:00 and 16:45

Impact of customers2K customers could not complete checkout

Impact of business50K euros loss of order that could be completed

Analysis of root causeWhy there was no orders?

Action items...

Page 68: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Every incident should have postmortem

Page 69: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million
Page 70: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Preparing for Black Friday

- Business forecast- Load testing of real customer journey- Capacity planning

Page 71: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Checklist for every microservice involved in Black Friday

- Is the architecture and dependencies reviewed?- Are the possible point of failures identified and mitigated?- Are reliability patterns implemented?- Are the configurations adjustable without need of deployment?- Do we have scaling strategy?- Is monitoring in place?- Are all alerts actionable?- Is our team prepared for 24x7 incident management?

Page 72: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Situation room

Page 73: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Black Friday pattern of requests

> 4,200 orders/m

Page 74: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

My summary of learnings

- Think outside the happy path and mitigate failures with reliability patterns

- Services are scalable proportionally with their dependencies

- Monitor the microservice ecosystem

Page 76: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

ObrigadaThank youDankeContactPamela Canchanya [email protected]@pamcdm

Page 77: Building and running applications at scale in Zalando · in Zalando Online fashion store Checkout case By Pamela Canchanya . About Zalando ~ 5.4 billion EUR revenue 2018 > 250 million

Building and running applications at scale in ZalandoOnline fashion store Checkout caseBy Pamela Canchanya