milena talavera senior infrastructure manager@slack

50
Flannel Slack’s Secret to Scale Milena Talavera Senior Infrastructure Manager@Slack

Upload: others

Post on 25-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Milena Talavera Senior Infrastructure Manager@Slack

Flannel Slack’s Secret to Scale

Milena Talavera Senior Infrastructure Manager@Slack

Page 2: Milena Talavera Senior Infrastructure Manager@Slack

Make a Copy of this deck (File > Make a copy…) when creating your own. This will preserve the design styles.

Things don’t need to be HUGE. Most presentations are seen full-screen or even projected quite large at an event so let’s keep things looking professional and to a modest size.

Less is more! Keep slides simple and provide helpful notes.

Our Mission: To make people’s working lives simpler, more pleasant, and more productive. t force others to read them.

Page 3: Milena Talavera Senior Infrastructure Manager@Slack
Page 4: Milena Talavera Senior Infrastructure Manager@Slack
Page 5: Milena Talavera Senior Infrastructure Manager@Slack
Page 6: Milena Talavera Senior Infrastructure Manager@Slack

Slack Scale

❖  8M+ Daily Active Users 3M+ paid users; 65% of Fortune 100

Companies ❖  100+ countries 50%+ of DAU outside of US

Page 7: Milena Talavera Senior Infrastructure Manager@Slack

From supporting small teams 3-4 years ago To serving gigantic organizations of hundreds of thousands of users today

Page 8: Milena Talavera Senior Infrastructure Manager@Slack

Slack Scale

To support such rapid growth of yesterday and today, Slack’s Infrastructure has to get ahead of customer growth

Page 9: Milena Talavera Senior Infrastructure Manager@Slack

Biggest Teams

2015 8,000 users

Page 10: Milena Talavera Senior Infrastructure Manager@Slack

Biggest Teams

2015 8,000 users

2016 26,000 users

Page 11: Milena Talavera Senior Infrastructure Manager@Slack

Biggest Teams

2015 8,000 users

2016 26,000 users

2018 266,000 users

Page 12: Milena Talavera Senior Infrastructure Manager@Slack

Slack Architecture History Lesson

Page 13: Milena Talavera Senior Infrastructure Manager@Slack

Fat, greedy client

Fat, lazy client

Flannel Powered

Lazy + Flannel Powered

Resiliency

Scale: Pub/Sub

Slack Architecture History Lesson

2015 2017 2018

Page 14: Milena Talavera Senior Infrastructure Manager@Slack

Fat, greedy client

Fat, lazy client

Flannel Powered

Lazy + Flannel Powered

Resiliency

Scale: Pub/Sub

Slack Architecture History Lesson

2015 2017 2018

Page 15: Milena Talavera Senior Infrastructure Manager@Slack

Fat, Greedy Client

WebApp PHP/Hack

Messaging Server Java

HTTP

WebSocket

MySql

Page 16: Milena Talavera Senior Infrastructure Manager@Slack

User Connect Flow in 2015

Client Server

2. HTTP response: a snapshot of the team

3.Long-lived WebSocket connection

real time events

1. https://slack.com/api/rtm.start

Connect

time

Page 17: Milena Talavera Senior Infrastructure Manager@Slack

User Connect Flow in 2015

Advantages ○  Every Slack Object available locally on

the client ○  User experience was super speedy ○  Enabled us to move fast

Page 18: Milena Talavera Senior Infrastructure Manager@Slack

User Connect Flow in 2015

Limitations ○  Expensive connection/reconnection ○  Large client memory footprint (grows with

team size) ○  Susceptible to thundering herd

Page 19: Milena Talavera Senior Infrastructure Manager@Slack

19

Number of users Number of channels Snapshot size (bytes)

30 10 200K

500 200 2.5M

3,000 7,000 20M

30,000 1,000 60M

Team Snapshot Size

Max Team Sizes in 2015: ~8,000 users

Page 20: Milena Talavera Senior Infrastructure Manager@Slack

Fat, greedy client

Fat, lazy client

Flannel Powered

Lazy + Flannel Powered

Resiliency

Scale: Pub/Sub

Slack Architecture History Lesson

2015 2017 2018

Page 21: Milena Talavera Senior Infrastructure Manager@Slack

User Connect Flow in 2015

Client Server

2. HTTP response: a snapshot of the team

3.Long-lived WebSocket connection

real time events

1. https://slack.com/api/rtm.start

Connect

time

Page 22: Milena Talavera Senior Infrastructure Manager@Slack

User Connect Flow in 2016

Client Server

2. HTTP response: a partial snapshot of the

objects

3.Long-lived WebSocket connection

Pruned real time events

1. https://slack.com/api/rtm.start

Connect

time

4.Asynchronous fetch of non essential objects

Page 23: Milena Talavera Senior Infrastructure Manager@Slack

User Connect Flow in 2016

Incremental Improvements ○  Load less data at client boot time ○  Parallelized, lazy loading on demand ○  Simplified objects

On a 10,000 user team, these change alone saved a few megabyte of data.

Page 24: Milena Talavera Senior Infrastructure Manager@Slack

User Connect Flow in 2016

Still Limitations ○  Still Susceptible to thundering herd if

clients dump their cache ○  Still grows with team size

Page 25: Milena Talavera Senior Infrastructure Manager@Slack

Fat, greedy client

Fat, lazy client

Powered by Flannel

Lazy + Powered by Flannel

Resiliency

Scale: Client Pub/Sub

Slack Architecture History Lesson

2015 2017 2018

Page 26: Milena Talavera Senior Infrastructure Manager@Slack

Flannel Powered Slack

Flannel: Slack’s edge cache service

○  A query engine backed by cache on edge locations

Page 27: Milena Talavera Senior Infrastructure Manager@Slack

Powered by Flannel

WebApp PHP/Hack

Messaging Server Java

MySql Cache

Edge Pops Client Edges Non edge locations

1. WebSocket connection

2. HTTP Post: download a snapshot of the team

3. WebSocket: Stream Json events to keep cache updated

Page 28: Milena Talavera Senior Infrastructure Manager@Slack

Flannel Deployment Architecture

HAProxy

Flannel

Edge Region C

Team Affinity

Flannel Flanne

l

Flannel

HAProxy

Flannel

Edge Region B

Team Affinity

Flannel Flanne

l

Flannel

GeoDNS

Client

HAProxy

Flannel

Team Affinity

Flannel Flanne

l

Flannel

Edge Region A

Page 29: Milena Talavera Senior Infrastructure Manager@Slack

Flannel Powered Slack 2017

Advantages ○  Clients have low latency access to key

big objects through edge/pop regions ○  Minimal client changes were needed to

implement ○  More query flexibility and filtering than

typical cache solutions like memcache

Page 30: Milena Talavera Senior Infrastructure Manager@Slack

Features Powered by Flannel

Quick Switcher

Page 31: Milena Talavera Senior Infrastructure Manager@Slack

Features Powered by Flannel

Mention Suggestions

Page 32: Milena Talavera Senior Infrastructure Manager@Slack

Features Powered by Flannel

Channel Header

Page 33: Milena Talavera Senior Infrastructure Manager@Slack

Features Powered by Flannel

Channel Sidebar

Page 34: Milena Talavera Senior Infrastructure Manager@Slack

Flannel Powered Slack 2017

Limitations ○  Keeping Flannel cache updated is

expensive (firehose feed of events) ○  Thundering herd phenomenon is still a

possibility ○  Cache on the websocket is in the critical

path

Page 35: Milena Talavera Senior Infrastructure Manager@Slack

Fat, greedy client

Fat, lazy client

Powered by Flannel

Lazy + Powered by Flannel

Resiliency

Scale: Client Pub/Sub

Slack Architecture History Lesson

2015 2017 2018

Page 36: Milena Talavera Senior Infrastructure Manager@Slack

Powered by Flannel V1.5

Impactful Improvements ○  Thrift Pub/Sub reducing number of

events processed by 1000X

Page 37: Milena Talavera Senior Infrastructure Manager@Slack

Powered by Flannel V1.5

WebApp PHP/Hack

Messaging Server Java

MySql Cache

Edge Pops Client Edges Non edge locations

1. WebSocket connection

2. HTTP Post: download a snapshot of the team

3. Pub Sub Thrift events to keep cache updated

Page 38: Milena Talavera Senior Infrastructure Manager@Slack

Before

After

events reduce by

Page 39: Milena Talavera Senior Infrastructure Manager@Slack

Powered by Flannel V1.5

Impactful Improvements ○  Client lazily loads primary objects (users,

channels, channel membership) significantly reducing boot time

Max Team Sizes in 2018: ~266,000 users

Page 40: Milena Talavera Senior Infrastructure Manager@Slack

Fat, greedy client

Fat, lazy client

Powered by Flannel

Lazy + Powered by Flannel

Resiliency

Scale: Client Pub/Sub

Slack Architecture History Lesson

2015 2017 2018

Page 41: Milena Talavera Senior Infrastructure Manager@Slack

Resiliency

As scale increases, failures are more likely to happen. Our goal is to minimize blast radius and recovery time of failure modes.

Page 42: Milena Talavera Senior Infrastructure Manager@Slack

Resiliency

Our observation: when failures happen, they happen faster than one can blink an eye. Solution to this can not rely on human intervention

Page 43: Milena Talavera Senior Infrastructure Manager@Slack

Resiliency

+ =

Automated Admission Control

Page 44: Milena Talavera Senior Infrastructure Manager@Slack

Resiliency

Measures Taken ○  Automated Admission Control based on

various metrics. Examples: memory pressure, concurrent requests, etc

Automated Admission Control

Page 45: Milena Talavera Senior Infrastructure Manager@Slack

Resiliency

Circuit Breakers

Page 46: Milena Talavera Senior Infrastructure Manager@Slack

Resiliency

Measures Taken ○  Built in Circuit Breakers to mitigate

cascading failures and protect services from each other’s bad behaviours

Circuit Breakers

Page 47: Milena Talavera Senior Infrastructure Manager@Slack

What Else

Regional Failover Auto Scaling

Page 48: Milena Talavera Senior Infrastructure Manager@Slack

Fat, greedy client

Fat, lazy client

Powered by Flannel

Lazy + Powered by Flannel

Resiliency

Scale: Client Pub/Sub

Slack Architecture History Lesson

2015 2017 2018

Page 49: Milena Talavera Senior Infrastructure Manager@Slack

Sneak Peak into the Future

Expand Pub/Sub to Client Side ○  Reduce events clients have to handle ○  Track what is in the current view ○  Subscribe/Unsubscribe to events when

view changes

Page 50: Milena Talavera Senior Infrastructure Manager@Slack

THANK YOU! Got Questions?

Milena