managing 50k+ redis databases over 4 public clouds ... with a tiny devops team
DESCRIPTION
A presentation by Redis Labs' CTO, Yiftach Shoolman, given at the July 2nd meet up, hosted by I am OnDemand and IGT Cloud at the Microsoft ILDC Auditorium. See the video at: https://www.youtube.com/watch?v=eymqHZaUOH4 In this In this session Yiftach shares tips on how the company manages 50,000+ scalable and highly avaliable Redis databases over the 4 largest public clouds, 8 leading Platforms-as-a-Service, and across 10 geographical regions. He explains the service's back-end architecture, the open-source projects it uses, and which tools the company builds in-house. Shoolman also shares what Redis Labs' small DevOps team does automatically, and what it still does manually. Finally, he offers advice on how to build a strong R&D team that lives and breathes DevOps. Since the company launched its Redis Cloud service, it has dealt with 150+ node failure events and a half-dozen complete data-center outages. In addition, its team has experienced many interesting scenarios, such as hard to believe scaling patterns like 0 to a few hundreds gigabytes of in-memory data in just a few minutes, and 0 to 300K+ ops/sec in just a few seconds.TRANSCRIPT
1
powering lightning fast apps
2
The newest NoSQL
The fastest data store available today (served entirely
from RAM)
Among the top 3 databases chosen by developers
Much more than a simple key/value - Strings, Hashes,
Lists, Sets, Sorted Set, LUA, transactions, Bits
operations
Strong use cases, dynamic community, large eco-
system
Redis
3
Leading the commercial Redis market
Founded in 2011; GA in 02/2013
2,400+ paying customers; 52,000+ DBs; 100+
new DBs/day
2nd largest contributor to open source Redis
Raised $13M - Bain/Carmel/Strategic/Angels
Offices in Santa Clara and Tel-Aviv
Redis Labs
4
Redis Cloud Memcached Cloud
Our offering
Fully-managed cloud services.
On-prem server license - soon.
5
100msec =
Fast apps requirements
max E2E response time, under any load
50msec = average Internet latency
50msec = required app response time (includes processing & multi DB accesses)
1msec = required DB response time
The only database to meet requirement
=
6
DB performance comparison@<1mse
c
@<1msec
@<1msec
@<20msec
@<10-50msec
@<10-50msec
@<100msec
@<100msec
@>100msec
7
Why is Redis efficient ?
Many data-structures
Many cool commands (atomicity
maintained)
Complexity aware
8
Real world use case:
•500+GB
•400K writes/sec
•1500 reads/sec
•37.5KB average object size
Efficiency
No extra work at app level
1.5Gbps 120Gbps
Tones of work at
app level
NoSQL
6 Nodes cluster
150+ Nodes cluster
9
Timeline
Followers
Caching
Messaging
Geo search
Leaderboards
Job management
RT analytics
Verticals & main use cases
Online advertisin
g
Social Gaming
Financial Services
10
• Multi-TB in memory
• ~ 300,000 reads/sec
• ~ 5,000*N writes/sec
N - # of followers
Every Timeline
(800 tweets per user)
is on Redis
11
• 20TB+ in memory
• ~ 6,000,000 reads/sec
• ~ 600,000 writes/sec
Weibo (Chinese Twitter)
• Counting
• Reverse cache
• Top 10 lists
• Last Index
• Relational list/Message Queue
• Fast transactions w/ LUA
12
Object graph:
• Per user (Sorted Set w/ timestamp as
score)
store the users followed (explicit+
implicit)
store the user’s followers
(explicit+implicit)
• Per board
Redis Hash for storing explicit followers
Redis Set for storing explicit unfollowers
13
Stack Overflow
Three levels of cache:
• Local cache (no persistence)
sessions, and pending view count
updates
• Site cache
hot question id lists, users acceptance
rates..
• Global cache
Inboxes, API usage quotas, …
14
Github
• Redis is used for routing info
• Matching user repositories to server
names
15
Hipchat
• Which users are in which room
• Who is online
• XMPP server balancing
16
Youporn
Most data is found in Hashes with ordered Sets used to
know what data to show
(1) ZinterStore on:
{videos:filters:release}{videos:filters:orientation:straig
ht}
{videos:filters:categories(id)}{videos:ordering:rating}
(2) Perform a ZRANGE to get the pages we want and get
the list of video_ids back
(3) Start pipelining to get all the videos from Hashes
17
Snapchat
• 500+ instances
• 15-50TB
• Running on GCE
400M messages/day
18
Why Redis Labs ?
19
Infinite seamless scalability
True high-availability
Stable top performance
Zero management
Users choose us because..
Dynamic Clustering Technology
Zero-latency proxy
Cluster
manager
In-Memory Node
Cross-shard processor
In-Memory Cluster
+
21
Challenge #1
How to serve users from the same data-center ?
4 clouds /10 regions
18 data-centers / 30 clusters
24
AWS zones mapping dilemma
Redis Labs Userus-east-1a us-east-1c
us-east-1b
us-east-1c us-east-1e
us-east-1d us-east-1a
us-east-1e us-east-1b
25
Eric Hammond’s post on: Matching EC2 Availability
Zones Across AWS Accounts
How did we solve it
26
How did we solve it
Redis Labs
User
27
Challenge #2
Which instance type shall we use for our cluster?
28
Various instance types in the same cluster• High load scenarios • High memory usage scenarios • New generation of instances
Dedicated instances
As cheap as possible
Cluster’s node requirements
29
Adrian Cockcroft's Blog - Understanding and using Amazon EBS - Elastic Block Store
• use large instances and get dedicated instances for free
The tip
30
What we use today
C3 & R3 A4/5/6/7n1-standardn1-highmemn1-highcpu
BM+VM
31
Challenge #3
How to mange data-persistence with high volumes
of ‘writes’ and slow cloud storage ?
32
Ephemeral vs. Persistence storage
Ephemeral
EBS/Cloud Drive/Persistent
Disk/SAN
Network attachedPersistent
Slow
Direct attachedEphemeral
“Fast”
33
Adrian’ s Blog use the larger EBSes if you want speed
Google (GCP) “Larger volumes can achieve higher I/O levels than smaller volumes”
The tips
34
We use large volumes (1TB+)
We use both ephemeral and persistent storage
We improved/tuned/optimized the Redis persistent storage interface
If replication is enabled, slave writes to disk
We don’t use PIOPS
What we do
35
Why not PIOPS
36
Challenge #4
How to monitor 50K+ databases, 30+ clusters and
hundreds of nodes ?
37
Zabbix (not Nagios) - per node metrics
Limbic (home made) - databases’ metrics• 50K (databases) x 100+(metrics) x 10K+(time
resolutions)
• Based on Python, RRD, Redis
Redis adminUI – cluster configuration
Monitoring
38
Team/Method/Spirit
39
Team /Method/Spirit
Tiny devops team
Core dev. team knows ops (very well)
Baby steps, especially in production
The practical approach always wins
Review your plans every 3 months
40
We are hiring !
41
Thank You
42
Why is Redis efficient ?
Many data-structures
Many cool commands (atomicity
maintained)
Complexity aware
43
Think data-structure • Strings
• Hashes
• Lists
• Sets Sorted Sets
• HyperLogLogs
44
Cool commands• SET if it doesn’t exist – O(1)
• Blocking POP (with timeout) – O(1)
• (blocking) POP from one list, PUSH to another – O(1)
• Get/Set string ranges (and bit operation) – O(N)
• Union/Intersect/Ranges of SETs – O(N)+O(Mxlog(M))
• Pub/Sub – O(1)/O(M)/O(M+N)
• LUA / Transactions / Pipelining