redisconf17- using redis at scale @ twitter

Nighthawk

Distributed caching with Redis @

Twitter

Rashmi Ramesh@rashmi_ur

Agenda

What is Nighthawk?

How does it work?

Scaling out

High availability

Current challenges

Nighthawk - cache-as-a-service

Runs redis at it’s core

> 10M QPS,

Largest cluster runs ~3K redis nodes

> 10TB of data

Who uses Nighthawk?

Some of our biggest customers:

Analytics services - Ads, Video

Ad serving

Ad Exchange

Direct Messaging

Mobile app conversion tracking

Design Goals

Scalable: scale vertically and horizontally

Elastic: add / remove instances without violating SLA

High throughput and low latencies

High availability in the event of machine failures

Topology agnostic client

Nighthawk Architecture

Client

Proxy/Routing layer

Backend N

..……...

Redis 0 Redis N

Backend 0

..……...

Redis 0 Redis N

Topology

Cluster

manager

Cache backend

Mesos Container

Redis nodes

Topology

watcher and

announcer

Proxy/Router

Replica 1 -> Redis1

Replica 2 -> Redis2

Replica 3 -> Redis3

Redis1(dc,host,port1,capacity)

Redis2(dc,host,port2, capacity)

Redis3(dc,host,port3,, capacity)

Topology

Cluster manager

Manages topology membership and changes

- (Re)Balances replicas

- Reacts to topology changes, eg: dead node

- Replicated cache - ensures 2 replicas of same partition are on separate

failure domains

Redis databases for partitions

Partition -> Redis DB

Granular key remapping

Logical data isolation

Enumerating - redis db scan

Deletion - flushdb

Enables replica rehydration

K1 K4K2 K3

Partition X Partition Y

Scaling

Scaling out with Client/Proxy managed

partitioningKey count: 1.5 M keys

Client

500K 500K500K

Scaling out with Client/Proxy managed

partitioningKey count: 1.5M keys

Remapped keys: 600KClient

300K 300K300K 300K300K

Persistent storage

Scaling out with Cluster managerKey count: 1.5M keys

Partition count: 100

Keys/Partition: 15K

Client

Persistent storage

Topology and

cluster manager

500K 500K500K

Keys/Partition: 15K

Client

Persistent storage

Topology and

cluster manager

500K 485K500K 15K

Keys/Partition: 15K

Client

485K 485K500K 15K 15K

Persistent storage

Topology and

cluster manager

Scaling out with Cluster manager - Post

balancingKey count: 1.5M keys

Post balancing...

Client

Persistent storage

Topology and

cluster manager

250K 250K250K 250K 500K

Advantages over Client managed partitioning

- Thin client - simple and oblivious to topology

- Clients, proxy layer and backends scale independently

- Pluggable custom load balancing logic through cluster manager

- No cluster downtime during scaling out/up/back

High Availability

High Availability with Replication

Synchronous, best effort

RF = 2, Intra DC

Supports idempotent operations only - get, put, remove, count, scan

Copies of a partition never on the same host and rack

Passive warming for failed/restarted replicas

High Availability with Replication

Client

Proxy/Routing layer

Backend 0

Partition 2,5,9

Topology

Cluster

manager

GetKey in

Partition 5GetKey in

Partition 5

SERVING

Backend N

Partition

12,5,10

SERVINGFAILED

Backend N*

Partition 12,5,10

WARMING

SetKey in

partition 5

Pool A Pool B

Current challenges

Remember this?

The most retweeted

Tweet of 2014!

Hot key symptom

Significantly high QPS to a single cache server

Hot Key Mitigation

Server side diagnostics:

Sampling a small % of requests and logging

Post processing the logs to identify high frequency keys

Client side solution:

Client side hot key detection and caching

Better to have:

Redis tracks the hot keys

Protocol support to send feedback to client if a key is hot

Active warming of replicas

Client

Proxy/Routing layerTopology

Cluster

managerBackend A

Partition 2,5,9

SERVING

Backend B*

Partition 12,5,10

WARMING

writes

Bootstrapper

Pool APool B

Questions?

redisconf17- using redis at scale @ twitter

Technology

redisconf17 - redis development, an update - @antirez

redis labcamp

redis day keynote salvatore sanfillipo redis labs

redisconf17 - building large high performance redis...

developer`s overview of sonic part 2…redis server database...

redis as a reliable work queue - percona redis as a... ·...

troubleshooting redis

redis presentation

redisconf17 - lyft - geospatial at scale - daniel hochman

redisconf17 - redis as java session store

application device queues (adq) with redis* solution...

redisconf17 - searching billions of documents with redis

redis overview

introducing redis

redisconf17 - operationalizing redis at scale

redis überall

amazon elasticache for redis - elasticache for redis … ·...

using redis as a cache backend in magentousing redis as a

redis edu 2

redisconf17 - redis enterprise: continuous availability,...