monitoring and scaling redis at datadog - ilan rabinovitch, datadog
TRANSCRIPT
![Page 1: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/1.jpg)
Redis at
RedisConf 2016San Francisco, CA
Ilan Rabinovitch Director, CommunityDatadog
![Page 2: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/2.jpg)
$ finger ilan@datadog
[datadoghq.com]Name: Ilan RabinovitchRole: Director, Technical CommunityInterests: * Open Source * Web Operations & Infra Automation * Monitoring and Metrics * FL/OSS Community Events
![Page 3: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/3.jpg)
$ cat ~/.plan
1. An Overview of Datadog
2. Monitoring 101
3. How Datadog uses Redis
4. Key Metrics and Examples
![Page 4: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/4.jpg)
• Infrastructure and App monitoring as a service. • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Powered by Redis! • We’re hiring! (www.datadoghq.com/careers/)
Datadog Overview
![Page 5: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/5.jpg)
Operating Systems, Cloud Providers (AWS), Containers, Web Servers, Datastores, Caches, Queues and more...
Monitor Everything
![Page 6: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/6.jpg)
![Page 7: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/7.jpg)
![Page 8: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/8.jpg)
Queueing Caching Session State
Redis Use Cases
![Page 9: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/9.jpg)
Monitoring 101
![Page 10: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/10.jpg)
![Page 11: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/11.jpg)
Follow @honest_update on Twitter
![Page 12: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/12.jpg)
![Page 13: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/13.jpg)
![Page 14: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/14.jpg)
Collecting data is cheap; not having it when you need it can be expensive
![Page 15: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/15.jpg)
Instrument all the things!
![Page 16: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/16.jpg)
Operational Complexity Increases with..
• Number of things to measure
• Velocity of change
![Page 17: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/17.jpg)
How much we measure?
1 instance• 10 metrics from CloudWatch
1 operating system (e.g., Linux)• 100 metrics
50~ metrics per redis instance
![Page 18: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/18.jpg)
![Page 19: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/19.jpg)
Operational Complexity
100instances
400containers
![Page 20: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/20.jpg)
Operational Complexity: Scale
160metrics per host
640metrics per host
Assuming 4 containers per host
![Page 21: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/21.jpg)
Operational Complexity: Scale
100instances
64,000metrics
Assuming 4 containers per host
![Page 22: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/22.jpg)
How much we measure?
1 instance• 10 metrics from CloudWatch
1 operating system (e.g., Linux)• 100 metrics
50~ metrics per application N containers
• 150*N metricsMetrics Overload!
![Page 23: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/23.jpg)
Operational Complexity Increases with..
• Number of things to measure
• Velocity of change
![Page 24: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/24.jpg)
Source: Datadog
![Page 26: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/26.jpg)
Operational Complexity Increases with..
• Number of things to measure
• Velocity of change
![Page 27: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/27.jpg)
![Page 28: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/28.jpg)
![Page 29: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/29.jpg)
More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/
![Page 30: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/30.jpg)
Finding Signal - Categorizing Your Metrics
![Page 31: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/31.jpg)
![Page 32: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/32.jpg)
![Page 33: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/33.jpg)
![Page 34: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/34.jpg)
![Page 35: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/35.jpg)
![Page 36: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/36.jpg)
Examples: Web Application
Work Metrics:
• Requests Per Second Dropped Connections
• Request Response Time • Error Rates (4xx or 5xx) • Success (2xx)
Resource Metrics:
• Disk I/O • Network Bandwidth • Memory • CPU • Queue Length
![Page 37: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/37.jpg)
Examples: Web Application - Events
Work Metrics:
• Configuration Change • Code Deployment / Release • Add / Remove Nodes • Cache Purge
![Page 38: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/38.jpg)
When to let a sleeping engineer lie?
![Page 39: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/39.jpg)
When to alert?
![Page 40: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/40.jpg)
Recurse until you find root cause
![Page 41: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/41.jpg)
The Life of a Metric
![Page 42: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/42.jpg)
Monitor Everything
![Page 43: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/43.jpg)
![Page 44: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/44.jpg)
• Billions of data points per day • Time series data (time stamps and values)
• Millions of events per day (text) • Metadata • 1 second resolution • Stored at full resolution for 13 months.
Metrics
![Page 45: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/45.jpg)
Primarily
ElasticSearch
![Page 46: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/46.jpg)
Examples: Events
![Page 47: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/47.jpg)
A metric is born…
• Open Source Python agent
• SDKs, Libraries, RESTful APIs
![Page 48: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/48.jpg)
A metric is born…
• Open Source Python agent
• SDKs, Libraries, RESTful APIs
• SaaS Integrations
![Page 49: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/49.jpg)
{ "series": [ { "metric": “<metric-name>”, "points": [ [ <timestamp>, <value> ] ], "type": "gauge", "host": “<host name>", "tags": [ “<tags>" ] } ]}
![Page 50: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/50.jpg)
![Page 51: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/51.jpg)
Tags All the Way Down
![Page 52: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/52.jpg)
![Page 53: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/53.jpg)
Asking Better Questions
“Monitor all containers running image web in region us-west-2 across all availability zones that use more than 1.5x the average memory on c3.xlarge”
![Page 54: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/54.jpg)
Asking Better Questions
“90% of all web requests are taking more than 0.5s to process and respond.”
![Page 55: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/55.jpg)
{ "series": [ { "metric": “system.cpu.system”, "points": [ [ 419707187, 5 ] ], "type": "gauge", "host": "test.example.com", "tags": [ “environment:prod" ] } ]}
![Page 56: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/56.jpg)
Intake APIs
• Constant stream of data
• Low latency requirements (30-60s)
• Data needed by multiple consumers/systems.
![Page 57: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/57.jpg)
![Page 58: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/58.jpg)
Redis Use Case #1: Queuing
![Page 59: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/59.jpg)
Why Redis?
• Easy to scale vertically
• Simple push/pop interaction for queues
• Simple clustering via twemproxy
![Page 60: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/60.jpg)
Work Metrics - Queuing
• Queue Depth • Message Latency • Cmd Latency • Read vs Write Calls per sec
![Page 61: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/61.jpg)
Work Metrics - Queuing
![Page 62: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/62.jpg)
Work Metrics - Queuing
![Page 63: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/63.jpg)
Resource Metrics - Queuing
• Network Utilization • used_memory • Disk IO • Disk Space • connected_clients • keyspace • rejected_connection
![Page 64: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/64.jpg)
![Page 65: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/65.jpg)
Use Case #2 Caching
![Page 66: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/66.jpg)
Initial Architecture
• Cache Aside
• Single Redis per Worker
• Local Cache
• LRU Caching (allkeys-lru)
![Page 67: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/67.jpg)
Dogpound
Source: http://bit.ly/1NoW6aj
• Uniform API for caching backends • Pluggable Architecture
• memory • redis • sharded redis • tiered caching
• Service Discovery via Consul
![Page 68: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/68.jpg)
Sharded Redis
HashRing
![Page 69: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/69.jpg)
Tiered Cache Redis
• Local Redis • Shared Cache
• Tiered • HA Pair
• Fallback to primary data store
![Page 70: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/70.jpg)
Work Metrics - Caching
• Cache Hit to Miss Ratio • cmds / second • Keys stored by host • request latency
![Page 71: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/71.jpg)
Resource Metrics - Caching
• evicted_keys • used_memory • network utilization • connected_clients • keyspace • utilization on backend data store
![Page 72: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/72.jpg)
![Page 73: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/73.jpg)
![Page 74: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/74.jpg)
Events
• Adding / Removing Nodes • Incidents • Code Deploys • Config Changes
![Page 76: Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog](https://reader031.vdocument.in/reader031/viewer/2022021422/587c04c11a28ab7c668b74eb/html5/thumbnails/76.jpg)
ResourcesMonitoring 101: Alerting https://www.datadoghq.com/blog/monitoring-101-alerting/
Monitoring 101: Collecting the Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/
Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/
The Power of Tagged Metrics https://www.datadoghq.com/blog/the-power-of-tagged-metrics/
Monitoring Redis: Collecting Performance Metrics https://www.datadoghq.com/blog/how-to-monitor-redis-performance-metrics/
HashRing https://pypi.python.org/pypi/hash_ring