Michael Kehoe Brian Cory Sherwin
Couchbase at LinkedIn2015
3
Overview
• The LinkedIn Story• Development & Operations• Operational Tooling• LinkedIn’s Couchbase as a Database• Questions
4
• Site Reliability Engineer (SRE) at LinkedIn
• SRE for Profile &
Higher-Education
• Member of CBVT
• B.E. (Electrical Engineering) fromthe University of Queensland,Australia
Drag picture to placeholder or click icon to add
Michael Kehoe
5
The LinkedIn Story
• Founded in 2002, LinkedIn has grown into the world’s largest professional social media network
• Offices in 24 countries, Available in 23 languages• Over 360M members• Revenue of $638M in Q1 2015
6
In-Memory storage needs
The LinkedIn Story
• At our scale, it becomes challenging to scale data systems• Read-Scaling becomes important• Applicable use-cases:
• Simple cache store• Pre-warmed• Read through
• Temporary data storage for de-duping• Potential for Source of Truth (SoT) store
7
Enter Couchbase
The LinkedIn Story
• Until 2012, we were only using Memcached as a non SoT In-Memory store
• However it had some drawbacks;• Long cache warmup times• No partitioning/sharing – Had to write our own• Cold-cache restarts• Difficult to move data across hosts/clusters/datacentres
8
Enter Couchbase
The LinkedIn Story
• Evaluated systems to replace Memcached: Mongo, Redis, and others• Couchbase had advantages
• Drop-in replacement for Memcached• Built in replication and cluster expansion• Memory latency for operations• Asynchronous writes to disk• Utilize some of the development infrastructure we’ve built
9
Coding
Development & Operations
• Memcached configured with Spring and implements a caching Java interface
• Implemented with Couchbase Native Client• Developer just replaces the Spring
10
Operations
Development & Operations
• Hadoop jobs build warm cache data• Tools to partition the data and load into Couchbase offline• Apply deltas when brought on-line• Clean, warm caches ready when needed
11
Operational Tooling
• In order to efficiently use Couchbase as SRE’s, we need the following:• Provisioning• Installation• Monitoring & Alerting• Infrastructure Visibility
12
Provisioning
Operational Tooling
• Provisioning Flow• Seek estimated usage statistics on cluster
• Size of data to be stored• QPS• Redundancy Needs
• Calculate cluster sizing• Currently done via a spreadsheet with a template• Moving into an in-house application
• Request hardware for cluster(s)
13
Installation
Operational Tooling
• Current System• Enter cluster metadata into our management system (Yahoo range)• Use SALT module to install & configure cluster
• Future System• Use same metadata system• Use SALT States to install and configure cluster
• Benefits of the new system• It’s possible to have ‘state enforcement’• Use SALT Pillar’s to encrypt cluster/bucket passwords
14
Installation
Operational Tooling
CLUSTER: - ela4.couchbase.30 - prod-lva1.couchbase.30 - prod-ltx1.couchbase.30NAME: follow-bluePORT: 11211INSTANCE: 30ALERT_ADDRESSES: - q([email protected])SRE_GROUPS: - sre-team-nameCLIENT_CONTAINERS: - following-servicesEMAIL_ALERTS: - HIGHWATER_PERCENT_FULL - MEMORY_PERCENT_FULL - NOT_MY_VBUCKET - PERCENT_IN_MEMORY - KEY_USAGE - AUTOFAILOVER
15
Monitoring & Alerting
Operational Tooling
• We run a daemon on each Couchbase Server that collects metrics every minute via a Couchbase Library API
• Use cluster metadata from range to build dashboard definition file via Jinja template & Python
16
Monitoring & Alerting
Operational Tooling
$ ./couchbase.py –I 30[INFO] Generating dashboard file: common-templates/couchbase.follow-blue
17
Monitoring & Alerting
Operational Tooling
- title: couchbase.follow-blue AutoFailover Enabled
defs:
- range: "%{FABRIC}.couchbase.30"
label: "autofailover_enabled"
rrd: couchbase.follow-blue/autofailover_enabled.rrd
params:
vlabel: 'enabled_boolean'
autoalerts:
zones: ['COUCHBASE-SLA2']
enabled-fabrics: ['ela4', 'prod-lva1', 'prod-ltx1']
processor: 'ingraphs'
filter-type: 'ingraphs_filter'
contacts: [‘[email protected]']
state-check: threshold
state-check-args:
min: 1.0
consecutive-events: 10
alert-plugin: emailer
alert-plugin-args:
recipients: [’[email protected]’]
interval: 3600
include-definition: True
18
Monitoring & Alerting
Operational Tooling
19
Management
Operational Tooling
• We want to see a world-view of all the clusters that we run
• Having bucket cluster/server level statistics are useful• Having a view of who owns each cluster/bucket is useful
20
Management
Operational Tooling
21
Management
Operational Tooling
22
Management
Operational Tooling
23
Management
Operational Tooling
24
Management
Operational Tooling
25
Management
Operational Tooling
26
Conclusions
• Couchbase fits into our existing infrastructure• We have good management and monitoring of the
clusters• Rich set of tooling we extended for our environment• Starting to expand our use from a cache to a store for
internal tooling
Brian Cory Sherwin Site Reliability Engineer
LinkedIn’s Couchbase as a Database
28
• Our use case and requirements
• Why we chose Couchbase vs MySQL
• Pitfalls encountered
The Agenda
29
Memcache replacement
• Data resiliency
• Maintenance friendly
Couchbase @ Linkedin
30
AutoRemediation!
A job execution platform to remediate operations issues
• Database backend for state tracking of a workflow engine
Using Couchbase as a Workflow Backend
31
• Easy JSON documents
• Rapid iteration
• Horizontally scalable
Our Requirements
32
Couchbase as a database
• Document store
• Views for indexing
• Data resiliency
• Replication
• Simplicity
Why Couchbase?
33
• Upfront cost in creating the schema
• Rapidly changing documents• Number of columns
• Consistent incremental updates
Why not MySQL?
34
• ACID implications• Durability and Consistency
• Concurrency
• Different and new tech
Pitfalls using Couchbase
35
Questions?
If you want to learn more on AutoRemediaiton
http://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/
Questions?