lisa 2013 -- sysops-api -- leveraging in-memory key value stores for large scale operations with...
DESCRIPTION
https://github.com/linkedin/sysops-api sysops-api is a framework designed to provide visability from tens of thousands of machines in seconds. Instead of trying to SSH to remote machines to collect data (execute commands, grep through files), LinkedIn uses this framework to answer any arbitrary question about any infrastructure.TRANSCRIPT
![Page 1: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/1.jpg)
Leveraging In-Memory Key Value Stores for
Large Scale Operations with Redis and
CFEngine
Mike SvobodaStaff Systems and Automation
Engineerwww.linkedin.com/in/mikesvoboda
[email protected]://github.com/linkedin/sysops-
api
![Page 2: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/2.jpg)
My Background with LinkedIn / CFEngine
Hired at LinkedIn into System Operations in 2010
When I started, our server count was 300 machines
Implemented CFEngine automation in 2010
Since then, we have grown 100 times that size
Created our Redis API in 2012 to provide visibility
![Page 3: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/3.jpg)
What is Redis?Redis is an in-memory key value store, similar
to Memcached with additional featuresOffers on disk persistence (snapshots to disk) -
You can use this as a real database instead of just a volatile cache
Offers simple data structures out of the box and commands to work with them natively
dictionaries, lists, sets, sorted sets, etc.Highly scalable data store - A single Redis
server can satisfy hundreds of thousands of requests per second
Supports transactions - Group commands together so they are executed as a single transaction.
![Page 4: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/4.jpg)
What is CFEngine?
CFEngine: Is an IT infrastructure automation framework that
helps manage infrastructure throughout its lifecycleBuilds, deploys, and manages systemsProvides auditingMaintains infrastructure by enforcing intended
system state for complianceRuns on the smallest embedded devices, servers,
desktops, mainframes, and big iron. CFEngine easily supports tens of thousands of hosts. Provides horizontal scalability.
![Page 5: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/5.jpg)
How CFEngine works
![Page 6: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/6.jpg)
CFEngine reduces operational costs
Using CFEngine automation is more effective than hiring additional headcount
Stop fighting fires every day Allow operations to focus on
tomorrow’s problems Stay ahead of the curve Keeping the lights on is
automated Respond to outages rapidly
![Page 7: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/7.jpg)
Why LinkedIn chose CFEngine
Very mature codebase
Not dependent on underlying virtual machines like Ruby, Python, Perl, etc.
Flexible architecture Easily scale upwards to support thousands of
machines Just as simple to support smaller environments
Zero reported security vulnerabilities
Lightweight footprint
![Page 8: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/8.jpg)
What CFEngine has done for LinkedIn
Since implementing CFEngine:Operations has become extremely agile Quickly respond and resolve outagesSystem administration workload has reduced, even
with 100x the amount of serversHave built new datacenter in minutes with little
effortReal time visibility after creating our Redis
infrastructure, driven by CFEngine execution Can answer any question imaginable about all of our
servers in seconds Know every action that happens on our machines
![Page 9: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/9.jpg)
How LinkedIn uses CFEngine
Functions we have automated:Hardware failure detectionAccount administrationPrivilege escalationSoftware deploymentO/S configuration management Process / service managementSoftware deploymentSystem monitoring
You never need to log into a machine to manage it
![Page 10: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/10.jpg)
Two problems still existed for Linkedin that automation didn’t
addressThe company wanted to be able to answer any
question imaginable about production.
We didn’t want to break production by pushing new automation changes.
To solve both problems, we needed visibility.
![Page 11: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/11.jpg)
Problem #1: The company wants questions answered. STAT!
Management / Engineers want to have questions answered immediately and ask several times a day interrupting your work.
![Page 12: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/12.jpg)
LinkedIn was hunting for data
![Page 13: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/13.jpg)
What LinkedIn sysadmins were doing
Thousands of network connections were made to remote machines from a single host to fetch data.
Did I get results from everything?
Parse results after collection
• Questions about Infrastructure were answered by
sysadmins SSHing to machines to hunt for data.
• As our scale increased, we used a remote execution
tool to parallelize some variant of SSH / DSH
![Page 14: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/14.jpg)
Forcing command execution on remote
machines doesn’t scale
Machines were missed, data wasn’t collected
Firewalls mangled packets
SSHD offline or didn’t spawn on the remote hostDepended on system accounts being valid
Network connections failed to the remote machine
Data collection shouldn’t be complicated
Unsure if we were able to collect all of the necessary data.
![Page 15: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/15.jpg)
Problem #2: We didn’t want to break production by pushing new automation
changes.
Ops was hesitant of using automation because they didn’t know where things would break
When automation was expanded, we didn’t know where systems need alternative behavior to work correctly (or where they have been modified by developers with root access)
Ops had to be agile. We have to work fast. The business needs us to modify production multiple times a day, but we had to make changes without breaking it
![Page 16: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/16.jpg)
Automation changes were happening in the blind
Sysadmins were under pressure from large ticket queues numerous change requests business needs to scale
Automation changes were being performed without fully understanding the impact before that change was executed
We realized that this could lead to mistakes, disasters, outages, and pink slips. To keep this from happening, I built our Redis API to provide visibility.
![Page 17: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/17.jpg)
To provide visibility, we had to scale data
collectionWe had to build a reliable system that was extremely
fast, which could give us results of remote command execution from tens of thousands of systems in seconds
Querying this data could not put load on production systems
The cache needed to be publically available to the company via an API so they could answer their own questions
We needed to quickly add new data into the cache before pushing automation changes to view production impact.
![Page 18: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/18.jpg)
We built a cache and populated it with data to answer arbitrary
questions
Instead of executing commands remotely, we have CFEngine populate the cache with commonly queried data
CFEngine executes expensive commands like lshw or dmidecode once and make the output available for everybody to use
Data collection becomes a scheduled event that happens once a day - This data collection becomes a cost of doing business
With the same data being gathered on all machines, it becomes trivial to compare two or more pieces of hardware
![Page 19: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/19.jpg)
Architecture of the Cache
Step 1: Rely on CFEngine execution to drive data insertion
Step 2: Shard your data
Step 3: Use software load balancing!
![Page 20: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/20.jpg)
Step 1: CFEngine drives data insertion
Leverage automation to change what you insert or remove from the cache
![Page 21: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/21.jpg)
The cache is a simple dictionary, sharded over multiple Redis servers.
![Page 22: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/22.jpg)
Step 2: Extract Sharded Data
Determine scope. How much data do I need to answer my question?
For each CFEngine policy server running Redis, search Redis for matching keys in the dictionary
For each key we find from a search, perform the relevant data extraction Contents Md5sum os.stat() wordcount
![Page 23: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/23.jpg)
Step 3: Use Software Load Balancing!
Have clients populate multiple Redis servers on insertion - Pick a Redis server at random on extraction (Load balancing) If we don’t get a response from our first choice,
pick another Redis server at random (failover)
Find randomized CFEngine policy servers with Redis from each level in the scope If the CFEngine policy server responds, push it
into a list of machines we need to query for data If the CFEngine policy server doesn’t respond,
pick another one at random (fail over)
![Page 24: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/24.jpg)
Local Scope
![Page 25: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/25.jpg)
Example: Local cache extraction
$ time extract_sysops_cache.py \
--search /etc/passwd \
--contents | grep msvoboda | wc -l
487
real 0m1.813s
user 0m1.484s
sys 0m0.087s
![Page 26: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/26.jpg)
Site (datacenter) Scope
![Page 27: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/27.jpg)
Example: Site cache extraction
$ time extract_sysops_cache.py \ --site lva1 \--search /etc/passwd \--contents | grep msvoboda | wc -l 8687
real0m19.169suser 0m30.286ssys 0m1.271s
![Page 28: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/28.jpg)
Global Scope
![Page 29: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/29.jpg)
Example: Global cache extraction
$ time extract_sysops_cache.py \
--scope global \
--search /etc/passwd \
--contents | grep msvoboda | wc -l
27344
real 0m44.827s
user1m39.532s
sys 0m4.288s
![Page 30: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/30.jpg)
Make it fast! Become Multithreaded
![Page 31: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/31.jpg)
Make it faster!Build a Redis pipeline
![Page 32: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/32.jpg)
Cache extraction with a pipeline
![Page 33: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/33.jpg)
Extracting the Cache for Fun and Profit
[msvoboda@esv4-infra01 ~]$ extract_sysops_cache.py \ --scope local \ --search mps*cm.conf \ --md5sum \ --prefix-hostnames
esv4-2360-mps01.corp.linkedin.com#/etc/cm.conf 12721673715de3ee6b9dec487529355eesv4-2360-mps02.corp.linkedin.com#/etc/cm.conf 56b03a16c69e5b246a565dbcda44ba28esv4-2360-mps03.corp.linkedin.com#/etc/cm.conf 11e20e28ec60ac6c71cbb71b0a6c9b35esv4-2360-mps04.corp.linkedin.com#/etc/cm.conf 55402eda02e7f5c17dc7535455adc097
![Page 34: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/34.jpg)
Make it fastest!Compression is significant!
Less network overhead on cache insertion
Less network overhead on cache extraction
More stuff we can put into the Cache
With less network I/O = faster results delivered
Less CPU usage on extraction
![Page 35: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/35.jpg)
Seconds for cache insertion
![Page 36: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/36.jpg)
CPU cycles for cache insertion
![Page 37: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/37.jpg)
Data size in megabytes of the cache for an entire datacenter
![Page 38: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/38.jpg)
Time for cross country complete datacenter cache
extraction
![Page 39: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/39.jpg)
Drink from the firehose
![Page 40: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/40.jpg)
With Redis API, you can now be confident in pushing automation
changesYou know what systems will be affected before a
change
You aren’t hit with surprises in production
You have added visibility
You don’t have to log into machines to modify or update
![Page 41: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/41.jpg)
SummaryBefore
implementation of CFEngine & Redis
APIat LinkedIn
After implementation of CFEngine & Redis
APIat LinkedIn
Headcount 6 people supporting a few hundred machines
6 people supporting tens of thousands of machines
Time spent Hours to build a single machine
Build complete datacenters in minutes
Productivity Hours spent collecting data before change, change itself causing outages
Can focus on building infrastructure, team became proactive to fix future problems, not reactive / firefighting
Ease of scaling server deployment
Incredibly difficult to respond to change, low visibility into production
Superior administration, rapid response to changing needs, complete system visibility
![Page 42: LISA 2013 -- sysops-api -- Leveraging In-Memory Key Value Stores for Large Scale Operations with Redis and CFEngine](https://reader036.vdocument.in/reader036/viewer/2022062614/546e9df5af795962298b57fc/html5/thumbnails/42.jpg)
Open SourceQuestions?
www.linkedin.com/in/mikesvoboda
You can download the code from this presentation here:
https://github.com/linkedin/sysops-api