Transcript
Page 1: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Salt at Web Scale

Craig Sebenik SRE

29 January 2014 SaltConf

Page 2: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Who Am I?

•Programming for 30-ish years

•Scientific computing

• Java and Perl Developer (web apps)

•HATE doing the same thing more than once

•Been at LinkedIn overy 3 years

•From the very beginning of us using salt

•Manage/architect the entire salt infrastructure at LinkedIn

Page 3: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

What is LinkedIn?

•Social media company connecting the world’s professionals

• 5000+ employees

•Offices throughout the world

• Based in Mountain View, CA

Page 4: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

How Big Is lnkedin.com?

•Several data centers

•Customer facing apps (aka “production”)

•Staging for production apps

• Internal only apps

• Several Hundred Apps

• 30+K Hosts

•90+% Linux

•Solaris

•Mac and Linux Desktops

Page 5: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

LinkedIn Operations

•Several operations groups

•Systems (eg. OS install/config, “rack and stack”)

•Database Admins

•Network

•Application (i.e. SRE)

•Different groups have different needs for automation

Page 6: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

What Is An SRE?

•Assist application developers deploy their apps

•Advise on rollout plans

•Coordinate rollouts

•Generally, the group in-between all of operations and all of the developers

•Lots of troubleshooting

• SREs write code (automation)

Page 7: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

SREs Use Salt

•Using salt since 0.8.9

!

• Installation of new apps

!

•Config management

!

•Some troubleshooting

Page 8: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Salt Architecture

•Each physical data center

•multiple “fabrics” (logical grouping of hosts)

• single salt master (largest set of minions = 8+k)

•warm backup (same private key)

•minions configured with CNAME to master

• Files stored in subversion

•states, grains, modules

• runners

• reactor

Page 9: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Building Salt

• Internal fork from github

•Add another number. E.g. 2014.01.0.0

•Allows for internal only patches

•Create specific package for testing

•same git repo, with same tags

•LNKD-salt-dev-2014.01.0.0-12345.noarch.rpm

•Allows for emergency changes elsewhere

• salt-dev is deployed on a set of virtual machines

•custom test suite is run

Page 10: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Installing Salt

•OS is managed by cfengine

• cfengine will push new salt releases and restart minions

•cfengine also manages minion configs

•master is a set of RPMs

• includes config

• Solaris install is handled by systems team

•Roll out to one data center at a time

•Entire process can take over a week

Page 11: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Salt Master

• salt master is wrapped in a “runit” script

• runit is a process supervisor

• restarts the master if is dies/stops

• salt API

• use the reactor system to send metrics

•metrics gathering is all home grown

• trying to open source it

• file updates (every 5 mins)

•modules, states, grains

Page 12: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Master Access

• Logins to the host are managed via cfengine

•Have to be in a whitelisted group to log on

• Access to salt command controlled via sudo

•sudo logs provide audit trail

• Disable cmd.* from salt cli

• If you want to automate; write a state and/or module

• salt API access via a whitelist of IPs

•Auth using LDAP

•Only a handful of commands

Page 13: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Minions

• basic salt RPM

• includes “salt” command (unfortunately)

•module sync

•every hr

• small python script using client API

•minion metrics

• “age” of modules (via a tracker file)

•uptime of minion

Page 14: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Deployment With Salt

• LinkedIn.com apps are deployed via a custom app

•App is showing its age and needs to be replaced

• Team outside of operations is writing new deployment app

•Uses salt api

•Has a lot of custom code

•Not in salt

• Needs to deploy locally (for testing)

•This includes Mac desktop/laptops

Page 15: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Custom Modules and States

• couchbase management (via runner)

• runit

• Apache Traffic Server

•metrics system

•alerts

•data collection

•data display

Page 16: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Module Promotion

•Small oversight last year caused massizve issues

•Developed process to “promote”modules

• Salt environments:

•dev -> vm -> test -> stage -> prod

•different dirs in svn

•sparse directories

•minions are configured to look at certain environments

•Changes are managed with “review board”

Page 17: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Problems

•Education!

•Most salt customizations in 2 groups (out of 10)

•Few power users

•Corrupted keys

• Syncing only every hour

•No syncing on solaris

•No highstate enforcement

Page 18: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

More Problems

• Lots of CPU issues on master

• Key management

•Reinstall of OS with same host name

Page 19: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

Future

•Multi master

•shared job cache via file system isn’t what we want

• investigating using a returner to share job info

•More training

•Whitelist of states

•Non-ops users

•Eg. devs that want to deploy just their code

• Increase amount of data in grains

Page 20: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.

More Future

•Pillar data

•Metrics

• Better visibility when things go wrong

•Tools to see job cache

•Logs on master are too chatty

•Ability to watch all traffic from a specific minion(s)

• Key management

• reactor system, possibly

Page 21: SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale

Questions?

http://www.linkedin.com/in/craigsebenik


Top Related