inside the chef push jobs service - chefconf 2015

Chef Push in 2015Mark Anderson, 2015-04-01

Mark AndersonEngineer, Chef

The basics of Chef Push

If you want to run a command on a set of nodes • `knife ssh` can be problematic

• Key distribution/revocation • Access control/User accounts • Difficult to audit • Extra work required if the node is behind firewall • Doesn’t really scale very far past tens of nodes

• None of the alternative systems suited our needs

Why Chef Push?

• We wanted a remote execution system that is • Robust under network and client failure • Gates execution on a quorum being available • Provides presence information • Scale to hundreds if not thousands of nodes • Integrated with Chef authentication and

authorization system • Works behind firewalls and NAT

Why Chef Push?

• knife job start -quorum 90% 'chef-client' --search 'role:webapp'

• Finds all nodes with role webapp • Submits a job to the push server. • Checks quorum; 90% nodes listed must be available • Starts job chef-client on available nodes • Gathers success and failures • And will do this for ten nodes...or a thousand

Push jobs in a command line

The lifecycle of a job

Server

Client

Job Accepted

Send Command

Clients ACK

Wait for Quorum Start Exec

Clients Exec

Collect Results

• Erlang service • Extends the Chef REST API

• Job creation and tracking • Push client configuration

• Controls the clients via ZeroMQ • Heartbeats to track node availability • Command execution • All ZeroMQ packets are signed

Chef Push Server

• Simple ruby client • Receives heartbeats from the server • Sends back heartbeats to the server • Executes commands

• Configuration requirements are minimal • The client initiates all connections to the server

• Most configuration is via Chef API call to config endpoint

• Using that info opens ZeroMQ connections to server

Chef Push Client

Chef Push Networking

Message switch

Heartbeat generator

REST API

Client

HTTPS

PUB/SUB

DEALERROUTER

• All control for push is via extensions to the chef API • Node status • Job control

• start • stop • status

• Job listing

Chef Push knife extension

• Access rights controlled by groups • ‘push_job_writers’ group controls job creation and

deletion • ‘push_job_readers’ group controls read access to

job status and results • Whitelist for commands

• The client rejects commands that aren’t on the whitelist

• We’d like to do finer grained access control in the future

Access control

• Version 1.0 scales to 2k nodes • Works with Chef 12 • Open source since Fall 2014

• We’ve been working on new features since last spring

• But Chef 12 had to go out first • Required features from Enterprise Chef • Open sourcing chef push pretty meaningless

without a open source server

Status:

New Features in Chef Push 2.0

• Breaking change to the protocol • End to end encryption of every packet

• Required for us to implement parameter passing and output return features

• Built on the ZeroMQ4 implementation of CurveCP • CurveCP provides a framework which is

• Fast • Crypto hardened against modern attacks • Forward secrecy

• We still bootstrap the authentication using the Chef Client key

End to End Encryption

Enhanced control for the job execution environment • A config file up 100k • Effective User • Working directory • Environment variables

• User defined variables • Special variables for

• job id • job file location

Command environment and config files

• New flag for job • capture_output: boolean

• Capture is all or nothing • All nodes in the job • Both stdout and stderr

• Stored on server with job description • No streaming output … yet

Command output capture

Two event feeds • Per org feed

• Job start • Job completion summary • Runs forever

• Per job feed with fine grained execution data • Job voting start • Quorum votes by node • Job start • Completion state by node • Job completion

Server Sent Event Feeds

• Previously we’ve been advertising around 2k as the limit

• 10k connected nodes demonstrated • 10 sec heartbeats • c3.2xlarge chef server in standalone mode • Push server consumes 2 cores and about 2GB

• Up to 1k nodes in a single job • around 1.5-2k nodes we start seeing some

stampede problems • Not done scaling; there are a few tweaks left to do

Stable at 10k connected nodes

Demo some improvements

• That test was done with real push clients • 20 m3.2xlarge nodes, • Each running 500 docker containers

• But we also do a lot of testing using a simulator • Understanding the limits of our current system

• SystemTap is amazing for this kind of work

Current work: Scalability and Stability drive

Axes of scaling tested • # of active clients • Heartbeat rate for a client • Number of clients in a single job

Below 10k clients there is a pretty linear trade between heartbeat rate and number of connected clients; heartbeats/sec is was a useful metric

Must use care to avoid stampedes in job execution

Scaling and Tuning

• A port in ZeroMQ is bound to a single thread • All communications go through a single ‘command

switch’ • Client heartbeats, and all command messages go

through the switch • The switch ended up being a bottleneck at around

2k messages/sec • Experiment: multiple command switches

• Exercises some weaknesses in the ZeroMQ - Erlang interface

• Not as big of a win as hoped, ended up being more complex than we’d like

Lessons from scaling

Nearly feature complete but: • Remaining work for new features

• Knife push extensions for everything • Documentation

• Windows testing and stability • Committed to making Windows a first class citizen

• CentOS 7 • Polish around installation and cookbooks • Upgrade tooling for 1.0->2.0 • Bug fixes

• Please file bugs

Remaining work for 2.0

Roadmap for 2.1 and beyond

• Currently we support • Ubuntu 10.04, 12.04, 14.04 LTS • CentOS 5, 6, and 7 soon • Windows (client only)

• Investigating client support for • AIX • Solaris

Platform Support

• Key rotation support • Multiple keys breaks some assumptions around

how we auth in push • Needs fixes on Chef Server as well as Push

• Better access control • Controlling access on a node by node basis • Examining persistent jobs as a first class object

with their own ACLs - look for the RFC

Features for 2.x releases

• Integration into Chef Client package • Delayed joining the two because of the protocol

breaking changes in 2.0 • Future server versions will be backward

compatible.


Scaling • Rate limited job execution

• Prevent stampede effect • Protects both push and chef server • Starting 1k chef client runs at once is a bad idea

anyways • Per-job and server global limits

• Multiple socket command switch • Biggest scaling bottleneck • Infrastructure for distributed server


• Move push connections to front ends in tiered Chef • Push will be running on all of the front end nodes • Expect should improve scaling

• Better HA support • Move to a true active-active model on BE

• Scaling • Our goal is to scale with Chef server

Future major releases - 3.x and beyond

Protocol changes required • Complex networks difficult; proxies are hard

• ZeroMQ was helpful at first, but hitting limitations • Stability problems at scale • Erlang doesn’t need a lot of what ZeroMQ brings

• Backward compatibility will be a priority

Future major releases - 3.x and beyond

• Office hours • Currently Monday and Wednesday 12:00PST

• chef-push is the master repository • github.com/chef/chef-push • File issues here • Specific issues and PRs are fine to file against the

individual repos • Pull requests always welcome

• RFCs for major new features

`

inside the chef push jobs service - chefconf 2015

Technology

job chefclient

server chef push client

basics of chef push

job server client job

chef authentication

signed chef push server

chef client key end

job status