everything you thought you already knew about orchestration

42
Everything You Thought You Already Knew About Orchestration Laura Frank Director of Engineering, Codeship

Upload: laura-frank

Post on 21-Jan-2018

1.720 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Everything You Thought You Already Knew About Orchestration

Everything You Thought You Already Knew About Orchestration

Laura Frank Director of Engineering, Codeship

Page 2: Everything You Thought You Already Knew About Orchestration
Page 3: Everything You Thought You Already Knew About Orchestration

Managing Distributed State with Raft Quorum 101 Leader Election Log Replication

Service Scheduling

Failure Recovery

Agenda

bonus debu

gging tips!

Page 4: Everything You Thought You Already Knew About Orchestration

They’re trying to get a collection of nodes to behave like a single node. • How does the system maintain state? • How does work get scheduled?

The Big Problem(s)What are tools like Swarm and Kubernetes trying to do?

Page 5: Everything You Thought You Already Knew About Orchestration

Manager leader

Worker Worker

Manager follower

Manager follower

Worker Worker Worker

raft consensus group

Page 6: Everything You Thought You Already Knew About Orchestration

So You Think You Have Quorum

Page 7: Everything You Thought You Already Knew About Orchestration

Quorum

The minimum number of votes that a consensus group needs in order to be allowed to perform an operation.

Without quorum, your system can’t do work.

Page 8: Everything You Thought You Already Knew About Orchestration

Math! Managers Quorum Fault Tolerance

1 1 0

2 2 0

3 2 1

4 3 1

5 3 2

6 4 2

7 4 3

(N/2) + 1

In simpler terms, it means a majority

Page 9: Everything You Thought You Already Knew About Orchestration

Math! Managers Quorum Fault Tolerance

1 1 0

2 2 0

3 2 1

4 3 1

5 3 2

6 4 2

7 4 3

(N/2) + 1

In simpler terms, it means a majority

Page 10: Everything You Thought You Already Knew About Orchestration

Having two managers instead of one actually doubles your chances of losing quorum.

Page 11: Everything You Thought You Already Knew About Orchestration

Pay attention to datacenter topology when placing managers.

Quorum With Multiple Regions

Manager Nodes Distribution across 3 Regions

3 1-1-1

5 1-2-2

7 3-2-2

9 3-3-3

magically work

s with

Docker for AW

S

Page 12: Everything You Thought You Already Knew About Orchestration

Let’s talk about Raft!

Page 13: Everything You Thought You Already Knew About Orchestration

I think I’ll just write my own distributed consensus algorithm.

-no sensible person

Page 14: Everything You Thought You Already Knew About Orchestration

Log replication

Leader election

Safety (won’t talk about this much today)

Raft is responsible for…

Being easier to understand

Page 15: Everything You Thought You Already Knew About Orchestration

Orchestration systems typically use a key/value store backed by a consensus algorithm

In a lot of cases, that algorithm is Raft!

Raft is used everywhere……that etcd is used

Page 16: Everything You Thought You Already Knew About Orchestration

SwarmKit implements the Raft algorithm directly.

Page 17: Everything You Thought You Already Knew About Orchestration

In most cases, you don’t want to run work on your manager nodes

docker node update --availability drain <NODE>

Participating in a Raft consensus group is work, too. Make your manager nodes unavailable for tasks:

*I will run work on managers for educational purposes

Page 18: Everything You Thought You Already Knew About Orchestration

Leader Election & Log Replication

Page 19: Everything You Thought You Already Knew About Orchestration

Manager leader

Manager candidate

Manager follower

Manager offline

Page 20: Everything You Thought You Already Knew About Orchestration

demo.consensus.group

Page 21: Everything You Thought You Already Knew About Orchestration

The log is the source of truth for your application.

Page 22: Everything You Thought You Already Knew About Orchestration

In the context of distributed computing (and this talk), a log is an append-only, time-based record of data.

2 10 30 25 5 12first entry append entry here!

This log is for computers, not humans.

Page 23: Everything You Thought You Already Knew About Orchestration

2 10 30 25 5 12Server

12

Client12

In simple systems, the log is pretty straightforward.

Page 24: Everything You Thought You Already Knew About Orchestration

In a manager group, that log entry can only “become truth” once it is confirmed from the majority of followers (quorum!)

Client12

Manager follower

Manager follower

Manager leader

Page 25: Everything You Thought You Already Knew About Orchestration

demo.consensus.group

Page 26: Everything You Thought You Already Knew About Orchestration

In distributed computing, it’s essential that you understand log replication.

bit.ly/logging-post

Page 27: Everything You Thought You Already Knew About Orchestration

Debugging Tip

Watch the Raft logs.

Monitor via inotifywait OR just read them directly!

Page 28: Everything You Thought You Already Knew About Orchestration
Page 29: Everything You Thought You Already Knew About Orchestration

Scheduling

Page 30: Everything You Thought You Already Knew About Orchestration

HA application problems

scheduling problems

orchestrator problems

Page 31: Everything You Thought You Already Knew About Orchestration

Scheduling constraints

Restrict services to specific nodes, such as specific architectures, security levels, or types

docker service create \ --constraint 'node.labels.type==web' my-app

Page 32: Everything You Thought You Already Knew About Orchestration

New in 17.04.0-ceTopology-aware scheduling!!1!

Implements a spread strategy over nodes that belong to a certain category.

Unlike --constraint, this is a “soft” preference

—placement-pref ‘spread=node.labels.dc’

Page 33: Everything You Thought You Already Knew About Orchestration
Page 34: Everything You Thought You Already Knew About Orchestration

Swarm will not rebalance healthy tasks when a new node comes online

Page 35: Everything You Thought You Already Knew About Orchestration

Debugging Tip

Add a manager to your Swarm running with --availability drain and in Engine debug mode

Page 36: Everything You Thought You Already Knew About Orchestration

Failure Recovery

Page 37: Everything You Thought You Already Knew About Orchestration

Losing quorum

Page 38: Everything You Thought You Already Knew About Orchestration

• Bring the downed nodes back online (derp)

Regain quorum

• On a healthy manager, run docker swarm init --force-new-cluster

This will create a new cluster with one healthy manager • You need to promote new managers

Page 39: Everything You Thought You Already Knew About Orchestration

The datacenter is on fire

Page 40: Everything You Thought You Already Knew About Orchestration

• Bring up a new manager and stop Docker • sudo rm -rf /var/lib/docker/swarm• Copy backup to /var/lib/docker/swarm• Start Docker • docker swarm init (--force-new-cluster)

Restore from a backup in 5 easy steps!

Page 41: Everything You Thought You Already Knew About Orchestration

• In general, users shouldn’t be allowed to modify IP addresses of nodes

• Restoring from a backup == old IP address for node1 • Workaround is to use elastic IPs with ability to reassign

But wait, there’s a bug… or a feature

Page 42: Everything You Thought You Already Knew About Orchestration

Thank You!

@docker

#dockercon