building reliability with spokespiki.org/spokes.pdf · how people build so!ware no missed...

23
How people build soware ! " Building Reliability with Spokes Patrick Reynolds (@piki) and the Git Infrastructure team

Upload: others

Post on 21-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software

!

"

Building Reliability with SpokesPatrick Reynolds (@piki)

and the Git Infrastructure team

Page 2: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Git Infrastructure

2

##

Git file servers

Proxies and RPC • HTTPS, SSH, SVN • Internal API

$

Page 3: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Our job

3

• Run a Git service • Make it scale • Make it reliable at scale

Page 4: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Goals

4

• One server down = no impact • Two servers down = minimal impact • Balance disk space and CPU load • Automatically resync and repair • Horizontal scaling • Maintenance operations = no impact • GitHub Enterprise, too

Page 5: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Spokes

5

• Git-level replication • Three copies of every repository, wiki, and gist

• One copy = read-only • Two copies = fully functional

# #

push fetch

Page 6: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

• Goal: one server down = no impact

• Send every request to the best server

!Surviving failures

6

Page 7: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Picking the best server

7

Get replica list

Filter it

Sort it

Use it

Page 8: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Failure detector

8

• Watch real traffic • Three failures in a row = offline • One success = back online

• Heartbeats • Quick failover

Page 9: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Failure detector

9

fe1 fe2 fe3 worker api1

dfs1 dfs2 dfs3 dfs4 dfs5

• Per-client list of offline servers

Page 10: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Failure detector

10

fe1 fe2 fe3 worker api1

dfs1 dfs2 dfs3 dfs4 dfs5

• Per-client list of offline servers

Page 11: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!!#!

11

Page 12: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!No missed operations

12

• Goal: one server down = no impact • Reads pick a healthy server, fail over • Writes go to all healthy servers

• Goal: two servers down = minimal impact • No impact if in the same rack • ≤ 0.01% of repositories are read-only

Page 13: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Load balancing

13

• Goal: balance CPU load • Treat busy servers as offline

Page 14: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Recovery

14

• What happens after downtime? • What happens after failure?

Page 15: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Goals and repairs

15

• Goals • Properties that are supposed to be true • Expressed as SQL queries

• Repairs • How to make each property true, if it isn’t • Expressed as Resque jobs

Page 16: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Goals and repairs

16

Contents match? Exactly 3 copies? Disks balanced?

Repair repo Create/destroy replica Migrate replica

No No No

Page 17: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Goal-based programming

17

• Decoupled in code • Decoupled in time • Reusable

Page 18: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Repairs

18

• Goal: balance disk space • Set balanced disks as a repair goal • Move repository replicas as needed

• Goal: resync and repair • Set unanimous checksums as a repair goal • git fetch + rsync as needed

Page 19: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Maintenance operations

19

• Goal: maintenance operations = no impact • Reboots work like any other server crash • Removing a server works like a failure • Can we be more graceful?

• Quiescing and evacuating

Page 20: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Horizontal scaling

20

• Goal: horizontal scaling • Need more CPU or disk space? • Add a server • There is no step 2

Page 21: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!GitHub Enterprise

21

• Goal: GitHub Enterprise, too • Spokes = file servers for Enterprise Cluster

Page 22: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!Lessons

22

• Replicate everything • Deemphasize individual servers • Application traffic > heartbeats • Goals > failure modes • Learn more: githubengineering.com

Page 23: Building Reliability with Spokespiki.org/spokes.pdf · How people build so!ware No missed operations! 12 •Goal: one server down = no impact •Reads pick a healthy server, fail

How people build software!

!

23

</talk>