building reliability with spokespiki.org/spokes.pdf · how people build so!ware no missed...
TRANSCRIPT
How people build software
!
"
Building Reliability with SpokesPatrick Reynolds (@piki)
and the Git Infrastructure team
How people build software!
!Git Infrastructure
2
##
Git file servers
Proxies and RPC • HTTPS, SSH, SVN • Internal API
$
How people build software!
!Our job
3
• Run a Git service • Make it scale • Make it reliable at scale
How people build software!
!Goals
4
• One server down = no impact • Two servers down = minimal impact • Balance disk space and CPU load • Automatically resync and repair • Horizontal scaling • Maintenance operations = no impact • GitHub Enterprise, too
How people build software!
!Spokes
5
• Git-level replication • Three copies of every repository, wiki, and gist
• One copy = read-only • Two copies = fully functional
# #
push fetch
How people build software!
• Goal: one server down = no impact
• Send every request to the best server
!Surviving failures
6
How people build software!
!Picking the best server
7
Get replica list
Filter it
Sort it
Use it
How people build software!
!Failure detector
8
• Watch real traffic • Three failures in a row = offline • One success = back online
• Heartbeats • Quick failover
How people build software!
!Failure detector
9
fe1 fe2 fe3 worker api1
dfs1 dfs2 dfs3 dfs4 dfs5
• Per-client list of offline servers
How people build software!
!Failure detector
10
fe1 fe2 fe3 worker api1
dfs1 dfs2 dfs3 dfs4 dfs5
• Per-client list of offline servers
How people build software!
!!#!
11
How people build software!
!No missed operations
12
• Goal: one server down = no impact • Reads pick a healthy server, fail over • Writes go to all healthy servers
• Goal: two servers down = minimal impact • No impact if in the same rack • ≤ 0.01% of repositories are read-only
How people build software!
!Load balancing
13
• Goal: balance CPU load • Treat busy servers as offline
How people build software!
!Recovery
14
• What happens after downtime? • What happens after failure?
How people build software!
!Goals and repairs
15
• Goals • Properties that are supposed to be true • Expressed as SQL queries
• Repairs • How to make each property true, if it isn’t • Expressed as Resque jobs
How people build software!
!Goals and repairs
16
Contents match? Exactly 3 copies? Disks balanced?
Repair repo Create/destroy replica Migrate replica
No No No
How people build software!
!Goal-based programming
17
• Decoupled in code • Decoupled in time • Reusable
How people build software!
!Repairs
18
• Goal: balance disk space • Set balanced disks as a repair goal • Move repository replicas as needed
• Goal: resync and repair • Set unanimous checksums as a repair goal • git fetch + rsync as needed
How people build software!
!Maintenance operations
19
• Goal: maintenance operations = no impact • Reboots work like any other server crash • Removing a server works like a failure • Can we be more graceful?
• Quiescing and evacuating
How people build software!
!Horizontal scaling
20
• Goal: horizontal scaling • Need more CPU or disk space? • Add a server • There is no step 2
How people build software!
!GitHub Enterprise
21
• Goal: GitHub Enterprise, too • Spokes = file servers for Enterprise Cluster
How people build software!
!Lessons
22
• Replicate everything • Deemphasize individual servers • Application traffic > heartbeats • Goals > failure modes • Learn more: githubengineering.com
How people build software!
!
23
</talk>