the time-less datacenteree380.stanford.edu/abstracts/161116-slides.pdf · •more resilient to all...

28
The Time-less Datacenter Paul Borrill and Alan H. Karp Earth Computing The Datacenter Resilience Company Stanford EE Computer Systems Colloquium Wednesday, November 16, 2016 http://ee380.stanford.edu

Upload: others

Post on 19-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

The Time-less Datacenter

Paul Borrill and Alan H. Karp Earth Computing

The Datacenter Resilience Company

Stanford EE Computer Systems ColloquiumWednesday, November 16, 2016

http://ee380.stanford.edu

Page 2: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

The Three Taxes: 1. Complexity 2. Fragility 3. Vulnerability

Cloud Computing

2

Page 3: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Twitter Today

Systems can fail in catastrophic ways leading to death or tremendous financial loss. Although their are many potential causes including physical failure, human error, and environmental factors, design errors are increasingly becoming the most serious culprit*

3

*NASA Formal Methods Program: https://shemesh.larc.nasa.gov/fm/fm-why-new.html

Page 4: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Key Computer Science ProblemsReliable Consensus • Generals Problem (no fixed length protocol exists to guarantee a reliable

solution in an environment where messages can get lost)

• Slow Node vs. Link Failure Indistinguishability. I.e. what can one side of a failed link assume about a partner or cohort on the other side?

FLP Result • Impossibility of Distributed Consensus with One Faulty Process

Key Idea: • Don’t depend on processes to provide liveness, use a new kind of link

4

Page 5: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Problem: Event Ordering is Hard• In a distributed system over a general

network we can’t tell if event at process R happened before event at process Q, unless P caused R in some way

• Causal Trees provide this guarantee when they are stable

• Dynamic Causal Trees provide guarantees through failure & healing, iff you have AIT on each link

• Needs Atomic Information Transfer (AIT) in the Link

P

P:0

Q:--

R:--

Q

P:--

Q:0

R:--

R

P:--

Q:--

R:0

P

P:1

Q:2

R:1

P

P:2

Q:2

R:1

P

P:3

Q:3

R:3

Q

P:--

Q:1

R:1

Q

P:--

Q:2

R:1

Q

P:--

Q:3

R:1

Q

P:2

Q:4

R:1

Q

P:2

Q:5

R:1

R

P:--

Q:--

R:1

R

P:--

Q:3

R:2

R

P:--

Q:3

R:3

R

P:2

Q:5

R:4

R

P:2

Q:5

R:5

P

P:4

Q:5

R:5

t

Process

Causal History

FutureEffect

slop

e ≤ c

slop

e ≤ c

slope ≤ c

slope ≤ c

11 12 13 14

21 2223

24 25

3231 33 34 35

P

P:0

Q:--

R:--

Q

P:--

Q:0

R:--

R

P:--

Q:--

R:0

P

P:1

Q:2

R:1

P

P:2

Q:2

R:1

P

P:3

Q:3

R:3

Q

P:--

Q:1

R:1

Q

P:--

Q:2

R:1

Q

P:--

Q:3

R:1

Q

P:2

Q:4

R:1

Q

P:2

Q:5

R:1

R

P:--

Q:--

R:1

R

P:--

Q:3

R:2

R

P:--

Q:3

R:3

R

P:2

Q:5

R:4

R

P:2

Q:5

R:5

P

P:4

Q:5

R:5

t

Process

Causal History

slop

e ≤ c

slop

e ≤ c

slope ≤ c

11 12 13 14

21 22 23 24 25

3231 33 34 35

FutureEffect

5

Page 6: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Problem: Consensus is Hard•Failure detectors have failed

to solve the problem

•2PC (Fail-Stop)

•Vulnerable to coordinator failure (no safety proof)

• 3PC vulnerable to network partitions (no liveness proof)

•Paxos (Fail-Recover)

•Robust Algorithm but hard to understand & get right.

• Causal Trees make roles robust, easier to understand & verify

6

Page 7: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Why? Because The Network is Flaky!•App developers believe the network is the problem

•Networks drop, delay, duplicate & reorder packets

•Networking people believe the apps are the problem

•The network end to end principle: Apps should retry to distinguish between delays & drops … but … retries* ruin TCP’s ordering guarantees

•Both are incorrect. Solution requires a simple, but fresh perspective

Peter Bailis, Kyle Kingsbury. The network is reliable

* Application retries (i.e. opening a new socket) 7

Page 8: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Datacenter Failures Cascade

8

Switches are DReDDful

They Drop, Reorder, Delay and Duplicate Packets

Interdependent failures Reconstruction storms Timeout storms Gossip storms Cascade failures

Page 9: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

It’s Time to SimplifyDelta Amazon Google Apple Netflix Paypal …

9

Page 10: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Earth Computing

Earth Computing

Tim Berners LeeThe Big IdeaWorld Wide Web Key Idea:

2 Simple Sets of Rules

•Document Language (html)

•Connection Protocol (http)

Cloud ComputingMere mortals can now get their computers to talk to each other

Mere mortals can now manage their infrastructures

ONE WAY LINKS

TWO WAY LINKS

10

Earth Computing Key Idea: 2 Simple Sets of Rules

•Graph Language (gvml)

•Connection Protocol (eclp)

Page 11: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

CAS •Atomic Instruction

Key Idea:

Lock-Free data structures

•New Concurrency Libraries

•Atomic RMW

AIT •Atomic Information

Key Idea:

Recoverable Atomic Tokens

•Deterministic, In-Order

•Reversible Atomic Message

Distributed Systems Primitives

Concurrent Safety

Non-Blocking

Deterministic Recoverability

Durable Indivisible Property

Shared Memory

Reversible Token

{While (CAS(oldvalue,newvalue, ) != new value}

{Transfer (AIT(tokenID,Notify=NO, ) != Continue}11

Page 12: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company12

C2C Lattice of Cells & Links

Typical Clos-Based DC Topology, Spine and Leaf Architecture

DC Gateway

Spine Node

Leaf Node

ToR

GW

SN

ToR

VM VM

VM VM

ToR

VM VM

VM VM

ToR

VM VM

VM VM

LN LN

SN

ToR

VM VM

VM VM

ToR

VM VM

VM VM

ToR

VM VM

VM VM

LN LN

GW

SN

ToR

VM VM

VM VM

ToR

VM VM

VM VM

ToR

VM VM

VM VM

LN LN

SN

ToR

VM VM

VM VM

ToR

VM VM

VM VM

ToR

VM VM

VM VM

LN LN

DC DC DC

DCI/WAN

Today’s Networking: Servers & Switches EARTH Computing: Cells & Links

Servers, Any to Any (IP) addressing

Simpler Wiring: N2N, Switchless

Page 13: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company13

Today: Internal Segregation Firewalls EARTH: Dynamic Confinement Domains

The Datacenter Simplified

Fundamentally Simpler

The Datacenter Today

Page 14: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Split infrastructure into: Cloud datacenter accessed by untrusted legacy protocols

Earth dynamic, resilient, programmable topologies

Core where data is immutable, secure, protected, & resilient to perturbations (failures, disasters, attacks)

14

Cloudplane

OutsideWorld

EarthCore

Earth Computing Network Fabric

Data Center

Page 15: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Confidential | Earth Computing Inc.

The Big Idea

Cloudplane

OutsideWorld

EarthCore

15

Page 16: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Logical Foundation for Resilience

16

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NIC

NIC

NIC

CellAgent

NIC

NIC

NIC

NICNIC

NIC

Fabric

Page 17: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

EARTH Computing Link Protocol (ECLP)

• Events: Replaces Heartbeats, Timeouts

• Addresses the Common Knowledge* Problem

17

CellAgentNICNIC

NIC

NICNIC

NIC

Cell AgentNICNIC

NIC

NICNIC

NIC

CableCellAgent

NICNIC

NIC

NICNIC

NIC

Cable

New Distributed Systems Foundation

*Knowledge and Common Knowledge in a Distributed Environment – Joseph Y. Halpern & Yoram Moses ’90 (initial version 1984).

Page 18: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Composable Presence ManagementRouter

NICNIC

NIC

NICNIC

NIC

Router NICNIC

NIC

NICNIC

NIC

CellAgentNICNIC

NIC

NICNIC

NIC

Router NICNIC

NIC

NICNIC

NIC

CableCellAgent

NICNIC

NIC

NICNIC

NIC

Cable Cable Cable

18

Page 19: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Composable Presence ManagementRouter

NICNIC

NIC

NICNIC

NIC

Router NICNIC

NIC

NICNIC

NIC

CellAgentNICNIC

NIC

NICNIC

NIC

Router NICNIC

NIC

NICNIC

NIC

CableCellAgent

NICNIC

NIC

NICNIC

NIC

Cable Cable Cable

19

Page 20: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Demo20

Page 21: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Two Generals Problem

21

Page 22: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Example Use Cases

22

Two Phase Commit

Paxos

Link Reversal

Page 23: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

Example Use Cases• Two-phase commit The prepare phase is asking if the receiving agent is ready to accept the

token. This serves two purposes: communication liveness and agent readiness. Links provide the communication liveness test, and we can avoid blocking on agent ready, by having the link store the token on the receiving half of the link. If there is a failure, both sides know; and both sides know what to do next.

• Paxos “Agents may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted”. The assumption is when a node has failed and restarted, it can’t remember the state it needs to recover. With AIT, the other half of the link can tell it the state to recover from.

• Reliable tree generation Binary link reversal algorithms work by reversing the directions of some edges. Transforming an arbitrary directed acyclic input graph into an output graph with at least one route from each node to a special destination node. The resulting graph can thus be used to route messages in a loop-free manner. Links store the direction of the arrow (head and tail); AIT facilitates the atomic swap of the arrow’s tail and head to maintain loop-free routes during failure and recovery.

23

Page 24: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company24

Common Knowledge

Courtesy: Adrian Coyler, The Morning Paper.https://blog.acolyer.org/2015/02/16/knowledge-and-common-knowledge-in-a-distributed-environment/

Page 25: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Confidential | Earth Computing Inc. | Paul Borrill

Implementation On Smart NIC’s

PHY

PHY

PHY

PHY

PHY

PHY

Agent

NIC

NIC

NIC

NIC

NIC

NIC

Cell Hardware(containing Processor, Memory, Storage and

(e.g 6) physical network ports)

Cell SoftwarePrimary Agent

NetworkInterfaceController

NetworkConnector

EntanglementSynchronization

Domain

25

Page 26: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company

TRAPHs (Tree-gRAPHs) • Simple Provisioning, Confinement,

Elasticity, Migration, Failover

26

New Distributed Systems Foundation

Bare Metal

NALNetwork

Asset Layer

Page 27: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company27

DemoSimulator

Page 28: The Time-less Datacenteree380.stanford.edu/Abstracts/161116-slides.pdf · •More Resilient to all perturbations •Easier to Secure •Key Ideas •RAFE: Reliable Address-Free Ethernet

Earth Computing | The Datacenter Resilience Company28

• Don’t Make Datacenter Look Like the Internet • Simpler to Configure/Reconfigure• More Resilient to all perturbations• Easier to Secure

• Key Ideas • RAFE: Reliable Address-Free Ethernet• Replace switches with cell to cell links• Don’t rely on blueprints, discover wiring• Event driven => No network timeouts • Keep state in links for recovery• No VLANs, no network-layer encryption• Scalable design - local only view• NO IP; service addressing• Self recovering from link & server failures• RAFE is a discovery process rather than a configuration process

Questions?

Cloudplane

OutsideWorld

EarthCore