distributed computing for new bloods
DESCRIPTION
This is the keynote talk i delivered at GeekCamp.SG 2014 The main purpose of the talk is to create an awareness, if not existent, in the community when it comes to choosing and wanting to building a distributed system. This presentation is not meant to be a survey of distributed computing through the ages but hopefully it serves as a good starting point in which the journeyman can start from. I want to thank Jonas, CTO of Typesafe, as his work in Akka strongly influenced my own and i hope it would help you in the way his work helped me.TRANSCRIPT
![Page 1: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/1.jpg)
DISTRIBUTED COMPUTING FOR NEW BLOODS
RAYMOND TAY I WORK IN HEWLETT-PACKARD LABS SINGAPORE
@RAYMONDTAYBL
![Page 2: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/2.jpg)
What is a Distributed System?
![Page 3: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/3.jpg)
What is a Distributed System?
It’s been there for a long time
![Page 4: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/4.jpg)
What is a Distributed System?
It’s been there for a long time
You did not realize it
![Page 5: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/5.jpg)
DISTRIBUTED SYSTEMS ARE THE NEW FRONTIER
whether YOU like it or not
![Page 6: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/6.jpg)
DISTRIBUTED SYSTEMS ARE THE NEW FRONTIER
whether YOU like it or not
LAN Games
Mobile
Databases
ATMs
Social Media
![Page 7: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/7.jpg)
WHAT IS THE ESSENCE OF DISTRIBUTED SYSTEMS?
![Page 8: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/8.jpg)
WHAT IS THE ESSENCE OF DISTRIBUTED SYSTEMS?
IT IS ABOUT THE ATTEMPT TO OVERCOME
INFORMATION TRAVELWHEN INDEPENDENT PROCESSES FAIL
![Page 9: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/9.jpg)
Information flows at speed of light !
X YMESSAGE
![Page 10: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/10.jpg)
When independent things DONT fail.
X Y
I’ve sent it. I’ve received it.
I’ve sent it.I’ve received it.
![Page 11: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/11.jpg)
When independent things fail, independently.
X Y
NETWORK FAILURE
![Page 12: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/12.jpg)
When independent things fail, independently.
X Y
NODE FAILURE
![Page 13: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/13.jpg)
There is no difference between a slow node and a dead node
![Page 14: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/14.jpg)
Why do we NEED it?Scalability
when the needs of the system outgrows what a single node can provide
![Page 15: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/15.jpg)
Why do we NEED it?Scalability
when the needs of the system outgrows what a single node can provide
Availability Enabling resilience when a node fails
![Page 16: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/16.jpg)
IS IT REALLY THAT HARD?
![Page 17: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/17.jpg)
IS IT REALLY THAT HARD?
The answer lies in knowing what is possible and what is not possible…
![Page 18: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/18.jpg)
IS IT REALLY THAT HARD?
Even widely accepted facts are challenged…
![Page 19: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/19.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable Peter Deutsch and other fellows at Sun Microsystems
![Page 20: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/20.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
![Page 21: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/21.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
The network is secure
Peter Deutsch and other fellows at Sun Microsystems
![Page 22: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/22.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
The network is secure2013 cost of cyber crime study: United States
![Page 23: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/23.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
The network is secure
The network is homogeneous
Peter Deutsch and other fellows at Sun Microsystems
![Page 24: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/24.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
The network is secure
The network is homogeneous
Latency is zero
Peter Deutsch and other fellows at Sun Microsystems
Ingo Rammer on latency vs bandwidth
![Page 25: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/25.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
The network is secure
The network is homogeneous
Latency is zero
Bandwidth is infinite
Peter Deutsch and other fellows at Sun Microsystems
![Page 26: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/26.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
The network is secure
The network is homogeneous
Latency is zero
Bandwidth is infinite
Topology doesn’t change
Peter Deutsch and other fellows at Sun Microsystems
![Page 27: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/27.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
The network is secure
The network is homogeneous
Latency is zero
Bandwidth is infinite
Topology doesn’t change
There is one administrator
Peter Deutsch and other fellows at Sun Microsystems
![Page 28: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/28.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
The network is secure
The network is homogeneous
Latency is zero
Bandwidth is infinite
Topology doesn’t change
There is one administrator
Transport cost is zero
Peter Deutsch and other fellows at Sun Microsystems
![Page 29: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/29.jpg)
THE FALLACIES OF DISTRIBUTED COMPUTING
The network is reliable
The network is secure
The network is homogeneous
Latency is zero
Bandwidth is infinite
Topology doesn’t change
There is one administrator
Transport cost is zero
Peter Deutsch and other fellows at Sun Microsystems
![Page 30: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/30.jpg)
WHAT WAS TRIED IN THE PAST?
DISTRIBUTED OBJECTS REMOTE PROCEDURE CALL DISTRIBUTED SHARED MUTABLE STATE
![Page 31: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/31.jpg)
WHAT WAS ATTEMPTED?DISTRIBUTED OBJECTS
REMOTE PROCEDURE CALL
DISTRIBUTED SHARED MUTABLE STATE• Typically, programming constructs to “abstract” the fact
that there are local and distributed objects • Ignores latencies • Handles failures with this attitude → “i dunno what to
do…YOU (i.e. invoker) handle it”.
![Page 32: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/32.jpg)
WHAT WAS ATTEMPTED?DISTRIBUTED OBJECTS
REMOTE PROCEDURE CALL
DISTRIBUTED SHARED MUTABLE STATE
• Assumes the synchronous processing model • Asynchronous RPC tries to model after
synchronous RPC…
![Page 33: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/33.jpg)
WHAT WAS ATTEMPTED?DISTRIBUTED OBJECTS
REMOTE PROCEDURE CALL
DISTRIBUTED SHARED MUTABLE STATE
• Found in Distributed Shared Memory Systems i.e. “1” address space partitioned into “x” address spaces for “x” nodes
• e.g. JavaSpaces ⇒ Danger of 2 independent processes to commit successfully.
![Page 34: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/34.jpg)
YOU CAN APPRECIATE THAT IT IS VERY HARD TO SOLVE IN ITS ENTIRETY
![Page 35: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/35.jpg)
The question really is “How can we do better?”
![Page 36: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/36.jpg)
The question really is “How can we do better?”
To start,we need to understand 2 results
![Page 37: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/37.jpg)
FLP IMPOSSIBILITY RESULTThe FLP result shows that in an asynchronous model, where one processor might crash, there is no distributed algorithm to solve the consensus problem.
![Page 38: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/38.jpg)
Paxos (1989), Raft (2013)
Consensus is a fundamental problem in fault-tolerant distributed systems. Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final.
Typical consensus algorithms make progress when any majority of their servers are available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect result).
![Page 39: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/39.jpg)
CAP THEOREMCAP CONJECTURE (2000) established as a theorem in 2002
Consistency
Partition Tolerance
Availability
![Page 40: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/40.jpg)
CAP THEOREMCAP CONJECTURE (2000) established as a theorem in 2002
Consistency
Partition Tolerance
Availability
• Consistency ⇒ all nodes should see the same data, eventually
![Page 41: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/41.jpg)
CAP THEOREMCAP CONJECTURE (2000) established as a theorem in 2002
Consistency
Partition Tolerance
Availability
• Consistency ⇒ all nodes should see the same data, eventually
• Availability ⇒ System is still working when node(s) fails
![Page 42: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/42.jpg)
CAP THEOREMCAP CONJECTURE (2000) established as a theorem in 2002
Consistency
Partition Tolerance
Availability
• Consistency ⇒ all nodes should see the same data, eventually
• Availability ⇒ System is still working when node(s) fails
![Page 43: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/43.jpg)
CAP THEOREMCAP CONJECTURE (2000) established as a theorem in 2002
Consistency
Partition Tolerance
Availability
• Consistency ⇒ all nodes should see the same data, eventually
• Availability ⇒ System is still working when node(s) fails
• Partition-Tolerance ⇒ System is still working on arbitrary message loss
![Page 44: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/44.jpg)
How does knowing FLP and CAP help me?
![Page 45: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/45.jpg)
How does knowing FLP and CAP help me?
It’s really about asking which would you sacrifice when a system fails? Consistency or Availability?
![Page 46: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/46.jpg)
You can choose C over A
It needs to preserve reads and writes i.e. linearizabilityi.e. A read will return the last completed write (on ANY replica) e.g. Two-Phase-Commit
![Page 47: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/47.jpg)
You can choose A over C
It might return stale reads …
![Page 48: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/48.jpg)
You can choose A over C
It might return stale reads …DynamoDB uses vector clocks; Cassandra uses a clever form of last-write-wins
![Page 49: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/49.jpg)
You cannot NOT choose P
Partition Tolerance is mandatory in distributed systems
![Page 50: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/50.jpg)
To scale, partition
To resilient, replicate
![Page 51: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/51.jpg)
A B C
A A BB
CC
ReplicationPartitioning
[101..200][0..100] [0..100] [101..200]
Node 1 Node 2
![Page 52: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/52.jpg)
Replication is strongly related to fault-tolerance
Fault-tolerance relies on reaching consensus
Paxos (1989), ZAB, Raft (2013)
![Page 53: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/53.jpg)
Consistent Hashing is used often PARTITIONING and REPLICATION
C-H commonly used in load-balancing web objects. In DynamoDB Cassandra, Riak and Memcached
![Page 54: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/54.jpg)
• If your problem can fit into memory of a single machine, you don’t need a distributed system
• If you really need to design / build a distributed system, the following helps: • Design for failure • Use the FLP and CAP to critique
systems • Algorithms that work for single-node
may not work in distributed mode • Avoid coordination of nodes
• Learn to estimate your capacity
![Page 55: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/55.jpg)
Jeff Dean - Google
![Page 56: Distributed computing for new bloods](https://reader037.vdocument.in/reader037/viewer/2022110114/5476e0cdb4af9f39378b4785/html5/thumbnails/56.jpg)
REFERENCES• http://queue.acm.org/detail.cfm?id=2655736 “The network is
reliable” • http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-
systems-eventual-consistency-and-the-cap-theorem/fulltext • http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • “Towards Robust Distributed Systems,” http://
www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf • “Why Do Computers Stop and What Can be Done About It,”
Tandem Computers Technical Report 85.7, Cupertino, Ca., 1985. http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf
• http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
• http://codahale.com/you-cant-sacrifice-partition-tolerance/ • http://dl.acm.org/citation.cfm?id=258660 “Consistent hashing
and random trees: distributed caching protocols for relieving hot spots on the World Wide Web”