intro to distributed systems

Distributed Systems andwhy you should care!

Ahmed Solimanأحمد سليمان

Why the hassle?

Wrestling With Capacity Constraints

Cannot add more (CPU Cores, RAM, Disk, Network Bandwidth, etc.) to a single box

Wrestling With Availability Constraints

Dealing with failures of (Network, Power, Hard-drives, Faulty Ram, etc.)

Wrestling With Latency/Performance ConstraintsNot the entire world is wired with Optical Fibers.

Sometimes even the speed of light is not fast enough*

it takes 39ms for light to reach from Cairo to Dallas.

Welcome to Distributed Systems

The world is a mesh of very large and very small computers with a utterly sophisticated networking medium in between

it’s everywhere!The web is currently the world’s largest distributed

system, mobile is getting even larger, wearables will make the web-scale look like ant to an elephant!

Computing was never that sophisticated

and it’s getting even more sophisticated every single day.

Some Challenges• Heterogeneity and Abstractions

• Transparency and Abstractions

• Concurrency and Coordination

• Scalability

• Resilience to Failures

• Security

Heterogeneity• Different systems will be written in different

languages, running on different operating systems, network characteristics, computer architectures, etc.

• Clear boundaries and a set of abstractions must be defined. Think of Micro-service/SOA architecture, RESTful APIs, Thrift/Protobuf/Avro/msgpack/etc. for efficient binary message serialization across systems

Heterogeneity and Abstractions

• Abstractions through interfaces exposed through the network (think of sockets as low-level abstraction)

• Higher-level abstractions of data formats (json/thrift/protobuf/avro/etc.), protocols like HTTP, XMLRPC, Thrift-RPC,

• Effective separation offers great flexibility for large-scale development teams and allows the use of the right-tool for the right-job

• Generates a new set of challenges, think of protocols and versioning, cascading failures that are hard to trace, congestion/malfunction traceability is order of magnitude higher

Transparency• Systems need to know how to reach other systems

• Central Registry, discovery protocols, gossip protocols

• Components living in the same process vs. inter-process communication (mobility)

• shared-memory or IPC/Unix Sockets/TCP Sockets/etc?

• Higher-level abstraction means that there is no conceptual difference between scaling vertically on multicore or horizontally on the cluster

• Think of failures, restarts, commute of components and their effect on consumers.

Concurrency• a multi-user system means that users will be competing

against system resources.

• Don’t confuse concurrency and parallelism.

• A tricky business, if you think that threading is the best way to handle concurrency you will pay for the cost for that…a lot!

• heisenbugs™ everywhere

• Please welcome, race conditions, dead-locks, locking, barriers, compilers reordering, compiler optimizations, etc.

Concurrency• Mutable shared state is the root of most evil™

• Low-level abstractions (threading)

• Higher-level abstractions in programming languages (co-routines, async in C#, etc.)

• Actor Model (Erlang OTP, Scala/Java Akka, etc.)

• Concurrent but not Parallel using Async IO (NodeJS)

Coordination• Low-level coordination primitives from the operating

system like mutex, readwrite-locks, reentrant-locks, flock, etc.

• across-systems coordination is sometimes needed

• central registry with atomic semantics

• consensus protocols (paxos, raft, etc.)

Coordination• Classical master-writer, slave-reader clustering is

an easy fix but proven to scale poorly in write-intensive environments (e.g., social networks)

• Sharding/Partitioning may help in avoiding coordination altogether, still rebalancing, healing is extra work needed.

• CRDT (Conflict-free Data Types) maybe one potential solution too.

Scalability• Horizontal Scalability vs Vertical Scalability

• Optimization of per-node capacity (req/s) is a third dimension of your scalability design plan

• Single Point of Failures

• Performance Bottlenecks

• Commutative Operations

• Consistency Levels (Strong, Weak, Eventual)

Resilience• Ability to sustain high-loads without crashing

• Ability to recover from partial crashes

• Failures should be properly reported to the operation team and gracefully delivered to the client

• Avoidance of cascading failures as possible

Security• Authentication and authorization across different

systems

• Central authorization vs. state-less authorization

• data confidentiality across different systems

• JWT (Json Web Tokens) as one solution to decentralized authenticity verification

In Short…

• Distributed Systems are much more fun than you think

• It’s a mixture between science and art

• It’s about putting the right tradeoffs in place

• It has an immediate business impact, be careful.

Thank You!

intro to distributed systems

Software

abstractions abstractions

abstractions concurrency

abstractions transparency

set of abstractions

evil lowlevel abstractions

failures of network

network think of sockets

different operating