intro to distributed systems
TRANSCRIPT
Distributed Systems andwhy you should care!
Ahmed Solimanأحمد سليمان
Why the hassle?
Wrestling With Capacity Constraints
Cannot add more (CPU Cores, RAM, Disk, Network Bandwidth, etc.) to a single box
Wrestling With Availability Constraints
Dealing with failures of (Network, Power, Hard-drives, Faulty Ram, etc.)
Wrestling With Latency/Performance ConstraintsNot the entire world is wired with Optical Fibers.
Sometimes even the speed of light is not fast enough*
it takes 39ms for light to reach from Cairo to Dallas.
Welcome to Distributed Systems
The world is a mesh of very large and very small computers with a utterly sophisticated networking medium in between
it’s everywhere!The web is currently the world’s largest distributed
system, mobile is getting even larger, wearables will make the web-scale look like ant to an elephant!
Computing was never that sophisticated
and it’s getting even more sophisticated every single day.
Some Challenges• Heterogeneity and Abstractions
• Transparency and Abstractions
• Concurrency and Coordination
• Scalability
• Resilience to Failures
• Security
Heterogeneity• Different systems will be written in different
languages, running on different operating systems, network characteristics, computer architectures, etc.
• Clear boundaries and a set of abstractions must be defined. Think of Micro-service/SOA architecture, RESTful APIs, Thrift/Protobuf/Avro/msgpack/etc. for efficient binary message serialization across systems
Heterogeneity and Abstractions
• Abstractions through interfaces exposed through the network (think of sockets as low-level abstraction)
• Higher-level abstractions of data formats (json/thrift/protobuf/avro/etc.), protocols like HTTP, XMLRPC, Thrift-RPC,
• Effective separation offers great flexibility for large-scale development teams and allows the use of the right-tool for the right-job
• Generates a new set of challenges, think of protocols and versioning, cascading failures that are hard to trace, congestion/malfunction traceability is order of magnitude higher
Transparency• Systems need to know how to reach other systems
• Central Registry, discovery protocols, gossip protocols
• Components living in the same process vs. inter-process communication (mobility)
• shared-memory or IPC/Unix Sockets/TCP Sockets/etc?
• Higher-level abstraction means that there is no conceptual difference between scaling vertically on multicore or horizontally on the cluster
• Think of failures, restarts, commute of components and their effect on consumers.
Concurrency• a multi-user system means that users will be competing
against system resources.
• Don’t confuse concurrency and parallelism.
• A tricky business, if you think that threading is the best way to handle concurrency you will pay for the cost for that…a lot!
• heisenbugs™ everywhere
• Please welcome, race conditions, dead-locks, locking, barriers, compilers reordering, compiler optimizations, etc.
Concurrency• Mutable shared state is the root of most evil™
• Low-level abstractions (threading)
• Higher-level abstractions in programming languages (co-routines, async in C#, etc.)
• Actor Model (Erlang OTP, Scala/Java Akka, etc.)
• Concurrent but not Parallel using Async IO (NodeJS)
Coordination• Low-level coordination primitives from the operating
system like mutex, readwrite-locks, reentrant-locks, flock, etc.
• across-systems coordination is sometimes needed
• central registry with atomic semantics
• consensus protocols (paxos, raft, etc.)
Coordination• Classical master-writer, slave-reader clustering is
an easy fix but proven to scale poorly in write-intensive environments (e.g., social networks)
• Sharding/Partitioning may help in avoiding coordination altogether, still rebalancing, healing is extra work needed.
• CRDT (Conflict-free Data Types) maybe one potential solution too.
Scalability• Horizontal Scalability vs Vertical Scalability
• Optimization of per-node capacity (req/s) is a third dimension of your scalability design plan
• Single Point of Failures
• Performance Bottlenecks
• Commutative Operations
• Consistency Levels (Strong, Weak, Eventual)
Resilience• Ability to sustain high-loads without crashing
• Ability to recover from partial crashes
• Failures should be properly reported to the operation team and gracefully delivered to the client
• Avoidance of cascading failures as possible
Security• Authentication and authorization across different
systems
• Central authorization vs. state-less authorization
• data confidentiality across different systems
• JWT (Json Web Tokens) as one solution to decentralized authenticity verification
In Short…
• Distributed Systems are much more fun than you think
• It’s a mixture between science and art
• It’s about putting the right tradeoffs in place
• It has an immediate business impact, be careful.
Thank You!