designing large scale distributed systems

Post on 05-Dec-2014

1.810 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Designing Large­Scale Distributed Systems

Ashwani Priyedarshi

“the network is the computer.”

John Gage, Sun Microsystems

“A distributed system is one in which the failure of a computer you didn’t even know existed can 

render your own computer unusable.”

Leslie Lamport

“Of three properties of distributed data systems­ consistency, availability, partition­tolerance – 

choose two.”

Eric Brewer, CAP Theorem, PODC 2000

Agenda

● Consistency Models● Transactions

● Why to distribute?

● Decentralized Architecture

● Design Techniques & Tradeoffs

● Few Real World Examples

● Conclusions

Consistency Model

• Restricts possible values that a read operation on an item can return– Some are very restrictive, others are less

– The less restrictive ones are easier to implement

• The most natural semantic for storage system is ­ "read should return the last written value”– In case of concurrent accesses and multiple replicas, it's 

not easy to identify what "last write" means

Strict Consistency

● Assumes the existence of absolute global time

● It is impossible to implement on a large distributed system

● No two operations (in different clients) allowed at the same time

● Example: Sequence (a) satisfies strict consistency, but sequence (b) does not

Sequential Consistency

● The result of any execution is the same as if 

● the read and write operations by all processes on the data store were executed in some sequential order

● the operations of each individual process appear in this sequence in the order specified by its program

● All processes see the same interleaving of operations

● Many interleavings are valid

● Different runs of a program might act differently

● Example: Sequence (a) satisfies sequential consistency, but sequence (b) does not

Consistency vs Availability

• In large shared­data distributed systems, network partitions are a given

• Consistency or Availability

• Both options require the client developer to be aware of what the system is offering

Eventual Consistency

• An eventual consistent storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value

• If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as:– load on the system– communication delays– number of replicas

• The most popular system that implements eventual consistency is DNS

Quorum­based Technique 

• To enforce consistent operation in a distributed system.

• Consider the following parameters:– N = Total number of replicas

– W = Replicas to wait for acknowledgement during writes

– R = Replicas to access during reads

• If W+R > N– the read set and the write set always overlap and one can 

guarantee strong consistency

• If W+R <= N– the read and write set might not overlap and consistency 

cannot be guaranteed

Agenda

● Consistency Models

● Transactions● Why to distribute?

● Decentralized Architecture

● Design Techniques & Tradeoffs

● Few Real World Examples

● Conclusions

Transactions

● Extended form of consistency across multiple operations

● Example: Transfer money from A to B

● Subtract from A

● Add to B

● What if something happens in between?

● Another transaction on A or B

● Machine Crashes

● ...

Why Transactions?

● Correctness

● Consistency

● Enforce Invariants

● ACID

Agenda

● Consistency Models

● Transactions

● Why to distribute?● Decentralized Architecture

● Design Techniques & Tradeoffs

● Few Real World Examples

● Conclusions

Why to distribute?

● Catastrophic Failures

● Expected Failures

● Routine Maintenance

● Geolocality

● CDN, edge caching

Why NOT to distribute?

● Within a Datacenter

● High bandwidth: 1­100Gbps interconnects

● Low latency: < 1ms within a rack, < 5ms across

● Little to no cost

● Between Datacenters

● Low bandwidth: 10Mbps­1Gbps

● High latency: expect 100s of ms

● High Cost for fiber

Agenda

● Consistency Models

● Transactions

● Why to distribute?

● Decentralized Architecture● Design Techniques & Tradeoffs

● Few Real World Examples

● Conclusions

Decentralized Architecture

● Operating from multiple data­centers simultaneously

● Hard problem

● Maintaining consistency? Harder

● Transactions? Hardest

Option 1: Don't

● Most common

● Make sure data­center never goes down

● Bad at catastrophic failure

● Large scale data loss

● Not great for serving

● No geolocation

Option 2: Primary with hot failover(s)

● Better, but not ideal

● Mediocre at catastrophic failure

● Window of lost data

● Failover data may be inconsistent

● Geolocated for reads, not for writes

Option 3: Truly Distributed

● Simultaneous writes in different DCs, maintaining consistency

● Two­way: Hard

● N­way: Harder

● Handles catastrophic failure, geolocality

● But high latency

Agenda

● Consistency Models

● Transactions

● Why to distribute?

● Decentralized Architecture

● Design Techniques & Tradeoffs● Few Real World Examples

● Conclusions

Tradeoffs

Backups M/S MM 2PC Paxos

Consistency

Transactions

Latency

Throughput

Data Loss

Failover

Backups

● Make a copy

● Weak consistency

● Usually no transactions

Tradeoffs – Backups

Backups M/S MM 2PC Paxos

Consistency Weak

Transactions No

Latency Low

Throughput High

Data Loss High

Failover Down

Master/slave replication

● Usually asynchronous

● Good for throughput, latency

● Weak/eventual consistency

● Support transactions

Tradeoffs – Master/Slave

Backups M/S MM 2PC Paxos

Consistency Weak Eventual

Transactions No Full

Latency Low Low

Throughput High High

Data Loss High Some

Failover Down Read Only

Multi­master replication

● Asynchronous, eventual consistency

● Concurrent writes

● Need serialization protocol

● e.g. monotonically increasing timestamps

● Either with master election or distributed consensus protocol

● No strong consistency

● No global transactions

Tradeoffs ­ Multi­master

Backups M/S MM 2PC Paxos

Consistency Weak Eventual Eventual

Transactions No Full Local

Latency Low Low Low

Throughput High High High

Data Loss High Some Some

Failover Down Read Only Read/write

Two Phase Commit

● Semi­distributed consensus protocol

● deterministic coordinator

● 1: Request 2: Commit/Abort

● Heavyweight, synchronous, high latency

● 3PC: Asynchronous (One extra round trip)

● Poor Throughput

Tradeoffs ­ 2PC

Backups M/S MM 2PC Paxos

Consistency Weak Eventual Eventual Strong

Transactions No Full Local Full

Latency Low Low Low High

Throughput High High High Low

Data Loss High Some Some None

Failover Down Read Only Read/write Read/write

Paxos

● Decentralized, distributed consensus protocol

● Protocol similar to 2PC/3PC

● Lighter, but still high latency

● Three class of agents: proposers, acceptors, learners

● 1. a) prepare b) promise 2. a) accept b) accepted 

● Survives minority failure

Tradeoffs

Backups M/S MM 2PC Paxos

Consistency Weak Eventual Eventual Strong Strong

Transactions No Full Local Full Full

Latency Low Low Low High High

Throughput High High High Low Medium

Data Loss High Some Some None None

Failover Down Read Only Read/write Read/write Read/write

Agenda

● Consistency Models

● Transactions

● Why to distribute?

● Decentralized Architecture

● Design Techniques & Tradeoffs

● Few Real World Examples● Conclusions

Examples

● Megastore

● Google's Scalable, Highly Available Datastore

● Strong Consistency, Paxos

● Optimized for reads

● Dynamo

● Amazon’s Highly Available Key­value Store

● Eventual Consistency, Consistent Hashing, Vector Clocks

● Optimized for writes

● PNUTS

● Yahoo's Massively Parallel & Distributed Database System

● Timeline Consistency 

● Optimized for reads

Conclusions

● No silver bullet

● There are no simple solutions

● Design systems based on application needs

The End

Backup Slides

Vector Clocks

• Used to capture causality between different versions of the same object.

• A vector clock is a list of (node, counter) pairs.• Every version of every object is associated with 

one vector clock.• If the counters on the first object’s clock are 

less­than­or­equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.

Vector Clock Example

Partitioning Algorithm

• Consistent hashing:– The output range of a hash 

function is treated as a fixed circular space or “ring”.

• Virtual Nodes– Each node can be responsible 

for more than one virtual node.

– When a new node is added, it is assigned multiple positions.

– Various Advantages

top related