lecture 1 – introduction to distributed systems 1 15-440 distributed systems

86
Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Upload: owen-perry

Post on 19-Jan-2016

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 1 – Introduction to Distributed Systems

1

15-440 Distributed Systems

Page 2: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

What Is A Distributed System?

“A collection of independent computers that appears to its users as a single coherent system.” •Features:

• No shared memory – message-based communication• Each runs its own local OS• Heterogeneity• Expandability

•Ideal: to present a single-system image:• The distributed system “looks like” a single computer

rather than a collection of separate computers.

Page 3: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Definition of a Distributed System

Figure 1-1. A distributed system organized as middleware. The middleware layer runs on all machines, and offers a uniform interface to the system

Page 4: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Distributed Systems: Goals

• Resource Availability: remote access to resources• Distribution Transparency: single system image

• Access, Location, Migration, Replication, Failure,…

• Openness: services according to standards (RPC)• Scalability: size, geographic, admin domains, …

• Example of a Distributed System? • Web search on google • DNS: decentralized, scalable, robust to failures, ... • ...

04/21/23 4

Page 5: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 2 & 3 – 15-441 in 2 Days

15-440 Distributed Systems

Page 6: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Packet Switching – Statistical Multiplexing

• Switches arbitrate between inputs• Can send from any input that’s ready

• Links never idle when traffic to send• (Efficiency!)

Packets

6

Page 7: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Model of a communication channel

• Latency - how long does it take for the first bit to reach destination

• Capacity - how many bits/sec can we push through? (often termed “bandwidth”)

• Jitter - how much variation in latency?

• Loss / Reliability - can the channel drop packets?

• Reordering

7

Page 8: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Packet Switching

• Source sends information as self-contained packets that have an address.• Source may have to break up single message in multiple

• Each packet travels independently to the destination host.• Switches use the address in the packet to determine how to

forward the packets• Store and forward

• Analogy: a letter in surface mail.

8

Page 9: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Internet

• An inter-net: a network of networks.• Networks are connected using

routers that support communication in a hierarchical fashion

• Often need other special devices at the boundaries for security, accounting, ..

• The Internet: the interconnected set of networks of the Internet Service Providers (ISPs)• About 17,000 different networks

make up the Internet

Internet

9

Page 10: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Network Service Model

• What is the service model for inter-network?• Defines what promises that the network gives for any

transmission• Defines what type of failures to expect

• Ethernet/Internet: best-effort – packets can get lost, etc.

10

Page 11: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Possible Failure models

• Fail-stop:• When something goes wrong, the process stops / crashes /

etc.• Fail-slow or fail-stutter:

• Performance may vary on failures as well• Byzantine:

• Anything that can go wrong, will.• Including malicious entities taking over your computers and

making them do whatever they want.• These models are useful for proving things;• The real world typically has a bit of everything.

• Deciding which model to use is important!

11

Page 12: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

What is Layering?

• Modular approach to network functionality• Example:

Link hardware

Host-to-host connectivity

Application-to-application channels

Application

12

Page 13: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

IP Layering

• Relatively simple

Bridge/Switch Router/GatewayHost Host

Application

Transport

Network

Link

Physical

13

Page 14: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Protocol Demultiplexing

• Multiple choices at each layer

FTP HTTP TFTPNV

TCP UDP

IP

NET1 NET2 NETn…

TCP/UDPIPIPX

Port Number

Network

Protocol Field

Type Field

14

Page 15: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Goals [Clark88]

0 Connect existing networksinitially ARPANET and ARPA packet radio network

1.Survivabilityensure communication service even in the presence of

network and router failures

2.Support multiple types of services3. Must accommodate a variety of networks4. Allow distributed management5. Allow host attachment with a low level of effort6. Be cost effective7. Allow resource accountability

15

Page 16: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Goal 1: Survivability

• If network is disrupted and reconfigured…• Communicating entities should not care!• No higher-level state reconfiguration

• How to achieve such reliability?• Where can communication state be stored?

Network Host

Failure handing Replication “Fate sharing”

Net Engineering Tough Simple

Switches Maintain state Stateless

Host trust Less More16

Page 17: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

CIDR IP Address Allocation

Provider is given 201.10.0.0/21

201.10.0.0/22 201.10.4.0/24 201.10.5.0/24 201.10.6.0/23

Provider

17

Page 18: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

18

Ethernet Frame Structure (cont.)

• Addresses: • 6 bytes• Each adapter is given a globally unique address at

manufacturing time• Address space is allocated to manufacturers

• 24 bits identify manufacturer• E.g., 0:0:15:* 3com adapter

• Frame is received by all adapters on a LAN and dropped if address does not match

• Special addresses• Broadcast – FF:FF:FF:FF:FF:FF is “everybody”• Range of addresses allocated to multicast

• Adapter maintains list of multicast groups node is interested in

18

Page 19: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

End-to-End Argument

• Deals with where to place functionality• Inside the network (in switching elements)• At the edges

• Argument• If you have to implement a function end-to-end anyway

(e.g., because it requires the knowledge and help of the end-point host or application), don’t implement it inside the communication system

• Unless there’s a compelling performance enhancement

• Key motivation for split of functionality between TCP,UDP and IP

Further Reading: “End-to-End Arguments in System Design.” Saltzer, Reed, and Clark. 19

Page 20: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

User Datagram Protocol (UDP): An Analogy

Postal Mail• Single mailbox to receive

messages• Unreliable • Not necessarily in-order

delivery• Each letter is independent• Must address each reply

Example UDP applicationsMultimedia, voice over IP

UDP• Single socket to receive

messages• No guarantee of delivery• Not necessarily in-order

delivery• Datagram – independent

packets• Must address each packet

Postal Mail• Single mailbox to receive

letters• Unreliable • Not necessarily in-order

delivery• Letters sent independently

• Must address each letter

20

Page 21: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Transmission Control Protocol (TCP): An Analogy

TCP• Reliable – guarantee

delivery• Byte stream – in-order

delivery• Connection-oriented –

single socket per connection

• Setup connection followed by data transfer

Telephone Call• Guaranteed delivery• In-order delivery• Connection-oriented • Setup connection

followed by conversation

Example TCP applicationsWeb, Email, Telnet

21

Page 22: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 5 – Classical Synchronization

25

15-440 Distributed Systems

Page 23: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Classic synchronization primitives

• Basics of concurrency • Correctness (achieves Mutex, no deadlock, no livelock)• Efficiency, no spinlocks or wasted resources • Fairness

• Synchronization mechanisms • Semaphores (P() and V() operations) • Mutex (binary semaphore) • Condition Variables (allows a thread to sleep)

• Must be accompanied by a mutex • Wait and Signal operations

• Work through examples again + GO primitives

04/21/23 26

Page 24: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 6 – RPC

15-440 Distributed Systems

27

Page 25: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

RPC Goals

• Ease of programming• Hide complexity • Automates task of implementing distributed

computation• Familiar model for programmers (just make a

function call)

Historical note: Seems obvious in retrospect, but RPC was only invented in the ‘80s. See Birrell & Nelson, “Implementing Remote Procedure Call” ... orBruce Nelson, Ph.D. Thesis, Carnegie Mellon University: Remote Procedure Call., 1981 :)

Page 26: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Passing Value Parameters (1)

• The steps involved in a doing a remote computation through RPC.

29

Page 27: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Stubs: obtaining transparency

• Compiler generates from API stubs for a procedure on the client and server

• Client stub • Marshals arguments into machine-independent format• Sends request to server• Waits for response• Unmarshals result and returns to caller

• Server stub• Unmarshals arguments and builds stack frame• Calls procedure• Server stub marshals results and sends reply

30

Page 28: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Real solution: break transparency

• Possible semantics for RPC:• Exactly-once

• Impossible in practice

• At least once: • Only for idempotent operations

• At most once• Zero, don’t know, or once

• Zero or once• Transactional semantics

31

Page 29: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Asynchronous RPC (3)

• A client and server interacting through two asynchronous RPCs.

32

Page 30: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Important Lessons

• Procedure calls• Simple way to pass control and data• Elegant transparent way to distribute application• Not only way…

• Hard to provide true transparency• Failures• Performance• Memory access• Etc.

• How to deal with hard problem give up and let programmer deal with it• “Worse is better”

33

Page 31: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 7,8 – Distributed File Systems

34

15-440 Distributed Systems

Page 32: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Why DFSs?

• Why Distributed File Systems: • Data sharing among multiple users

• User mobility

• Location transparency

• Backups and centralized management

• Examples: NFS (v1 – v4), AFS, CODA, LBFS

• Idea: Provide file system interfaces to remote FS’s• Challenge: heterogeneity, scale, security, concurrency,..

• Non-Challenges: AFS meant for campus community

• Virtual File Systems: pluggable file systems

• Use RPC’s

35

Page 33: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

NFS vs AFS Goals

• AFS: Global distributed file system• “One AFS”, like “one Internet”• LARGE numbers of clients, servers (1000’s cache files)• Global namespace (organized as cells) => location transparency• Clients w/disks => cache, Write sharing rare, callbacks • Open-to-close consistency (session semantics)

• NFS: Very popular Network File System • NFSv4 meant for wide area • Naming: per-client view (/home/yuvraj/…)• Cache data in memory, not on disk, write through cache• Consistency model: buffer data (eventual, ~30seconds)• Requires significant resounces as users scale

36

Page 34: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

DFS Important bits (1)

• Distributed filesystems almost always involve a tradeoff: consistency, performance, scalability.

• We’ve learned a lot since NFS and AFS (and can implement faster, etc.), but the general lesson holds. Especially in the wide-area.

• We’ll see a related tradeoff, also involving consistency, in a while: the CAP tradeoff. Consistency, Availability, Partition-resilience.

Page 35: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

DFS Important Bits (2)

• Client-side caching is a fundamental technique to improve scalability and performance• But raises important questions of cache consistency

• Timeouts and callbacks are common methods for providing (some forms of) consistency.

• AFS picked close-to-open consistency as a good balance of usability (the model seems intuitive to users), performance, etc.• AFS authors argued that apps with highly concurrent,

shared access, like databases, needed a different model

Page 36: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Coda Summary

• Distributed File System built for mobility• Disconnected operation key idea

• Puts scalability and availability beforedata consistency• Unlike NFS

• Assumes that inconsistent updates are very infrequent

• Introduced disconnected operation mode and file hoarding and the idea of “reintegration”

40

Page 37: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Low Bandwidth File SystemKey Ideas

• A network file systems for slow or wide-area networks

• Exploits similarities between files • Avoids sending data that can be found in the server’s

file system or the client’s cache• Uses RABIN fingerprints on file content (file chunks)

• Can deal with byte offsets when part of file change

• Also uses conventional compression and caching• Requires 90% less bandwidth than traditional

network file systems

41

Page 38: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 9 – Time Synchronization

42

15-440 Distributed Systems

Page 39: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Impact of Clock Synchronization

• When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time.

43

Page 40: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Clocks in a Distributed System

• Computer clocks are not generally in perfect agreement• Skew: the difference between the times on two clocks (at any instant)

• Computer clocks are subject to clock drift (they count time at different rates)• Clock drift rate: the difference per unit of time from some ideal reference

clock • Ordinary quartz clocks drift by about 1 sec in 11-12 days. (10-6 secs/sec).• High precision quartz clocks drift rate is about 10-7 or 10-8 secs/sec

44

Page 41: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Perfect networks

• Messages always arrive, with propagation delay exactly d

• Sender sends time T in a message• Receiver sets clock to T+d

• Synchronization is exact

Page 42: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Cristian’s Time Sync

mr

mt

pTime server,S

• A time server S receives signals from a UTC source• Process p requests time in mr and receives t in mt from S• p sets its clock to t + RTT/2

• Accuracy ± (RTT/2 - min) :• because the earliest time S puts t in message mt is min after p sent mr. • the latest time was min before mt arrived at p• the time by S’s clock when mt arrives is in the range [t+min, t + RTT - min]

Tround is the round trip time recorded by pmin is an estimated minimum round trip time

46

Page 43: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Berkeley algorithm

• Cristian’s algorithm - • a single time server might fail, so they suggest the use of a group of

synchronized servers• it does not deal with faulty servers

• Berkeley algorithm (also 1989)• An algorithm for internal synchronization of a group of computers• A master polls to collect clock values from the others (slaves)• The master uses round trip times to estimate the slaves’ clock values• It takes an average (eliminating any above average round trip time or with

faulty clocks)• It sends the required adjustment to the slaves (better than sending the

time which depends on the round trip time)• Measurements

• 15 computers, clock synchronization 20-25 millisecs drift rate < 2x10-5

• If master fails, can elect a new master to take over (not in bounded time)

•47

Page 44: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

NTP Protocol

• All modes use UDP• Each message bears timestamps of recent events:

• Local times of Send and Receive of previous message• Local times of Send of current message

• Recipient notes the time of receipt T3 (we have T0, T1, T2, T3)

48

T3

T2T1

T0

Server

Client

Time

m m'

Time

Page 45: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Logical time and logical clocks (Lamport 1978)

• Instead of synchronizing clocks, event ordering can be used

1. If two events occurred at the same process pi (i = 1, 2, … N) then they occurred in the order observed by pi, that is the definition of: “ i”

2. when a message, m is sent between two processes, send(m) happens before receive(m)

3. The happened before relation is transitive

• The happened before relation is the relation of causal ordering49

Page 46: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Total-order Lamport clocks

• Many systems require a total-ordering of events, not a partial-ordering

• Use Lamport’s algorithm, but break ties using the process ID• L(e) = M * Li(e) + i

• M = maximum number of processes• i = process ID

Page 47: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Vector Clocks

• Note that e e’ implies V(e)<V(e’). The converse is also true

• Can you see a pair of parallel events?• c || e (parallel) because neither V(c) <= V(e) nor V(e) <= V(c)

51

Page 48: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 10 – Mutual Exclusion

15-440 Distributed Systems

Page 49: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Example: Totally-Ordered Multicasting

• San Fran customer adds $100, NY bank adds 1% interest• San Fran will have $1,111 and NY will have $1,110

• Updating a replicated database and leaving it in an inconsistent state.

• Can use Lamport’s to totally order55

(San Francisco) (New York)

(+$100) (+1%)

Page 50: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Mutual ExclusionA Centralized Algorithm (1)

@ Client Acquire: Send (Request, i) to coordinator Wait for reply

@ Server:while true:

m = Receive()

If m == (Request, i):If Available():

Send (Grant) to i

56

Page 51: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Distributed Algorithm Strawman

• Assume that there are n coordinators• Access requires a majority vote from m > n/2

coordinators. • A coordinator always responds immediately to a

request with GRANT or DENY

• Node failures are still a problem• Large numbers of nodes requesting access can

affect availability

57

Page 52: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

A Distributed Algorithm 2(Lamport Mutual Exclusion)

• Every process maintains a queue of pending requests for entering critical section in order. The queues are ordered by virtual time stamps derived from Lamport timestamps• For any events e, e' such that e --> e' (causality ordering), T(e) <

T(e')• For any distinct events e, e', T(e) != T(e')

• When node i wants to enter C.S., it sends time-stamped request to all other nodes (including itself) • Wait for replies from all other nodes.• If own request is at the head of its queue and all replies have been

received, enter C.S.• Upon exiting C.S., remove its request from the queue and send a

release message to every process.

58

Page 53: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

A Distributed Algorithm 3(Ricart & Agrawala)

• Also relies on Lamport totally ordered clocks.

• When node i wants to enter C.S., it sends time-stamped request to all other nodes. These other nodes reply (eventually). When i receives n-1 replies, then can enter C.S.

• Trick: Node j having earlier request doesn't reply to i until after it has completed its C.S.

59

Page 54: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

A Token Ring Algorithm

• Organize the processes involved into a logical ring• One token at any time passed from node to

node along ring

60

Page 55: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

A Token Ring Algorithm

• Correctness:• Clearly safe: Only one process can hold token

• Fairness: • Will pass around ring at most once before getting

access.

• Performance:• Each cycle requires between 1 - ∞ messages• Latency of protocol between 0 & n-1

• Issues• Lost token

61

Page 56: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Summary

• Lamport algorithm demonstrates how distributed processes can maintain consistent replicas of a data structure (the priority queue).

• Ricart & Agrawala's algorithms demonstrate utility of logical clocks.

• Centralized & ring based algorithms much lower message counts

• None of these algorithms can tolerate failed processes or dropped messages.

62

Page 57: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 11 – Concurrency, Transactions

63

15-440 Distributed Systems

Page 58: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Distributed Concurrency Management

• Multiple-Objects, Multiple Servers. Ignore Failure• Single Server: Transactions (RD/WR to Global State)• ACID: Atomicity, Consistency, Isolation, Durability • Learn what these mean in the context of transactions• E.g. banking app => ACID is violated if not careful• Solutions: 2-phase locking (General, strict, strong strict)

• Deadling with deadlocks => build “waits-for” graph• Transactions: 2 phases (prep, commit/abort)

• Preparation: generate Lock Set “L”, Updates “U”

• COMMIT (updated global state), ABORT (leave state as is)

• Example using banking app

04/21/23 64

Page 59: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Distributed Transactions – 2PC

• Similar idea as before, but: • State spread across servers (maybe even WAN)

• Want to enable single transactions to read and update global state while maintaining ACID properties

• Overall Idea: • Client initiate transaction. Makes use of “co-ordinator”

• All other relevant servers operate as “participants”

• Co-ordinator assigns unique transaction ID (TID)

• 2 Phase-Commit • Prepare & Vote (client determine states, talk to Coord.)

• Commit/Abort Phase (co-ordinator boardcast to clients)

65

Page 60: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 12 – Logging and Crash Recovery

66

15-440 Distributed Systems

Page 61: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Summary – Fault Tolerance

• Real Systems (are often unreliable) • Introduced basic concepts for Fault Tolerant Systems

including redundancy, process resilience, RPC

• Fault Tolerance – Backward recovery using checkpointing, both Independent and coordinated

• Fault Tolerance –Recovery using Write-Ahead-Logging, balances the overhead of checkpointing and ability to recover to a consistent state

67

Page 62: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Dependability Concepts

• Availability – the system is ready to be used immediately.

• Reliability – the system runs continuously without failure.

• Safety – if a system fails, nothing catastrophic will happen. (e.g. process control systems)

• Maintainability – when a system fails, it can be repaired easily and quickly (sometimes, without its users noticing the failure). (also called Recovery)

• What’s a failure? : System that cannot meet its goals => faults• Faults can be: Transient, Intermittent, Permanent

Page 63: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Masking Failures by Redundancy

• Strategy: hide the occurrence of failure from other processes using redundancy.

1. Information Redundancy – add extra bits to allow for error detection/recovery (e.g., Hamming codes and the like).

2. Time Redundancy – perform operation and, if needs be, perform it again. Think about how transactions work (BEGIN/END/COMMIT/ABORT).

3. Physical Redundancy – add extra (duplicate) hardware and/or software to the system.

Page 64: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Recovery Strategies

• When a failure occurs, we need to bring the system into an error free state (recovery). This is fundamental to Fault Tolerance.

1. Backward Recovery: return the system to some previous correct state (using checkpoints), then continue executing.

-- Can be expensive, however still used

2. Forward Recovery: bring the system into a correct new state, from which it can then continue to execute.

-- Need to know potential errors up front!

Page 65: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Independent Checkpointing

The domino effect – Cascaded rollback

P2 crashes, roll back, but 2 checkpoints inconsistent (P2 shows m received, but P1 does not show m sent)

Solution? Co-ordinated checkpointing

Page 66: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Shadow Paging Vs WAL

• Shadow Pages• Provide Atomicity and Durability, “page” = unit of storage• Idea: When writing a page, make a “shadow” copy

• No references from other pages, edit easily!

• ABORT: discard shadow page • COMMIT: Make shadow page “real”. Update pointers to

data on this page from other pages (recursive). Can be done atomically

72

Page 67: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Shadow Paging vs WAL

• Write-Ahead-Logging • Provide Atomicity and Durability• Idea: create a log recording every update to database • Updates considered reliable when stored on disk• Updated versions are kept in memory (page cache) • Logs typically store both REDO and UNDO operations• After a crash, recover by replaying log entries to

reconstruct correct state • 3 Passes: (Analysis Pass, recovery pass, Undo Pass) • WAL is more common, fewer disk operations,

transactions considered committed once log written.

73

Page 68: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 13 – Errors and Failures

15-440 Distributed Systems

Page 69: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Measuring Availability

• Mean time to failure (MTTF)• Mean time to repair (MTTR)• MTBF = MTTF + MTTR

• Availability = MTTF / (MTTF + MTTR)• Suppose OS crashes once per month, takes 10min to

reboot. • MTTF = 720 hours = 43,200 minutes

MTTR = 10 minutes• Availability = 43200 / 43210 = 0.997 (~“3 nines”)

75

Page 70: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Disk failure conditional probability distribution - Bathtub curve

Expected operating lifetime

1 / (reported MTTF)

Infantmortality

Burn out

76

Page 71: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Parity Checking

Single Bit Parity:Detect single bit errors

77

Page 72: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Block Error Detection

• EDC= Error Detection and Correction bits (redundancy)• D = Data protected by error checking, may include header fields • Error detection not 100% reliable!

• Protocol may miss some errors, but rarely• Larger EDC field yields better detection and correction

78

Page 73: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Error Detection – Cyclic Redundancy Check (CRC)

• Polynomial code• Treat packet bits a coefficients of n-bit polynomial• Choose r+1 bit generator polynomial (well known –

chosen in advance)• Add r bits to packet such that message is divisible by

generator polynomial

• Better loss detection properties than checksums• Cyclic codes have favorable properties in that they are

well suited for detecting burst errors• Therefore, used on networks/hard drives

79

Page 74: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Error Recovery

• Two forms of error recovery• Redundancy

• Error Correcting Codes (ECC)• Replication/Voting

• Retry

• ECC• Keep encoded redundant data to help repair losses• Forward Error Correction (FEC) – send bits in advance

• Reduces latency of recovery at the cost of bandwidth

80

Page 75: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Summary

• Definition of MTTF/MTBF/MTTR: Understanding availability in systems.

• Failure detection and fault masking techniques• Engineering tradeoff: Cost of failures vs. cost of

failure masking.• At what level of system to mask failures?• Leading into replication as a general strategy for fault

tolerance

• Thought to leave you with:• What if you have to survive the failure of entire

computers? Of a rack? Of a datacenter?

81

81

Page 76: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Lecture 14 – RAID

Thanks to Greg Ganger and Remzi Arapaci-Dusseau for slides

15-440 Distributed Systems

Page 77: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Just a bunch of disks (JBOD)

• Yes, it’s a goofy name• industry really does sell “JBOD enclosures”

83

Page 78: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Disk Striping

• Interleave data across multiple disks • Large file streaming can enjoy parallel transfers • High throughput requests can enjoy thorough load

balancing• If blocks of hot files equally likely on all disks (really?)

84

Page 79: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Redundancy via replicas

• Two (or more) copies• mirroring, shadowing, duplexing, etc.

• Write both, read either

85

Page 80: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Simplest approach: Parity Disk

• Capacity: one extra disk needed per stripe

86

Page 81: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Updating and using the parity

87

Page 82: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Solution: Striping the Parity

• Removes parity disk bottleneck

88

Page 83: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

RAID Taxonomy

• Redundant Array of Inexpensive Independent Disks• Constructed by UC-Berkeley researchers in late 80s (Garth)

• RAID 0 – Coarse-grained Striping with no redundancy • RAID 1 – Mirroring of independent disks • RAID 2 – Fine-grained data striping plus Hamming code disks

• Uses Hamming codes to detect and correct multiple errors • Originally implemented when drives didn’t always detect errors • Not used in real systems

• RAID 3 – Fine-grained data striping plus parity disk • RAID 4 – Coarse-grained data striping plus parity disk • RAID 5 – Coarse-grained data striping plus striped parity

89

Page 84: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

How often are failures?

• MTBF (Mean Time Between Failures)• MTBFdisk ~ 1,200,00 hours (~136 years, <1% per year)

• MTBFmutli-disk system = mean time to first disk failure • which is MTBFdisk / (number of disks) • For a striped array of 200 drives• MTBFarray = 136 years / 200 drives = 0.65 years

90

Page 85: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Rebuild: restoring redundancy after failure

• After a drive failure • data is still available for access • but, a second failure is BAD

• So, should reconstruct the data onto a new drive • on-line spares are common features of high-end disk arrays

• reduce time to start rebuild

• must balance rebuild rate with foreground performance impact • a performance vs. reliability trade-offs

• How data is reconstructed • Mirroring: just read good copy • Parity: read all remaining drives (including parity) and compute

91

Page 86: Lecture 1 – Introduction to Distributed Systems 1 15-440 Distributed Systems

Conclusions

• RAID turns multiple disks into a larger, faster, more reliable disk

• RAID-0: StripingGood when performance and capacity really matter, but reliability doesn’t

• RAID-1: MirroringGood when reliability and write performance matter, but capacity (cost) doesn’t

• RAID-5: Rotating ParityGood when capacity and cost matter or workload is read-mostly• Good compromise choice