distributed transaction

1

DISTRIBUTEDDISTRIBUTEDTRANSACTIONTRANSACTION

FASILKOM

UNIVERSITAS INDONESIA

2

What is a Transaction?

An atomic unit of database access, which is either completely executed or not executed at all.

It consists of an application specified sequence of operation, beginning with a begin_transaction primitive and ending with either commit or abort.

3

E.g.

Transfer $200 from account A in London to account B in Depok:begin_transaction

amntA = lookup amount in account A

amntB = lookup amount in account Bif (amntA < $200) abort

set account A = amntA - $200

set account B = amntB + $200 commit

4

Transaction Properties

Four main properties, the ACID properties:– Atomicity: A transaction must be all or nothing.– Consistency: A transaction takes the system form one

consistent state to another consistent state.– Isolation: The results of an incomplete transactions

are not allowed to be revealed to other transactions.– Durability: The results of a committed transaction will

never be lost, independent of subsequent failures.

Atomicity & durability -> failure tolerance

5

Failure Tolerance

Atomicity & durability -> failure tolerance Types of failures :

• Transaction-local failures detected by the application (e.g.insufficient funds)

• Transaction-local failures not detected by the application (e.g. divide by zero)

• System failures affecting volatile storage (e.g. CPU failure)• Media failures (e.g. HD crash)

What is a volatile storage? What is a stable storage?

6

Recovery

Based on redundancy. For example :

1.Periodically archive database2.Every time a change is made, record old and new values

to a log.

3.If a failure occurs :• If not damage to physical database undo all ‘unreliable’ changes.• If database physically damaged, restore from archive and redo

changes

7

Logging (1)

Database vs transaction log. For each change (begin transaction,

commit, and abort), write a log record with:

• Transaction ID (TID)• Record ID• Type of action• Old value of record• New value of record• Other info, e.g. pointer to previous log record of this

transaction.

8

Logging (2)

After a failure we need to undo or redo changes.

Undo and redo must be idempotent as there may be a failure whilst they are executing.

9

Log Write-ahead Protocol (1)

Before performing any update, at least the undo portion of the log record must be written to stable storage.

Before committing a transaction, all log records must have been fully recorded on stable storage. The commit record is written after these.

10

Log Write-ahead Protocol (2)

Reason for first rule :– If we change log before database :

• log -- change -- crash • log -- crash

– If we change log after database :• change -- log -- crash • change -- crash can’t undo

11

Checkpointing (1)

How does the recovery manager know which transaction to undo an which to redo after a failure.

Naive approach :– Examine entire log from the start. Look for

begin transaction records: • if a corresponding commit record exists, redo; • if there’s an abort, do nothing; and • if neither, undo.

12

Checkpointing (2)

Alternative:– Every so often:

1) Force all log buffers to disk.2) Write a checkpoint record to disk containing:

a) A list of all active transactionsb) The most recent log records for each transaction in a)

3) Force all database buffers to disk - disk is now totally up-to-date.

4) Write address of checkpoint record to fixed ‘restart location’ (had better be atomic).

13

Checkpointing (3)

There are 5 categories of transaction:

Time

T1

T2

T3

T4

T5

CrashCheckpointing

Leave

Redo

Undo

Undo

Redo

14

Recovery (1)

Look for most recent checkpoint record. For all records active at checkpoint must:

– undo all active at failure– redo all others

15

Recovery (2)

Have 2 lists: undo and redo Initially, undo contains all TIDs in

checkpoint record & redo is empty3 passes through log:

– Forwards from checkpoint to end:• If we find ‘begin_transaction’ add undo list.• If we find ‘commit’, transfer from undo to redo list.• If we find ‘abort’, remove from undo list.

– Backwards from end to checkpoint: undo.– Forwards from checkpoint to end: redo.

16

Commit Protocols

Commit protocols. Assume a set of cooperating managers

which deal with parts of a transaction. For atomicity we must ensure that

– At each site, either all actions or none are performed.– All sites take the same decision on whether to commit

or abort

17

Two Phase Commit (2PC) Protocol - 1

One node, the coordinator, has a special role, the others are participants.

The coordinator initiates the 2PC protocol. If any participant cannot commit, then all

site must abort.

18

2PC – 2

Phase I:– reach a common decision on whether to abort

or commit

Phase II:– Implement the decision at all sites

19

2PC - 3

I

U

CA

Coordinator

tm/ACM

-/PM

AAM/ACM

RM/CCM

I

R

AC

Participant

ua/ -

CCM/ - ACM/ -

PM/AAM

PM/RM

2PC

States:I = Initial stateU = UndecidedR = Ready to CommitA = AbortC = Commit

Messages:PM = Prepare MessageRM = Ready MessageAAM = Abort Answer MessageACM = Abort Command MessageCCM = Commit Command Message

Other Transitions:ua = Unilateral Aborttm = timeout

20

2PC – Phase 1 Coordinator:

– Write prepare record to log– Multicast prepare message and set timeout

Participant:– Wait for prepare message– If we are willing to commit then

• force log records to stable storage• write ready record in log• send ready message to coordinator

– else• write ABORT in log• send abort answer message to coordinator

21

2PC – Phase 2 (1)

Coordinator:– wait for a reply messages (ready or abort) or timeout– If timeout expires or any message is abort

• write global abort record in the log

• send abort command message to all participants

– else• if all answers were ready

• write global commit record to log

• send commit command message to all participants

22

2PC – Phase 2 (2)

Participants:– Wait for command message (abort or commit)– write abort or commit in the log– send ack message to coordinator– execute command (may be null)

Coordinator:– wait for ack messages from all participants– write complete in the log

23

2PC – Site Failures

Resilient to all failures in which no log information is lost.

Site failures– participants fails before having written ready to log:

• timeout expires ---> ABORT

– Participants fails after having written ready to log:• Msg sent -- others take decision. This node gets outcome

from the coordinator or other participants after restart

• Msg unsent -- timeout expires ---> ABORT

24

2PC – Coordinator Failures

Coordinator fails after writing prepare but before global commit/global abort (globalX).– All participants must wait for recovery of coordinator ->

BLOCKING– Recovery of coordinator involves restarting protocol from identities

in prepare log record– Participants must identify duplicate prepare messages

Coordinator fails after having written global X but before writing complete.– On restart, coordinator must resend decision, to ensure blocked

processes get it. Others must discard duplicate. Coordinator fails after having written complete.

– No action needed

25

2PC – Lost Messages A reply message (ready or abort) from a

participant is lost.– Timeout expires -- coordinator ABORTs

A prepare message is lost.– Timeout expires -- coordinator ABORTs

A commit/abort command message is lost.– Timeout in participant -- request repetition of command

from the coordinator. An ack message is lost

– Timeout in coordinator -- coordinator resends command

26

2PC - Partitions

Everything aborts as coordinator can’t contact all participants. Those participants in partition without coordinator may remain blocked & the resources are still retained until the blocked participants are unblocked.

27

2PC - Comments

Blocking is a problem if the coordinator or network fails which reduces availability -> use 3PC.

Unilateral abort.– Any node can abort until it sends ready (site autonomy

before the ready state). Efficiency can be increased:

– Elimination of prepare messages. The participants, that can commit, will automatically send RM.

– Presumed commit/abort , if there’s no information found in the log. See [CER84] 13.5.1,2,&3.

28

Impossible Termination in 2PC

No operational participant has received the command. The operational participants are in the R state, but they haven’t received the ACM or CCM, AND

At least one participant failed. Unfortunately the failed participant acted as the coordinator.

29

Impossible Termination in 2PC

The failed participant might have already performed an undone action (commit or abort), i.e. in the C or A state.

The operational participants can’t know what the failed participant had done, and can’t take an independent decision.

The problem is solved by the 3PC.

30

3PC (1)3PC

I

U

BCA

Coordinator

tm/ACM

-/PM

AAM/ACMRM/PCM

I

R

APC

Participant

ua/ -

PCM/OK ACM/ -

PM/AAM

PM/RM

tm/ACM

C

OK/CCM

C

CCM/ -

New States:PC = Prepared to CommitBC = Before Commit

New Messages:PCM = Prepare to CommitOK = Entered PC statepossible restart

transitions

Restart 1

Restart 2

31

3PC (2)

Case study:– See slide no 3.

– London: Coordinator & Participant1

– Depok: Participant2

32

3PC (3)

3PC avoids problems with 2PC:1. If any operational participant has received an abort

then all can abort. The failed participant will abort at restart if it hasn’t already. [As 2PC] E.g. Depok fails, London is operational and has received an ACM.

2. If any participants has received the PCM, then all can commit. The failed participant (e.g.cannot have aborted unilaterally, because it had answered READY (RM). The failed participant will commit at restart (see “restart 1”). E.g. London fails, Depok is operational and has received the PCM.

33

3PC (4)

3. If none of the operational participants has received the PCM participant, i.e. all of the operational participants are in the R state, then 2PC would block. With 3PC we can abort safely since the failed participant cannot have committed. At most it has received the PCM -> it can abort at restart (see “restart 2”). E.g. London fails, Depok is operational and has NOT received the PCM (in the R state).

34

3PC (5)

3PC guarantees that there won’t be blocking condition caused by all possible failures during the 2nd phase.

Failures during the 3rd phase -> blocking???– If coordinator fails in 3rd phase, then elect

another and continue the commit process (since all must be in the PC state).

35

Consistency & Isolation Consistency & isolation -> concurrency control. The Lost Update Problem:

Transaction 1

Read X

Update X

Transaction 2

Read X

Update X

Lost update

time

36

The Uncommitted Dependency (Temporary Update) Problem

Transaction 1

Read X

Transaction 2

Update X

ABORT

temporary incorrect value of X,because Trasaction2 is aborted.

time

37

The Inconsistent Analysis Problem

Transaction 1

sum := 0Read Asum := sum + A

Transaction 2

Read Bsum := sum + B

Read A

Read B

Update A

Update B

COMMIT

time

before the update by transaction2

after the update by transaction2

38

Concurrent Transactions

If we have concurrent transactions, we must prevent interference.

c.f. lost update problem– Prevent T2’s read (because T1 has seen it and may update it)

[Locking]– Prevent T1’s update (because T2 has seen it) [Locking]– Prevent T2’s update (because T1 has already updated it and so

this is based on obsolete values) [timestamping]– Have them work independently and resolve difficulties on

commit.[Optimistic concurrency control]

39

Serializability

What we need is some notion of correctness.

Serializability is usually used write to transactions.

40

Serial Transactions

Two transactions execute serially if all operations of one precede all operations of the other. e.g:

S1: Ri(x) Wi(x) Ri(y) Rj(x) Wj(y) Rk(y) Wk(x), or

S1: TiTjTk, S2: TkTjTi, ………..

S1 = Schedule 1, S2 = Schedule 2 All serial schedules are correct, but restrictive of

concurrency .

41

Transaction Conflict

Two operations are in conflict if:– At least one is a write– They both act on the same data– They are issued by different transactions

Which of the following are in conflict?

Ri(x) Rj(x) Wi(y) Rk(y) Wj(x)

42

Computationally Equivalent

Two schedules (S1 & S2) are computationally equivalent if:– The same operations are involved (possibly

reordered)

– For every pair of operations in conflict (Oi & Oj),such that Oi precedes Oj in S1, then also Oi

precedes Oj in S2.

43

Serializable Schedule

A schedule is serializable if it is computationally equivalent to a serial schedule. e.g:

Ri(x) Rj(x) Wj(y) Wi(x)(which is not a serial schedule)is computationally equivalent to:

Rj(x) Wj(y) Ri(x) Wi(x)(which is a serial schedule: TjTi)

The following is NOT a serial schedule. But is it serialisable? Ri(x) Rj(x) Wi(y) Rk(y) Wj(x)The above schedule is computationally equivalent to serial schedules: TiTjTk, TiTkTj.

44

Serializability in Distributed Systems (1)

A local concurrency control mechanism isn’t sufficient. e.g:– Site 1: Ri(x) Wi(x) Rj(y) Wj(x) i.e: Ti < Tj

– Site 2: Rj(y) Wj(y) Ri(y) Wi(y) i.e: Tj < Ti

45

Serializability in Distributed Systems (2)

Let T1…Tn be a set of transactions and E be an execution of these modeled by schedules S1…Sm on machines 1…m.

Each local schedule (S1…Sm) is serialisable. Then E is serialisable (in distributed systems) if,

for all i and j, all conflicting operations from Ti and Tj in each of the schedules have the same order i.e. there is a global total ordering for all sites.

46

Locking (1)

How to implement serializability use locking

Shared/eXclusive (Read/Write) locks:1. A transaction T must have SLockx or

XLockx before any Read X.2. A transaction T must have XLockx before

any Write X.3. A transaction T must issue unLockx after

Read x or Write x is completed.

47

Locking (2)

4. A transaction T can upgrade the lock, i.e. issuing a XLockx after having SLockx, as long as T is the only transaction having Slockx. Otherwise T must wait.

5. A transaction T can downgrade the lock, i.e. issuing a SLockx after having XLockx.

48

Locking (3)

E.g.T1: X = X + Y T2: Y = X + Y

If initially X=20, Y=30 then either:– S1: T1 < T2: X=50, Y=80

– S2: T2 < T1: X=70, Y=50

Both are serial schedules, thus both are correct.

49

Locking (4)

However using Shared/eXclusive (Read/Write) locks does NOT guarantee serializability.

If any transaction releases a lock and then acquires another, it may produce incorrect results.

50

Locking (5)T1 T2

SLock xtemp2 = x 20

XLock y

y = temp2 + temp3 50

COMMIT

SLock y

temp1=y 30

unLock y

XLock x

x = temp4 + temp1 50

COMMIT

unLock xunLock x

temp3 = y

unLock y

temp4 = x 20

unLock x The schedule is NOT serializable!!!So it is NOT correct

time

51

Locking (6)

What is the problem?– It was too early unlocking Y in T1 and

unlocking X in T2. See the italics unLock Y and unLock X.

What is the solution?– 2 Phase Locking (2PL).

52

2PL - 1

Two phase locking (2PL)– Before operating on any object the transaction must

obtain a lock for it.– After releasing a lock the transaction never acquires

more locks– 2 phases:

1. Expanding (growing) phase: acquiring new locks, but NEVER releasing any locks.

2. Shrinking phase: releasing existing locks, but NEVER acquiring new locks.

53

2PL - 2

Exercise: modify the schedule on slide 50 by following the 2 PL.

2PL may cause deadlocks. See [ELM00]. If a schedule obeys 2PL it is serializable. How is the vice versa? Do all serializable

schedules follow the 2 PL?

54

2PL - 3Serializable but not 2PL

Ri (x) Temp1 = xWi (x)

Rj (x)

Wj (x) Ri (y)

Wi (y)

Rj (y)

Wj (y)

Equivalent 2PL

Ri (x)

Wi (x)

Rj (x)

Wj (x)

Ri (y)

Wi (y)

Rj (y)

Wj (y)

Account x at site1 & account y at site2.Ti : Ri(x) Wi(x) Ri(y) Wi(y)Tj : Rj(x) Wj(x) Rj(y) Wj(y)

Site1 Site2Site1 Site2

New problem: 2 PL may limit the amount of

concurrency. See the schedule on the right.

time

55

Optimistic Concurrency Control

Locking is pessimistic. Assume instead that contention is rare– All updates made to a private copy– On commit see if there are conflicts with other

transactions started afterwards.– If not, install changes atomically– else ABORT

Deadlock free & maximum parallelism, but may get livelock.– What is livelock?

56

Timestamping (1)

Again, no deadlock Rules:

– Each transaction receives a globally unique timestamp, TSi when started.

– Updates are not physically installed until commit.

– Every objects in the database carries the timestamp of the last transaction to read it (RTM(x)) and the last to write it (WTM(x))

57

Timestamping (2)

– If a transaction, Ti, requests an operation that conflicts with a younger transaction Tj, then Ti is restarted with a new timestamp.

– An operation from Ti is in conflict with an operation from Tj if.:

- It is a read and the object has already been

update by Tj; i.e. TSi < WTM(x), read operation is rejected & Ti is started with new time stamp. If the read is OK, set RTM(x) = max(TSi,RTM(x))

- It is update and the object has already been read or update by Tj; i.e. TSi < RTM(x) or

TSi < WTM(x), update operation is rejected & Ti is started with new time stamp. If the update is OK, set WTM(x) = TSi.

58

References

[CER84] Ceri, S., G. Pelagatti. Distributed Databases: Principles and Systems. New York: McGraw-Hill, 1984

[ELM00] Elmasri R,. S.B. Navathe. Fundamentals of Database Systems 3rd ed. Reading: Addison-Wesley, 2000

distributed transaction

Documents