distributed transaction management

http://www.cs.uta.fi/

Un

iversity of T

amp

ere, CS

D

epartm

ent

Distributed Transaction Management

Jyrki Nummenmaa

[email protected]


Un

iversity of T

amp

ere, CS

D

epartm

ent

How distributed txns are born?

• Distributed database management systems: a client contacts local server, which takes care of transparent distribution – the client does not need to know about the other sites.

• Alternative: a client contacts all local resource managers directly.

• Alternative: separate client programs are started on all respective sites.

• Typically the client executing on the first site acts as a coordinator communicating with the other clients.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Vector timestamps (continuing from last lecture)


Un

iversity of T

amp

ere, CS

D

epartm

ent

Assigning vector timestamps

• Initially, assign [0,...,0] to myVT.

• The id of the local site is referred to as ”self”.

• If event e is the receipt of message m, then:For i=1,...,M, assign max(m.VT[i],myVT[i]) to myVT[i].Add 1 to myVT[self]. Assign myVT to e.VT.

• If event e is the sending of m, then:Add 1 to myVT[self]. Assign myVT to both e.VT and m.VT.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Vector timestamps

T1 T2 T3

m1

m3

m2

T4

m5

m4

m6m7

m9

m8

[1,0,0,0]

[0,1,0,0]

[1,0,0,1]

[1,0,0,2]

[2,0,0,2]

[3,0,0,2] [0,0,1,0][1,0,1,3]

[1,1,1,4]

[3,1,1,5][3,1,1,6][3,1,3,7][3,1,3,8]

[3,3,3,9]

[3,1,2,6][3,1,3,6]

[3,2,3,8][3,3,3,8]


Un

iversity of T

amp

ere, CS

D

epartm

ent

Vector timestamp order

• e.VT V e'.VT, if and only if e.VT[i] e'.VT[i], 1 i M. (M = number of

sites)

• e.VT <V e'.VT, if and only if e.VT V e'.VT, and e.VT e'.VT.

• [0,1,2,3] [0,1,2,3]

• [0,1,2,2] < [0,1,2,3]

• The order of [1,1,2,3] and [0,1,2,4] is not defined, they are concurrent.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Vector timestamps - properties

• Vector timestamps also guarantee that if e<H e', then e.VT < e'.VT - This follows from the definition of happens-before relation by observing the path of events from e to e’.

• Vector timestamps also guarantee that if e.VT < e'.VT, then e <H e' (why?).


Un

iversity of T

amp

ere, CS

D

epartm

ent

Example Lock Manager Implementation


Un

iversity of T

amp

ere, CS

D

epartm

ent

Example lock manager

• Software to manage locks (no items). The software is not created for any application, just to demo concepts.

• (Non-existing) items are identified by numbers.

• Users may start and stop transactions and take and release locks using an interactive client.

• From the user’s point of view, the interactive client passes the commands to the lock managers.

• Number of managers to consult for a shared lock is given in startup and exclusive is computed from there.

• Lock managers queue lock requests and grant locks when/once they are available.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Lock manager classes

• See javadocs.

• LockManager manages locks.

• A LMMultiServerThread is created for each client.

• LockManagerProtocol talks with a client and calls LockManager methods.

• StartLockManager just starts a LockManager.

• Lock – a single lock object.

• Transaction – a single transaction object.

• Waitsfor – a single lock waiting for another.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Lock client classes

• StartLockClient – starts a lock client.

• LockClientController – the textual user interface, this is separate from the LockClient because for user it uses blocking input. It just reads standard input and sends commands to a LockClient.

• LockClient – a client talking to a LockManager. The client opens a socket to talk to a lock manager, and reads input from both the Lock Manager and the Client Controller.


Un

iversity of T

amp

ere, CS

D

epartm

ent

LockClient implementation

• In addition to the communication data, the LockClient keeps track on its txn id and its lock requests (pending and granted).

• Output from LockManagers is just printed out with the exception that if it contains a new txn-id, it will be stored.

• It keeps a track on lock requests from client controller: – When a new lock request comes in, it will send it to

appropriate LockManagers.– When an unlock request comes, it will check that there is

a lock request and send the unlock to LockManagers.

• NewTxn requests are only sent to one LockManager.


Un

iversity of T

amp

ere, CS

D

epartm

ent

More on input/output

• Each LMMultiServerThread talks with only one client.

• Each client talks to several servers (LMMultiServerThreads).

• An interactive client talks and also listens to a client controller, but it does not read standard input as this would block, or if you just check input with .ready() then user input is not echoed on the screen :(

• A client controller only sends reads standard input and sends the messages to an interactive client.


Un

iversity of T

amp

ere, CS

D

epartm

ent

LockManager data structures

• A Vector for currently locked objects.

• A Vector for locks queued in.

• A Vector for locks given from the queue and waiting to be notified to the clients.

• A Vector for transactions.


Un

iversity of T

amp

ere, CS

D

epartm

ent

LockManager locking

• The lock queue is checked. If there are xlocks preventing locking, the lock is put into the queue, otherwise the lock is added.

• When locks are released, the queue is checked and the first possible queueing lock is given. If it is possible to give other locks, they are also given. Those locks are added into the notify queue.

• As a LMMultiServerThread communicates with the client, it will also regularly call LockManagers checkNotify() method to see if there are locks given from the queue to be notified.


Un

iversity of T

amp

ere, CS

D

epartm

ent

LockManagerProtocol

• A LMMultiServerThread uses a LockManagerProtocol to manage communication with a client.

• The LockManagerProtocol parses the input from the client (and should do some more error processing than it does at present).

• The LockManagerProtocol calls the LockManager’s services for the parsed commands.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Example lock management implementation - concurrency

• A lock manager uses threads to serve incoming connections.

• As threads access the lock manager’s services concurrently, those methods are qualified to be synchronized.There may actually be an overkill, as it seems to be enough to synchronize the public methods (xlock/slock/unlock).

• An interactive client uses a single thread, but this is just to access the sleep method. There is no concurrent data access and consequently no synchronized methods.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Server efficiency

• It would be more efficient to pick threads from a pool of idle threads rather than always creating a new one.

• The data structures are not optimized. As there typically is a large number of data items and only some unpredictable number of them is locked at any one moment, a hashing data structure might be suitable for the lock table (like HashTable, but a hash function must exist for the hash data type, e.g. Integer is ok).

• It would be ok to connect a lock request queue (e.g. a linked list) for each lock item in the lock table with a direct reference.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Example lock management implementation – txn ids

• Currently long integers are used.

• Giving global txn ids sequentially (over all sites) is either complicated or needs a centralised server -> it is better to use a local (site-id, local-txn-id) pair to construct the txn ids. If the sites are known in advance, they can be numbered.

• We only care about txns accessing some resource manager’s services and there may be a fixed set of them -> ask the first server to talk with for an id (server-id, local-txn-id).


Un

iversity of T

amp

ere, CS

D

epartm

ent

Final notes on the software

• This is demo software.

• It is not bug free. If/once you find bugs, I am glad to hear. Even better, if you fix them

• It is not robust (error management is far from perfect).

• However, it may be able to run some time without errors

• If you for any reason (and I can not think of any) need to use the software, I can give you an official license to use it.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Distributed Deadlock Management


Un

iversity of T

amp

ere, CS

D

epartm

ent

Deadlock - Introdcution

• Centralised example:– T1 locks X on time t1.

– T2 locks Y on time t2 >t1.

– T1 attempts to lock Y on time t3 >t2 and gets blocked.

– T2 attempts to X on time t4 >t3 and gets blocked.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Deadlock (continued)

• Deadlock can occur in centralised systems. For example:– At the operating system level there can be resource

contention between processes– At the transaction processing level there can be data

contention between transactions.– In a poorly-designed multithread program, there can be

deadlock between threads in the same process.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Distributed deadlock ”management” approaches

• The approach taken by distributed systems designers to the problem of deadlock depends on the frequency with which it occurs.

• Possible strategies:– Ignore it.– Detection (and recovery).– Prevention and Avoidance (by statically making deadlock

structurally impossible and by allocating resources carefully).

– Detect local deadlock and ignore global deadlock.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Ignore deadlocks?

• If the system ignores the deadlocks, then the application programmers have to make their applications in such a way, that a timeout will force the transaction to abort and possibly re-start.

• The same approach is sometimes used in the centralised world.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Distributed Deadlock Prevention and Avoidance

• Some proposed techniques are not feasible in practice, like making a process request all of its resources at the start of execution.

• For transaction processing systems with timestamps, the following scheme can be implemented (like in centralised world):– When a process blocks, its timestamp is compared to the

timestamp of the blocking process. – The blocked process is only allowed to wait if it has a

higher timestamp. – This avoids any cyclic dependency.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Wound-wait and wound-die

• In the wound-wait approach, if an older process requests a lock to an item held by a younger process, it wounds the younger process and effectively kills it.

• In the wait-die approach, if a younger process requests a lock to an item held by an older process, the younger process commits suicide.

• Both of these approaches kill transactions blindly. There does not need to be a deadlock.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Further considerations for wound-wait and wound-die

• To reduce the number of unnecessarily aborted transactions, it is possible to use the cautious waiting rule: – ”You are always allowed to wait for a process,

which is not waiting for another process.”

• The aborted processes are re-started with their original timestamps, to guarantee liveness.

• Otherwise, a transaction may not make progress if it gets aborted over and over again.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Local waits-for graphs

• Each resource manager can maintain its local `waits-for' graph.

• A coordinator maintains a global waits-for graph. It can be computed when requested by someone, or automatically at times.

• When the coordinator detects a deadlock, it selects a victim process and kills it (thereby causing its resource locks to be released), breaking the deadlock.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Waits-for graph

• Centralised: a cycle in the waits-for graph means a deadlock

T1

T3

T2


Un

iversity of T

amp

ere, CS

D

epartm

ent

Local vs Global

• Distributed: collect the local graphs and create a global waits-for graph.

T1

T3

T1

T3

T2

Site 1

T1

T2

Site 2T3

Global waits-for graph


Un

iversity of T

amp

ere, CS

D

epartm

ent

Local waits-for graphs – problems

• It is impossible to collect all information at the same time, thus reflecting a ”global state” of locking.

• Since information is collected from ”different times”, it may be that locks are unlocked in between and the global waits-for graph is wrong.

• Therefore, the information about changes must be transmitted to the coordinator.

• The coordinator knows nothing about change information that is in transit.

• In practice, many of the deadlocks that it thinks it has detected will be what they call in the trade `phantom deadlocks'.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Global states and snapshot computation


Un

iversity of T

amp

ere, CS

D

epartm

ent

Global state

• We assume that between each pair (Si,Sj) of sites there

is a reliable order-preserving communication channel C{i,j}, whose contents is a list of messages L{i,j}=m1(i,j),m2(i,j), ..., mk(i,j).

• Let L={L{i,j}} be the collection of all message lists and let K be the collection of all local states. We say that the pair G=(K,L) is the global state of the system.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Consistent cut

• We say that a global state G=(K,L) is a consistent cut, if for each event e in G, G contains all events e’ such that e’ <H e.

• That is, there are no events missing from G such that they have happened before e, if e is in G.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Which lines cut a consistent cut?

T T’ T’’m1

m3m2

T’’’

m5m4

m6

m7

m9

m8


Un

iversity of T

amp

ere, CS

D

epartm

ent

Distributed snapshots

• A state of a site at one time is called a snapshot.

• There is no way we can take the snapshots simultaneously. If we could, that would solve the deadlock detection.

• Therefore, we want to create a snapshot that reflects a consistent cut.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Computing a distributed snapshot – assumptions

• If we form a graph with sites as nodes and communication channels as edges, we assume that the graph is connected.

• Neither channels nor processes fail.

• Any site may initiate snapshot computation.

• There may be several simultaneous snapshot computations.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Computing a distributed snapshot / 1

• As a site Sj initiates snapshot collection, it records its

state and sends snapshot token to all sites it communicates with.


Un

iversity of T

amp

ere, CS

D

epartm

ent


• If a site Si, i j, receives a snapshot token for the first

time, and it receives it from Sk, it does the following:

1. Stops processing messages.2. Makes L(k,i) the empty list. 3. Records its state.4. Sends a snapshot token to all sites it communicates with.5. Continues to process messages.


Un

iversity of T

amp

ere, CS

D

epartm

ent


• If a site Si receives a snapshot token from Sk and it has

received a snaphost token also earlier on or i = j, then the list L(k,i) is the list of messages Si has received from

Sk after recording its state.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Distributed snapshot example

S1 S2S3

snapm3

m2

S4

m5m4

m1

Assume the connection set { (S1,S2), (S2,S3), (S3,S4) }.

snap

snapsnapsnap

snap

L(1,2) = { m1 }, L(3,2) = {},

L(4,3) = { m3, m4 }. Effects of m2 are included in the state of S4.

Message m5 takes place entirely after the snapshot computation.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Termination of the snapshot algorithm

• Termination is, in fact, straightforward to see.

• Since the graph of sites is connected, and the local processing and message delivery between any two sites takes only a finite time, all sites will be reached and processed within a finite time.


Un

iversity of T

amp

ere, CS

D

epartm

ent

The snapshot algorithm selects a consistent cut

• Proof. Let ei and ej be events, which occurred on sites Si and Sj and ej <H ei. Assume that ei was in the cut produced by the snapshot algorithm. We want to show that also ej is in the cut. If Si = Sj , this is clear. Assume that Si Sj. Assume, now, for contradiction, that ej was not in the cut. Consider the sequence of messages <m1,m2,….,mk> by which ej <H ei. By the way snap markers are sent and received, a snap marker message has reached Sj before each m1,m2,….,mk and Si has therefore recorded its state before ej. Therefore, ei is not in the cut. This is a contradiction, and we are done.


Un

iversity of T

amp

ere, CS

D

epartm

ent

A picture relating to the theorem

Si Sj

m1

m2

mk

ej

ei

ej <H ei

Assume for contradiction ->

…

Conclusion: Si must haverecorded itsstate before ej

and ei can notbe in the cut -contradiction.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Several simultaneous snapshot computations

• The snapshot markers have to be different and sites just manage each separate snapshot computation separately.


Un

iversity of T

amp

ere, CS

D

epartm

ent

Snapshots and deadlock detection

• Apparently, in this case the state to be recorded is the waits-for graph and lock request / lock release messages.

• After the snapshot computation, the waits-for graphs are collected to a site, say, the initiator of the snapshot computation.

• For this, the snap messages should include the initiator of the snapshot computation.

distributed transaction management

Documents