12 example

Upload: idriadi

Post on 04-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 12 Example

    1/34

    CS-510

    Transactional Memory: ArchitecturalSupport for Lock-Free Data Structures

    By Maurice Herlihy and J. Eliot B. Moss1993

    Presented by Steve CowardPSU Fall 2011 CS-510

    11/02/2011

    Slide content heavily borrowed from Ashish Jha

    PSU SP 2010 CS-51005/04/2010

  • 8/13/2019 12 Example

    2/34

    2CS-510

    Agenda

    Lock-based synchronization

    Non-blocking synchronization TM (Transactional Memory) concept

    A HW-based TM implementation

    Core additions

    ISA additions

    Transactional cache

    Cache coherence protocol changes

    Test Methodology and results

    Blue Gene/Q

    Summary

  • 8/13/2019 12 Example

    3/34

    3CS-510

    Generally easy to use, except not composable

    Generally easy to reason about

    Does not scale well due to lock arbitration/communication

    Pessimistic synchronization approach

    Uses Mutual Exclusion Blocking, only ONE process/thread can execute at a time

    Lock-based synchronization

    Lo-priority

    Holds Lock X

    Hi-priorityPre-emption Cant proceed

    Priority Inversion

    Holds Lock X

    De-scheduled Cant proceed

    Ex: Quantum expiration,Page Fault, other interrupts

    Proc A Proc B

    Convoying

    Cant proceed

    DeadlockCant proceed

    Get Lock X

    Get Lock X

    Holds Lock Y

    Get Lock X

    Holds Lock X

    Get Lock Y

    High context

    switch overhead

    Med-priority

    Proc C

  • 8/13/2019 12 Example

    4/34

    4CS-510

    Lock-Free Synchronization

    Non-Blocking - optimistic, does not use mutual exclusion

    Uses RMW operations such as CAS, LL&SC

    limited to operations on single-word or double-words

    Avoids common problems seen with conventional techniquessuch as Priority inversion, Convoying and Deadlock

    Difficult programming logic

    In absence of above problems and as implemented in SW,lock-free doesnt perform as well as a lock-based approach

  • 8/13/2019 12 Example

    5/34

    5CS-510

    Non-Blocking Wish List

    Simple programming usage model

    Avoids priority inversion, convoying anddeadlock

    Equivalent or better performance than lock-based approach

    Less data copying

    No restrictions on data set size or contiguity

    Composable

    Wait-free

    Enter Transactional Memory (TM)

  • 8/13/2019 12 Example

    6/34

    6CS-510

    What is a transaction (tx)? A tx is a finite sequence of revocable operations executed by a

    process that satisfies two properties:

    Serializable Steps of one tx are not seen to be interleaved with the steps

    of another tx

    Txs appear to all processes to execute in the same order

    Atomic

    Each tx either aborts or commits Abort causes all tentative changes of tx to be discarded

    Commit causes all tentative changes of tx to be madeeffectively instantaneously globally visible

    This paper assumes that a process executes at most one tx at a time

    Txs do not nest (but seems like a nice feature) Hence, these Txs are not composable

    Txs do not overlap (seems less useful)

  • 8/13/2019 12 Example

    7/347CS-510

    What is Transactional Memory?

    Transactional Memory (TM) is a lock-free, non-blocking concurrency control mechanism based on

    txs that allows a programmer to define customizedread-modify-write operations that apply tomultiple, independently chosen memory locations

    Non-blockingMultiple txs optimistically execute CS in parallel

    (on diff CPUs)

    If a conflict occurs only one can succeed,others can retry

  • 8/13/2019 12 Example

    8/34

    8CS-510

    Basic Transaction Concept

    beginTx:[A]=1;[B]=2;[C]=3;x=_VALIDATE();

    Proc A

    beginTx:

    z=[A];y=[B];[C]=y;x=_VALIDATE();

    Proc B

    Atomicity

    ALLor

    NOTHING

    IF (x)_COMMIT();

    ELSE_ABORT();

    GOTO beginTx;

    //_COMMIT -instantaneously make allabove changes visible to allProcs

    IF (x)_COMMIT();

    ELSE_ABORT();

    GOTO beginTx;

    // _ABORT - discard allabove changes, may tryagain

    True concurrency

    How is validity determined?

    Optimistic execution

    Serialization ensured if only one tx commits and others abort

    Linearization ensured by (combined) atomicity of validate, commit, and abort operations

    How to COMMIT?

    How to ABORT?

    Changes must be revocable!

  • 8/13/2019 12 Example

    9/34

    9CS-510

    TM vs. SW Non-Blocking

    LT or LTX// READ set of locations

    VALIDATE// CHECK consistency of READ values

    ST// MODIFY set of locations

    COMMIT

    Succeed

    Fail

    Fail

    CriticalSection

    // HW makes changes PERMANENT

    Pass

    // TMWhile (1) {

    curCnt = LTX(&cnt);

    if (Validate()) {

    int c = curCnt+1;

    ST(&cnt, c);

    if (Commit())

    return;}

    }

    // Non-BlockWhile (1) {

    curCnt = LL(&cnt);

    // Copy object

    if (Validate(&cnt)) {

    int c = curCnt + 1;

    if (SC(&cnt, c))

    return;}

    }

    // Do work

  • 8/13/2019 12 Example

    10/34

    10CS-510

    HW or SW TM? TM may be implemented in HW or SW

    Many SW TM library packages exist

    C++, C#, Java, Haskell, Python, etc. 2-3 orders of magnitude slower than other synchronization

    approaches

    This paper focuses on HW implementation

    HW offers significantly greater performance than SW forreasonable cost

    Minor tweaks to CPUs* Core ISA Caches Bus and Cache coherency protocols

    Leverage cache-coherency protocol to maintain tx memoryconsistency

    But HW implementation w/o SW cache overflow handling isproblematical

    * See Blue Gene/Q

  • 8/13/2019 12 Example

    11/34

    11CS-510

    Core: TM Updates

    Each CPU maintains two additional StatusRegister bits

    TACTIVE

    flag indicating whether a transaction is in progresson this cpu

    Implicitly set upon entering transaction

    TSTATUS

    flag indicating whether the active transaction hasconflicted with another transaction

    TACTIVE TSTATUS MEANING

    False DC No tx active

    True False Orphan tx - executing, conflict detected, will abort

    True True Active tx - executing, no conflict yet detected

  • 8/13/2019 12 Example

    12/34

    12CS-510

    ISA: TM Memory Operations LT: load from shared memory to register

    ST: tentative write of register to shared memory which

    becomes visible upon successful COMMIT LTX: LT + intent to write to same location later

    A performance optimization for early conflict detection

    ISA: TM Verification Operations VALIDATE: validate consistency of read set

    Avoid misbehaved orphan

    ABORT: unconditionally discard all updates

    Used to handle extraordinary circumstances

    COMMIT: attempt to make tentative updates permanent

  • 8/13/2019 12 Example

    13/34

    13CS-510

    TM Conflict Detection

    LOAD and STORE are supported but do not affect txs READ or WRITE set

    Why would a STORE be performed within a tx?

    Left to implementation Interaction between tx and non-tx operations to same address

    Is generally a programming error

    Consider LOAD/STORE as committed txs with conflict potential, otherwise non-linearizable outcome

    LTreg, [MEM]

    LTXreg, [MEM]ST[MEM], reg

    // Tx READ with intent to WRITE later// Tx WRITE to local cache, value globally

    // visible only after COMMIT

    // pure Tx READ

    Tx Dependencies Definition Tx Abort Condition Diagram*

    READ-SET

    WRITE-SETDATASET

    DATA-SETUpdated?

    WRITE-SET

    Read by any other Tx?

    COMMIT//WRITE-SET visibleto other processes

    ABORT//Discard changes

    to WRITE-SET

    N

    N

    Y

    Y

    LOAD reg, [MEM]STORE[MEM], reg

    // non-Tx READ// non-Tx WRITE - careful!

    *Subject to arbitration

    (as in Reader/Writer paradigm)

  • 8/13/2019 12 Example

    14/34

    14CS-510

    Core

    TM Cache Architecture

    Two primary, mutually exclusive caches In absence of TM, non-Tx ops uses same caches, control logic and coherency protocols as non-Tx architecture

    To avoid impairing design/performance of regular cache

    To prevent regular cache set size from limiting max tx size

    Accessed sequentially based on instruction type!

    Tx Cache

    Fully-associative

    Otherwise how would tx address collisions be handled?

    Single-cycle COMMIT and ABORT

    Small - size is implementation dependent

    Intel Core i5 has a first level TLB with 32 entries! Holds all tentative writes w/o propagating to other caches or memory

    May be extended to at act as Victim Cache

    Upon ABORT

    Modified lines set to INVALID state

    Upon COMMIT

    Lines can be snooped by other processors

    Lines WB to memory upon replacement

    1stlevel Cache

    L2D L3D

    2ndlevel 3rdlevel Main Mem

    1 Clk

    4 Clk

    Core

    L1D

    Direct-mappedExclusive

    2048 lines x 8B

    Tx Cache

    Fully-associativeExclusive

    64 lines x 8B

    L2D L3D

  • 8/13/2019 12 Example

    15/34

    15CS-510

    HW TM Leverages Cache Protocol M - cache line only in current cache and

    is modified

    E - cache line only in current cache andis not modified

    S - cache line in current cache andpossibly other caches and is notmodified

    I - invalid cache line

    Tx commit logic will need to detect thefollowing events (akin to R/W locking)

    Local read, remote write

    S -> I

    E -> I

    Local write, remote write

    M -> I

    Local write, remote read

    M, snoop read, write back -> S

    Works with bus-based (snoopy cache) ornetwork-based (directory) architectures

    http://thanglongedu.org/showthread.php?2226-Nehalem/page2

  • 8/13/2019 12 Example

    16/34

    16CS-510

    Protocol: Cache States

    Every Tx Op allocates 2 CL entries (2 cycle allocation)

    Single CL entry also possible

    Effectively makes tx cache twice as big

    Two entry scheme has optimizations

    Allows roll back (and roll forward) w/o bus traffic

    LT allocates one entry (if LTX used properly)

    Second LTX cycle can be hidden on cache hit

    Authors decision appears to be somewhat arbitrary

    A dirty value originally read must either be

    WB to memory, or

    Allocated to XCOMMIT entry as its old entry

    avoids WBs to memory and improves performance

    XCOMMIT

    XABORT New Value

    Old value

    Tx Cache Line Entry, States and Replacement

    Tx Op Allocation

    2 lines

    COMMIT ABORT

    CL Entry & States

    CL Replacement EMPTYsearch No

    replace

    Yes

    NORMALNo

    replace

    XCOMMIT

    WB if DIRTY then replace

    Cache Line StatesName Access Shared? Modified?INVALID none

    VALID R Yes No

    DIRTY R, W No Yes

    RESERVED R, W No No

    Tx TagsName MeaningEMPTY contains no data

    NORMAL contains committed data

    XCOMMIT discard on commit

    XABORT discard on abort

    EMPTY

    NORMAL New Value

    Old value

    New Value

    Old valueNORMAL

    EMPTY

    NoABORT Tx/Trap to SW

    Yes Yes

  • 8/13/2019 12 Example

    17/34

    17CS-510

    ISA: TM Verification Operations

    Orphan = TACTIVE==TRUE && TSTATUS==FALSE Tx continues to execute, but will fail at commit

    Commit does not force write back to memory Memory written only when CL is evicted or invalidated

    Conditions for calling ABORT Interrupts

    Tx cache overflow

    VALIDATE [Mem]

    ReturnTRUE TSTATUS

    TSTATUS=TRUETACTIVE=FALSE

    FALSE

    ABORT

    TSTATUS=TRUETACTIVE=FALSE

    For ALL entries1. Drop XABORT2. Set XCOMMITto NORMAL

    For ALL entries1. Drop XCOMMIT2. Set XABORTto NORMAL

    ReturnFALSE

    COMMIT

    TSTATUSABORTReturnFALSE

    TSTATUS=TRUETACTIVE=FALSE

    TRUE

    ReturnTRUE

  • 8/13/2019 12 Example

    18/34

    18CS-510

    TM Memory Access

    IsXABORT DATA?

    LT reg, [Mem]//search Tx cache

    IsNORMAL DATA?

    XABORT NORMAL DATAXCOMMIT DATA

    T_READ cycleXABORT DATAXCOMMIT DATA

    //ABORT TxFor ALL entries1. Drop XABORT2. Set XCOMMIT to NORMAL

    TSTATUS=FALSE

    ReturnDATA

    ReturnarbitraryDATA

    Y

    Y

    OK

    BUSY

    CL State as Goodmans protocol for LOAD (Valid) CL State as Goodmans protocol for LOAD (Reserved)

    ST to Tx cache only!!!

    IsXABORT DATA?

    LTX reg, [Mem]//search Tx cache

    IsNORMAL DATA?

    XABORT NORMAL DATAXCOMMIT DATA

    T_RFO cycleXABORT DATAXCOMMIT DATA

    //ABORT TxFor ALL entries1. Drop XABORT2. Set XCOMMIT to NORMAL

    TSTATUS=FALSE

    ReturnDATA

    ReturnarbitraryDATA

    //Tx cache miss

    Y

    Y

    OK

    BUSY

    //Tx cache miss

    IsXABORT DATA?

    ST [Mem], reg

    IsNORMAL DATA?

    XCOMMIT NORMAL DATAXABORT NEW DATA

    T_RFO cycleXCOMMIT DATAXABORT NEW DATA

    //ABORT TxFor ALL entries1. Drop XABORT2. Set XCOMMIT to NORMAL

    TSTATUS=FALSE

    ReturnarbitraryDATA

    //Tx cache miss

    Y

    OK

    BUSY

    CL State as Goodmans protocol for STORE (Dirty)

    Bus Cycles

    Name Kind Meaning New AccessREAD regular read value shared

    RFO regular read value exclusive

    WRITE both write back exclusiveT-READ Tx read value sharedT-RFO Tx read value exclusive

    BUSY Tx refuse access unchanged

    XCOMMIT

    XABORT New Value

    Old value

    Tx Op Allocation

    For reference

    XCOMMIT DATAXABORT NEW DATA

    Y

    Tx requests REFUSED by BUSY response

    Tx aborts and retries (after exponentialbackoff?)

    Prevents deadlock or continual mutual aborts

    Exponential backoff not implemented in HW Performance is parameter sensitive

    Benchmarks appear not to be optimized

  • 8/13/2019 12 Example

    19/34

    19CS-510

    TM Snoopy Cache Actions

    Both Regular and Tx Cache snoop on the bus

    Main memory responds to all L1 read misses

    Main memory responds to cache line replacement WRITE s

    If TSTATUS==FALSE, Tx cache acts as Regular cache (for NORMAL entries)

    Line Replacement Action - Regular or Tx $Bus Request

    WRITE write data on Bus, writ ten to Memory

    Action

    Response to SNOOP Actions - Regular $

    Bus Request Current State Next State/Action

    Regular RequestRead VALID/DIRTY/RESV. VALID/Return Data

    RFO VALID/DIRTY/RESV. INVALID/Return Data

    Tx Request

    T_Read VALID VALID/Return Data

    T_Read DIRTY/RESV. VALID/Return Data

    T_RFO VALID/DIRTY/RESV. INVALID/Return Data

    Response to SNOOP Actions - Tx $

    Bus Request Tx Tag Current State Next State/Action

    Regular RequestRead Normal VALID/DIRTY/RESV. VALID/Return Data

    RFO Normal VALID/DIRTY/RESV. INVALID/Return Data

    Tx Request

    T_Read Normal/Xcommit/Xabort VALID VALID/Return Data

    T_Read Normal/Xcommit/Xabort DIRTY/RESV. NA/BUSY SIGNAL

    T_RFO Normal/Xcommit/Xabort DIRTY/RESV. NA/BUSY SIGNAL

  • 8/13/2019 12 Example

    20/34

    20CS-510

    Test Methodology

    TM implemented in Proetus - execution driven simulator from MIT Two versions of TM implementation

    Goodmans snoopy protocol for bus-based arch

    Chaiken directory protocol for (simulated) Alewife machine 32 Processors

    memory latency of 4 clock cycles

    1stlevel cache latency of 1 clock cycles 2048x8B Direct-mapped regular cache 64x8B fully-associative Tx cache

    Strong Memory Consistency Model

    Compare TM to 4 different implementation Techniques SW

    TTS (test-and-test-and-set) spinlock with exponential backoff [TTS Lock] SW queuing [MCS Lock]

    Process unable to lock puts itself in the queue, eliminating poll time

    HW LL/SC (LOAD_LINKED/STORE_COND) with exponential backoff [LL/SC Direct/Lock]

    HW queuing [QOSB] Queue maintenance incorporated into cache-coherency protocol

    Goodmans QOSB protocol - head in memory elements in unused CLs

    Benchmarks

    Counting LL/SC directly used on the single-word counter variable

    Producer & Consumer

    Doubly-Linked List

    All benchmarks do fixed amount of work

  • 8/13/2019 12 Example

    21/34

    21CS-510

    Counting Benchmark

    N processes increment shared counter 2^16/n times, n=1 to 32

    Short CS with 2 shared-mem accesses, high contention

    In absence of contention, TTS makes 5 references to mem for each increment RD + test-and-set to acquire lock + RD and WR in CS + release lock

    TM requires only 3 mem accesses RD & WR to counter and then COMMIT (no bus traffic)

    void process (int work)

    {int success = O, backof f = BACKOFF_MIN;unsigned wait;

    while (success < work) {ST(&counter, LTX(&counter) + 1) ;if (COMMIT()) {

    success++;

    backof f = BACKOFF_MIN;} else {

    wait = randomo % (01

  • 8/13/2019 12 Example

    22/34

    22CS-510

    Counting Results

    LL/SC outperforms TM LL/SC applied directly to counter variable, no explicit commit required

    For other benchmarks, adv lost as shared object spans multiple words only way to use LL/SC is as a spin lock

    TM has higher thruput than all other mechanisms at most levels of concurrency TM uses no explicit locks and so fewer accesses to memory (LL/SC -2, TM-3,TTS-5)

    Total cyclesneeded tocomplete thebenchmark

    ConcurrentProcesses

    BUS NW

    SOURCE:Figure copiedfrom paper

    TTS Lock

    MCS Lock - SWQ

    QOSB - HWQ

    TM

    LL/SC Direct

    TTS Lock

    TM

    LL/SC Direct

    MCS Lock - SWQ

    QOSB - HWQ

  • 8/13/2019 12 Example

    23/34

    23CS-510

    Prod/Cons Benchmark

    N processes share a bounded buffer, initially empty Half produce items, half consume items

    Benchmark finishes when 2^16 operations have completed

    typedef struct { Word deqs; Word enqs; Word items [QUEUE_SI ZE ] ; } queue;

    unsigned queue_deq (queue *q) {

    unsigned head, tail, result, wait;

    unsigned backoff = BACKOFF_MIN

    while (1) {result = QUEUE_EMPTY ;

    tail = LTX(&q-> enqs) ;

    head = LTX( &q->deqs ) ;

    if (head ! = tail) { /* queue not empty? */

    result = LT(&q->items [head % QUEUE_SIZE] ) ;

    ST(&q->deqs, head + 1) ; /* advance counter */

    }if (COMMIT() ) break; .

    wait = randomo % (01 backof f */

    while (wait--) ;

    if (backoff < BACKOFF_MAX) backof f ++;

    }

    return result;

    }

    SOURCE:from paper

    d/ lCycles needed

  • 8/13/2019 12 Example

    24/34

    24CS-510

    Prod/Cons Results

    In Bus arch, almost flat thruput for all TM yields higher thruput but not as dramatic as counting benchmark

    In NW arch, all thruputs suffer as contention increases TM suffers the least and wins

    yto complete theBenchmark

    ConcurrentProcesses

    BUS NW

    SOURCE:Figure copiedfrom paper

    N

    QOSB - HWQ

    TTS Lock

    TM

    LL/SC Direct

    MCS Lock - SWQ

    TM

    QOSB - HWQ

    LL/SC Direct

    MCS Lock - SWQ

    TTS Lock

    bl i k d i h k

  • 8/13/2019 12 Example

    25/34

    25CS-510

    Doubly-Linked List Benchmark

    N processes share a DL list anchored by Head & Tail pointers Process Dequeues an item by removing the item pointed by tail and then Enqueues it by threading it onto the list as head

    Process that removes last item sets both Head & Tail to NULL

    Process that inserts item into an empty list sets both Head & Tail to point to the new item

    Benchmark finishes when 2^16 operations have completed

    SOURCE:from paper

    typedef struct list_elem {

    struct list elem *next; /* next to dequeue */

    struct list elem *prev; /* previously enqueued */

    int value;

    } entry;

    shared entry *Head, *Tail;

    void list_enq(entry* new) {

    entry *old tail;

    unsigned backoff = BACKOFF_MIN;

    unsigned wait;

    new->next = new->prev = NULL;

    while (TRUE) {

    old_tail = (entry*)LTX(&Tail);

    if (VALIDATE()) {

    ST(&new->prev, old tail);if (old_tail == NULL) ST(&Head, new);

    else ST(&old tail->next, new);

    ST(&Tail, new);

    if (COMMIT()) return;

    }

    wait = randomo % (01

  • 8/13/2019 12 Example

    26/34

    26CS-510

    Doubly-Linked List Results

    Concurrency difficult to exploit by conventional means State dependent concurrency is not simple to recognize using locks

    Enquerers dont know if it must lock tail-ptr until after it has locked head-ptr & vice-versa for dequeuers Queue non-empty: each Tx modifies head or tail but not both, so enqueuers can (in principle) execute

    without interference from dequeuers and vice-versa

    Queue Empty: Tx must modify both pointers and enqueuers and dequeuers conflict

    Locking techniques uses only single lock Lower thruput as single lock prohibits overlapping of enqueues and dequeues

    TM naturally permits this kind of parallelism

    to complete theBenchmark

    ConcurrentProcesses

    BUS NW

    SOURCE:Figure copiedfrom paper

    MCS Lock - SWQ

    MCS Lock - SWQ

    TMTM

    QOSB - HWQ

    QOSB - HWQ

    TTS Lock

    TTS Lock

    LL/SC Direct

    LL/SC Direct

  • 8/13/2019 12 Example

    27/34

    27CS-510

    Blue Gene/Q Processor First Tx Memory HW implementation

    Used in Sequoia supercomputer built by IBM for LawrenceLivermore Labs

    Due to be completed in 2012

    Sequoia is 20 petaflop machine

    Blue Gene/Q will have 18 cores One dedicated to OS tasks

    One held in reserve (fault tolerance?)

    4 way hyper-threaded, 64-bit PowerPC A2 based

    Sequoia may use up to 100k Blue Genes

    1.6 GHz, 205 Gflops, 55W, 1.47 B transistors, 19 mm sq

  • 8/13/2019 12 Example

    28/34

    28CS-510

    TM on Blue Gene/Q Transactional Memory only works intra chip

    Inter chip conflicts not detected

    Uses a tag scheme on the L2 cache memory Tags detect load/store data conflicts in tx

    Cache data has version tag

    Cache can store multiple versions of same data

    SW commences tx, does its work, then tells HW to attemptcommit

    If unsuccessful, SW must re-try

    Appears to be similar to approach on Suns Rock processor

    Ruud Haring on TM: a lot of neat trickery, sheer

    genius Full implementation much more complex than paper suggests?

    Blue Gene/Q is the exception and does not mark widescale acceptance of HW TM

  • 8/13/2019 12 Example

    29/34

    29CS-510

    Pros & Cons Summary Pros

    TM matches or outperforms atomic update locking techniques for simple benchmarks

    Uses no locks and thus has fewer memory accesses

    Avoids priority inversion, convoying and deadlock

    Easy programming semantics Complex NB scenarios such as doubly-linked list more realizable through TM

    Allow true concurrency and hence highly scalable (for smaller Tx sizes)

    Cons

    TM can not perform undoable operations including most I/O

    Single cycle commit and abort restrict size of 1stlevel cache and hence Tx size

    Is it good for anything other than data containers?

    Portability is restricted by transactional cache size

    Still SW dependent

    Algorithm tuning benefits from SW based adaptive backoff

    Tx cache overflow handling

    Longer Tx increases the likelihood of being aborted by an interrupt or scheduling conflict

    Tx should be able to complete within one scheduling time slot

    Weaker consistency models require explicit barriers at start and end, impacting performance

    Other complications make it more difficult to implement in HW

    Multi-level caches

    Nested Transactions (required for composability) Cache coherency complexity on many-core SMP and NUMA archs

    Theoretically subject to starvation

    Adaptive backoff strategy suggested fix - authors used exponential backoff

    Else queuing mechanism needed

    Poor debugger support

    S

  • 8/13/2019 12 Example

    30/34

    30CS-510

    Summary TM is a novel multi-processor architecture which allows easy lock-free multi-

    word synchronization in HW

    Leveraging concept of Database Transactions

    Overcoming single/double-word limitation Exploiting cache-coherency mechanisms

    Wish Granted? Comment

    Simple programming model Yes Very elegant

    Avoids priority inversion,convoying and deadlock

    Yes Inherent in NB approach

    Equivalent or betterperformance than lock-based approach

    Yes For very small and short txs

    No restrictions on data setsize or contiguity

    No Limited by practicalconsiderations

    Composable No Possible but would addsignificant HW complexity

    Wait-free No Possible but would addsignificant HW complexity

  • 8/13/2019 12 Example

    31/34

    31CS-510

    References

    M.P. Herlihy and J.E.B. Moss. TransactionalMemory: Architectural support for lock-freedata structures. Technical Report 92/07,Digital Cambridge Research Lab, OneKendall Square, Cambridge MA 02139,December 1992.

    Appendix

  • 8/13/2019 12 Example

    32/34

    32CS-510

    AppendixLinearizability: HerlihysCorrectness Condition

    Invocation of an object/function followed by a response

    History - sequence of invocations and responses made by a set of threads

    Sequential history - a history where each invocation is followed immediately by its response

    Serializable - if history can be reordered to form sequential history which is consistent with sequentialdefinition of the object

    Linearizable - serializable history in which each response that preceded invocation in history must alsoprecede in sequential reordering

    Object is linearizable if all of its usage histories may be linearized.

    A invokes lock | B invokes lock | A fails | B succeeds

    An example history

    Reordering 1 - A sequential history but not serializable reordering

    Reordering 2 - A linearizable reordering

    A invokes lock | A fails | B invokes lock | B succeeds

    B invokes lock | B succeeds | A invokes lock | A fails

    Appendix

  • 8/13/2019 12 Example

    33/34

    33CS-510

    AppendixGoodmans Write-Once Protocol

    D - line present only in onecache and differs from main

    memory. Cache mustsnoop for read requests.

    R - line present only in onecache and matches mainmemory.

    V - line may be present inother caches and matchesmain memory.

    I - Invalid.

    Writes to V,I are write thru

    Writes to D,R are write back

    Appendix

  • 8/13/2019 12 Example

    34/34

    34CS 510

    AppendixTx Memory and dB Txs

    Differences with dB Txs

    Disk vs. memory - HW better for Txs than dB systems

    Txs do not need to be durable (post termination)

    dB is a closed system; Tx interacts with non-Tx operations

    Tx must be backward compatible with existing programming environment

    Success in database transaction field does not translate to transactionalmemory