presented by steve coward psu fall 2011 cs-510 11/02/2011
DESCRIPTION
Transactional Memory: Architectural Support for Lock-Free Data Structures By Maurice Herlihy and J. Eliot B. Moss 1993. Presented by Steve Coward PSU Fall 2011 CS-510 11/02/2011. Slide content heavily borrowed from Ashish Jha PSU SP 2010 CS-510 05/04/2010. Agenda. - PowerPoint PPT PresentationTRANSCRIPT
CS-510
Transactional Memory: Architectural Support for Lock-Free Data Structures
By Maurice Herlihy and J. Eliot B. Moss1993
Presented by Steve CowardPSU Fall 2011 CS-510
11/02/2011
Slide content heavily borrowed from Ashish JhaPSU SP 2010 CS-510
05/04/2010
2CS-510
Agenda Lock-based synchronization Non-blocking synchronization TM (Transactional Memory) concept A HW-based TM implementation
– Core additions– ISA additions– Transactional cache– Cache coherence protocol changes
Test Methodology and results Blue Gene/Q Summary
3CS-510
Generally easy to use, except not composable Generally easy to reason about Does not scale well due to lock arbitration/communication Pessimistic synchronization approach Uses Mutual Exclusion
– Blocking, only ONE process/thread can execute at a time
Lock-based synchronization
Lo-priority
Holds Lock X
Hi-priorityPre-emption Can’t proceedPriority Inversion
Holds Lock XDe-scheduled Can’t proceed
Ex: Quantum expiration,Page Fault, other interrupts
Proc A Proc B
Convoying
Can’t proceedDeadlock
Can’t proceed
Get Lock X
Get Lock X
Holds Lock YGet Lock X
Holds Lock XGet Lock Y
High context switch overhead
Med-priority
Proc C
4CS-510
Lock-Free Synchronization
Non-Blocking - optimistic, does not use mutual exclusion
Uses RMW operations such as CAS, LL&SC– limited to operations on single-word or double-words
Avoids common problems seen with conventional techniques such as Priority inversion, Convoying and Deadlock
Difficult programming logic
In absence of above problems and as implemented in SW, lock-free doesn’t perform as well as a lock-based approach
5CS-510
Non-Blocking Wish List Simple programming usage model Avoids priority inversion, convoying and deadlock
Equivalent or better performance than lock-based approach– Less data copying
No restrictions on data set size or contiguity Composable Wait-free
Enter Transactional Memory (TM) …
6CS-510
What is a transaction (tx)? A tx is a finite sequence of revocable operations executed by a
process that satisfies two properties:– Serializable
– Steps of one tx are not seen to be interleaved with the steps of another tx
– Tx’s appear to all processes to execute in the same order– Atomic
– Each tx either aborts or commits– Abort causes all tentative changes of tx to be discarded– Commit causes all tentative changes of tx to be made
effectively instantaneously globally visible This paper assumes that a process executes at most one tx at a time
– Tx’s do not nest (but seems like a nice feature)– Hence, these Tx’s are not composable
– Tx’s do not overlap (seems less useful)
7CS-510
What is Transactional Memory? Transactional Memory (TM) is a lock-free, non-
blocking concurrency control mechanism based on tx’s that allows a programmer to define customized read-modify-write operations that apply to multiple, independently chosen memory locations
Non-blocking– Multiple tx’s optimistically execute CS in parallel
(on diff CPU’s)– If a conflict occurs only one can succeed,
others can retry
8CS-510
Basic Transaction Concept
beginTx:[A]=1;[B]=2;[C]=3;x=_VALIDATE();
Proc A
beginTx:z=[A];y=[B];[C]=y;x=_VALIDATE();
Proc B
AtomicityALL or
NOTHING
IF (x) _COMMIT();ELSE _ABORT(); GOTO beginTx;
// _COMMIT - instantaneously make all above changes visible to all Proc’s
IF (x) _COMMIT();ELSE _ABORT(); GOTO beginTx;
// _ABORT - discard all above changes, may try again
True concurrency
How is validity determined?
Optimistic execution
Serialization ensured if only one tx commits and others abort
Linearization ensured by (combined) atomicity of validate, commit, and abort operations
How to COMMIT?How to ABORT?
Changes must be revocable!
9CS-510
TM vs. SW Non-Blocking
LT or LTX// READ set of locations
VALIDATE// CHECK consistency of READ values
ST// MODIFY set of locations
COMMIT
Succeed
Fail
Fail
Critical Section// HW makes changes PERMANENT
Pass
// TMWhile (1) {
curCnt = LTX(&cnt);
if (Validate()) {
int c = curCnt+1;
ST(&cnt, c);
if (Commit()) return; }
}
// Non-BlockWhile (1) { curCnt = LL(&cnt);
// Copy object if (Validate(&cnt)) { int c = curCnt + 1;
if (SC(&cnt, c)) return; }
}
// Do work
10CS-510
HW or SW TM? TM may be implemented in HW or SW
– Many SW TM library packages exist– C++, C#, Java, Haskell, Python, etc.– 2-3 orders of magnitude slower than other synchronization
approaches This paper focuses on HW implementation
– HW offers significantly greater performance than SW for reasonable cost– Minor tweaks to CPU’s*
– Core– ISA– Caches– Bus and Cache coherency protocols
– Leverage cache-coherency protocol to maintain tx memory consistency
– But HW implementation w/o SW cache overflow handling is problematical
* See Blue Gene/Q
11CS-510
Core: TM Updates Each CPU maintains two additional Status Register bits– TACTIVE
– flag indicating whether a transaction is in progress on this cpu
– Implicitly set upon entering transaction– TSTATUS
– flag indicating whether the active transaction has conflicted with another transaction
TACTIVE TSTATUS MEANINGFalse DC No tx activeTrue False Orphan tx - executing, conflict detected, will abortTrue True Active tx - executing, no conflict yet detected
12CS-510
ISA: TM Memory Operations LT: load from shared memory to register ST: tentative write of register to shared memory which
becomes visible upon successful COMMIT LTX: LT + intent to write to same location later
– A performance optimization for early conflict detection
ISA: TM Verification Operations VALIDATE: validate consistency of read set
– Avoid misbehaved orphan ABORT: unconditionally discard all updates
– Used to handle extraordinary circumstances COMMIT: attempt to make tentative updates permanent
13CS-510
TM Conflict Detection
LOAD and STORE are supported but do not affect tx’s READ or WRITE set– Why would a STORE be performed within a tx?
Left to implementation– Interaction between tx and non-tx operations to same address
– Is generally a programming error– Consider LOAD/STORE as committed tx’s with conflict potential, otherwise non-
linearizable outcome
LT reg, [MEM]
LTX reg, [MEM]ST [MEM], reg
// Tx READ with intent to WRITE later// Tx WRITE to local cache, value globally// visible only after COMMIT
// pure Tx READ
Tx Dependencies Definition Tx Abort Condition Diagram*
READ-SET
WRITE-SETDATASET
DATA-SETUpdated?
WRITE-SETRead by any other Tx?
COMMIT //WRITE-SET visible
to other processes
ABORT//Discard changes
to WRITE-SET
N
N
Y
Y
LOAD reg, [MEM]STORE[MEM], reg
// non-Tx READ// non-Tx WRITE - careful!
*Subject to arbitration
(as in Reader/Writer paradigm)
14CS-510
Core
TM Cache Architecture
Two primary, mutually exclusive caches– In absence of TM, non-Tx ops uses same caches, control logic and coherency protocols as non-Tx architecture– To avoid impairing design/performance of regular cache– To prevent regular cache set size from limiting max tx size– Accessed sequentially based on instruction type!
Tx Cache– Fully-associative
– Otherwise how would tx address collisions be handled?– Single-cycle COMMIT and ABORT– Small - size is implementation dependent
– Intel Core i5 has a first level TLB with 32 entries!– Holds all tentative writes w/o propagating to other caches or memory
– May be extended to at act as Victim Cache– Upon ABORT
– Modified lines set to INVALID state– Upon COMMIT
– Lines can be snooped by other processors– Lines WB to memory upon replacement
1st level Cache
L2D L3D
2nd level … 3rd level Main Mem
1 Clk4 Clk
Core
L1D
Direct-mappedExclusive
2048 lines x 8B
Tx Cache
Fully-associativeExclusive
64 lines x 8B
L2D L3D
15CS-510
HW TM Leverages Cache Protocol M - cache line only in current cache and
is modified E - cache line only in current cache and
is not modified S - cache line in current cache and
possibly other caches and is not modified
I - invalid cache line
Tx commit logic will need to detect the following events (akin to R/W locking)
– Local read, remote write S -> I E -> I
– Local write, remote write M -> I
– Local write, remote read– M, snoop read, write back -> S
Works with bus-based (snoopy cache) or network-based (directory) architectures
http://thanglongedu.org/showthread.php?2226-Nehalem/page2
16CS-510
Protocol: Cache States
Every Tx Op allocates 2 CL entries (2 cycle allocation)– Single CL entry also possible
– Effectively makes tx cache twice as big– Two entry scheme has optimizations
– Allows roll back (and roll forward) w/o bus traffic– LT allocates one entry (if LTX used properly)– Second LTX cycle can be hidden on cache hit
– Authors decision appears to be somewhat arbitrary A dirty value “originally read” must either be
– WB to memory, or– Allocated to XCOMMIT entry as its “old” entry
– avoids WB’s to memory and improves performance
XCOMMITXABORT New Value
Old value
Tx Cache Line Entry, States and Replacement
Tx Op Allocation
2 lines
COMMIT ABORT
CL Entry & States
CL Replacement EMPTYsearch No
replaceYes
NORMALNo
replace
XCOMMIT
WB if DIRTY then replace
Cache Line StatesName Access Shared? Modified?INVALID none — —VALID R Yes NoDIRTY R, W No YesRESERVED R, W No No
Tx TagsName MeaningEMPTY contains no dataNORMAL contains committed dataXCOMMIT discard on commitXABORT discard on abort
EMPTYNORMAL New Value
Old valueNew ValueOld valueNORMAL
EMPTY
NoABORT Tx/Trap to SW
Yes Yes
17CS-510
ISA: TM Verification Operations
Orphan = TACTIVE==TRUE && TSTATUS==FALSE– Tx continues to execute, but will fail at commit
Commit does not force write back to memory– Memory written only when CL is evicted or invalidated
Conditions for calling ABORT– Interrupts– Tx cache overflow
VALIDATE [Mem]
Return TRUE TSTATUS
TSTATUS=TRUETACTIVE=FALSE
FALSE
ABORT
TSTATUS=TRUETACTIVE=FALSE
For ALL entries1. Drop XABORT2. Set XCOMMIT to NORMAL
For ALL entries1. Drop XCOMMIT2. Set XABORT to NORMAL
Return FALSE
COMMIT
TSTATUSABORTReturn FALSE
TSTATUS=TRUETACTIVE=FALSE
TRUE
Return TRUE
18CS-510
TM Memory Access
IsXABORT DATA?
LT reg, [Mem]//search Tx cache
IsNORMAL DATA?
XABORT NORMAL DATAXCOMMIT DATA
T_READ cycleXABORT DATAXCOMMIT DATA
//ABORT TxFor ALL entries 1. Drop XABORT 2. Set XCOMMIT to NORMALTSTATUS=FALSE
Return DATA
Return arbitrary DATA
Y
Y
OK
BUSY
CL State as Goodman’s protocol for LOAD (Valid) CL State as Goodman’s protocol for LOAD (Reserved)ST to Tx cache only!!!
IsXABORT DATA?
LTX reg, [Mem]//search Tx cache
IsNORMAL DATA?
XABORT NORMAL DATAXCOMMIT DATA
T_RFO cycleXABORT DATAXCOMMIT DATA
//ABORT TxFor ALL entries 1. Drop XABORT 2. Set XCOMMIT to NORMALTSTATUS=FALSE
Return DATA
Return arbitrary DATA
//Tx cache miss
Y
Y
OK
BUSY
//Tx cache miss
IsXABORT DATA?
ST [Mem], reg
IsNORMAL DATA?
XCOMMIT NORMAL DATAXABORT NEW DATA
T_RFO cycleXCOMMIT DATAXABORT NEW DATA
//ABORT TxFor ALL entries 1. Drop XABORT 2. Set XCOMMIT to NORMALTSTATUS=FALSE
Return arbitrary DATA
//Tx cache miss
Y
OK
BUSY
CL State as Goodman’s protocol for STORE (Dirty)
Bus CyclesName Kind Meaning New AccessREAD regular read value sharedRFO regular read value exclusiveWRITE both write back exclusiveT-READ Tx read value sharedT-RFO Tx read value exclusiveBUSY Tx refuse access unchanged
XCOMMITXABORT New Value
Old value
Tx Op AllocationFor reference
XCOMMIT DATAXABORT NEW DATA
Y
Tx requests REFUSED by BUSY response– Tx aborts and retries (after exponential
backoff?)– Prevents deadlock or continual mutual aborts
Exponential backoff not implemented in HW– Performance is parameter sensitive– Benchmarks appear not to be optimized
19CS-510
TM – Snoopy Cache Actions
Both Regular and Tx Cache snoop on the bus Main memory responds to all L1 read misses Main memory responds to cache line replacement WRITE ‘s If TSTATUS==FALSE, Tx cache acts as Regular cache (for NORMAL entries)
Line Replacement Action - Regular or Tx $Bus RequestWRITE write data on Bus, written to Memory
Action
Response to SNOOP Actions - Regular $Bus Request Current State Next State/ActionRegular RequestRead VALID/DIRTY/RESV. VALID/Return DataRFO VALID/DIRTY/RESV. INVALID/Return DataTx RequestT_Read VALID VALID/Return DataT_Read DIRTY/RESV. VALID/Return DataT_RFO VALID/DIRTY/RESV. INVALID/Return Data
Response to SNOOP Actions - Tx $Bus Request Tx Tag Current State Next State/ActionRegular RequestRead Normal VALID/DIRTY/RESV. VALID/Return DataRFO Normal VALID/DIRTY/RESV. INVALID/Return DataTx RequestT_Read Normal/Xcommit/Xabort VALID VALID/Return DataT_Read Normal/Xcommit/Xabort DIRTY/RESV. NA/BUSY SIGNALT_RFO Normal/Xcommit/Xabort DIRTY/RESV. NA/BUSY SIGNAL
20CS-510
Test Methodology TM implemented in Proetus - execution driven simulator from MIT
– Two versions of TM implementation– Goodman’s snoopy protocol for bus-based arch– Chaiken directory protocol for (simulated) Alewife machine
– 32 Processors– memory latency of 4 clock cycles– 1st level cache latency of 1 clock cycles
– 2048x8B Direct-mapped regular cache– 64x8B fully-associative Tx cache
– Strong Memory Consistency Model Compare TM to 4 different implementation Techniques
– SW– TTS (test-and-test-and-set) spinlock with exponential backoff [TTS Lock]– SW queuing [MCS Lock]
– Process unable to lock puts itself in the queue, eliminating poll time– HW
– LL/SC (LOAD_LINKED/STORE_COND) with exponential backoff [LL/SC Direct/Lock]– HW queuing [QOSB]
– Queue maintenance incorporated into cache-coherency protocol• Goodman’s QOSB protocol - head in memory elements in unused CL’s
Benchmarks– Counting
– LL/SC directly used on the single-word counter variable– Producer & Consumer– Doubly-Linked List– All benchmarks do fixed amount of work
21CS-510
Counting Benchmark
N processes increment shared counter 2^16/n times, n=1 to 32 Short CS with 2 shared-mem accesses, high contention In absence of contention, TTS makes 5 references to mem for each increment
– RD + test-and-set to acquire lock + RD and WR in CS + release lock TM requires only 3 mem accesses
– RD & WR to counter and then COMMIT (no bus traffic)
void process (int work){
int success = O, backof f = BACKOFF_MIN;unsigned wait;while (success < work) {
ST (&counter, LTX (&counter) + 1) ;if (COMMIT()) {
success++;backof f = BACKOFF_MIN;
} else {wait = randomo % (01 << backoff) ;while (wait-- ) ;if (backof f < BACKOFF_MAX )backof f ++;
}}
}
SOURCE: from paper
22CS-510
Counting Results
LL/SC outperforms TM– LL/SC applied directly to counter variable, no explicit commit required– For other benchmarks, adv lost as shared object spans multiple words – only way to use LL/SC is as a spin lock
TM has higher thruput than all other mechanisms at most levels of concurrency– TM uses no explicit locks and so fewer accesses to memory (LL/SC -2, TM-3,TTS-5)
Total cycles needed to complete the benchmark
Concurrent ProcessesBUS NW
SOURCE: Figure copied from paper
TTS Lock
MCS Lock - SWQ
QOSB - HWQ
TM
LL/SC Direct
TTS Lock
TMLL/SC Direct
MCS Lock - SWQ
QOSB - HWQ
23CS-510
Prod/Cons Benchmark
N processes share a bounded buffer, initially empty– Half produce items, half consume items
Benchmark finishes when 2^16 operations have completed
typedef struct { Word deqs; Word enqs; Word items [QUEUE_SI ZE ] ; } queue;unsigned queue_deq (queue *q) {
unsigned head, tail, result, wait;unsigned backoff = BACKOFF_MINwhile (1) {
result = QUEUE_EMPTY ;tail = LTX (&q-> enqs) ;head = LTX ( &q->deqs ) ;if (head ! = tail) { /* queue not empty? */
result = LT (&q->items [head % QUEUE_SIZE] ) ;ST (&q->deqs, head + 1) ; /* advance counter */
}if (COMMIT() ) break; .wait = randomo % (01 << backoff) ; /* abort => backof f */while (wait--) ;if (backoff < BACKOFF_MAX) backof f ++;
}return result;
}
SOURCE: from paper
24CS-510
Prod/Cons Results
In Bus arch, almost flat thruput for all– TM yields higher thruput but not as dramatic as counting benchmark
In NW arch, all thruputs suffer as contention increases– TM suffers the least and wins
Cycles needed to complete the Benchmark
Concurrent ProcessesBUS NW
SOURCE: Figure copied from paper
N
QOSB - HWQ
TTS Lock
TM
LL/SC Direct
MCS Lock - SWQ
TM
QOSB - HWQ
LL/SC Direct
MCS Lock - SWQ
TTS Lock
25CS-510
Doubly-Linked List Benchmark
N processes share a DL list anchored by Head & Tail pointers– Process Dequeues an item by removing the item pointed by tail and then Enqueues it by threading it onto the list as head– Process that removes last item sets both Head & Tail to NULL– Process that inserts item into an empty list set’s both Head & Tail to point to the new item
Benchmark finishes when 2^16 operations have completed
SOURCE: from paper
typedef struct list_elem {struct list elem *next; /* next to dequeue */struct list elem *prev; /* previously enqueued */int value;} entry;shared entry *Head, *Tail;void list_enq(entry* new) {
entry *old tail;unsigned backoff = BACKOFF_MIN;unsigned wait;new->next = new->prev = NULL;while (TRUE) {
old_tail = (entry*) LTX(&Tail);if (VALIDATE()) {
ST(&new->prev, old tail);if (old_tail == NULL) ST(&Head, new);else ST(&old —tail->next, new);ST(&Tail, new);if (COMMIT()) return;
}wait = randomo % (01 << backoff);while (wait--);if (backoff < BACKOFF_MAX) backoff++;
}}
26CS-510
Doubly-Linked List Results
Concurrency difficult to exploit by conventional means– State dependent concurrency is not simple to recognize using locks
– Enquerers don’t know if it must lock tail-ptr until after it has locked head-ptr & vice-versa for dequeuers– Queue non-empty: each Tx modifies head or tail but not both, so enqueuers can (in principle) execute without
interference from dequeuers and vice-versa– Queue Empty: Tx must modify both pointers and enqueuers and dequeuers conflict
– Locking techniques uses only single lock– Lower thruput as single lock prohibits overlapping of enqueues and dequeues
TM naturally permits this kind of parallelism
Cycles needed to complete the Benchmark
Concurrent Processes
BUS NW
SOURCE: Figure copied from paper
MCS Lock - SWQ
MCS Lock - SWQ
TM TM
QOSB - HWQ
QOSB - HWQ
TTS Lock
TTS Lock
LL/SC Direct
LL/SC Direct
27CS-510
Blue Gene/Q Processor First Tx Memory HW implementation
Used in Sequoia supercomputer built by IBM for Lawrence Livermore Labs
Due to be completed in 2012 Sequoia is 20 petaflop machine Blue Gene/Q will have 18 cores One dedicated to OS tasks One held in reserve (fault tolerance?) 4 way hyper-threaded, 64-bit PowerPC A2 based Sequoia may use up to 100k Blue Genes 1.6 GHz, 205 Gflops, 55W, 1.47 B transistors, 19 mm sq
28CS-510
TM on Blue Gene/Q Transactional Memory only works intra chip Inter chip conflicts not detected Uses a tag scheme on the L2 cache memory Tags detect load/store data conflicts in tx Cache data has ‘version’ tag Cache can store multiple versions of same data SW commences tx, does its work, then tells HW to attempt
commit If unsuccessful, SW must re-try Appears to be similar to approach on Sun’s Rock processor Ruud Haring on TM: ”a lot of neat trickery”, “sheer
genius”– Full implementation much more complex than paper suggests?
Blue Gene/Q is the exception and does not mark wide scale acceptance of HW TM
29CS-510
Pros & Cons Summary Pros
– TM matches or outperforms atomic update locking techniques for simple benchmarks– Uses no locks and thus has fewer memory accesses
– Avoids priority inversion, convoying and deadlock– Easy programming semantics
– Complex NB scenarios such as doubly-linked list more realizable through TM– Allow true concurrency and hence highly scalable (for smaller Tx sizes)
Cons– TM can not perform undoable operations including most I/O – Single cycle commit and abort restrict size of 1st level cache and hence Tx size
– Is it good for anything other than data containers?– Portability is restricted by transactional cache size– Still SW dependent
– Algorithm tuning benefits from SW based adaptive backoff– Tx cache overflow handling
– Longer Tx increases the likelihood of being aborted by an interrupt or scheduling conflict– Tx should be able to complete within one scheduling time slot
– Weaker consistency models require explicit barriers at start and end, impacting performance– Other complications make it more difficult to implement in HW
– Multi-level caches– Nested Transactions (required for composability)– Cache coherency complexity on many-core SMP and NUMA arch’s
– Theoretically subject to starvation– Adaptive backoff strategy suggested fix - authors used exponential backoff– Else queuing mechanism needed
– Poor debugger support
30CS-510
Summary TM is a novel multi-processor architecture which allows easy lock-free multi-word
synchronization in HW– Leveraging concept of Database Transactions– Overcoming single/double-word limitation– Exploiting cache-coherency mechanisms
Wish Granted? Comment
Simple programming model Yes Very elegant
Avoids priority inversion, convoying and deadlock
Yes Inherent in NB approach
Equivalent or better performance than lock-based approach
Yes For very small and short tx’s
No restrictions on data set size or contiguity
No Limited by practical considerations
Composable No Possible but would add significant HW complexity
Wait-free No Possible but would add significant HW complexity
31CS-510
References M.P. Herlihy and J.E.B. Moss. Transactional Memory: Architectural support for lock-free data structures. Technical Report 92/07, Digital Cambridge Research Lab, One Kendall Square, Cambridge MA 02139, December 1992.
32CS-510
AppendixLinearizability: Herlihy’s Correctness Condition
Invocation of an object/function followed by a response History - sequence of invocations and responses made by a set of threads Sequential history - a history where each invocation is followed immediately by its response Serializable - if history can be reordered to form sequential history which is consistent with sequential
definition of the object Linearizable - serializable history in which each response that preceded invocation in history must also
precede in sequential reordering Object is linearizable if all of its usage histories may be linearized.
A invokes lock | B invokes lock | A fails | B succeedsAn example history
Reordering 1 - A sequential history but not serializable reordering
Reordering 2 - A linearizable reordering
A invokes lock | A fails | B invokes lock | B succeeds
B invokes lock | B succeeds | A invokes lock | A fails
33CS-510
AppendixGoodman’s Write-Once Protocol
D - line present only in one cache and differs from main memory. Cache must snoop for read requests.
R - line present only in one cache and matches main memory.
V - line may be present in other caches and matches main memory.
I - Invalid.
Writes to V,I are write thru Writes to D,R are write back
34CS-510
AppendixTx Memory and dB Tx’s
Differences with dB Tx’s– Disk vs. memory - HW better for Tx’s than dB systems– Tx’s do not need to be durable (post termination)– dB is a closed system; Tx interacts with non-Tx operations– Tx must be backward compatible with existing programming environment– Success in database transaction field does not translate to transactional
memory