concurrent programming without locks keir fraser & tim harris adapted from an earlier...
TRANSCRIPT
Concurrent Programming Without Locks
Keir Fraser & Tim Harris
Adapted from an earlier presentation by Phil Howard
Motivation
• Locking precludes parallelism
• Recall “A Lock-Free Multiprocessor OS Kernel” by Massalin et al– Extensive use of CAS2 (aka DCAS, DCADS)– instruction does not exist on today’s CPUs
• Need a practical and general non-blocking solution
Solutions?
• Only use data structures that can be implemented with CAS?– Limiting
• RCU– Still uses locks for writers– Still limited to CAS data structures
• Software MCAS
• Transactional Memory
Goals
• Concreteness• Linearizability• Non-blocking progress guarantee• Disjoint access parallelism• Read parallelism• Dynamicity• Practicable space costs• Composability
Caveats
• “It remains possible for a thread to see a mutually inconsistent view of shared memory if it performs a series of [read] calls.”
Definitions
• Obstruction freedom – a thread will make progress as long as it doesn’t contend with other threads access to any location
• Lock-freedom – The system as a whole will make progress
• Wait-freedom – Every thread makes progress
Focus is on Lock-free designWhole transactions are lock-free, not just the sub-
components
Design considerations
• Need to update multiple locations atomically – using only “real” instructions
• The secret?– Indirection!– Use descriptors to access values
100
101
102
103
104
105
106
107
789456106
123123105
200100102
New ValueOld ValueAddress
Status
Memory
Descriptor
Implications of Descriptors
• Commit operation atomically updates status field• All accesses are indirect
– Need to distinguish between descriptor or value– Need to choose “actual”, “old”, or “new” value
• Once a descriptor is made visible, only the status field changes
• Once an outcome is decided, the status value doesn’t change– Retries use a new descriptor
• Descriptors are managed via garbage collection
Other requirements
• Descriptors must be able to own locations• Uncontended commits must work
– Prepare phase– Decision point– Update status value– Clean up– Status values: UNDECIDED, READ-
CHECK,SUCCESSFUL, FAILED
Other Requirements
• Contended Commits must make progress– Decided, but not complete
• Help the other thread complete
– Undecided, not read-check• Abort contending transactions
– Without contention management can lead to live-lock
• Help contending transactions– Sort memory addresses to prevent looping
– Read-check• Abort at least one contender• Prevent live-locks by totally ordering transactions
Algorithms
MCAS Multiple Compare And Swap
WSTM Word Software Transactional Memory
OSTM Object Software Transactional Memory
MCAS
CAS( word *address, // actual valueword expected_value,word new_value);
(logically)MCAS( int count,
word *address[], // actual valuesword expected_value[],word new_value[]);
(but an extra indirection is added)(pointers must indirect through the descriptor!)
MCAS
• Operates only on aligned pointers• Lower 2 bits used to distinguish
value/descriptor• Descriptors contain
– status– N– address[]– expected[]– new_value[]
Data Access
200100102
New Value
Old ValueAddress
Status: SUCCESS
descriptor
value
descriptor
300
200100105
New Value
Old ValueAddress
Status: UNKNOWN
CCAS
Conditional CAS built from CAS - takes effect only if condition == undecided - used to insert descriptor references
CCAS( word *address,word expected_value,word new_value,word *condition);
return original value of *address
Word *MCASRead(word **addr){
word *v;retry_read:
v = CCASRead(addr);if ( !IsMCASDesc(v)) return v;
for (int i=0; i<v->N; i++) {if (v->addr[i] == addr) {
if (v->status == SUCCESS)if (CCASRead(addr) == v)
return v->new[i]else
goto retry_read;else // FAILED or UNKNOWN
if (CCASRead(addr) == v)return v->expected[i];
elsegoto retry_read;
}}return v;
}
MCAS(3, {a,b,c}, {1,2,3}, {4,5,6})
1
2
3
a
b
c
MCAS(3, {a,c,b}, {1,3,2}, {4,6,5})
1
2
363c
52b
41a
3
UNKNOWNa
b
c
1
2
3
MCAS(3, {a,b,c}, {1,2,3}, {4,5,6})
1
2
363c
52b
41a
3
SUCCESS4
5
6
a
b
c
bool MCAS(int N, word **a[], word *e[], word *n[])
{mcas_descriptor *d =
new mcas_descriptor();d->N = N; d->status = UNDECIDED;for (int i=0; i<N; i++) {
d->a[i] = a[i]; d->e[i] = e[i]; d->n[i] = n[i];
}address_sort(d);return mcas_help(d);
}
bool mcas_help(mcas_descriptor *d){
word *v, desired = FAILED;bool success;
// Phase 1: acquirefor (int i=0; i<d->N; i++) {
while (TRUE){v = CCAS(d->a[i], d->e[i], d,
&d->status);if (v = d->e[i] || v == d) break;if (IsMCASDesc(v) )
mcas_help( (mcas_descriptor *)v );
elsegoto decision_point;
}}desired = SUCCESS;
decision_point:
mcas_help continued
// PHASE 2: read – not used by MCAS
decision_point:
CAS(&d->status, UNDECIDED, desired);
// PHASE 3: clean up
success = (d->status == SUCCESS);
for (int i=0; i<d->N; i++) {
CAS(d->a[i], d, success ? d->n[i] : d->e[i]);
}
return success;
}
Claiming Ownership
200100102
789456104
777999108
New Value
Old ValueAddress
Status: UNKNOWN
102
104
108
CCAS Descr
108
999
&MCAS_Descr
&mcas->status
999
Claiming Ownership
200100102
789456104
777999108
New Value
Old ValueAddress
Status: UNKNOWN
102
104
108
CCAS Descr
108
999
&MCAS_Descr
&mcas->status
999
word *CCAS(word **a, word *e, word *n, word *cond) {
ccas_descriptor *d = new ccas_descriptor();word *v;(d->a, d->e, d->n, d->cond) = (a,e,n,cond);while ( (v = CAS(d->a, d->e, d)) != d->e ) {
if ( IsCCASDesc(v) ) CCASHelp( (ccas_descriptor *)v);
elsereturn v;
}CCASHelp(d);return v;
}void CCASHelp(ccas_descriptor *d) {
bool success = (*d->cond == UNDECIDED);CAS(d->a, d, success ? d->n : d->e);
}
word *CCASRead(word **a) {
word *v = *a;
while ( IsCCASDesc(v) ) {
CCASHelp( (ccas_descriptor *)v);
v = *a;
}
return v;
}
Conflicts
200100102
789456104
777999108
New Value
Old ValueAddress
Status: UNKNOWN
102
104
108
200999108
New Value
Old ValueAddress
Status: UNKNOWN
bool mcas_help(mcas_descriptor *d){
word *v, desired = FAILED;bool success;
// Phase 1: acquirefor (int i=0; i<d->N; i++) {
while (TRUE){v = CCAS(d->a[i], d->e[i], d, &d->status);if (v = d->e[i] || v == d) break;if (IsMCASDesc(v) )
mcas_help( (mcas_descriptor *)v );else
goto decision_point;}
}
desired = SUCCESS;
decision_point:
Conflicts
200100102
789456104
777999108
New Value
Old ValueAddress
Status: UNKNOWN
102
104
108
200999108
New Value
Old ValueAddress
Status: UNKNOWN
Conflicts
200100102
789456104
777999108
New Value
Old ValueAddress
Status: UNKNOWN
102
104
108 200
bool mcas_help(mcas_descriptor *d){
word *v, desired = FAILED;bool success;
// Phase 1: acquirefor (int i=0; i<d->N; i++) {
while (TRUE){v = CCAS(d->a[i], d->e[i], d, &d-
>status);if (v = d->e[i] || v == d) break;if (IsMCASDesc(v) )
mcas_help( (mcas_descriptor *)v );else
goto decision_point;}
}
desired = SUCCESS;decision_point:
Conflicts
200100102
456456104
999999108
New Value
Old ValueAddress
Status: UNKNOWN
102
104
108
123456104
200999108
New Value
Old ValueAddress
Status: UNKNOWN
bool mcas_help(mcas_descriptor *d)
{
word *v, desired = FAILED;
bool success;
// Phase 1: acquire
for (int i=0; i<d->N; i++) {
while (TRUE){
v = CCAS(d->a[i], d->e[i], d, &d->status);
if (v = d->e[i] || v == d) break;
if (!IsMCASDesc(v) ) goto decision_point;
mcas_help( (mcas_descriptor *)v );
}
}
desired = SUCCESS;
decision_point:
mcas_help continued
// PHASE 2: read – not used by MCAS
decision_point:
CAS(&d->status, UNDECIDED, desired);
// PHASE 3: clean up
success = (d->status == SUCCESS);
for (int i=0; i<d->N; i++) {
CAS(d->a[i], d,
success ? d->n[i] : d->e[i]);
}
return success;
}
CCAS “failure modes”
• Someone helped us with the CCAS– call CCASHelp with our own descriptor– next time around, return MCAS descriptor– MCAS continues
• Someone else beat us to CCAS– help them with their CCAS– next time around, return their MCAS descriptor– Help with their MCAS– Our MCAS likely aborts
• Source value changed– return new value– MCAS aborts
word *CCAS(word **a, word *e, word *n,
word *cond) {
ccas_descriptor *d = new ccas_descriptor();
word *v;
(d->a, d->e, d->n, d->cond) = (a,e,n,cond);
while ( (v = CAS(d->a, d->e, d)) != d->e ) {
if ( !IsCASDesc(v) ) return v;
CCASHelp( (ccas_descriptor *)v);
}
CCASHelp(d);
return v;
}
void CCASHelp(ccas_descriptor *d) {
bool success = (*d->cond == UNDECIDED);
CAS(d->a, d, success ? d->n : d->e);
}
CCASHelp “failure modes”• MCAS aborted so status isn’t UNKNOWN
– old value put back in place
• MCAS aborted, CCASHelp doesn’t restore value– MCAS cleanup will put old value back in place
• Race: status switches to SUCCESS between check and CAS– CAS will fail because CCAS descriptor already
removed– CCAS return will not cause MCAS failure
• Race: status switches to FAILURE between check and CAS– CAS will always fail because for MCAS to fail,
someone must have read beyond us
Cost
• 3N + 1 CAS instructions (plus all the other code)
• “it is worth noting that the three batches of N updates all act on the same locations”
• “[improvements] may be useful if there are systems in which CAS operates substantially more slowly than an ordinary write.”
Deep Breath
WSTM
• Remove requirement for space reserved in values being updated
• WSTM keeps track of locations rather than caller
• Provides read parallelism
• Obstruction free, not lock free nor wait free
Data Structures
100
200
300
400
version 52
Status: Undecideda1: (100,15) -> (200,16)
a2: (200,52) -> (100,53)
Orecs
Logical contents
• Orec contains a version number:– value comes direct from memory
• Orec contains a descriptor reference– descriptor contains address
• value comes from descriptor based on status
– descriptor does not contain address• value comes direct from memory
Transaction Process
• Call WSTMRead/WSTMWrite to gather/change data– Builds transaction data structure, but it’s NOT
visible
• WSTMCommitTransaction– Get ownership – update ORecs– Read-Check – check version numbers– Decide– Clean up
version 52
version 15
version 53
version 16
Data Structures
100
200
300
400
Status: UNKNOWN
a1: (100,15) -> (200,16)
a2: (200,52) -> (200,52)a2: (200,52) -> (100,53)
200
100 Status: SUCCESS
Complications
• Fixed number of Orecs
• Hash collisions lead to false sharing
Issues• Orec ownership acts like a lock, so simple
scheme is not even obstruction free• Can’t help with “cleanup” because might
overwrite newer data• Can’t determine value during READCHECK, so
we’re forced to shoot down• force_decision() might be circular causing live
lock• helping requires <complicated> stealing of
transactions
• Uncontended cost is N+2
OSTM
• Objects are represented as opaque handles– can’t use pointers directly– must rewrite data structures
• Get accessible pointers via OSTMOpenForReading/OSTMOpenForWriting
• Eliminates need for Orecs/aliasing
Evaluation
• “We use … reference-counting garbage collection”
• Evaluated with one thread/CPU
• “Since we know the number of threads participating in our experiments…”
Uncontended Performance
Contended Locks
Data Contention
Data/Lock Contention
Spare Slides
word WSTMRead(wstm_transaction *tx, word *addr) {if (entry_exists) return entry->new_value;
if (orec->type != descriptor) create entry [current value, orec version]
else {force_decision(descriptor); // can’t be ours: not in commitif (descriptor contains our address)
if (status == SUCCESS) create entry [descr.new_val, descr.new_ver]
else create entry [descr.old_val, descr.old_ver]
else create entry [current value, descr.aliased.new_ver]
}
if (aliased) {if (entry->old_version != aliased->old_version)
status = FAILED;
entry->old_version = aliased->old_version;entry->new_version = aliased->new_version;
}
return entry->new_value;}
void WSTMWrite(wstm_transaction *tx, word *addr, word
new_value {
get entry using WSTMRead logic
entry->new_value = new_value;
for each aliased entry {entry->new_version++;
}}
bool WSTMCommit(wstm_transaction *tx) {
if (tx->status == FAILED) return false;
sort descriptor entriesdesired_status = FAILED;
for each updateif (!acquire_orec) goto decision_point;
CAS(status, UNDECIDED, READ_CHECK);for each read
if (!read_check) goto decision_point;
desired_status = SUCCESS;
decision_point:
decision_point:status = tx->status;while (status != FAILED && status != SUCCESS) {
CAS(tx->status, status, desired_status);status = tx->status;
}
if (tx->status == SUCCESS)for each update
*addr = entry->new_value;
for each updaterelease_orec
return (tx->status == SUCCESS);}
bool read_check(wstm_transaction *tx, wstm_entry *entry)
{if (orec is WSTM_descriptor) {
force_decision()if (SUCCESS)
version = new_version;else
version = old_version} else {
version = orec_version;}
return (version == entry->old_version);}
Data Structures
100
200
300
400
version 52
Status: Undecideda1: (100,15) -> (200,16)
a2: (200,52) -> (100,53)
a3: (300,15) -> (300,16)
Orecs
a1
a2
a3