mama: mostly automatic management of...
TRANSCRIPT
MAMA: Mostly Automatic Management of Atomicity
Christian DeLozier, Joseph Devietti, Milo M. K. Martin University of Pennsylvania
March 2nd, 2014
2
Start with a serial problem
Christian DeLozier - 2014
3
Find and express the parallelism
Christian DeLozier - 2014
4
Coordinate the parallel execution (synchronization)
Christian DeLozier - 2014
5
Don’t mess up!
Christian DeLozier - 2014
6
Is there another way to do this? • Programmer currently has to:
1. Express the parallelism (Hard) 2. Coordinate the parallelism (Hard)
• Alternative: 1. Programmer expresses the parallelism 2. Machine handles coordination
Christian DeLozier - 2014
7
Coordinating Parallel Execution • Atomicity vs. Ordering
• Types of concurrency bugs [Lu et al., ASPLOS 2008] • Atomicity: Locks, transactions • Ordering: Barriers, fork/join, blocking on a queue, etc.
• Atomicity constraints are more common than ordering constraints
• Difficult to infer ordering constraints
Christian DeLozier - 2014
8
Mostly Automatic Management of Atomicity • Toward automatically providing atomicity for
parallel programs
• Program either executes atomically – or deadlocks
• Protect every shared variable with its own lock
• Restore progress and performance when necessary (with help from the programmer)
Christian DeLozier - 2014
9
Related Work • Automatic Parallelization
• [Bernstein, IEEE Transactions 1966] • …
• Data Centric Synchronization • [Vaziri et. al, POPL 2006] • [Ceze et. al, HPCA 2007]
• Transactional Memory • [Herlihy and Moss, ISCA 1993] • …
Christian DeLozier - 2014
10
Lock-Based Atomic Sections • What lock do we acquire?
• When do we acquire the lock?
• When should we release the lock?
Christian DeLozier - 2014
11
What lock do we acquire? • Associate a lock with each variable
• Trade-off between parallelism and overhead
• Coarse-grained vs. Fine-grained • Coarse-grained: 1 lock per object, 1 lock per array • Fine-grained: 1 lock per field, 1 lock per array element
• Mutex vs. Reader-writer lock
Christian DeLozier - 2014
12
MAMA Prototype • Uses fine-grained locking
• More parallelism • Especially for arrays
• Optimization: Divide arrays into N chunks, 1 lock per chunk
• Uses reader-writer locks • More parallelism
• Read sharing is common
Christian DeLozier - 2014
13
Lock-Based Atomic Sections • What lock do we acquire?
• One reader-writer lock per variable (fine-grained)
• When do we acquire the lock? • Acquire before the first dynamic access
• When should we release the lock?
Christian DeLozier - 2014
14
T1
When should we release the lock? • Simple case: After the owning thread has exited
Christian DeLozier - 2014
T1 T2
Write A
Exit
Write A T2
Exited
Write A
15
When should we release the lock? • When the owning thread is waiting for another
thread to make progress (e.g. join, barrier)
Christian DeLozier - 2014
T1
T1 T2
Write A
Join T2
Write A T2
Joined
Write A
Exit
Spawn T2
16
When should we release the lock? • Other deadlocks cannot be safely broken
• Need help from the programmer • Trusted annotations to sanction breaking a deadlock
• MAMA_release(object) • Also used to improve performance when threads are
over-serialized Christian DeLozier - 2014
T1
T1 T2
Write A
Write B
Write B T2
Write B
Write A
Write A
17
Lock-Based Atomic Sections • What lock do we acquire?
• One reader-writer lock per variable (fine-grained)
• When do we acquire the lock? • Acquire before the first dynamic access
• When should we release the lock? • At thread exit • When waiting for another thread to make progress • Or, at programmer sanctioned program points
Christian DeLozier - 2014
18
What can deadlocks tell us? • When a thread cannot acquire a lock:
• Perform distributed deadlock detection [Bracha and Toueg, Distributed Computing 1987]
Christian DeLozier - 2014
void f(){ A = 1; B = 2; }
void g(){ B = 1; A = 2; }
T1 T2
Write B
Write A T2
T1
19
MAMA Prototype • Implemented as a RoadRunner tool [Flanagan and
Freund, PASTE 2010] • Dynamic instrumentation for Java byte-code
• Evaluated on the Java Grande benchmarks and selected DaCapo benchmarks
• Running on one socket (8 cores) of a 4 socket Nehalem system with 128 GB RAM
• Removed all synchronized blocks and java.util.concurrent constructs from benchmarks
• Ensure that MAMA is providing all of the atomicity
Christian DeLozier - 2014
20
Evaluating MAMA • Can we execute parallel programs correctly?
• How many annotations need to be added for progress and performance?
• How is the performance of the program affected? • Does MAMA permit thread to execute in parallel?
Christian DeLozier - 2014
21
Benchmark Lines of Code Progress Annotations
Performance Annotations
crypt 314 0 0 lufact 461 1 4
lusearch 124105 0 4 matmult 187 0 0 moldyn 487 3 0
montecarlo 1165 0 28 pmd 60062 0 4
series 180 0 0 sor 186 1 0
sunflow 21970 1 3 xalan 172300 0 0
Annotation Burden
Christian DeLozier - 2014
22
Performance
Christian DeLozier - 2014
0x
2x
4x
6x
8x
10x
Nor
mal
ized
Run
time
RR-ser MAMA-par MAMA-ser
• MAMA incurs overhead due to locking and serial execution • But, MAMA still allows some parallel execution as
compared to serialization
23x
23
Performance Breakdown
Christian DeLozier - 2014
0%
20%
40%
60%
80%
100%
Perc
ent R
untim
e
Running Blocked Deadlock Acquire Check
• Many benchmarks have significant portions that run in parallel • Checking whether or not a lock is already owned incurs
significant overhead on some benchmarks
24
Memory Usage
Christian DeLozier - 2014
• Fine-grained locking incurs significant memory overheads • Could be optimized to save space via chunking arrays or
decreasing the size of the lock
0x
10x
20x
30x
40x
Nor
mal
ized
Mem
ory
Usa
ge MAMA-par
25
Future Directions • Does this approach apply to other languages?
• How do we test programs running with MAMA? • Find uncommon deadlocks • Gain more confidence in trusted annotations
• How can we reduce the performance overheads?
• How can we infer ordering constraints?
Christian DeLozier - 2014
26
MAMA • Provides atomicity for parallel programs
• Some help via annotations from programmer
• A step toward programming without worrying about atomicity
• Programmer expresses parallelism • Machine provides atomicity automatically
Christian DeLozier - 2014
27
Thank you for listening!
Christian DeLozier - 2014