the galois project keshav pingali university of texas, austin joint work with milind kulkarni,...
Post on 20-Dec-2015
216 views
TRANSCRIPT
The Galois Project
Keshav PingaliUniversity of Texas, Austin
Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault,Donald Nguyen, Dimitrios Prountzos, Zifei Zhong
Overview of Galois Project
• Focus of Galois project: – parallel execution of irregular programs
• pointer-based data structures like graphs and trees– raise abstraction level for “Joe programmers”
• explicit parallelism is too difficult for most programmers • performance penalty for abstraction should be small
• Research approach:a) study algorithms to find common patterns of parallelism and locality
- Amorphous Data-Parallelism (ADP)b) design programming constructs for expressing these patternsc) implement these constructs efficiently
- Abstraction-Based Speculation (ABS)
• For more information– papers in PLDI 2007, ASPLOS 2008, SPAA 2008– website: http://iss.ices.utexas.edu
Organization
• Case study of amorphous data-parallelism– Delaunay mesh refinement
• Galois system (PLDI 2007)– Programming model– Baseline implementation
• Galois system optimizations– Scheduling (SPAA 2008)– Data and computation partitioning (ASPLOS 2008)
• Experimental results• Ongoing work
Delaunay Mesh Refinement
• Iterative refinement to remove badly shaped triangles:
while there are bad triangles do {
Pick a bad triangle;Find its cavity;Retriangulate cavity; // may create new bad
triangles}
• Order in which bad triangles should be refined:– final mesh depends on order in which
bad triangles are processed– but all bad triangles will be eliminated
ultimately regardless of order
Mesh m = /* read in mesh */WorkList wl;wl.add(mesh.badTriangles());while (true) { if ( wl.empty() ) break;
Triangle e = wl.get(); if (e no longer in mesh) continue;Cavity c = new Cavity(e);c.expand(); //determine cavityc.retriangulate();//re-triangulate cavitym.update(c);//update meshwl.add(c.badTriangles());
}
Delaunay Mesh Refinement• Parallelism:
– triangles with non-overlapping cavities can be processed in parallel
– if cavities of two triangles overlap, they must be done serially
– in practice, lots of parallelism• Exploiting this parallelism
– compile-time parallelization techniques like points-to and shape analysis cannot expose this parallelism (property of algorithm, not program)
– runtime dependence checking is needed
• Galois approach: optimistic parallelization
Take-away lessons
• Amorphous data-parallelism– data structures: graphs, trees, etc.– iterative algorithm over unordered or ordered work-list – elements can be added to work-list during computation– complex patterns of dependences between computations on different work-list
elements (possibly input-sensitive) – but many of these computations can be done in parallel
• Contrast: crystalline (regular) data-parallelism– data structures: dense matrices– iterative algorithm over fixed integer interval– simple dependence patterns: affine subscripts in array accesses (mostly input-insensitive)
for i = 1, N for j = 1, N for k = 1, N C[i,j] = C[i,j] + A[i,k]*B[k,j]
Take-away lessons (contd.)
• Amorphous data-parallelism is ubiquitous– Delaunay mesh generation: points to be inserted into mesh– Delaunay mesh refinement: list of bad triangles– Agglomerative clustering: priority queue of points from data-set– Boykov-Kolmogorov algorithm for image segmentation– Reduction-based interpreters for-calculus: list of redexes– Iterative dataflow analysis algorithms in compilers– Approximate SAT solvers: survey propagation, WalkSAT– ……
Take-away lessons (contd.)
• Amorphous data-parallelism is obscured within while loops, exit conditions, etc. in conventional languages– Need transparent syntax similar to FOR loops for crystalline
data-parallelism • Optimistic parallelization is necessary in general
• Compile-time approaches using points-to analysis or shape analysis may be adequate for some cases
• In general, runtime dependence checking is needed• Property of algorithms, not programs
• Handling of dependence conflicts depends on the application• Delaunay mesh generation: roll back any conflicting computation• Agglomerative clustering: must respect priority queue order
Organization
• Case study of amorphous data-parallelism– Delaunay mesh refinement
• Galois system (PLDI 2007)– Programming model– Baseline implementation
• Galois system optimizations– Scheduling (SPAA 2008)– Data and computation partitioning (ASPLOS 2008)
• Experimental results• Ongoing work
Galois Design Philosophy• Do not worry about “dusty decks” (for now)
– Restructuring existing code to expose amorphous data-parallelism: not our focus(cf. Google map/reduce)
• Evolution, not revolution– Modification of existing programming paradigms: OK– Radical solutions like functional programming: not OK
• No reliance on parallelizing compiler technology– will not work for many of our applications anyway– parallelizing compilers are very complex software artifacts
• Support two classes of programmers:– domain experts (Joe):
• should be shielded from complexities of parallel programming• most programmers will be Joes
– parallel programming experts (Steve):: • small number of highly trained people
– analogs:• industry model even for sequential programming• norm in domains like numerical linear algebra
– Steves implement BLAS libraries– Joes express their algorithms in terms of BLAS routines
Galois system
• Application program– Has well-defined sequential semantics
• current implementation: sequential Java
– Uses optimistic iterators to highlight for the runtime system opportunities for exploiting parallelism
• Class libraries– Like Java collections library but with additional
information for concurrency control
• Runtime system– Managing optimistic parallelism
Optimistic set iterators
• for each e in Set S do B(e)– evaluate block B(e) for each element in set S– sequential semantics
• set elements are unordered, so no a priori order on iterations• there may be dependences between iterations
– set S may get new elements during execution• for each e in OrderedSet S do B(e)
– evaluate block B(e) for each element in set S– sequential semantics
• perform iterations in order specified by OrderedSet• there may be dependences between iterations
– set S may get new elements during execution
Galois version of mesh refinement
Mesh m = /* read in mesh */Set wl;wl.add(mesh.badTriangles()); // initialize the Set wl
for each e in Set wl do { //unordered Set iterator
if (e no longer in mesh) continue;Cavity c = new Cavity(e);c.expand();c.retriangulate();m.update(c);wl.add(c.badTriangles()); //add new bad triangles
to Set}
- Scheduling policy for iterator: • controlled by implementation of Set class• good choice for temporal locality: stack
Shared
Memory
Objects
main()….for each …..{…….…….}..........for each ….{….....……..}……
Master
ProgramThreads
• Object-based shared-memory model
• Master thread and some number of worker threads– master thread begins
execution of program and executes code between iterators
– when it encounters iterator, worker threads help by executing iterations concurrently with master
– threads synchronize by barrier synchronization at end of iterator
• Threads invoke methods to access internal state of objects
– how do we ensure sequential semantics of program are respected?
Parallel execution model
Baseline solution: PLDI 2007
• Iteration must lock object to invoke method
• Two types of objects:– catch and keep policy
• lock is held even after method invocation completes
• all locks released at end of iteration• poor performance for programs with
collections and accumulators– catch and release policy
• like Java locking policy• permits method invocations from
different concurrent iterations to be interleaved
• how do we make sure this is safe?
Objects
time1
2
3 i j
Catch and keep: iteration rollback
• What does iteration j do if object is already locked by some other iteration i ?– one possibility: wait and try to acquire
lock again• but this might lead to deadlock
– our implementation: runtime system rolls back one of the iterations by undoing its updates to shared objects
– Undoing updates: any “copy-on-write” solution works
• Make a copy of entire object when you acquire the lock (wasteful for large objects)
• Runtime systems maintains an “undo log” that holds information for undoing side-effects to objects as they are made (cf. software transactional memory)
Shared Memory
Objects
time
12
3i j
Problem with catch and keep
• Poor performance for programs that deal with mutable collections and accumulators
– work-sets are mutable collections– accumulators are ubiquitous
• Example: Delaunay refinement– Work-set is a (mutable) collection of bad
triangles– Some thread grabs lock on work-set object,
gets a bad triangle and removes it from the work-set
– That thread must retain the lock on the work-set till iteration completes, which shuts out all other threads (same problem arises with transactional memory)
• Lesson:– For some objects, we need to interleave
method invocations from different iterations – But must not lose serializability
Shared Memory
Objects
1 2
3i j
4
Galois solution: selective catch and release • Example: accumulator
– two methods:• add(int)• read() //return value
– adds commute with other adds and reads commute with other reads
• Interleaving of commuting method invocations from different iterations
OK• Interleaving of non-commuting method
invocations from different iterations trigger abort
• Rolling back side-effects: programmer must provide “inverse” methods for forward methods
– Inverse method for add(n) is subtract(n)
– Semantic inverse, not representational inverse
• This solution works for sets as well.
Shared Memory
Accumulator
a.add(5)
a.add(8)
a.add(3)
a.add(-4)
a.add(5)a.add(3)a.add(8)a.add(-4)
a.read()
a.read()
058 3
Abstraction-based Speculation
• Library writer:– specifies commutativity and inverse information for some classes
• Runtime system:– catch and release locking for these classes– keeps track of forward method invocations– checks commutativity of forward method invocations– invokes appropriate inverse methods on abort
• More details: PLDI 2007• Related work:
– logical locks in database systems– Herlihy et al: PPoPP 2008
Organization
• Case study of amorphous data-parallelism– Delaunay mesh refinement
• Galois system (PLDI 2007)– Programming model– Baseline implementation
• Galois system optimizations– Scheduling (SPAA 2008)– Data and computation partitioning (ASPLOS 2008)
• Experimental results• Ongoing work
Scheduling iterators
• Control scheduling by changing implementation of work-set class– stack/queue/etc.
• Scheduling can have a profound effect on performance• Example: Delaunay mesh refinement
– 10,156 triangles of which 4,837 were bad– sequential code, work-set is stack:
• 21,918 completed iterations+0 aborted– 4-processor Itanium-2, work-set implementations:
• stack: 21,736 iterations completed+28,290 aborted• array+random choice: 21,908 iterations completed+49 aborted
Scheduling iterators (SPAA 2008)
• Crystalline data-parallelism: DO-ALL loops– main scheduling concerns are locality and load-balancing– Open-MP: static, dynamic, guided, etc.
• Amorphous data-parallelism: many more issues– Conflicts– Dynamically created work– Algorithmic issues: efficiency of data structures
• SPAA 2008 paper:– Scheduling framework for exploiting amorphous data-parallelism– Generalizes Open-MP DO-ALL loop scheduling constructs
Data Partitioning (ASPLOS 2008)
• Partition the graph between cores• Data-centric assignment of work:
– core gets bad triangles from its own partitions– improves locality– can dramatically reduce conflicts
• Lock coarsening:– associate locks with partitions
• Over-decomposition– improves core utilization
Cores
Organization
• Case study of amorphous data-parallelism– Delaunay mesh refinement
• Galois system (PLDI 2007)– Programming model– Baseline implementation
• Galois system optimizations– Scheduling (SPAA 2008)– Data and computation partitioning (ASPLOS 2008)
• Experimental results• Ongoing work
Small-scale multiprocessor results• 4-processor Itanium 2
– 16 KB L1, 256 KB L2, 3MB L3 cache• Versions:
– GAL: using stack as worklist– PAR: partitioned mesh + data-centric work assignment– LCO: locks on partitions– OVD: over-decomposed version (factor of 4)
Large-scale multiprocessor results
• Maverick@TACC– 128-core Sun Fire E25K 1 GHz– 64 dual-core processors– Sun Solaris
• First “out-of-the-box” results• Speed-up of 20 on 32 cores for
refinement • Mesh partitioning is still
sequential– time for mesh partitioning starts to
dominate after 8 processors (32 partitions)
• Need parallel mesh partitioning– Par-Metis (Karypis et al)
Galois version of mesh refinementMesh m = /* read in mesh */Set wl;wl.add(mesh.badTriangles()); // initialize the Set wlfor each e in Set wl do { //unordered Set iterator
if (e no longer in mesh) continue;Cavity c = new Cavity(e);c.expand();c.retriangulate();m.update(c);wl.add(c.badTriangles()); //add new bad triangles to
Set}
Partitioned Work-setPartitioned Graph
Galois runtime system
Results for BK Image Segmentation
• Versions:– GAL: standard Galois version– PAR: partitioned graph– LCO: locks on partitions– OVD: over-decomposed version
Related work
• Transactions– programming model is explicitly parallel – assumes someone else is responsible for parallelism, locality, load-
balancing, and scheduling, and focuses only on synchronization– Galois: main concerns are parallelism, locality, load-balancing, and
scheduling• “catch and keep” classes can use TM for roll-back but this is probably
overkill• Thread level speculation
– not clear where to speculate in C programs• wastes power in useless speculation
– many schemes require extensive hardware support– no notion of abstraction-based speculation– no analogs of data partitioning or scheduling– overall results are disappointing
Ongoing work
• Case studies of irregular programs– understand “parallelism and locality patterns” in irregular programs
• Lonestar benchmark suite for irregular programs– joint work with Calin Cascaval’s group at IBM Yorktown Heights
• Optimizing the Galois runtime system– improve performance for iterators in which work/iteration is relatively low
• Compiler analysis to reduce overheads of optimistic parallel execution
• Scalability studies– larger number of cores
• Distributed-memory implementation– billion element meshes?
• Program analysis to verify assertions about class methods– Need a semantic specification of class
• Irregular applications have amorphous data-parallelism – Work-list based iterative algorithms over unordered and ordered sets
• Amorphous data-parallelism may be inherently data-dependent– Pointer/shape analysis cannot work for these apps
• Optimistic parallelization is essential for such apps– Analysis might be useful to optimize parallel program execution
• Exploiting abstractions and high-level semantics is critical– Galois knows about sets, ordered sets, accumulators…
• Galois approach provides unified view of data-parallelism in regular and irregular programs– Baseline is optimistic parallelism– Use compiler analysis to make decisions at compile-time whenever
possible
Summary