the galois project keshav pingali university of texas, austin joint work with milind kulkarni,...

The Galois Project

Keshav PingaliUniversity of Texas, Austin

Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault,Donald Nguyen, Dimitrios Prountzos, Zifei Zhong

Overview of Galois Project

• Focus of Galois project: – parallel execution of irregular programs

• pointer-based data structures like graphs and trees– raise abstraction level for “Joe programmers”

• explicit parallelism is too difficult for most programmers • performance penalty for abstraction should be small

• Research approach:a) study algorithms to find common patterns of parallelism and locality

- Amorphous Data-Parallelism (ADP)b) design programming constructs for expressing these patternsc) implement these constructs efficiently

- Abstraction-Based Speculation (ABS)

• For more information– papers in PLDI 2007, ASPLOS 2008, SPAA 2008– website: http://iss.ices.utexas.edu

Organization

• Case study of amorphous data-parallelism– Delaunay mesh refinement

• Galois system (PLDI 2007)– Programming model– Baseline implementation

• Galois system optimizations– Scheduling (SPAA 2008)– Data and computation partitioning (ASPLOS 2008)

• Experimental results• Ongoing work

Delaunay Mesh Refinement

• Iterative refinement to remove badly shaped triangles:

while there are bad triangles do {

Pick a bad triangle;Find its cavity;Retriangulate cavity; // may create new bad

triangles}

• Order in which bad triangles should be refined:– final mesh depends on order in which

bad triangles are processed– but all bad triangles will be eliminated

ultimately regardless of order

Mesh m = /* read in mesh */WorkList wl;wl.add(mesh.badTriangles());while (true) { if ( wl.empty() ) break;

Triangle e = wl.get(); if (e no longer in mesh) continue;Cavity c = new Cavity(e);c.expand(); //determine cavityc.retriangulate();//re-triangulate cavitym.update(c);//update meshwl.add(c.badTriangles());

}

Delaunay Mesh Refinement• Parallelism:

– triangles with non-overlapping cavities can be processed in parallel

– if cavities of two triangles overlap, they must be done serially

– in practice, lots of parallelism• Exploiting this parallelism

– compile-time parallelization techniques like points-to and shape analysis cannot expose this parallelism (property of algorithm, not program)

– runtime dependence checking is needed

• Galois approach: optimistic parallelization

Take-away lessons

• Amorphous data-parallelism– data structures: graphs, trees, etc.– iterative algorithm over unordered or ordered work-list – elements can be added to work-list during computation– complex patterns of dependences between computations on different work-list

elements (possibly input-sensitive) – but many of these computations can be done in parallel

• Contrast: crystalline (regular) data-parallelism– data structures: dense matrices– iterative algorithm over fixed integer interval– simple dependence patterns: affine subscripts in array accesses (mostly input-insensitive)

for i = 1, N for j = 1, N for k = 1, N C[i,j] = C[i,j] + A[i,k]*B[k,j]

Take-away lessons (contd.)

• Amorphous data-parallelism is ubiquitous– Delaunay mesh generation: points to be inserted into mesh– Delaunay mesh refinement: list of bad triangles– Agglomerative clustering: priority queue of points from data-set– Boykov-Kolmogorov algorithm for image segmentation– Reduction-based interpreters for-calculus: list of redexes– Iterative dataflow analysis algorithms in compilers– Approximate SAT solvers: survey propagation, WalkSAT– ……

Take-away lessons (contd.)

• Amorphous data-parallelism is obscured within while loops, exit conditions, etc. in conventional languages– Need transparent syntax similar to FOR loops for crystalline

data-parallelism • Optimistic parallelization is necessary in general

• Compile-time approaches using points-to analysis or shape analysis may be adequate for some cases

• In general, runtime dependence checking is needed• Property of algorithms, not programs

• Handling of dependence conflicts depends on the application• Delaunay mesh generation: roll back any conflicting computation• Agglomerative clustering: must respect priority queue order

Organization





Galois Design Philosophy• Do not worry about “dusty decks” (for now)

– Restructuring existing code to expose amorphous data-parallelism: not our focus(cf. Google map/reduce)

• Evolution, not revolution– Modification of existing programming paradigms: OK– Radical solutions like functional programming: not OK

• No reliance on parallelizing compiler technology– will not work for many of our applications anyway– parallelizing compilers are very complex software artifacts

• Support two classes of programmers:– domain experts (Joe):

• should be shielded from complexities of parallel programming• most programmers will be Joes

– parallel programming experts (Steve):: • small number of highly trained people

– analogs:• industry model even for sequential programming• norm in domains like numerical linear algebra

– Steves implement BLAS libraries– Joes express their algorithms in terms of BLAS routines

Galois system

• Application program– Has well-defined sequential semantics

• current implementation: sequential Java

– Uses optimistic iterators to highlight for the runtime system opportunities for exploiting parallelism

• Class libraries– Like Java collections library but with additional

information for concurrency control

• Runtime system– Managing optimistic parallelism

Optimistic set iterators

• for each e in Set S do B(e)– evaluate block B(e) for each element in set S– sequential semantics

• set elements are unordered, so no a priori order on iterations• there may be dependences between iterations

– set S may get new elements during execution• for each e in OrderedSet S do B(e)

– evaluate block B(e) for each element in set S– sequential semantics

• perform iterations in order specified by OrderedSet• there may be dependences between iterations

– set S may get new elements during execution

Galois version of mesh refinement

Mesh m = /* read in mesh */Set wl;wl.add(mesh.badTriangles()); // initialize the Set wl

for each e in Set wl do { //unordered Set iterator

if (e no longer in mesh) continue;Cavity c = new Cavity(e);c.expand();c.retriangulate();m.update(c);wl.add(c.badTriangles()); //add new bad triangles

to Set}

- Scheduling policy for iterator: • controlled by implementation of Set class• good choice for temporal locality: stack

Shared

Memory

Objects

main()….for each …..{…….…….}..........for each ….{….....……..}……

Master

ProgramThreads

• Object-based shared-memory model

• Master thread and some number of worker threads– master thread begins

execution of program and executes code between iterators

– when it encounters iterator, worker threads help by executing iterations concurrently with master

– threads synchronize by barrier synchronization at end of iterator

• Threads invoke methods to access internal state of objects

– how do we ensure sequential semantics of program are respected?

Parallel execution model

Baseline solution: PLDI 2007

• Iteration must lock object to invoke method

• Two types of objects:– catch and keep policy

• lock is held even after method invocation completes

• all locks released at end of iteration• poor performance for programs with

collections and accumulators– catch and release policy

• like Java locking policy• permits method invocations from

different concurrent iterations to be interleaved

• how do we make sure this is safe?

Objects

time1

2

3 i j

Catch and keep: iteration rollback

• What does iteration j do if object is already locked by some other iteration i ?– one possibility: wait and try to acquire

lock again• but this might lead to deadlock

– our implementation: runtime system rolls back one of the iterations by undoing its updates to shared objects

– Undoing updates: any “copy-on-write” solution works

• Make a copy of entire object when you acquire the lock (wasteful for large objects)

• Runtime systems maintains an “undo log” that holds information for undoing side-effects to objects as they are made (cf. software transactional memory)

Shared Memory

Objects

time

12

3i j

Problem with catch and keep

• Poor performance for programs that deal with mutable collections and accumulators

– work-sets are mutable collections– accumulators are ubiquitous

• Example: Delaunay refinement– Work-set is a (mutable) collection of bad

triangles– Some thread grabs lock on work-set object,

gets a bad triangle and removes it from the work-set

– That thread must retain the lock on the work-set till iteration completes, which shuts out all other threads (same problem arises with transactional memory)

• Lesson:– For some objects, we need to interleave

method invocations from different iterations – But must not lose serializability

Shared Memory

Objects

1 2

3i j

4

Galois solution: selective catch and release • Example: accumulator

– two methods:• add(int)• read() //return value

– adds commute with other adds and reads commute with other reads

• Interleaving of commuting method invocations from different iterations

OK• Interleaving of non-commuting method

invocations from different iterations trigger abort

• Rolling back side-effects: programmer must provide “inverse” methods for forward methods

– Inverse method for add(n) is subtract(n)

– Semantic inverse, not representational inverse

• This solution works for sets as well.

Shared Memory

Accumulator

a.add(5)

a.add(8)

a.add(3)

a.add(-4)

a.add(5)a.add(3)a.add(8)a.add(-4)

a.read()

a.read()

058 3

Abstraction-based Speculation

• Library writer:– specifies commutativity and inverse information for some classes

• Runtime system:– catch and release locking for these classes– keeps track of forward method invocations– checks commutativity of forward method invocations– invokes appropriate inverse methods on abort

• More details: PLDI 2007• Related work:

– logical locks in database systems– Herlihy et al: PPoPP 2008

Organization





Scheduling iterators

• Control scheduling by changing implementation of work-set class– stack/queue/etc.

• Scheduling can have a profound effect on performance• Example: Delaunay mesh refinement

– 10,156 triangles of which 4,837 were bad– sequential code, work-set is stack:

• 21,918 completed iterations+0 aborted– 4-processor Itanium-2, work-set implementations:

• stack: 21,736 iterations completed+28,290 aborted• array+random choice: 21,908 iterations completed+49 aborted

Scheduling iterators (SPAA 2008)

• Crystalline data-parallelism: DO-ALL loops– main scheduling concerns are locality and load-balancing– Open-MP: static, dynamic, guided, etc.

• Amorphous data-parallelism: many more issues– Conflicts– Dynamically created work– Algorithmic issues: efficiency of data structures

• SPAA 2008 paper:– Scheduling framework for exploiting amorphous data-parallelism– Generalizes Open-MP DO-ALL loop scheduling constructs

Data Partitioning (ASPLOS 2008)

• Partition the graph between cores• Data-centric assignment of work:

– core gets bad triangles from its own partitions– improves locality– can dramatically reduce conflicts

• Lock coarsening:– associate locks with partitions

• Over-decomposition– improves core utilization

Cores

Organization





Small-scale multiprocessor results• 4-processor Itanium 2

– 16 KB L1, 256 KB L2, 3MB L3 cache• Versions:

– GAL: using stack as worklist– PAR: partitioned mesh + data-centric work assignment– LCO: locks on partitions– OVD: over-decomposed version (factor of 4)

Large-scale multiprocessor results

• Maverick@TACC– 128-core Sun Fire E25K 1 GHz– 64 dual-core processors– Sun Solaris

• First “out-of-the-box” results• Speed-up of 20 on 32 cores for

refinement • Mesh partitioning is still

sequential– time for mesh partitioning starts to

dominate after 8 processors (32 partitions)

• Need parallel mesh partitioning– Par-Metis (Karypis et al)

Galois version of mesh refinementMesh m = /* read in mesh */Set wl;wl.add(mesh.badTriangles()); // initialize the Set wlfor each e in Set wl do { //unordered Set iterator

if (e no longer in mesh) continue;Cavity c = new Cavity(e);c.expand();c.retriangulate();m.update(c);wl.add(c.badTriangles()); //add new bad triangles to

Set}

Partitioned Work-setPartitioned Graph

Galois runtime system

Results for BK Image Segmentation

• Versions:– GAL: standard Galois version– PAR: partitioned graph– LCO: locks on partitions– OVD: over-decomposed version

Related work

• Transactions– programming model is explicitly parallel – assumes someone else is responsible for parallelism, locality, load-

balancing, and scheduling, and focuses only on synchronization– Galois: main concerns are parallelism, locality, load-balancing, and

scheduling• “catch and keep” classes can use TM for roll-back but this is probably

overkill• Thread level speculation

– not clear where to speculate in C programs• wastes power in useless speculation

– many schemes require extensive hardware support– no notion of abstraction-based speculation– no analogs of data partitioning or scheduling– overall results are disappointing

Ongoing work

• Case studies of irregular programs– understand “parallelism and locality patterns” in irregular programs

• Lonestar benchmark suite for irregular programs– joint work with Calin Cascaval’s group at IBM Yorktown Heights

• Optimizing the Galois runtime system– improve performance for iterators in which work/iteration is relatively low

• Compiler analysis to reduce overheads of optimistic parallel execution

• Scalability studies– larger number of cores

• Distributed-memory implementation– billion element meshes?

• Program analysis to verify assertions about class methods– Need a semantic specification of class

• Irregular applications have amorphous data-parallelism – Work-list based iterative algorithms over unordered and ordered sets

• Amorphous data-parallelism may be inherently data-dependent– Pointer/shape analysis cannot work for these apps

• Optimistic parallelization is essential for such apps– Analysis might be useful to optimize parallel program execution

• Exploiting abstractions and high-level semantics is critical– Galois knows about sets, ordered sets, accumulators…

• Galois approach provides unified view of data-parallelism in regular and irregular programs– Baseline is optimistic parallelism– Use compiler analysis to make decisions at compile-time whenever

possible

Summary

the galois project keshav pingali university of texas, austin joint work with milind kulkarni,...

Documents

mesh delaunay mesh refinement

order mesh

final mesh

new bad triangles

j slide

common patterns of parallelism

updatecupdate mesh wl

shaped triangles