the galois project keshav pingali university of texas, austin joint work with milind kulkarni,...

31
The Galois Project Keshav Pingali University of Texas, Austin work with Milind Kulkarni, Martin Burtscher, Patrick Carriba d Nguyen, Dimitrios Prountzos, Zifei Zhong

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

The Galois Project

Keshav PingaliUniversity of Texas, Austin

Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault,Donald Nguyen, Dimitrios Prountzos, Zifei Zhong

Page 2: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Overview of Galois Project

• Focus of Galois project: – parallel execution of irregular programs

• pointer-based data structures like graphs and trees– raise abstraction level for “Joe programmers”

• explicit parallelism is too difficult for most programmers • performance penalty for abstraction should be small

• Research approach:a) study algorithms to find common patterns of parallelism and locality

- Amorphous Data-Parallelism (ADP)b) design programming constructs for expressing these patternsc) implement these constructs efficiently

- Abstraction-Based Speculation (ABS)

• For more information– papers in PLDI 2007, ASPLOS 2008, SPAA 2008– website: http://iss.ices.utexas.edu

Page 3: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Organization

• Case study of amorphous data-parallelism– Delaunay mesh refinement

• Galois system (PLDI 2007)– Programming model– Baseline implementation

• Galois system optimizations– Scheduling (SPAA 2008)– Data and computation partitioning (ASPLOS 2008)

• Experimental results• Ongoing work

Page 4: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Delaunay Mesh Refinement

• Iterative refinement to remove badly shaped triangles:

while there are bad triangles do {

Pick a bad triangle;Find its cavity;Retriangulate cavity; // may create new bad

triangles}

• Order in which bad triangles should be refined:– final mesh depends on order in which

bad triangles are processed– but all bad triangles will be eliminated

ultimately regardless of order

Mesh m = /* read in mesh */WorkList wl;wl.add(mesh.badTriangles());while (true) { if ( wl.empty() ) break;

Triangle e = wl.get(); if (e no longer in mesh) continue;Cavity c = new Cavity(e);c.expand(); //determine cavityc.retriangulate();//re-triangulate cavitym.update(c);//update meshwl.add(c.badTriangles());

}

Page 5: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Delaunay Mesh Refinement• Parallelism:

– triangles with non-overlapping cavities can be processed in parallel

– if cavities of two triangles overlap, they must be done serially

– in practice, lots of parallelism• Exploiting this parallelism

– compile-time parallelization techniques like points-to and shape analysis cannot expose this parallelism (property of algorithm, not program)

– runtime dependence checking is needed

• Galois approach: optimistic parallelization

Page 6: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Take-away lessons

• Amorphous data-parallelism– data structures: graphs, trees, etc.– iterative algorithm over unordered or ordered work-list – elements can be added to work-list during computation– complex patterns of dependences between computations on different work-list

elements (possibly input-sensitive) – but many of these computations can be done in parallel

• Contrast: crystalline (regular) data-parallelism– data structures: dense matrices– iterative algorithm over fixed integer interval– simple dependence patterns: affine subscripts in array accesses (mostly input-insensitive)

for i = 1, N for j = 1, N for k = 1, N C[i,j] = C[i,j] + A[i,k]*B[k,j]

Page 7: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Take-away lessons (contd.)

• Amorphous data-parallelism is ubiquitous– Delaunay mesh generation: points to be inserted into mesh– Delaunay mesh refinement: list of bad triangles– Agglomerative clustering: priority queue of points from data-set– Boykov-Kolmogorov algorithm for image segmentation– Reduction-based interpreters for-calculus: list of redexes– Iterative dataflow analysis algorithms in compilers– Approximate SAT solvers: survey propagation, WalkSAT– ……

Page 8: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Take-away lessons (contd.)

• Amorphous data-parallelism is obscured within while loops, exit conditions, etc. in conventional languages– Need transparent syntax similar to FOR loops for crystalline

data-parallelism • Optimistic parallelization is necessary in general

• Compile-time approaches using points-to analysis or shape analysis may be adequate for some cases

• In general, runtime dependence checking is needed• Property of algorithms, not programs

• Handling of dependence conflicts depends on the application• Delaunay mesh generation: roll back any conflicting computation• Agglomerative clustering: must respect priority queue order

Page 9: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Organization

• Case study of amorphous data-parallelism– Delaunay mesh refinement

• Galois system (PLDI 2007)– Programming model– Baseline implementation

• Galois system optimizations– Scheduling (SPAA 2008)– Data and computation partitioning (ASPLOS 2008)

• Experimental results• Ongoing work

Page 10: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Galois Design Philosophy• Do not worry about “dusty decks” (for now)

– Restructuring existing code to expose amorphous data-parallelism: not our focus(cf. Google map/reduce)

• Evolution, not revolution– Modification of existing programming paradigms: OK– Radical solutions like functional programming: not OK

• No reliance on parallelizing compiler technology– will not work for many of our applications anyway– parallelizing compilers are very complex software artifacts

• Support two classes of programmers:– domain experts (Joe):

• should be shielded from complexities of parallel programming• most programmers will be Joes

– parallel programming experts (Steve):: • small number of highly trained people

– analogs:• industry model even for sequential programming• norm in domains like numerical linear algebra

– Steves implement BLAS libraries– Joes express their algorithms in terms of BLAS routines

Page 11: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Galois system

• Application program– Has well-defined sequential semantics

• current implementation: sequential Java

– Uses optimistic iterators to highlight for the runtime system opportunities for exploiting parallelism

• Class libraries– Like Java collections library but with additional

information for concurrency control

• Runtime system– Managing optimistic parallelism

Page 12: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Optimistic set iterators

• for each e in Set S do B(e)– evaluate block B(e) for each element in set S– sequential semantics

• set elements are unordered, so no a priori order on iterations• there may be dependences between iterations

– set S may get new elements during execution• for each e in OrderedSet S do B(e)

– evaluate block B(e) for each element in set S– sequential semantics

• perform iterations in order specified by OrderedSet• there may be dependences between iterations

– set S may get new elements during execution

Page 13: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Galois version of mesh refinement

Mesh m = /* read in mesh */Set wl;wl.add(mesh.badTriangles()); // initialize the Set wl

for each e in Set wl do { //unordered Set iterator

if (e no longer in mesh) continue;Cavity c = new Cavity(e);c.expand();c.retriangulate();m.update(c);wl.add(c.badTriangles()); //add new bad triangles

to Set}

- Scheduling policy for iterator: • controlled by implementation of Set class• good choice for temporal locality: stack

Page 14: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Shared

Memory

Objects

main()….for each …..{…….…….}..........for each ….{….....……..}……

Master

ProgramThreads

• Object-based shared-memory model

• Master thread and some number of worker threads– master thread begins

execution of program and executes code between iterators

– when it encounters iterator, worker threads help by executing iterations concurrently with master

– threads synchronize by barrier synchronization at end of iterator

• Threads invoke methods to access internal state of objects

– how do we ensure sequential semantics of program are respected?

Parallel execution model

Page 15: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Baseline solution: PLDI 2007

• Iteration must lock object to invoke method

• Two types of objects:– catch and keep policy

• lock is held even after method invocation completes

• all locks released at end of iteration• poor performance for programs with

collections and accumulators– catch and release policy

• like Java locking policy• permits method invocations from

different concurrent iterations to be interleaved

• how do we make sure this is safe?

Objects

time1

2

3 i j

Page 16: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Catch and keep: iteration rollback

• What does iteration j do if object is already locked by some other iteration i ?– one possibility: wait and try to acquire

lock again• but this might lead to deadlock

– our implementation: runtime system rolls back one of the iterations by undoing its updates to shared objects

– Undoing updates: any “copy-on-write” solution works

• Make a copy of entire object when you acquire the lock (wasteful for large objects)

• Runtime systems maintains an “undo log” that holds information for undoing side-effects to objects as they are made (cf. software transactional memory)

Shared Memory

Objects

time

12

3i j

Page 17: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Problem with catch and keep

• Poor performance for programs that deal with mutable collections and accumulators

– work-sets are mutable collections– accumulators are ubiquitous

• Example: Delaunay refinement– Work-set is a (mutable) collection of bad

triangles– Some thread grabs lock on work-set object,

gets a bad triangle and removes it from the work-set

– That thread must retain the lock on the work-set till iteration completes, which shuts out all other threads (same problem arises with transactional memory)

• Lesson:– For some objects, we need to interleave

method invocations from different iterations – But must not lose serializability

Shared Memory

Objects

1 2

3i j

4

Page 18: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Galois solution: selective catch and release • Example: accumulator

– two methods:• add(int)• read() //return value

– adds commute with other adds and reads commute with other reads

• Interleaving of commuting method invocations from different iterations

OK• Interleaving of non-commuting method

invocations from different iterations trigger abort

• Rolling back side-effects: programmer must provide “inverse” methods for forward methods

– Inverse method for add(n) is subtract(n)

– Semantic inverse, not representational inverse

• This solution works for sets as well.

Shared Memory

Accumulator

a.add(5)

a.add(8)

a.add(3)

a.add(-4)

a.add(5)a.add(3)a.add(8)a.add(-4)

a.read()

a.read()

058 3

Page 19: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Abstraction-based Speculation

• Library writer:– specifies commutativity and inverse information for some classes

• Runtime system:– catch and release locking for these classes– keeps track of forward method invocations– checks commutativity of forward method invocations– invokes appropriate inverse methods on abort

• More details: PLDI 2007• Related work:

– logical locks in database systems– Herlihy et al: PPoPP 2008

Page 20: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Organization

• Case study of amorphous data-parallelism– Delaunay mesh refinement

• Galois system (PLDI 2007)– Programming model– Baseline implementation

• Galois system optimizations– Scheduling (SPAA 2008)– Data and computation partitioning (ASPLOS 2008)

• Experimental results• Ongoing work

Page 21: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Scheduling iterators

• Control scheduling by changing implementation of work-set class– stack/queue/etc.

• Scheduling can have a profound effect on performance• Example: Delaunay mesh refinement

– 10,156 triangles of which 4,837 were bad– sequential code, work-set is stack:

• 21,918 completed iterations+0 aborted– 4-processor Itanium-2, work-set implementations:

• stack: 21,736 iterations completed+28,290 aborted• array+random choice: 21,908 iterations completed+49 aborted

Page 22: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Scheduling iterators (SPAA 2008)

• Crystalline data-parallelism: DO-ALL loops– main scheduling concerns are locality and load-balancing– Open-MP: static, dynamic, guided, etc.

• Amorphous data-parallelism: many more issues– Conflicts– Dynamically created work– Algorithmic issues: efficiency of data structures

• SPAA 2008 paper:– Scheduling framework for exploiting amorphous data-parallelism– Generalizes Open-MP DO-ALL loop scheduling constructs

Page 23: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Data Partitioning (ASPLOS 2008)

• Partition the graph between cores• Data-centric assignment of work:

– core gets bad triangles from its own partitions– improves locality– can dramatically reduce conflicts

• Lock coarsening:– associate locks with partitions

• Over-decomposition– improves core utilization

Cores

Page 24: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Organization

• Case study of amorphous data-parallelism– Delaunay mesh refinement

• Galois system (PLDI 2007)– Programming model– Baseline implementation

• Galois system optimizations– Scheduling (SPAA 2008)– Data and computation partitioning (ASPLOS 2008)

• Experimental results• Ongoing work

Page 25: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Small-scale multiprocessor results• 4-processor Itanium 2

– 16 KB L1, 256 KB L2, 3MB L3 cache• Versions:

– GAL: using stack as worklist– PAR: partitioned mesh + data-centric work assignment– LCO: locks on partitions– OVD: over-decomposed version (factor of 4)

Page 26: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Large-scale multiprocessor results

• Maverick@TACC– 128-core Sun Fire E25K 1 GHz– 64 dual-core processors– Sun Solaris

• First “out-of-the-box” results• Speed-up of 20 on 32 cores for

refinement • Mesh partitioning is still

sequential– time for mesh partitioning starts to

dominate after 8 processors (32 partitions)

• Need parallel mesh partitioning– Par-Metis (Karypis et al)

Page 27: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Galois version of mesh refinementMesh m = /* read in mesh */Set wl;wl.add(mesh.badTriangles()); // initialize the Set wlfor each e in Set wl do { //unordered Set iterator

if (e no longer in mesh) continue;Cavity c = new Cavity(e);c.expand();c.retriangulate();m.update(c);wl.add(c.badTriangles()); //add new bad triangles to

Set}

Partitioned Work-setPartitioned Graph

Galois runtime system

Page 28: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Results for BK Image Segmentation

• Versions:– GAL: standard Galois version– PAR: partitioned graph– LCO: locks on partitions– OVD: over-decomposed version

Page 29: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Related work

• Transactions– programming model is explicitly parallel – assumes someone else is responsible for parallelism, locality, load-

balancing, and scheduling, and focuses only on synchronization– Galois: main concerns are parallelism, locality, load-balancing, and

scheduling• “catch and keep” classes can use TM for roll-back but this is probably

overkill• Thread level speculation

– not clear where to speculate in C programs• wastes power in useless speculation

– many schemes require extensive hardware support– no notion of abstraction-based speculation– no analogs of data partitioning or scheduling– overall results are disappointing

Page 30: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

Ongoing work

• Case studies of irregular programs– understand “parallelism and locality patterns” in irregular programs

• Lonestar benchmark suite for irregular programs– joint work with Calin Cascaval’s group at IBM Yorktown Heights

• Optimizing the Galois runtime system– improve performance for iterators in which work/iteration is relatively low

• Compiler analysis to reduce overheads of optimistic parallel execution

• Scalability studies– larger number of cores

• Distributed-memory implementation– billion element meshes?

• Program analysis to verify assertions about class methods– Need a semantic specification of class

Page 31: The Galois Project Keshav Pingali University of Texas, Austin Joint work with Milind Kulkarni, Martin Burtscher, Patrick Carribault, Donald Nguyen, Dimitrios

• Irregular applications have amorphous data-parallelism – Work-list based iterative algorithms over unordered and ordered sets

• Amorphous data-parallelism may be inherently data-dependent– Pointer/shape analysis cannot work for these apps

• Optimistic parallelization is essential for such apps– Analysis might be useful to optimize parallel program execution

• Exploiting abstractions and high-level semantics is critical– Galois knows about sets, ordered sets, accumulators…

• Galois approach provides unified view of data-parallelism in regular and irregular programs– Baseline is optimistic parallelism– Use compiler analysis to make decisions at compile-time whenever

possible

Summary