1 parallelizing irregular applications through the exploitation of amorphous data-parallelism keshav...
TRANSCRIPT
1
Parallelizing Irregular Applicationsthrough the Exploitation of
Amorphous Data-parallelism
Keshav Pingali (UT, Austin)Mario Mendez-Lojo (UT, Austin)
Donald Nguyen (UT, Austin)
2
Objectives
• What are irregular applications?– how are they different from regular applications?– why are they so hard to parallelize?
• Is there parallelism in irregular applications?– operator formulation of algorithms– amorphous data-parallelism (ADP)– Galois programming model and baseline execution model– ParaMeter tool for estimating ADP in Galois programs
• Is there structure in irregular algorithms?– structural analysis of irregular algorithms– exploiting structure for efficient parallel implementation– Galois system
• What are the open research problems?
3
Schedule
• 1:30 PM-3:00 PM– Irregular applications, amorphous data parallelism,
structure (Keshav Pingali)• 3:00 PM-3:30 PM
– Break• 3:30 PM-4:15 PM
– ParaMeter demo (Donald Nguyen)
• 4:15 PM-5:00 PM– Galois demo (Mario Mendez-Lojo)
4
Irregular applications
5
Definitions
• Regular applications– key data structures are
• vectors • dense matrices
– simple access patterns • (eg) array indices are affine functions of for-loop indices
– dependences are mostly independent of runtime values– examples:
• MMM, Cholesky & LU factorizations, stencil codes, FFT,…• Irregular applications
– key data structures are • lists, priority queues • trees, DAGs, graphs • usually implemented using pointers or references
– complex access patterns– dependences are mostly functions of runtime values – examples: see next slide
6
ExamplesApplication/domain Algorithm
AI Bayesian inferences, survey propagation
Compilers Iterative and elimination-based dataflow algorithms
Functional interpreters Graph reduction, static and dynamic dataflow
Maxflow Preflow-push, augmenting paths
Computational biology Protein homology networks, phylogenetic reconstruction
Event-driven simulation Chandy-Misra-Bryant, Jefferson Timewarp
Meshing Generation/refinement/partitioning
Social networks Connecting the dots
Data-mining Clustering
7
Problem• Handwritten parallel implementations exist for
some irregular problems • No systematic way of thinking about parallelism
in irregular applications– no abstractions for thinking about parallelism – many mechanisms exist but it is not clear when these
mechanisms are useful• static parallelization: dependence analysis and
restructuring• inspector-executor parallelization• interference graphs• optimistic parallelization• …..
Each new irregular application is a “new phenomenon” seemingly unrelated to other applications
• Need a science of parallel programming– analysis: framework for thinking about parallelism and
locality in applications– synthesis: guide for producing efficient parallel
implementations
8
Science of electro-magnetism
Seemingly unrelated phenomena Unifying abstractions
Specialized modelsthat exploit structure
9
Organization of talk
• Seemingly unrelated parallel algorithms and data structures– Stencil codes– Delaunay mesh refinement– Event-driven simulation– Graph reduction of functional languages– ………
• Unifying abstractions– Operator formulation of algorithms– Amorphous data-parallelism– Galois programming model– Baseline parallel implementation
• Specialized implementations that exploit structure– Structure of algorithms– Optimized compiler and runtime system
support for different kinds of structure
10
Seemingly unrelated algorithms
11
Stencil computation: Jacobi iteration• Finite-difference method for solving pde’s
– discrete representation of domain: grid• Values at interior points are updated using values at
neighbors– values at boundary points are fixed
• Data structure: – dense arrays
• Parallelism: – values at next time step can be computed simultaneously– parallelism is not dependent on runtime values
• Compiler can find the parallelism– spatial loops are DO-ALL loops
//Jacobi iteration with 5-point stencil//initialize array Afor time = 1, nsteps for <i,j> in [2,n-1]x[2,n-1] temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) for <i,j> in [2,n-1]x[2,n-1]: A(i,j) = temp(i,j)
Jacobi iteration, 5-point stencil
A temp
tn tn+1
12
Delaunay Mesh Refinement• Iterative refinement to remove badly
shaped triangles:while there are bad triangles do {
Pick a bad triangle;Find its cavity;Retriangulate cavity; // may create new bad triangles}
• Don’t-care non-determinism:– final mesh depends on order in which bad
triangles are processed– applications do not care which mesh is
produced• Data structure:
– graph in which nodes represent triangles and edges represent triangle adjacencies
• Parallelism: – bad triangles with cavities that do not
overlap can be processed in parallel– parallelism is dependent on runtime values
• compilers cannot find this parallelism – (Gary Miller et al) at runtime, repeatedly
build interference graph and find maximal independent sets for parallel execution
Mesh m = /* read in mesh */WorkList wl;wl.add(m.badTriangles());while (true) { if ( wl.empty() ) break;
Element e = wl.get(); if (e no longer in mesh) continue;Cavity c = new Cavity(e);//determine
new cavityc.expand();c.retriangulate();//re-triangulate regionm.update(c);//update meshwl.add(c.badTriangles());
}
13
Event-driven simulation
• Stations communicate by sending messages with time-stamps on FIFO channels
• Stations have internal state that is updated when a message is processed
• Messages must be processed in time-order at each station
• Data structure:– Messages in event-queue, sorted in time-order
• Parallelism: – activities created in future may interfere with
current activities interference graph approach will not work– Jefferson time-warp
• station can fire when it has an incoming message on any edge
• requires roll-back if speculative conflict is detected– Chandy-Misra-Bryant
• conservative event-driven simulation• requires null messages to avoid deadlock
2
5
AB
34
C
6
14
Remarks
• Diverse algorithms and data structures– relatively few algorithms use dense arrays– more common: graphs, trees, lists, priority queues,…
• Exploiting parallelism is very complex– DMR: (Miller et al) interference graph + maximal independent sets– Event-driven simulation: (Jefferson) Timewarp algorithm
• Complexity arises from several sources:– parallelism can be dependent on runtime values
• DMR, event-driven simulation, graph reduction,….– don’t-care non-determinism
• nothing to do with concurrency• DMR, graph reduction
– activities created in the future may interfere with current activities• event-driven simulation…
15
Organization of talk
• Seemingly unrelated parallel algorithms and data structures– Stencil codes– Delaunay mesh refinement– Event-driven simulation– Graph reduction of functional languages– ………
• Unifying abstractions– Operator formulation of algorithms– Amorphous data-parallelism– Galois programming model– Baseline parallel implementation
• Specialized implementations that exploit structure– Structure of algorithms– Optimized compiler and runtime system
support for different kinds of structure• Research topics
16
Unifying abstractions
• Should provide a model of parallelism in irregular algorithms
• Ideally, unified treatment of parallelism in regular and irregular algorithms– parallelism in regular algorithms should emerge as a
special case of general model– (cf.) correspondence principles in Physics
• Abstractions should be effective– should be possible to write an interpreter to execute
algorithms in parallel
17
Traditional abstraction• Dependence graph
– nodes are computations– edges are dependences
• Parallelism– width of the computation graph– critical path: longest path from START to END
• Effective parallel computation graph model– dataflow model of Dennis, Arvind et al
• Inadequate for irregular applications– dependences between computations are
functions of runtime values– don’t-care non-determinism– conflicting work may be created dynamically– …
• Data structures play almost no role in this abstraction– in most programs, parallelism comes from
data-parallelism (concurrent operations on data structure elements)
• New abstraction– data-centric: data structures play a central role– we will use graph ADT to illustrate concepts
18
Graph ADT• Parallel algorithms are usually introduced using matrices
and matrix algorithms• More useful abstraction for us: graph
– abstract data type (ADT): • some number of nodes and edges• edges may be directed• nodes/edges may be labeled with values
– concrete representation• may be different for different applications, platforms and
graphs– application program written in terms of ADT
• (cf.) relational databases
• Graphs are usually sparse in most applications– even Jacobi example: 2D grid is a sparse graph in which
each internal node has four neighbors!• Special graphs
– cliques: isomorphic to square dense matrices– labeled graphs w/o edges: isomorphic to sets/multi-sets
.....
ADT
Concrete representation
19
i1
i2
i3
i4
i5
Operator formulation of algorithms• Algorithm = repeated application of operator to graph
– active element: • node or edge where computation is needed
– DMR: nodes representing bad triangles– Event-driven simulation: station with
incoming message• activity: applying operator to active element
– neighborhood:• set of nodes and edges read/written to
perform computation– DMR: cavity of bad triangle– Event-driven simulation: station
• distinct usually from neighbors in graph– ordering:
• order in which active elements must be executed in a sequential implementation
– any order (Jacobi,DMR, graph reduction)» unordered algorithms
– some problem-dependent order (event-driven simulation)
» ordered algorithms
: active node
: neighborhood
20
Parallelism
• Amorphous data-parallelism (ADP)– Active nodes can be processed in parallel,
subject to• neighborhood constraints• ordering constraints
• Unordered algorithms– Activities are independent if their
neighborhoods do not overlap– Independent activities can be processed in
parallel– More generally,
• Elements in the intersection of neighborhoods are not written
• Commutativity relations– How do we find independent activities?
• Ordered algorithms– Independence is not enough – How do we determine what is safe to
execute w/o violating ordering?
i1
i2
i3
i4
i5
2
5
AB C
21
Galois programming model (PLDI 2007)
• Program written in terms of abstractions in model
• Two-level programming model• Joe programmers
– sequential, OO model – Galois set iterators: for iterating over unordered
and ordered sets of active elements• for each e in Set S do B(e)
– evaluate B(e) for each element in set S– no a priori order on iterations– set S may get new elements during execution
• for each e in OrderedSet S do B(e)– evaluate B(e) for each element in set S– perform iterations in order specified by OrderedSet– set S may get new elements during execution
• Stephanie programmers– Galois concurrent data structure library
• (Wirth) Algorithms + Data structures = Programs
Mesh m = /* read in mesh */Set ws;ws.add(m.badTriangles()); // initialize ws
for each tr in Set ws do { //unordered Set iterator if (tr no longer in mesh) continue;
Cavity c = new Cavity(tr);c.expand();c.retriangulate();m.update(c);ws.add(c.badTriangles()); //bad
triangles }
DMR using Galois iterators
22
Concurrent
Data structure
main()….for each …..{…….…….}.....
Master
Joe Program
• Parallel execution model:– shared-memory– optimistic execution of Galois
iterators• Implementation:
– master thread begins execution of program
– when it encounters iterator, worker threads help by executing iterations concurrently
– several work-distribution strategies implemented in Galois (cf. OpenMP)
– barrier synchronization at end of iterator
• Independence of neighborhoods:– software TM variety – logical locks on nodes and edges
• Ordering constraints for ordered set iterator:– execute iterations out of order
but commit in order– cf. out-of-order CPUs
Galois parallel execution model
i1
i2
i3
i4
i5
23
Parameter tool (PPoPP 2009)
• Idealized execution model:– unbounded number of processors– applying operator at active node takes one time step– execute a maximal set of active nodes, subject to
neighborhood and ordering constraints
• Measures amorphous data-parallelism in irregular program execution
• Useful as an analysis tool
24
Example: DMR
• Input mesh:– Produced by Triangle
(Shewchuck)– 550K triangles– Roughly half are badly
shaped• Available parallelism:
– How many non-conflicting triangles can be expanded at each time step?
• Parallelism intensity:– What fraction of the total
number of bad triangles can be expanded at each step?
25
Summary of ADP abstraction
• Old abstraction: dependence graphs– inadequate for most irregular algorithms
• New abstraction: operator formulation– active elements – neighborhoods– ordering of active elements
• Baseline execution model– Galois programming model
• sequential, OO• uses abstractions in model
– optimistic parallel execution– conflict detection scheme: many choices
• exclusive locks• read/write locks• commutativity relations
• Amorphous data-parallelism– generalizes conventional data-parallelism
i1
i2
i3
i4
i5
26
Organization of talk
• Seemingly unrelated parallel algorithms and data structures– Stencil codes– Delaunay mesh refinement– Event-driven simulation– Graph reduction of functional languages– ………
• Unifying abstractions– Operator formulation of algorithms– Amorphous data-parallelism– Galois programming model– Baseline parallel implementation
• Specialized implementations that exploit structure– Structure of algorithms– Optimized compiler and runtime system
support for different kinds of structure• Ongoing work
27
Structure in irregular algorithms
• Baseline implementation is general but usually inefficient– (eg) dynamic scheduling of iterations is not needed for stencil codes
since grid structure is known at compile-time– (eg) hand-written parallel implementations of DMR and stencil codes do
not buffer updates to neighborhood until commit point• Efficient execution requires exploiting structure in algorithms and
data structures• How do we talk about structure in algorithms?
– Previous approaches: like descriptive biology• Mattson et al book• Parallel programming patterns (PPP): Snir et al • Berkeley motifs: Patterson, Yelick, et al• …
– Our approach: like molecular biology• structural analysis of algorithms• based on amorphous data-parallelism framework
28
Structural analysis of irregular algorithms
irregularalgorithms
topology
operator
ordering
morph
local computation
reader
general graph
grid
tree
unordered
ordered
refinement
coarsening
general
topology-driven
data-driven
Jacobi: topology: grid, operator: local computation, ordering: unordered DMR, graph reduction: topology: graph, operator: morph, ordering: unorderedEvent-driven simulation: topology: graph, operator: local computation, ordering: ordered
29
Exploiting structure to reduce overheads
30
Cautious operators (PPoPP 2010)
• Cautious operator implementation:– reads all the elements in its neighborhood
before modifying any of them– (eg) Delaunay mesh refinement
• Algorithm structure:– cautious operator + unordered active
elements• Optimization: optimistic execution w/o
buffering – grab locks on elements during read phase
• conflict: someone else has lock, so release your locks
– once update phase begins, no new locks will be acquired
• update in-place w/o making copies• zero-buffering
– note: this is not two-phase locking
31
Eliminating speculation
• Coordinated execution of activities: if we can build dependence graph – Run-time scheduling:
• cautious operator + unordered active elements• execute all activities partially to determine neighborhoods• create interference graph and find independent set of activities• execute independent set of activities in parallel w/o synchronization
– Just-in-time scheduling:• local computation + topology-driven (eg) tree walks, sparse MVM• inspector-executor approach
– Compile-time scheduling: • previous case + graph is known at compile-time (eg) Jacobi, FFT, MMM, Cholesky• make all scheduling decisions at compile-time time
DMR Results
• 0.5M triangles, 0.25M bad triangles
• Best serial: locallifo.flagopt• Serial time: 17002 ms• Best // time: 3745 ms• Best speedup: 4.5X
• 2 x Xeon X5570 (4 core, 2.93 GHz)• Java 1.6.0_0-b11• Linux 2.6.24-27 x86_64• 20GB heap size
Mario and Donald will talk about Galois programming model and optimizations
New C++ implementation
3324 core machine: 6-core Intel Xeon x 4, 128 GB RAM
• Points-to Analysis– infer variables pointed by
pointers in program
• Results– input: Linux kernel– seq. implementation in C++– “chunked FIFO” scheduling– speedup: 3.75x (8 core
Xeon)– seq. phases limit speedup
34
1 2 3 4 5 6 7 80
5,000
10,000
15,000
20,000
25,000
Galoisserial
threadsru
nti
me
(s
ec
)
Andersen-style points-to analysis (OOPSLA 2010)
FAQs
• How is this different from TM?– Programming model:
• TM: explicitly parallel (threads)– transactions: synchronization mechanism for threads– mostly memory-level conflict detection + boosting
• Galois: Joe programs are sequential OO programs– ADT-level conflict detection
– Where do threads come from?• TM: someone else’s problem • Galois project: focus on
– sources of parallelism in algorithms– structure of parallelism in algorithms
– (Wirth) Algorithms + Data structures = Programs• Galois: focus on algorithms and algorithmic structure• TM: focus on concurrent data structures
– Galois: mechanisms customized to algorithm structure• Few algorithms require full-blown speculation/optimistic parallelization
35
FAQs (contd.)
• How is this different from TLS?– Programming model:
• Galois: separation between ADT and its implementation is critical– permits separation of Joe and Stephanie layers (cf. relational databases)– permits more aggressive conflict detection schemes like commutativity relations
• TLS: FORTRAN/C, so no separation between ADT and implementation
– Programming model:• Galois: don’t-care non-determinism plays a central role • TLS: FORTRAN/C, so only ordered algorithms
– Exploiting algorithmic structure:• Galois: exploits algorithmic structure to
– reduce speculative overheads– focus speculation only where needed
36
37
Summary
• Seemingly unrelated parallel algorithms and data structures– Stencil codes– Delaunay mesh refinement– Event-driven simulation– ………
• Unifying abstractions– Operator formulation of algorithms– Amorphous data-parallelism– Galois programming model– Baseline parallel implementation
• Specialized implementations that exploit structure– Structure of algorithms– Optimized compiler and runtime system
support for different kinds of structure
38
Schedule
• 1:30 PM-3:00 PM– Irregular algorithms, amorphous data parallelism,
structure (Keshav Pingali)• 3:00 PM-3:30 PM
– Break• 3:30 PM-4:15 PM
– ParaMeter (Donald Nguyen)
• 4:15 PM-4:50 PM– Galois (Mario Mendez-Lojo)
• 4:50 PM-5:00 PM– Research problems (Keshav Pingali)
39
Research ideas
• Application studies:– Lonestar benchmark suite for irregular programs
http://iss.ices.utexas.edu/lonestar/ • Development of ADP model
– locality – intra-operator parallelism
• Identifying algorithmic structure useful in parallel implementations– (eg) refinement operators (see Mario’s paper in OOPSLA 2010)
• Language/compiler analysis– how do you find/communicate structure to the implementation?– (eg) use program analysis to find cautious operators (Dimitris, POPL 2011)
• Specializing data structure implementations to particular algorithms– can this be done semi-automatically?– programmer writes sequential implementation of ADT and tool generates thread-
safe version for use in Galois program?
n1
n2
n4
n3
h4
h3
h2
n1
n2
n4
n3
h4
h3
h2
n1
n2
n4
h1
n3
h2
h4
h3
40
Research ideas (contd.)
• Auto-tuning– concrete implementations for graphs– scheduling strategies– contention-management policies– …..
• Load-balancing– “data-centric work-stealing”– see Milind Kulkarni’s paper in SPAA 2010
• Other execution models– GPUs – distributed-memory machines
• Architecture– existing core designs use lots of real estate for caches– may not be well-suited for irregular applications:
• little spatial and temporal locality• relatively small amount of computation per memory access
– one idea:• use multi-threading to hide latency• NVIDIA GPU-like architecture for irregular algorithms?
• Hardware-software codesign– can we generate custom cores for particular applications to exploit ADP?
41
Science of Parallel Programming
Seemingly unrelated algorithms
Unifying abstractionsSpecialized models
that exploit structure
2 A B
……..
i1
i2
i3
i4
i5