concurrent collections (cnc)
DESCRIPTION
Concurrent Collections (CnC). Kath Knobe Technology Pathfinding and Innovation (TPI) Software and Services Group (SSG) Intel . Cholesky Performance. IPDPS’10 (best paper) Aparna Chandramowlishwaran. Intel 2-socket x 4-core Nehalem @ 2.8 GHz + Intel MKL 10.2. Eigensolver Performance. - PowerPoint PPT PresentationTRANSCRIPT
1Software and Services Group 1
Concurrent Collections (CnC)
Kath KnobeTechnology Pathfinding and Innovation (TPI)
Software and Services Group (SSG)Intel
2Software and Services Group 2
Cholesky Performance
Intel 2-socket x 4-core Nehalem @ 2.8 GHz +
Intel MKL 10.2
IPDPS’10 (best paper)Aparna Chandramowlishwaran
3Software and Services Group 3
Eigensolver Performance
Intel 2-socket x 4-core Nehalem @ 2.8 GHz + Intel® Math Kernel Libraries
10.2
Multithreaded Intel® MKL
Baseline
Concurrent Collections
Higher is better
4Software and Services Group 4
The Problem
• Most serial languages over-constrain orderings• Require arbitrary serialization• Allow for overwriting of data• The decision of if and when to execute are bound together
• Most parallel programming languages are embedded within serial languages
• Inherit problems of serial languages• They are too specific wrt type of parallelism in the
application and wrt the type target architecture • For the tuning expert, they don’t provide quite enough
control
5Software and Services Group 5
explicitly parallel languages(over-constrained)
Concurrent Collections(only semantically required constraints)
explicitly serial languages(over-constrained)
Raise the level of the programming model just enough to avoid over-constraints
The Solution
6Software and Services Group 66
• Introduction• The Big Idea• Simple C++ Example• Execution
Agenda
7Software and Services Group 77
So What’s the Big Idea?•Don’t specify what operations run in parallel
−Difficult and depends on target
•Specify the semantic ordering constraints only−Easier and depends only on application
Semantic ordering constraints: The meaning of the program require some computations to execute before others.
8Software and Services Group 88
Exactly Two Sources of Semantic Ordering Requirements
•Producer / Consumer (Data Dependence)Producer must execute before consumer
•Controller / Controllee (Control Dependence) Controller must execute before controllee
9Software and Services Group 99
-The domain expert does not need to know about parallelism
Separation of Concerns Between Domain Expert and Tuning Expert
- The tuning expert does not need to know about the domain
- CnC maximizes flexibility for good performance
Goal: separation of concerns
The work of the domain expert• Semantic correctness• Constraints required by the application
The work of the tuning expert• Architecture• Actual parallelism • Locality• Overhead • Load balancing• Distribution among processors• Scheduling within a processor
10Software and Services Group 1010
Concurrent Collections Spec
-The domain expert does not need to know about parallelism
Separation of Concerns Between Domain Expert and Tuning Expert
- The tuning expert does not need to know about the domain
- CnC maximizes flexibility for good performance
Goal: separation of concerns
The work of the domain expert• Semantic correctness• Constraints required by the application
The work of the tuning expert• Architecture• Actual parallelism • Locality• Overhead • Load balancing• Distribution among processors• Scheduling within a processor
11Software and Services Group 1111
Concurrent Collections Spec
The application problem
-The domain expert does not need to know about parallelism
Separation of Concerns Between Domain Expert and Tuning Expert
- The tuning expert does not need to know about the domain
- CnC maximizes flexibility for good performance
Goal: separation of concerns
The work of the domain expert• Semantic correctness• Constraints required by the application
The work of the tuning expert• Architecture• Actual parallelism • Locality• Overhead • Load balancing• Distribution among processors• Scheduling within a processor
12Software and Services Group 1212
Mapping to target platform
Concurrent Collections Spec
The application problem
-The domain expert does not need to know about parallelism
Separation of Concerns Between Domain Expert and Tuning Expert
- The tuning expert does not need to know about the domain
- CnC maximizes flexibility for good performance
Goal: separation of concerns
The work of the domain expert• Semantic correctness• Constraints required by the application
The work of the tuning expert• Architecture• Actual parallelism • Locality• Overhead • Load balancing• Distribution among processors• Scheduling within a processor
13Software and Services Group 1313
Notation
(foo)
<T>
[x]
Computation Step
Data Item
Control Tag
(foo)
<T>
[x]
TextualGraphical
14Software and Services Group 1414
Producer - consumer
(step1) (step2)[item]
Exactly Two Sources of Ordering Requirements
•Producer / Consumer (Data Dependence)Producer must execute before consumer
•Controller / Controllee (Control Dependence) Controller must execute before controllee
15Software and Services Group 1515
Producer - consumer Controller - controllee
(step1) (step2)[item] (step1) (step2)
<t2>
Exactly Two Sources of Ordering Requirements
•Producer / Consumer (Data Dependence)Producer must execute before consumer
•Controller / Controllee (Control Dependence) Controller must execute before controllee
16Software and Services Group 16
loop k = …
loop j = … loop i = … Z(i, j, k) = = A(k, j) end endend
Tag collections: new concept(step1) (step2)
<t2>
• Loop iteration space is a sequence <1,1,1>, <1,1,2>, …
• Restricted to integer indices. Body accesses values.
• IF = WHEN
17Software and Services Group 17
Tag collections: new concept(step1) (step2)
<t2>
(Step2)
components of tag <t2> are k, j and i
loop k = …
loop j = … loop i = … Z(i, j, k) = … … = A(k, j) end endend A tag collection is a generalization of an
iteration space• An unordered set, not an ordered sequence
• Not just integer indices. Also graph nodes, tree nodes, set elements, … Body accesses values.
• IF ≠ WHEN
18Software and Services Group 18
Tag collections: new concept(step1) (step2)
<t2>
(Step2)
components of tag <t2> are k, j and i
loop k = …
loop j = … loop i = … Z(i, j, k) = Z(i-1, j, k) + … … = A(k, j) end endend
[z]
19Software and Services Group 19
Tag collections: new concept(step1) (step2)
<t2>
(Step2)
Single component of tag <t2> is k
loop k = …
loop j = … loop i = … Z(i, j, k) = = A(k, j) end endend
20Software and Services Group 2020
• Introduction• The Big Idea• Simple C++ Example• Execution
Agenda
21Software and Services Group 21
Cholesky factorization
22Software and Services Group 22
Cholesky
Cholesky factorization
23Software and Services Group 23
Cholesky
Trisolve
Cholesky factorization
24Software and Services Group 24
Cholesky
Trisolve Update
Cholesky factorization
25Software and Services Group 25
Cholesky
Cholesky factorization
26Software and Services Group 2626
Example: The White Board Drawing What are the high level operations?
What are the chunks of data?
What are the producer/consumer relationships?
What are the inputs and outputs?
(Cholesky)
[array]
(Trisolve)
(Update)
27Software and Services Group 2727
(Cholesky: iter)
[array: iter, row, col]
(Trisolve: row, iter)
(Update: col, row, iter)
Make it precise enough to executeDistinguish among the instances?
What are the distinct control tag collections?
What steps produce them?
<TrisolveTag: row, iter>
<CholeskyTag: iter>
<UpdateTag: col, row, iter>
28Software and Services Group 28
Essentially making it ready for parallel execution But without explicit thinking about parallelism Focus on domain/application knowledge
Result is: • Parallel• Deterministic (wrt results)• Race-free
29Software and Services Group 29
<Cell tags Initial: K, cell> (Correction filter: K, cell)
(Cell tracker: K) (Arbitrator initial: K)
(Arbitrator final: K)
[Input Image: K]
[Histograms: K]
[Motion corrections: K , cell]
[Labeled cells initial: K]
[Predicted states: K, cell]
[Cell candidates: K ]
[State measurements: K]
[Labeled cells final: K]
[Final states: K]
<K>
(Prediction filter: K, cell)
(Cell detector: K)
Cell tracker
<Cell tags Final: K, cell>
30Software and Services Group 30
Another Application: Face Tracker• Look for face in all possible windows of an image Example: in a 3 x 3 image there are 14 windows• Sequence of classifiers. Any can determine: not face
30
Four 2 X 2
windows
Nine 1 x 1
windows
One 3 x 3
window
31Software and Services Group 3131
Example: The White Board Drawing
What are the high level operations?
What are the chunks of data?
What are the producer/consumer relationships?
What are the inputs and outputs?
31
[image]
(classifier1)
(classifier2)
(classifier3)
32Software and Services Group 3232
Make it precise enough to execute
<T2>
<face>
<T3>
(classifier1)
(classifier2)
(classifier3)
<T1>
Distinguish among the instances?
What are the distinct control tag collections?
What steps produce them?
[image]
33Software and Services Group 33
Objects – summary Step Collections
- Are tagged. A step has access to its tag value- Performs gets and puts- Functional. Only side-effects are put objects
Item Collections – Means of communication
among step instances (data dependence)– Dynamic single assignment
Each instance is associated with exactly one contents.
– Are tagged
Tag Collections – Means of communication
among step instances (control dependence)– A tag collection may control
multiple step collections– Determines what step
instances will execute
(foo)
<T>[x]
34Software and Services Group 34
Relationships – summary Prescription
>Every step collection is prescribed.>The relationship is always the identity function.>A tag collection may prescribe multiple step collections.
Consumer - Corresponds to gets in steps- A step may consume multiple distinct
item collections [x], [y] -> (foo)- A step may consume multiple
instances of items from a given collection
[x: neighbors(i)] -> (foo: i)
Producer - Corresponds to puts in steps- A step may produce to multiple distinct
collections (foo) -> [x], [y]
- A step may produce multiple instances to a given collection(foo: i) -> [x: neighbors(i)]
(step1)
<t>
(step2)
(step2)[item] (step1) [item]
(step1) <t2>
35Software and Services Group 3535
.cnc file
// delcarations[BlockedMatrix<double>* array: int, int, int];
// Control relations<TrisolveTag: row, iter> :: (Trisolve: row, iter);
// Producer and consumer relations[array: row, iter, iter], [array: iter, iter, iter+1] -> (Trisolve: row, iter);
(Trisolve: row, iter) -> [array: row, iter, iter+1];
[array: iter, row, col]
(Trisolve)
<TrisolveTag: row, iter>
36Software and Services Group 36
Coding Hints: Trisolve
StepReturnValue_t Trisolve( cholesky_graph_t& graph, const Tag_t& TS_tag) { // For each input item for this step retrieve the item using the proper tag // User code to create item tag here
BlockedMatrix<double>* ... = graph.array.Get(Tag_t(...)); string span = // Step implementation logic goes here ... // For each output item for this step, put the new item using the proper tag // User code to create item tag here
graph.array.Put(Tag_t(...), ...);span); } return CNC_Success; }
[array: iter, row, col]
(Trisolve)
<TrisolveTag: row, iter>
37Software and Services Group 37
A model not a language: 1 Variety of ways to access the model
> Textual representation> Class library> Graphical user interface
38Software and Services Group 38
A model not a language: 2
There are variants within the model – Distinct trade-offs
> Efficiency> Ease-of-use> Generality> Guarantees > …
− Possible different functionality> Continuous (or not)> Tag functions are analyzable (or not)> Real-time (or not)> …
39Software and Services Group 39
Program execution: Attributes are monotonically acquired
Item instance
40Software and Services Group 40
Program execution: Attributes are monotonically acquired
available
Item instance
41Software and Services Group 41
Program execution: Attributes are monotonically acquired
available
Item instance
Tag instance
42Software and Services Group 42
Program execution: Attributes are monotonically acquired
available
Item instance
Tag instanceavailable
43Software and Services Group 43
Program execution: Attributes are monotonically acquired
available
Item instance
Tag instanceavailable
Step instance
Inputs-available
prescribed
enabled executed
44Software and Services Group 44
Program execution: Attributes are monotonically acquired
available
Item instance
Tag instanceavailable
Step instance
Inputs-available
prescribed
enabled executed
45Software and Services Group 45
Program execution: Attributes are monotonically acquired
available
Item instance
Tag instanceavailable
Step instance
Inputs-available
prescribed
enabled executed
46Software and Services Group 46
Program execution: Attributes are monotonically acquired
available
Item instance
Tag instanceavailable
Step instance
Inputs-available
prescribed
enabled executed
47Software and Services Group 47
Program execution: Attributes are monotonically acquired
available
Item instance
Tag instanceavailable
Step instance
Inputs-available
prescribed
enabled executed
48Software and Services Group 48
Program execution: Attributes are monotonically acquired
available
Item instance
Tag instanceavailable
Step instance
Inputs-available
prescribed
enabled executed
49Software and Services Group 49
Program execution: Attributes are monotonically acquired
dead
Item instance
Tag instanceavailable
Step instance
Inputs-available
prescribed
enabled executed
50Software and Services Group 50
Program execution: Attributes are monotonically acquired
dead
Item instance
Tag instancedead
Step instance
Inputs-available
prescribed
enabled executed
51Software and Services Group 5151
• Introduction and history• The Big Idea• Simple C++ Examples• Execution
Agenda
52Software and Services Group 52
Plays well with a variety of runtime approaches
grain distribution schedule
HP TStreams
Intel CnC
Georgia Tech CnC
HP TStreams
Rice CnC
static static static
static static dynamic
static dynamic dynamic1
static dynamic dynamic
dynamic2 dynamic dynamic
1We want to allow you to plug your own schedulerMandviwala et. al.
LCPC ‘07[ ]2
53Software and Services Group 53
Plays well with a variety of tuning experts
• The domain expert / later time• Different person / different expertise• Static analysis• Dynamic Runtime
(This is the option currently available on our website.)• …
54Software and Services Group 54
Plays well with various types of parallelism
Possible parallelism• Data / loop parallelism• Task parallelism• Pipeline parallelism• Tree based parallelism• Fork/join
Rescheduling serial executions• Memory hierarchy opt• Power
Could have different runtime schedulers for distinct subgraphs of the application
55Software and Services Group 55
CnC Plays Nicely With Serial Languages
• Intel: C++ / TBB• Rice: Java / Habanaro • Intel: Haskell (preliminary)• Rice: .NET (preliminary)• …
55
56Software and Services Group 56
CnC Plays Nicely With target architectures
• Shared memory / distributed memory• Homogeneous / heterogeneous• Flat / hierarchical• …
56
57Software and Services Group 57
•TPIKath KnobeGeoff LowneyMark HamptonRyan NewtonFrank Schlimbach
•ICLChih-Ping ChenMelanie BlowerShin LeeSteve RoseLeo TreggiariGanesh RaoNikolay KurtovJeff Arnold (soon)Mario Deilmann
The academic community• Rice University
Vivek SarkarZoran BudimlicSagnak TasirlarDavid PeixottoMike Burke (+ IBM)
• Georgia Tech Rich Vuduc Aparna Chandramowlishwaran
• Novosibirsk Nikolay Kurtov
• UCLA Jens Palsberg
The HP communityCarl OffnerAlex Nelson
The Intel community
58Software and Services Group 58
CnC’10 WorkshopCo-located with
Languages and Compilers for parallel Computers (LCPC)
at Rice University in October 2010
59Software and Services Group 59
Intel (C++): http://whatif.intel.com
Rice (Java): http://habanero.rice.edu/cnc.html
60Software and Services Group 60