parallel programming: design of an overview...
TRANSCRIPT
X10 Workshop, San Jose - June 4, 2011
Parallel Programming: Design of an Overview Class
Christoph von PraunUniversity of Applied Sciences
Nuremberg, [email protected]
1This work was supported by an IBM Innovation Award grant.
X10 Workshop, San Jose - June 4, 2011 2
Summary
• Design of a 3rd year introductory parallel programming class in the Bachelor curriculum: ‘Orientation’ class
• Key characteristics of the class– Organization of topics follows the
Tiers of parallelism– Uses programming language X10– Strong focus on lab sessions
• Teaching materials are available online
X10 Workshop, San Jose - June 4, 2011 3
Computer Science Master + Bachelor Curriculum
thesis
123456
thesis7
Parallel Programming
The Art of Multiprocessor Programming
Graphics Programming with CUDA
Elective classes
Scientific Computing
‘Orientation’ class
123
Semester
X10 Workshop, San Jose - June 4, 2011
Influences on parallel programming classes
4
Parallel programming
Scientific computing
OS
High-performance computing
Software architectureComputer
architecture
Programming languages
X10 Workshop, San Jose - June 4, 2011
Influences on parallel programming classes
5
Parallel programming
Scientific computing
OS
High-performance computing
Software architectureComputer
architecture
Programming languages
X10 Workshop, San Jose - June 4, 2011 6
Outline
• Tiers of parallelism• Course structure and contents• Role of X10• Student feedback and experience
X10 Workshop, San Jose - June 4, 2011 7
Tiers of parallelism
• Original idea due to Michael L. Scott:– “Don’t start with Dekker’s algorithm ...” [1]– “Making the simple case simple” [2]
• Development of parallel software (parallel programming) can be based on techniques at different abstraction layers– progressively less complexity at higher abstraction
layers
X10 Workshop, San Jose - June 4, 2011 8
implementation of threads, synchronization mechanisms, non-blocking data structures
parallelization techniques
(1) automatic or implicit
parallelizing compiler
(2) deterministic fully independent computations or serialization
(3) explicitly synchronized (data race free)
critical sections, transactions
(4) low-level (with race conditions)
high-level(simpler)
low-level(more
complex)
X10 Workshop, San Jose - June 4, 2011 9
implementation of threads, synchronization mechanisms, non-blocking data structures
parallelization techniques
(1) automatic or implicit
parallelizing compiler
(2) deterministic fully independent computations or serialization
(3) explicitly synchronized (data race free)
critical sections, transactions
(4) low-level (with race conditions)
X10 Workshop, San Jose - June 4, 2011 10
implementation of threads, synchronization mechanisms, non-blocking data structures
parallelization techniques
(1) automatic or implicit
parallelizing compiler
(2) deterministic fully independent computations or serialization
(3) explicitly synchronized (data race free)
critical sections, transactions
(4) low-level (with race conditions)
Goal of the class:
Students should be conscious about ‘their’ tier when developing a parallel
program.
Encourage students to move programming activity to higher tiers in
the abstraction hierarchy.
X10 Workshop, San Jose - June 4, 2011 11
Tier-1: automatic or implicit parallelism
• Auto-parallelization through compilers• Parallel kernels: parallelism encapsulated in libraries– LAPACK, etc.
• Parallel frameworks: framework organizes parallelism, synchronization and communication, programmer supplies sequential kernels– Map-reduce– Web application frameworks, e.g. WebSphere– etc.
X10 Workshop, San Jose - June 4, 2011 12
Tier-1: automatic or implicit parallelism
• Auto-parallelization through compilers• Parallel kernels: parallelism encapsulated in libraries– LAPACK, etc.
• Parallel frameworks: framework organizes parallelism, synchronization and communication, programmer supplies sequential kernels– Map-reduce– Web application frameworks, e.g. WebSphere
Sequential semantics.
X10 Workshop, San Jose - June 4, 2011 13
Tier-2: deterministic parallelism
• Independent computations:– parallel array languages (FORALL loops)– parallel containers (e.g., STAPL, Intel Concurrent
Collections, Hierarchically Tiled Arrays)
• Concurrent computations with dependencies that follow deterministic idioms:– reduction, scan
X10 Workshop, San Jose - June 4, 2011 14
Tier-2: deterministic parallelism
• Independent computations:– parallel array languages (FORALL loops)– parallel containers (e.g., STAPL, Intel Concurrent
Collections, Hierarchically Tiled Arrays)
• Concurrent computations with dependencies that follow deterministic idioms:– reduction, scan
Semantics through serialization + sequential reasoning.
X10 Workshop, San Jose - June 4, 2011 15
Tier-3: explicitly synchronized, data-race-free
Three principal programming models• Event-based • Thread-parallel with shared memory– critical sections– condition variables
• Message-based– send/receive– collective communication
X10 Workshop, San Jose - June 4, 2011 16
Tier-3: explicitly synchronized, data-race-free
Three principal programming models• Event-based • Thread-parallel with shared memory– critical sections– condition variables
• Message-based– send/receive– collective communication
Semantics through interleaving of program blocks
X10 Workshop, San Jose - June 4, 2011 17
Tier-4: low-level, with race conditions
• Programming with shared memory– atomic load and store– atomic compare and swap
• Platform-specific (Java, X86, ...)• Sequential consistency is often a simplifying assumption– e.g. teaching Dekker’s algorithm
X10 Workshop, San Jose - June 4, 2011 18
Tier-4: low-level, with race conditions
• Programming with shared memory– atomic load and store– atomic compare and swap
• Platform-specific (Java, X86, ...)• Sequential consistency is often a simplifying assumption– e.g. teaching Dekker’s algorithm
Semantics through interleaving of statements, possibly not
sequentially consistent
X10 Workshop, San Jose - June 4, 2011 19
Outline
• Tiers of parallelism• Course structure and contents• Role of X10• Student feedback and experience
X10 Workshop, San Jose - June 4, 2011 20
Roadmap of the class (15 weeks)
Tier-1 (1 week)
Tier-4 (1 + 2 week)
Tier-2 (7 weeks)
Tier-3(2 weeks)
Motivation (1 week)
Principles(1 week)
Topics not addressed in this course
1
2
3 4
5
6
7
X10 Workshop, San Jose - June 4, 2011 21
Motivation (1 week)
• Hardware trend: – Moore’s Law continues– frequency scaling limited by power density:
multicores• Performance: Software need to be parallel
– challenges (Amdahl’s Law)– opportunities (Gustafson’s Law)
• Energy: Throughput-oriented computing can save energy
Lab session• Pencil and paper
X10 Workshop, San Jose - June 4, 2011
Principles (1 week)
• Simple model for concurrent computations– partial orders of operations– synchronization vs ordinary operations– happens-before relation
• Explain semantics of X10 language constructs – async– finish, for-async– atomic
Lab session• Parallel prime number testing
22
X10 Workshop, San Jose - June 4, 2011
Tier-4 (1 week)[low-level with race conditions]
• Race conditions • Non-determinacy
– associative non-determinism (floating point)– atomicity violation: lost-update problem
• “Interleaving” semantics
Lab sessions• Numeric integration
23
X10 Workshop, San Jose - June 4, 2011
Tier-1 (1 week)[automatic or implicit parallelism]
• Challenges of loop parallelization– intro to data dependencies – difficulties and limitations of dependence analysis on
some loop scenarios• Parallel frameworks
– Map-Reduce, Web-applications
Lab sessions• Development of Map-Reduce applications
(framework provided)
24
X10 Workshop, San Jose - June 4, 2011 25
Tier-2 (7 weeks)[deterministic parallelism]
Patterns for algorithmic problem decomposition: according to T. Mattson, B. Sanders, B. Massingill, “Patterns for parallel programming”, AW 2005.
• Data parallelism– geometric decomposition, recursive
data– data locality issues
• Task parallelism– task parallel, divide and conquer– task scheduling / load balancing issues
Lab sessions• Data parallel
– heat-transfer– matrix multiply– algorithms
for reduction and prefix-sum
• Task parallel– map-reduce framework implementation– merge-sort– traveling salesman
X10 Workshop, San Jose - June 4, 2011 26
Tier-2 (7 weeks)[deterministic parallelism]
X10 Workshop, San Jose - June 4, 2011 27
Tier-3 (2 weeks)[explicitly synchronized]
• Pattern: Pipeline parallelism• Producer-consumer communication through concurrent
queues– critical sections– conditional synchronization
Lab sessions• Array-based concurrent queue with explicit
synchronization (atomic blocks)
X10 Workshop, San Jose - June 4, 2011
Tier-4 (2 weeks)[low-level with race conditions]
• Programming with race conditions• Memory models (SC, TSO)
Lab sessions• Lamport’s concurrent non-blocking queue
(1 consumer / 1 produce non-blocking queue)• Observe non-SC behavior of Java
28
X10 Workshop, San Jose - June 4, 2011
Topics not addressed in the course
• Patterns for ...– ... locality / reducing data access latency – ... load balancing / distribution of work – ... enhancing parallelism– ... distribution data
• Performance debugging
29
X10 Workshop, San Jose - June 4, 2011 30
Outline
• Tiers of parallelism• Course structure and contents• Role of X10• Student feedback and experience
X10 Workshop, San Jose - June 4, 2011 31
X10
• Pragmatic choice: – Syntax familiar to students: “extension of sequential
Java”– Simple things can be expressed with succinct syntax– X10 can express programs at tiers (1)-(3)
memory model not specified -> use Java at tier (4)
• Class was not X10 ‘only’– students could choose their own language for
projects– X10 language tutorial provided separately
“The language should not be used in future classes , since parallel programming is simplified significantly, and for that reason, one does not run into issues and problems that occur when conventional programming languages are used for parallel programming.”
“Takes a while to be familiar with the type system / type inference.”
“Usability of X10 IDE needs to be improved” [March-June 2010]
X10 Workshop, San Jose - June 4, 2011 32
Student feedback on X10
X10 Workshop, San Jose - June 4, 2011 33
Outline
• Tiers of parallelism• Course structure and contents• Role of X10• Student feedback and experience
Has the course been well-structured and did the structure support your learning?
— 1 —
Evaluation, Parallel Programming (Christoph von Praun)
Auswertung zur Veranstaltung "Parallel Programming" Liebe Dozentin, lieber Dozent,anbei erhalten Sie die Ergebnisse der Evaluation Ihrer Lehrveranstaltung.Zu dieser Veranstaltung wurden 32 Bewertungen abgegeben. Erläuterungen zu den Diagrammen befinden sich am Ende dieses Dokuments.Mit freundlichen Grüßen, Das Evaluationsteam
Aufbereitung und Strukturierung des Lehrstoffes
Darbietung der Lehrstoffes
zu wenig (1) zu viel (5)
2 9 5
1 2 3 4 5
3.19s = 0.63
3.19s = 0.63
Stoffmenge
nicht zu erkennen (1) deutlich sichtbar (5)
2 2 8 4
1 2 3 4 5
3.88s = 0.93
3.88s = 0.93
Roter Faden
überflüssig (1) hilfreich (5)
1 3 4 8
1 2 3 4 5
4.19s = 0.95
4.19s = 0.95
Schriftliche Unterlagen
zu niedrig (1) zu hoch (5)
10 6
1 2 3 4 5
3.38s = 0.48
3.38s = 0.48
Niveau der Veranstaltung
verwirrend (1) übersichtlich (5)
4 7 5
1 2 3 4 5
4.06s = 0.75
4.06s = 0.75
Tafelarbeit/Folien
langweilig (1) motivierend (5)
7 9
1 2 3 4 5
4.56s = 0.5
4.56s = 0.5
Vortragsstil
dürftig (1) ausgeprägt (5)
2 2 12
1 2 3 4 5
4.62s = 0.7
4.62s = 0.7
Diskussionsbereitschaft
nicht kompetent (1) sehr kompetent (5)
1 1 14
1 2 3 4 5
4.81s = 0.53
4.81s = 0.53
Dozent war fachlich
yes, very clear structure
no, poor structure
fair structure
X10 Workshop, San Jose - June 4, 2011 34
Student feedback (1/3)
Feedback collected from 16/21 participants.
The number of topics and the volume of material presented in class was ...
— 1 —
Evaluation, Parallel Programming (Christoph von Praun)
Auswertung zur Veranstaltung "Parallel Programming" Liebe Dozentin, lieber Dozent,anbei erhalten Sie die Ergebnisse der Evaluation Ihrer Lehrveranstaltung.Zu dieser Veranstaltung wurden 32 Bewertungen abgegeben. Erläuterungen zu den Diagrammen befinden sich am Ende dieses Dokuments.Mit freundlichen Grüßen, Das Evaluationsteam
Aufbereitung und Strukturierung des Lehrstoffes
Darbietung der Lehrstoffes
zu wenig (1) zu viel (5)
2 9 5
1 2 3 4 5
3.19s = 0.63
3.19s = 0.63
Stoffmenge
nicht zu erkennen (1) deutlich sichtbar (5)
2 2 8 4
1 2 3 4 5
3.88s = 0.93
3.88s = 0.93
Roter Faden
überflüssig (1) hilfreich (5)
1 3 4 8
1 2 3 4 5
4.19s = 0.95
4.19s = 0.95
Schriftliche Unterlagen
zu niedrig (1) zu hoch (5)
10 6
1 2 3 4 5
3.38s = 0.48
3.38s = 0.48
Niveau der Veranstaltung
verwirrend (1) übersichtlich (5)
4 7 5
1 2 3 4 5
4.06s = 0.75
4.06s = 0.75
Tafelarbeit/Folien
langweilig (1) motivierend (5)
7 9
1 2 3 4 5
4.56s = 0.5
4.56s = 0.5
Vortragsstil
dürftig (1) ausgeprägt (5)
2 2 12
1 2 3 4 5
4.62s = 0.7
4.62s = 0.7
Diskussionsbereitschaft
nicht kompetent (1) sehr kompetent (5)
1 1 14
1 2 3 4 5
4.81s = 0.53
4.81s = 0.53
Dozent war fachlich
too muchtoo few perfect
X10 Workshop, San Jose - June 4, 2011 35
Student feedback (2/3)
Did the lab sessions help you to learn and understand the materials presented in class?
X10 Workshop, San Jose - June 4, 2011 36
Student feedback (3/3)
— 2 —
Evaluation, Parallel Programming (Christoph von Praun)
Allgemeines
Gesamteindruck der Veranstaltung
Freitextkommentare Weitere freie Anmerkungen
nicht vorhanden (1) ausgeprägt (5)
3 4 9
1 2 3 4 5
4.38s = 0.78
4.38s = 0.78
Praxisbezug des Lehrstoffes
zu wenig (1) zu viel (5)
6 8 2
1 2 3 4 5
3.75s = 0.66
3.75s = 0.66
Übungsbeispiele (Menge)
ungeeignet (1) vertiefend (5)
1 2 6 7
1 2 3 4 5
4.19s = 0.88
4.19s = 0.88
Übungsbeispiele (Inhalt)
1.56s = 0.5
7 9
1 2 3 4 5 6
1.56s = 0.5
Bitte vergeben Sie eine Schulnote: 1 (sehr gut) bis 6 (ungenügend).
Besonders gut finde ich:
alwaysnever sometimes
• Focus of discussion on correctness, not performance
• Focus on ‘higher layers’ in the abstraction hierarchy – Less complex than lower tiers– Assumption: People educated in our school are more
likely to do parallel programming at higher rather than lower tiers
• Language X10 not widely used in practice
X10 Workshop, San Jose - June 4, 2011 37
Criticism
• “Tiers of parallelism” is a fruitful concept– course structure– orientation for students
• Focus on lab sessions important – provided skeletons and solutions for every exercise– few students could chose their own language
(typically much more complex than X10)• X10 turned out to be very good choice
– succinct expression of programs at different tiers– steep learning curve
X10 Workshop, San Jose - June 4, 2011 38
Conclusions
01 -
[1] Michael L. Scott: “Don’t start with Dekker’s algorithm - top-down introduction to concurrency”, Multicore Programming Education Workshop, 2009.
[2] Michael L. Scott: “Making the simple case simple”, Position paper, Workshop on Curricula for Concurrency, in conjunction with OOPSLA, 2009.
39
Sources
X10 Workshop, San Jose - June 4, 2011
Thank you for your attention.
Teaching materials are available at http://www.in.ohm-hochschule.de/professors/praun/pp
40