cs498dp 1 introduction

52
7/30/2019 Cs498DP 1 Introduction http://slidepdf.com/reader/full/cs498dp-1-introduction 1/52 CS 498DP3/4/O  INTRODUCTION TO PARALLEL PROGRAMMING  SPRING 2013  Department of Computer Science University of Illinois at Urbana-Champaign 1

Upload: crimsonredmk2

Post on 04-Apr-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 1/52

CS 498DP3/4/O INTRODUCTION TO PARALLELPROGRAMMING 

SPRING 2013 

Department of Computer Science

University of Illinois at Urbana-Champaign

1

Page 2: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 2/52

Topics covered

• Parallel algorithms

• Parallel programing languages

• Parallel programming techniques focusing on tuning

programs for performance.

• The course will build on your knowledge of algorithms,

data structures, and programming. This is an advanced

course in Computer Science for CS students.• This course is a more advanced version of CS420

2

Page 3: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 3/52

Why parallel programming ?

• For science and engineering• Science and engineering computations are often lengthy.

• Parallel machines have more computational power than their sequential counterparts.

• Faster computing→ Faster science/design

• If fixed resources: Better science/engineering

• For everyday computing• Scalable software will get faster with increased parlalelism.

• Better poser consumption.

• Yesterday: Top of the line machines were parallel

• Today: Parallelism is the norm for all classes of machines, frommobile devices to the fastest machines.

3

Page 4: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 4/52

CS498DP3/4/O

• A parallel programming course for Computer Sciencestudents.

• Assumes students are proficient programmers with

knowledge of algorithms and data structures.

4

Page 5: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 5/52

Course organization

Course website: http://courses.engr.illinois.edu/cs498dp3/

Instructor: David Padua

4227 SC

[email protected]

3-4223

Office Hours: TBATA: G. Carl Evans

[email protected]

Grading: 7-10 Machine Problems(MPs) 40%

Homeworks Not graded

Midterm (Wednesday, Feb 27) 30%

Final (Comprehensive,

8:00-11:00 AM, Wednesday, May 8) 30%

Graduate students registered for 4 credits must complete

additional work (assigned as part of some of the MPs).

5

Page 6: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 6/52

MPs

• Several programing models

• Sequential (locality)

• Vector 

• Shared memory

• Distributed memory

• Common language will be C with extensions.

• Target machines will be

• Engineering workstations for development

• I2PC5 in single user mode for measurements

6

Page 7: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 7/52

Textbook

• Introduction to Parallel

Computing by Ananth Grama, Anshul Gupta, George Karypis, and

Vipin Kumar. Addison-Wesley . 2edition (January 26, 2003)

7

Page 8: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 8/52

Specific topics covered

• Material from the

textbook, plus papers

on specific topicsincluding.

• Locality

• Vector computing

• Compiler technology

8

Page 9: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 9/52

PARALLEL COMPUTING

9

Page 10: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 10/52

 An active subdiscipline

• The history of computing is intertwined with parallelism.

• Parallelism has become an extremely active discipline

within Computer Science.

10

Page 11: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 11/52

What makes parallelism so important ?

• One reason is its impact on performance

• For a long time, the technology of high-end machines

•  An important strategy to improve performance for all classes of 

machines

11

Page 12: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 12/52

Parallelism in hardware

• Parallelism is pervasive. It appears at all levels

• Within a processor • Basic operations

• Multiple functional units

• Pipelining

• SIMD

• Multiprocessors• Multiplicative effect on performance

12

Page 13: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 13/52

Parallelism in hardware (Adders)

• Adders could be serial

• Parallel

• Or highly parallel

13

Page 14: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 14/52

14

Carry lookahead logic

Page 15: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 15/52

Parallelism in hardware

(Scalar vs SIMD array operations)

for (i=0; i<n; i++)c[i] = a[i] + b[i];

… Register File

X1

Y1

Z1

32 bits

32 bits

+

32

bits

ld r1, addr1

ld r2, addr2

add r3, r1, r2

st r3, addr3

n

times

ldv vr1, addr1

ldv vr2, addr2

addv vr3, vr1,

vr2

stv vr3, addr3

n/4

times

15

Page 16: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 16/52

Parallelism in hardware (Multiprocessors)

• Multiprocessing is the characteristic that is most evident in

clients and high-end machines.

16

Page 17: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 17/52

Power (1/3)

• With recent increases in frequency, there was

also an increase in energy consumption

• Power  ∝ V 2 *  frequency and since voltage and frequency depend on

each other: 2.5

( )Power frequency

Page 18: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 18/52

Power (2/3)

D. Yen, “Chip multithreading processors enable reliable high throughput computing,”

Keynote speech at International Symposium on Reliability Physics (IRPS), April 2005.

From Pradip Bose. Power Wall. Encyclopedia of Parallel Computing Springer Verlag.

Page 19: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 19/52

Challenges in Power 

• Energy consumption imposes limits at the high end. (“Youwould need a good-size nuclear power plant next door [for an exascale machine]”P. Kogge)

• It also imposes limits on mobile and other personal

devices because of batteries. More processors implymore power (albeit only linear increases ?)

• This is a tremendous challenge at both ends of thecomputing spectrum.• New architectures

• Heterogeneous systems

• No caches

•  Ability to switch off parts of processors

• New hardware technology

Page 20: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 20/52

Power (3/3)

• At the same time,

Moore’s Law is

still going strong.

• Therefore

increased

parallelism ispossible

From Wikipedia

Page 21: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 21/52

Parallelism is the norm

• Despite all limitations, there is much parallelism today andmore is coming.

• The most effective path towards performance gains

22

Page 22: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 22/52

Clients: Intel microprocessor performance

(Graph from Markus Püschel, ETH)

Knights FerryMIC co-processor

22

23

Page 23: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 23/52

High-end machines: Top 500 number 1

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

        J    -        9        9

        J    -        0        0

        J    -        0        1

        J    -        0        2

        J    -        0        3

        J    -        0        4

        J    -        0        5

        J    -        0        6

        J    -        0        7

        J    -        0        8

        J    -        0        9

        J    -        1        0

        J    -        1        1

   G   f   l  o  p   /  s Theoretical peak

performance

Theoretical peakperformance per core

Number of cores

23

24

Page 24: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 24/52

• How can it be accessed ? In increasing degrees of 

complexity:

•  Applications

• Programming

• Libraries

• Implicitly parallel

• Explicitly parallel.

24

25

Page 25: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 25/52

1. ISSUES IN APPLICATIONS

25

26

Page 26: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 26/52

 Applications at the high-end

• Numerous applications have been developed in a wide

range of areas.

• Science

• Engineering

• Search engines

• Experimental AI

• Tuning for performance requires expertise.

• Although additional computing power is expected to help

advances in science and engineering, it is not that simple:

26

27

Page 27: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 27/52

More computational power is only part of 

the story

• “increase in computing power will need to be accompanied bychanges in code architecture to improve the scalability, … and bythe recalibration of model physics and overall forecastperformance in response to increased spatial resolution” *

• “…there will be an increased need to work toward balancedsystems with components that are relatively similar in theirparallelizability and scalability”.* 

• Parallelism is an enabling technology but much more is needed.

*National Research Council: The potential impact of high-end capability computing on four illustrative fields of scienceand engineering. 2008

27

28

Page 28: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 28/52

 Applications for clients / mobile devices

• A few cores can be justified to supportexecution of multiple applications.

• But beyond that, … What app will drive theneed for increased parallelism ?

• New machines will improve performance by

adding cores. Therefore, in the newbusiness model: software scalabilityneeded to make new machines desirable.

• Need app that must be executed locally andrequires increasing amounts of computation.

• Today, many applications ship computationsto servers (e.g. Apple’s Siri). Is that thefuture. Will bandwidth limitations force localcomputations ?

28

29

Page 29: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 29/52

2A. ISSUES IN PARALLEL

PROGRAMMING:

LIBRARIES

29

30

Page 30: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 30/52

Library routines

• Easy access to parallelism. Already available in some

libraries (e.g. Intel’s MKL). 

• Same conventional programming style. Parallel programs

would look identical to today’s programs with parallelism

encapsulated in library routines.

• But, … 

• Libraries not always easy to use (Data structures). Hence not

always used.

• Locality across invocations an issue.

• In fact, composability for performance not effective today

30

31

Page 31: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 31/52

2B. ISSUES IN PARALLEL

PROGRAMMING:

IMPLICIT PARALLELISM

31

32

Page 32: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 32/52

Objective:

Compiling conventional code

• Since the Illiac IV times

• “The ILLIAC IV Fortran compiler's Parallelism Analyzer and Synthesizer (mnemonicized as the Paralyzer)detects computations in Fortran DO loops which can be

performed in parallel.” (*) 

(*) David L. Presberg. 1975. The Paralyzer: Ivtran's Parallelism Analyzer and Synthesizer. In Proceedings of the Conference on Programming Languages and Compilers for Parallel and Vector Machines . ACM, New York, NY, USA, 9-16.

33

Page 33: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 33/52

Benefits 

• Same conventional programming style. Parallel programs

would look identical to today’s programs with parallelism

extracted by the compiler.

• Machine independence.

• Compiler optimizes program.

• Additional benefit: legacy codes

• Much work in this area in the past 40 years, mainly at

Universities.

• Pioneered at Illinois in the 1970s

34

Page 34: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 34/52

The technology

• Dependence analysis is the foundation.

• It computes relations between statement instances

• These relations are used to transform programs• for locality (tiling),

• parallelism (vectorization, parallelization),

• communication (message aggregation),

• reliability (automatic checkpoints),

• power … 

35

Page 35: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 35/52

The technology

Example of use of dependence

• Consider the loop

for (i=1; i<n; i++) {for (j=1; j<n; j++) {a[i][j]=a[i][j-1]+a[i-1][j];

}}

36

Page 36: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 36/52

for (i=1; i<n; i++) {for (j=1; j<n; j++) {a[i][j]=a[i][j-1]+a[i-1][j];

}}

a[1][1] = a[1][0] + a[0][1]

a[1][2] = a[1][1] + a[0][2]

a[1][3] = a[1][2] + a[0][3]

a[1][4] = a[1][3] + a[0][4]

 j=1

 j=2

 j=3

 j=4

a[2][1] = a[2][0] + a[1][1]

a[2][2] = a[2][1] + a[1][2]

a[2][3] = a[2][2] + a[1][3]

a[2][4] = a[2][3] + a[1][4]

i=1 i=2

The technology

Example of use of dependence• Compute dependences (part 1)

37

Page 37: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 37/52

The technology

Example of use of dependence

for (i=1; i<n; i++) {for (j=1; j<n; j++) {a[i][j]=a[i][j-1]+a[i-1][j];

}}

a[1][1] = a[1][0] + a[0][1]

a[1][2] = a[1][1] + a[0][2]

a[1][3] = a[1][2] + a[0][3]

a[1][4] = a[1][3] + a[0][4]

 j=1

 j=2

 j=3

 j=4

a[2][1] = a[2][0] + a[1][1]

a[2][2] = a[2][1] + a[1][2]

a[2][3] = a[2][2] + a[1][3]

a[2][4] = a[2][3] + a[1][4]

i=1 i=2

• Compute dependences (part 2)

38

Page 38: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 38/52

The technology

Example of use of dependence

for (i=1; i<n; i++) {for (j=1; j<n; j++) {a[i][j]=a[i][j-1]+a[i-1][j];

}}

1 2 3 4 … 

1

2

3

4

 j

i

1,1

or 

39

Page 39: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 39/52

The technology

Example of use of dependence3.

for (i=1; i<n; i++) {for (j=1; j<n; j++) {

a[i][j]=a[i][j-1]+a[i-1][j];}}

• Find parallelism

40

Page 40: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 40/52

The technology

Example of use of dependence

for (i=1; i<n; i++) {for (j=1; j<n; j++) {

a[i][j]=a[i][j-1]+a[i-1][j];}}

• Transform the code

for k=4; k<2*n; k++) forall (i=max(2,k-n):min(n,k-2)) a[i][k-i]=...

41

Page 41: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 41/52

How well does it work ?

• Depends on three factors:

1. The accuracy of the dependence analysis

2. The set of transformations available to the compiler 

3. The sequence of transformations

42

Page 42: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 42/52

How well does it work ?

Our focus here is on vectorization

• Vectorization important:

• Vector extensions are of great importance. Easy parallelism. Will

continue to evolve

• SSE

•  AltiVec

• Longest experience

• Most widely used. All compilers has a vectorization pass

(parallelization less popular)

• Easier than parallelization/localization

• Best way to access vector extensions in a portable manner 

•  Alternatives: assembly language or machine-specific macros

43

Page 43: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 43/52

How well does it work ?

Vectorizers - 2005

0

1

2

3

4

  C a  l c  u

  l a  t  i o  n

_  o  f_   t

  h e_   L  T  P...

  S  h o  r  t

_   T e  r  m

_  A  n a  l

  y s  i s_   F  i  l  t e  r

  S  h o  r  t

_   T e  r  m

_   S  y  n  t  h

 e s  i s_   F

  i  l  t e  r

 c a  l c_   n o  i s e  2

 s  y  n  t  h_ 

  1  t o  1

  j   p e g _   i d c  t_   i s  l o  w  d  i s  t  1   f d c  t

  f o  r  m_  c

 o  m  p o

  n e  n  t_   p  r e

 d  i c  t  i o  n   i d c  t

  I  W  P  i  x

  m a  p : :  i  n  i

  t

  p e  r s  p_ 

  t e  x  t  u  r

 e d_   t  r  i a

  n g   l e

 g   l_  d e  p

  t  h_   t e s

  t_  s  p a  n

_  g  e  n

 e  r  i c

  m  i  x_ 

  m  y s  t e  r  y_ 

 s  i g   n a  l

       S     p     e     e      d     u

     p     s

Manual Vectorization

ICC 8.0

G. Ren, P. Wu, and D. Padua: An Empirical Study on the Vectorization of Multimedia Applications

for Multimedia Extensions. IPDPS 2005

44

Page 44: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 44/52

 

S. Maleki, Y. Gao, T. Wong, M. Garzarán, and D. Padua. An Evaluation of Vectorizing Compilers .International Conference on Parallel Architecture and Compilation Techniques. PACT 2011.

How well does it work ?

Vectorizers - 2010

45

Page 45: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 45/52

Going forward

• It is a great success story. Practically all compilers today havea vectorization pass (and a parallelization pass)

• But… Research in this are stopped a few years back. Althoughall compilers do vectorization and it is a very desirable property.

• Some researchers thought that the problem was impossible tosolve.

• However, work has not been as extensive nor as long as workdone in AI for chess of question answering.

• No doubt that significant advances are possible.

46

Page 46: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 46/52

What next ?

3-10-2011

Inventor, futurist predicts dawn of total artificialintelligence

Brooklyn, New York (VBS.TV) -- ...Computers will be ableto improve their own source codes ... in ways we puny

humans could never conceive.

47

Page 47: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 47/52

2C. ISSUES IN PARALLEL

PROGRAMMING:

EXPLICIT PARALLELISM

48

Page 48: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 48/52

• Much has been accomplished

• Widely used parallel programming notations• Distributed memory (SPMD/MPI) and

• Shared memory (pthreads/OpenMP/TBB/Cilk/ArBB).

 Accomplishments of the last decades in

programming notation

49

Page 49: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 49/52

• OpenMP constitutes an important advance, butits most important contribution was to unify the

syntax of the 1980s (Cray, Sequent, Alliant,Convex, IBM,…).

• MPI has been extraordinarily effective.

• Both have mainly been used for numerical

computing. Both are widely considered as “lowlevel”.

Languages

50

Page 50: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 50/52

The future

• Higher level notations

• Libraries are a higher level solution, but perhaps too high-level.

• Want something at a lower level that can be used to

program in parallel.

• The solution is to use abstractions.

51

Page 51: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 51/52

 Array operations in MATLAB

• An example of abstractions are array operations.

• They are not only appropriate for parallelism, but also tobetter represent computations.

• In fact, the first uses of array operations does not seem tobe related to parallelism. E.g. Iverson’s APL (ca. 1960).

 Array operations are also powerful higher levelabstractions for sequential computing

• Today, MATLAB is a good example of languageextensions for vector operations

52

Page 52: Cs498DP 1 Introduction

7/30/2019 Cs498DP 1 Introduction

http://slidepdf.com/reader/full/cs498dp-1-introduction 52/52

 Array operations in MATLAB

 Matrix addition in scalar mode

for i=1:m,

for j=1:l,

c(i,j)= a(i,j) + b(i,j);end 

end 

 Matrix addition in array notation

c = a + b;