programming - gbv

6
y y -X-'-y-'•:•,:•••'••'•• : : ':•; '•''•••• : 1 : :• :* -/:-- : -V PROGRAMMING Calvin Lin Department of Computer Sciences The University of Texas at Austin Lawrence Snyder Department of Computer Science and Engineering University of Washington, Seattle PEARSON Addison Wesley Boston San Francisco New York London Toronto Sydney Tokyo Singapore Madrid Mexico City Munich Paris Cape Town Hong Kong Montreal

Upload: others

Post on 14-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

yy-X-'-y-'•:•,:•••'••'•• :: ':•; '•''•••• : 1 : :• :* - / : - - : - V

PROGRAMMING

Calvin Lin Department of Computer Sciences The University of Texas at Austin

Lawrence Snyder Department of Computer Science and Engineering

University of Washington, Seattle

PEARSON

Addison Wesley

Boston San Francisco New York

London Toronto Sydney Tokyo Singapore Madrid

Mexico City Munich Paris Cape Town Hong Kong Montreal

Contents

PARTI Foundations

Chapter 1 Introduction

The Power and Potential of Parallelism Parallelism, a Familiär Concept Parallelism in Computer Programs Multi-Core Computers, an Opportunity Even More Opportunities to Use Parallel

Hardware Parallel Computing versus Distributed

15

I C t •••. •

16 16 17 18

19

Computing System Level Parallelism Convenience of Parallel Abstractions

Examining Sequential and Parallel Programs

Parallelizing Compilers A Paradigm Shift Parallel Prefix Sum

Parallelism Using Multiple Instruction Streams

The Concept of a Thread A Multithreaded Solution to Counting 3s

The Goals: Scalability and Performance Portability

Scalability Performance Portability Principles First

20 20 22

22 22 23 27

29 29 29

39 39 40 41

Chapter Summary 41 Historical Perspective 42 Exercises 42

Chapter 2 Understandirtg Parallel Computers 44

Balancing Machine Specifics with Portability 44

A Look at Six Parallel Computers 45 Chip Multiprocessors 45 Symmetrie Multiprocessor Architectures 48 Heterogeneous Chip Designs 50 Clusters 53 Supercomputers 54 Observations from Our Six

Parallel Computers 57

An Abstraction of a Sequential Computer 58 Applying the RAM Model 58 Evaluating the RAM Model 59

The PRAM: A Parallel Computer Model 60

The CTA: A Practical Parallel Computer Model

The CTA Model Communication Latency Properties of the CTA

Memory Reference Mechanisms Shared Memory

61 61 63 66

67 67

8

One-Sided Communication 68 Message Passing 68 Memory Consistency Models 69 Programming Models 70

A Closer Look at Communication 71

Applying the CTA Model 72

Chapter Summary 73 Historical Perspective 73 Exercises 73

Chapter 3 Reasoning about Performance 75

Motivation and Basic Concepts 75 Parallelism versus Performance 75 Threads and Processes 76 Latency and Throughput 76

Sources of Performance Loss 78 Overhead 78 Non-Parallelizable Code 79 Contention 81 IdleTime 81

Parallel Structure 82 Dependences 82 Dependences Limit Parallelism 84 Granularity 86 Locality 87

Performance Trade-Offs 87 Communication versus Computation 88 Memory versus Parallelism 89 Overhead versus Parallelism 89

Measuring Performance 91 Execution Time 91 Speedup 92 Superlinear Speedup 92 Efficiency 93 Concerns with Speedup 93 Scaled Speedup versus Fixed-Size Speedup 95

Scalable Performance 95 Scalable Performance Is Difficult to Achieve 95

Implications for Hardware Implications for Software Scaling the Problem Size

Chapter Summary Historical Perspective Exercises

PART 2 Parallel Abstractions

Chapter 4 First Steps Toward Parallel Programming

96 97 97

98 98 99

101

102

Data and Task Parallelism Definitions Illustrating Data and Task Parallelism

The Peril-L Notation Extending C Parallel Threads Synchronization and Coordination Memory Model Synchronized Memory Reduce and Scan The Reduce Abstraction

Count 3 s Example

Formulating Parallelism Fixed Parallelism Unlimited Parallelism Scalable Parallelism

Alphabetizing Example Unlimited Parallelism Fixed Parallelism Scalable Parallelism

Comparing the Three Solutions

Chapter Summary Historical Perspective Exercises

102 102 103

103 104 104 105 106 108 109 110

111

111 111 112 113

114 115 116 118

123

124 124 124

Contents

Chapter 5 Scalable Algor i thmic Techniques

Blocks of Independent Computation

Schwartz' Algorithm

The Reduce and Scan Abstractions Example of Generalized Reduces

and Scans The Basic Structure Structure for Generalized Reduce Example of Components

of a Generalized Scan Applying the Generalized Scan Generalized Vector Operations

Assigning Work to Processes Statically Block Allocations Overlap Regions Cyclic and Block Cyclic Allocations Irregulär Allocations

Assigning Work to Processes Dynamically

Work Queues Variations of Work Queues Case Study: Concurrent Memory

Allocation

Trees Allocation by Sub-Tree Dynamic Allocations

Chapter Summary Historical Perspective Exercises

PART 3 Parallel Programming Languages

Chapter 6 Programming wi th Threads

POSIX Threads Thread Creation and Destruction

126

127

129

130 132 133

136 138 139

139 140 142 143 146

148 148 151

151

153 153 154

155 156 156

157

159

159 160

Mutual Exclusion 164 Synchronization 167 Safetylssues 177 Performance Issues 181 Case Study: Successive Over-Relaxation 188 Case Study: Overlapping Synchronization

with Computation 193 Case Study: Streaming Computations

on a Multi-Core Chip 201

Java Threads 201 Synchronized Methods 203 Synchronized Statements 203 The Count 3s Example 204 Volatile Memory 206 Atomic Objects 206 Lock Objects 207 Executors 207 Concurrent Collections 207

OpenMP 207 The Count 3s Example 208 Semantic Limitations on p a r a l l e l for 209 Reduction 210 Thread Behavior and Interaction 211 Sections 213 Summary of OpenMP 213

Chapter Summary 214 Historical Perspective 214 Exercises 214

Chapter 7 MPI and Other Local Wiew Languages

MPI: The Message Passing Interface 216 The Count 3s Example 217 Groups and Communicators 225 Point-to-Point Communication 226 Collective Communication 228 Example: Successive Over-Relaxation 233 Performance Issues 236 Safety Issues 242

Partitioned Global Address Space Languages 243

Contents

Co-Array Fortran Unified Parallel C Titanium

Chapter Summary Historical Perspective Exercises

Chapter 8 Z P L a n d O t h e r G l o b a l V i e w L a n g u a g e s

The ZPL Programming Language

Basic Concepts of ZPL Regions Array Computation

Life, an Example The Problem The Solution How It Works The Philosophy of Life

Distinguishing Features of ZPL Regions Statement-Level Indexing Restrictions Imposed by Regions Performance Model Addition by Subtraction

Manipulating Arrays of Different Ranks Partial Reduce Flooding The Flooding Principle Data Manipulation, an Example Flood Regions Matrix Multiplication

Reordering Data with Remap Index Arrays Remap Ordering Example

Parallel Execution of ZPL Programs Role of the Compiler Specifying the Number of Processes

244 245 246

247 248 248

250

250

Assigning Regions to Processes 275 Array Allocation 276 Scalar Allocation 277 Work Assignment 277

Performance Model 278 Applying the Performance Model: Life 279 Applying the Performance Model:

SUMMA 280 Summary of the Performance Model 280

NESL Parallel Language 281 Language Concepts 281 Matrix Product Using Nested Parallelism 282 NESL Complexity Model 283

251 251 254

256 256 256 257 259

259 259 259 260 260 261

261 262 263 264 265 266 267

269 269 270 272

274 274 275

Chapter Summary Historical Perspective Exercises

Chapter 9 Assessing the State of the Art Four Important Properties of Parallel Languages

Correctness Performance Scalability Portability

Evaluating Existing Approaches POSIX Threads Java Threads OpenMP MPI PGAS Languages ZPL NESL

Lessons for the Future Hidden Parallelism Transparent Performance Locality Constrained Parallelism Implicit versus Explicit Parallelism

Chapter Summary Historical Perspective Exercises

283 283 284

285

285 285 287 288 288

289 289 290 290 290 291 292 292

293 293 294 294 294 295

296 296 296

Contents

PART 4 Parallel Programming Recommendations 321

Looking Forward Chapter 10 Future Directions in Parallel Programming Attached Processors

Graphics Processing Units Cell Processors Attached Processors Summary

Grid Computing

Transactional Memory Comparison with Locks Implementation Issues Open Research Issues

MapReduce

Problem Space Promotion

Emerging Languages Chapel Fortress X10

Chapter Summary Historical Perspective Exercises

297

jß *̂* ß»

298 299 302 302

304

305 306 307 309

310

312

313 314 314 316

318 318 318

Incremental Development Focus on the Parallel Structure Testing the Parallel Structure Sequential Programming Be Willing to Write Extra Code Controlling Parameters during Testing Functional Debugging

Capstone Project Ideas Implementing Existing Parallel

Algorithms Competing with Standard Benchmarks Developing New Parallel Computations

Performance Measurement Comparing against a Sequential Solution Maintaining a Fair Experimental Setting

Understanding Parallel Performance

Performance Analysis

Experimental Methodology

Portability and Tuning

Chapter Summary Historical Perspective Exercises

321 321 322 323 323 324 324

325

325 326 327

328 329 329

330

331

332

333

333 333 334

Chapter 11 Writing Parallel Programs

Getting Started Access and Software Hello, World

319 Glossary

319 References 319 320 Index

335

339

342