1 many-core software burton smith microsoft. 2 computing is at a crossroads continual performance...

1

Many-Core Software

Burton SmithMicrosoft

2

Computing is at a Crossroads Continual performance improvement is our field’s lifeblood

It encourages people to buy new hardware It opens up new software possibilities

Single-thread performance is nearing the end of the line But Moore’s Law will continue for some time to come What can we do with all those transistors?

Computation needs to become as parallel as possible Henceforth, serial means slow Systems must support general purpose parallel computing The alternative is commoditization

New many-core chips will need new software Our programming models will have to change The von Neumann premise is broken

3

The von Neumann Premise Simply put, “instruction instances are totally ordered” This notion has created artifacts:

Variables Interrupts Demand paging

And caused major problems: The ILP wall The power wall The memory wall

What software changes will we need for many-core? New languages? New approaches for compilers, runtimes, tools? New (or perhaps old) operating system ideas?

4

Do We Really Need New Languages? Mainstream languages schedule values into variables

To orchestrate the flow of values in the program To incrementally but consistently update state

Introducing parallelism exposes weaknesses in: Passing values between unordered instructions Updating state consistently

Our “adhesive bandage” attempts have proven insufficient Not general enough Not productive enough

So my answer is “Absolutely!”

5

Parallel Programming Languages There are (at least) two promising approaches:

Functional programming Atomic memory transactions

Neither is completely satisfactory by itself Functional programs don’t allow mutable state Transactional programs implement data flows awkwardly

Data base applications show synergy of these two ideas SQL is a “mostly functional” language Transactions allow Consistency via Atomicity and Isolation

Many people think functional languages must be inefficient Sisal and NESL are excellent counterexamples Both competed strongly with Fortran on Cray systems

Others think memory transactions must be inefficient also This remains to be seen; we have only just begun to optimize

6

Transactions and Invariants Invariants are a program’s conservation laws

Relationships among values in iteration and recursion Rules of data structure (state) integrity

If statements p and q preserve the invariant I and they do not “interfere”, their parallel composition { p || q } also preserves I †

If p and q are performed atomically, i.e. as transactions, then they will not interfere ‡

Although operations seldom commute with respect to state, transactions give us commutativity with respect to the invariant

It would help if the invariants were available to the compiler Can we ask programmers to supply them?

† Susan Owicki and David Gries. Verifying properties of parallel programs: An axiomatic approach. CACM 19(5):279−285, May 1976.‡ Leslie Lamport and Fred Schneider. The “Hoare Logic” of CSP, And All That. ACM TOPLAS 6(2):281−296, Apr. 1984.

7

Styles of Parallelism We probably need to support multiple programming styles

Both functional and transactional Both data parallel and task parallel Both message passing and shared memory Both declarative and imperative Both implicit and explicit

We may need several languages to accomplish this After all, we do use multiple languages today Language interoperability (e.g. .NET) will help greatly

It is essential that parallelism be exposed to the compiler So that the compiler can adapt it to the target system

It is also essential that locality be exposed to the compiler For the same reason

8

Compiler Optimization for Parallelism Some say automatic parallelization is a demonstrated

failure Vectorizing and parallelizing compilers (especially for the

right architecture) have been a tremendous success They have enabled machine-independent languages What they do can be termed parallelism packaging Even manifestly parallel programs need it

What failed is parallelism discovery, especially in-the-large Dependence analysis is chiefly a local success

Locality discovery in-the-large has also been a non-starter Locality analysis is another word for dependence analysis

The jury is still out on in-the-large locality packaging Local locality packaging works pretty well

9

Fine-grain Parallelism Exploitable parallelism grows as task granularity shrinks

But dependences among tasks become more numerous Inter-task dependence enforcement demands scheduling

A task needing a value from elsewhere must wait for it User-level work scheduling is needed

No privilege change to stop or restart a task Locality (e.g. cache content) can be better preserved

Todays OSes and hardware don’t encourage waiting OS thread preemption makes blocking dangerous Instruction sets encourage non-blocking approaches Busy-waiting wastes instruction issue opportunities

We need better support for blocking synchronization In both instruction set and operating system

10

Resource Management Consequences Since the user runtime is scheduling work on processors,

the OS should not attempt to do the same An asynchronous OS API is a necessary corollary The user-exposed API should be synchronous

Scheduling memory via demand paging is also problematic Instead, the application and OS should negotiate

The application tells the OS its resource needs & desires The OS makes decisions based on the big picture:

Requirements for quality of service Availability of resources Appropriateness of power level

The OS can preempt resources to reclaim them But with notification, so the application can rearrange work

Resources should be time- and space-shared in chunks

11

Bin Packing The more resources allocated, the more swapping overhead

It would be nice to amortize it The more resources you get, the longer you may keep them

Roughly, this means scheduling = packing squarish blocks QOS applications might need long rectangles instead

When the blocks don’t fit, the OS can morph them a little Or cut corners when absolutely necessary

Qua

ntit

y of

res

ourc

e

Time

12

Parallel Debugging and Tuning Today, debugging relies on single-stepping and printf()

Single-stepping a parallel program is a bit less effective Conditional program and data breakpoints are helpful

To stop when an invariant fails to be true Support for ad-hoc data perusal is also very important

Debugging is data mining Serial program tuning tries to discover where the program

counter spends its time The answer is usually found by sampling the PC

In contrast, parallel program tuning tries to discover where there is insufficient parallelism

A good way is to log perf counters and a timestamp at events Visualization is a big deal for both debugging and tuning

13

Conclusions It is time to rethink some of the basics There is lots of work for everyone to do

I’ve left out lots of things, e.g. applications We need basic research as well as industrial development

Research in computer systems is deprecated these days In the USA, NSF and DOD need to take the initiative

1 many-core software burton smith microsoft. 2 computing is at a crossroads continual performance...

Documents