1 many-core software burton smith microsoft. 2 computing is at a crossroads continual performance...
TRANSCRIPT
1
Many-Core Software
Burton SmithMicrosoft
2
Computing is at a Crossroads Continual performance improvement is our field’s lifeblood
It encourages people to buy new hardware It opens up new software possibilities
Single-thread performance is nearing the end of the line But Moore’s Law will continue for some time to come What can we do with all those transistors?
Computation needs to become as parallel as possible Henceforth, serial means slow Systems must support general purpose parallel computing The alternative is commoditization
New many-core chips will need new software Our programming models will have to change The von Neumann premise is broken
3
The von Neumann Premise Simply put, “instruction instances are totally ordered” This notion has created artifacts:
Variables Interrupts Demand paging
And caused major problems: The ILP wall The power wall The memory wall
What software changes will we need for many-core? New languages? New approaches for compilers, runtimes, tools? New (or perhaps old) operating system ideas?
4
Do We Really Need New Languages? Mainstream languages schedule values into variables
To orchestrate the flow of values in the program To incrementally but consistently update state
Introducing parallelism exposes weaknesses in: Passing values between unordered instructions Updating state consistently
Our “adhesive bandage” attempts have proven insufficient Not general enough Not productive enough
So my answer is “Absolutely!”
5
Parallel Programming Languages There are (at least) two promising approaches:
Functional programming Atomic memory transactions
Neither is completely satisfactory by itself Functional programs don’t allow mutable state Transactional programs implement data flows awkwardly
Data base applications show synergy of these two ideas SQL is a “mostly functional” language Transactions allow Consistency via Atomicity and Isolation
Many people think functional languages must be inefficient Sisal and NESL are excellent counterexamples Both competed strongly with Fortran on Cray systems
Others think memory transactions must be inefficient also This remains to be seen; we have only just begun to optimize
6
Transactions and Invariants Invariants are a program’s conservation laws
Relationships among values in iteration and recursion Rules of data structure (state) integrity
If statements p and q preserve the invariant I and they do not “interfere”, their parallel composition { p || q } also preserves I †
If p and q are performed atomically, i.e. as transactions, then they will not interfere ‡
Although operations seldom commute with respect to state, transactions give us commutativity with respect to the invariant
It would help if the invariants were available to the compiler Can we ask programmers to supply them?
† Susan Owicki and David Gries. Verifying properties of parallel programs: An axiomatic approach. CACM 19(5):279−285, May 1976.‡ Leslie Lamport and Fred Schneider. The “Hoare Logic” of CSP, And All That. ACM TOPLAS 6(2):281−296, Apr. 1984.
7
Styles of Parallelism We probably need to support multiple programming styles
Both functional and transactional Both data parallel and task parallel Both message passing and shared memory Both declarative and imperative Both implicit and explicit
We may need several languages to accomplish this After all, we do use multiple languages today Language interoperability (e.g. .NET) will help greatly
It is essential that parallelism be exposed to the compiler So that the compiler can adapt it to the target system
It is also essential that locality be exposed to the compiler For the same reason
8
Compiler Optimization for Parallelism Some say automatic parallelization is a demonstrated
failure Vectorizing and parallelizing compilers (especially for the
right architecture) have been a tremendous success They have enabled machine-independent languages What they do can be termed parallelism packaging Even manifestly parallel programs need it
What failed is parallelism discovery, especially in-the-large Dependence analysis is chiefly a local success
Locality discovery in-the-large has also been a non-starter Locality analysis is another word for dependence analysis
The jury is still out on in-the-large locality packaging Local locality packaging works pretty well
9
Fine-grain Parallelism Exploitable parallelism grows as task granularity shrinks
But dependences among tasks become more numerous Inter-task dependence enforcement demands scheduling
A task needing a value from elsewhere must wait for it User-level work scheduling is needed
No privilege change to stop or restart a task Locality (e.g. cache content) can be better preserved
Todays OSes and hardware don’t encourage waiting OS thread preemption makes blocking dangerous Instruction sets encourage non-blocking approaches Busy-waiting wastes instruction issue opportunities
We need better support for blocking synchronization In both instruction set and operating system
10
Resource Management Consequences Since the user runtime is scheduling work on processors,
the OS should not attempt to do the same An asynchronous OS API is a necessary corollary The user-exposed API should be synchronous
Scheduling memory via demand paging is also problematic Instead, the application and OS should negotiate
The application tells the OS its resource needs & desires The OS makes decisions based on the big picture:
Requirements for quality of service Availability of resources Appropriateness of power level
The OS can preempt resources to reclaim them But with notification, so the application can rearrange work
Resources should be time- and space-shared in chunks
11
Bin Packing The more resources allocated, the more swapping overhead
It would be nice to amortize it The more resources you get, the longer you may keep them
Roughly, this means scheduling = packing squarish blocks QOS applications might need long rectangles instead
When the blocks don’t fit, the OS can morph them a little Or cut corners when absolutely necessary
Qua
ntit
y of
res
ourc
e
Time
12
Parallel Debugging and Tuning Today, debugging relies on single-stepping and printf()
Single-stepping a parallel program is a bit less effective Conditional program and data breakpoints are helpful
To stop when an invariant fails to be true Support for ad-hoc data perusal is also very important
Debugging is data mining Serial program tuning tries to discover where the program
counter spends its time The answer is usually found by sampling the PC
In contrast, parallel program tuning tries to discover where there is insufficient parallelism
A good way is to log perf counters and a timestamp at events Visualization is a big deal for both debugging and tuning
13
Conclusions It is time to rethink some of the basics There is lots of work for everyone to do
I’ve left out lots of things, e.g. applications We need basic research as well as industrial development
Research in computer systems is deprecated these days In the USA, NSF and DOD need to take the initiative