seminar on multicore programming - aalto · seminar on multicore programming. context 1....

Post on 04-Jun-2018

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Legacy Code in a Multicore Environment

30.4.2009

Jari Karppinen

Seminar on Multicore Programming

Context

1. Introduction2. Motivation3. Available tools 4. Parallelisation5. Case study6. Concluding Remarks7. Resources

Introduction

➢ Legacy code➢ Millions of lines and maintained > 10 year➢ Mostly written with C/C++➢ No possibility to rewrite with current resources➢ Often undocumented➢ Original developers have left the company➢ Might be running up on proprietary OS (no SMP support)

Introduction

➢ Legacy code parallelisation strategies➢ Auto parallelisation➢ Using multiple threads/processes➢ “Partitioning”➢ “Virtualisation”

Motivation

➔ Scaling by %serial code➔ Amdahl's Law➔ Karp-Flatt metric➔ parallelisation strategies

Scaling by %serial code

➢ Current legacy code scaling to multicore environment.➢ Only few % scales

well➢ 50 % does not scale

at all.

0 2 4 6 8 10 12

0

2

4

6

8

10

12

Scaling by %serial code

Perfect99.00%95.00%90.00%80.00%70.00%

60.00%50.00%

Scaling

Spee

dup

Motivation

Amdahl's Law

It does not take overhead or load balancing into account

100

5050

5050

100

100

100

100 100

Motivation

Amdahl's law

Performance = Time in Serial region + (Time in parallel region / Number of threads)

0 1 2 3 4 5 6 7 8 9 10

0

20

40

60

80

100

120

Serial Region

Paralle l Region

Time

Threads

Motivation

Karp-Flatt metric➢ Uses serial factor to take into account load balancing and overhead.

➢ Given a parallel computation exhibiting speed-up ψ on p processors, where p > 1, the experimentally determined serial fraction e is defined to be the Karp - Flatt Metric.

➔The less the value of e the better the parallelisation

Motivation

Karp-Flatt metric

Performance = Time in Serial region + (Time in parallel region / Number of threads)+ Synchronisation cost

0 1 2 3 4 5 6 7 8 9 10

0

20

40

60

80

100

120

Serial Region

Karp-Flatt metric

Amdahl's law

Time

Threads

Motivation

Parallelisation strategies

➢Separate tasks➢Multiple copies of the same task➢Task split over multiple threads➢Pipeline of tasks➢Client-server➢Producer-consumer model

Motivation

Available tools

➔Tools in general➔Execution analyser➔Thread analyser➔Performance analyser

Tools in general

➢Arbitrary parallelisation may not give the desired response.➢Estimation of legacy code parallelisation is difficult.➢Some tools are used to analysis code statically or dynamically to find out parts which are beneficial for parallelisation. ➢There exist also tools to ease problem finding after parallelisation.➢Many tools are vendor specific.

Available tools

Dynamic analysis tools

➢With dynamic analysis tools you can locate ex. Producer-consumer relations form code. ➢This is done with tracking the memory locations where other blocks write and other read.➢This information can be used when creating pipeline parallelism. ➢Pipeline parallelism is suitable to both new and existing programs.

Available tools

Performance analyzer

➢The purpose of profiling the execution of an application is to find the hotspots of that application.➢The hotspots are indicators of where attention needs to be spend in order to optimize the code.➢They are good candidates for threading, since these hotspots are going to be the most computationally intensive portions of the serial code.

Available tools

Coarce-Grained pipeline Parallelism in C program

➢Wiliam Thies and co. from Computer Science and Artificial Intelligence Laboratory has made a tool to locate producer-consumer relations from stream program.➢This tool analyses the application dynamically and it gives recommendations and macros which parts can be changed to the pipeline mode.

Available tools

Coarce-Grained pipeline Parallelism in C program

➢They have used several different kind of streams to evaluate the performance gain.➢GMTI, MPEG-2, MP3, 197.parser, 256.bzip2 and 456.hmmer➢Some of the parts of source code was needed to be changed thus the model did not support loops with break or continue inside.➢Speed up with 4 core was approximately 2.78x.

Available tools

Performance analyzer

➢A common Linux profiling tool is gprof. Actually it is a display tool for data collected during the execution of an application compiled and instrumented for profiling.➢The -pg option, used in the cc command, will instrument C code.➢Instrumented binary will generate a profile data file “gmon.out” when run.➢Gprof prints amount of time spent in the each function and time spent in child function calls.

Available tools

Thread analyzer

➢Data races are the most common cause of error in multi-threaded application.➢They are also hard to isolate because of non-deterministic scheduling of thread execution by the OS.➢Intel Thread Checker is a tool designed to identify data races, potential deadlocks, thread stalls and other threading errors.➢It does dynamic analysis when application executes.

Available tools

Thread analyzer

➢The Valgrind tool suite provides a number of debugging and profiling tools. ➢Helgrind is a Valgrind debugging tool for detecting synchronisation errors in C, C++ and Fortran programs that use the POSIX pthreads threading primitives. ➢Helgrind looks for various kinds of synchronisation errors in code that uses the POSIX PThreads API.

Available tools

Parallelisation

➔ Auto parallelisation➔ Using multiple threads/processes➔ Virtualisation➔ Partitioning

Auto parallelization

➢Having the compiler analyse the code and be able to determine if that code can be executed concurrently has been a research topic for many decades.➢Commutativity analysis for software parallelisation, research was done by Farhana Aleen and Nathan Clark in Georgia institute of technology.➢Unfortunately, there have not been too many breakthroughs in this field.

Parallelisation

Auto parallelisation

➢Compiler does the work - no source changes➢Easy to use for developer➢Loop-based parallelisation➢Is effective only for certain kind of applications➢Nature of C/C++ code is not feasible for auto parallelisation. Ex. Pointers behaviour are difficult to predict for compiler.

Parallelisation

Auto parallelisation

➢Compiler will print a report what loops were considered for parallelisation, the success of attempt and in the case of failure.➢Programmer can then analyse the given reason.➢In case of valid dependencies programmer can rewrite the loop to make them disappear ➢Report can be used also for advice on using OpenMP.

Parallelisation

Auto parallelisation

➢Dependence analysis is a static method to determine what dependencies exist between variables referenced within the loop body across iterations of the loop.➢If no cross-iterations data races can be shown within the loop, the iterations can be executed concurrently and the loop can be parallelised.

Parallelisation

Auto parallelisation example

For (int i=0; i<100000; i++){

a[i]=b[i]+c[i];

}

$ cc -o -xautopar -xloopinfo -xvpara loop.c

“Loop.c” line 3: PARALLELIZED, and serial version generated

Parallelisation

Using multiple threads/processes

➔SMP support ➔Timing➔ Workload imbalance➔ Spin locks & mutexes in legacy code➔ Deadlocks – avoidance➔ MT-Safe vs MT-hot➔ Workload imbalance➔ Hardware- trashing➔ Memory ordering

Parallelisation

SMP support

➢ Operating system needs to support SMP➢ Software is needed to be divided to tasks➢ Communication between threads/processes is needed➢ messages➢ shared memory➢ barriers➢ condition variables➢ locking is needed➢ atomic operations

Parallelisation

Timing

➢Unless timing is enforced threads will progress at different rates➢Cannot rely on access pattern of serial code➢Cannot assume the time or order when threads will run

Parallelisation

Spin locks & mutexes in legacy code

➢ Mutex➢ Thread rescheduled when lock busy, woken up

when free➢ Consumes no processor resources waiting➢ More lock and unlock overhead

➢ Spin locks➢ Threads spin when lock busy➢ Consumes processor resources waiting for lock➢ Good for locks that are held for short times➢ In many cases multicore performance is lost when

processes spins while waiting some resources.

Parallelisation

Deadlocks - avoidance

➢Ideal multi-core SW would be lock free.➢This can not be reached or it is very challenging to program.➢It is needed to use Thread Analyser➢Avoid by always acquiring resources in the same order.➢When legacy code is running in multicore environment operating system is scheduling processes non-deterministic way and there fore previously existing hidden deadlocks pop up.

Parallelisation

MT-Safe vs MT-hot

➢ MT-safe➢ Multiple threads can call it and it doesn't crash.➢ May serialise

➢ MT-hot➢ Multiple threads can call it with good performance.➢ Parallel algorithm

➢ Example➢ Default malloc --> MT-Safe➢ Mtmalloc --> MT-Hot

➢ Many operating system calls may serialize legacy application execution.

Parallelisation

Workload imbalance

➢Threads doesn't always perform the same task at same time. Then other thread needs to be wait the synchronization.

Parallelisation

Hardware- thrashing

➢Multiple cache lines mapping to the same cache entry➢With using Performance analyser this can be detected➢Then it can be located and fixed

Parallelisation

Posix in legacy code adaptation

➢ Advantages➢ User can have primitive control of parallesation

➢ Disadvantages➢ Increases complexity of code significantly➢ Finding competent persons will get difficult

Parallelisation

OpenMP in legacy code adaptation

➢ Pros➢ Compiler does the work➢ Minimal source changes➢ Directive based➢ Can be compiled also for single thread to ease

debugging➢ You can incrementally parallise the region of interest

➢ Cons➢ Suitable only for certain type of applications➢ ex. control plane applications are hard to parallelize

Parallelisation

Virtualisation

➢ Can be used to run different applications on same processor in independent environment.

➢ Used mainly on server side➢ Different technologies

➢ Sun Hypervisor➢ Xen➢ VMware

Parallelisation

Partitioning

➢ Asymmetric configuration in multicore processor

➢ Different cores does different tasks➢ ex. in 8 core processor 4 cores can be allocated

for SMP and others run code in simple environments. No OS or light wait OS is used.

➢ More flexibility for software architecture

Parallelisation

Case study

➔Comments from people➔What has happened➔Case shared variable➔Case Ericsson

Comments from people

➢Another processor change➢Calculating MIPS is enough to estimate performance➢SMT will do the job for you➢Compiler will do the job for you

Case study

What has happened

➢Many new faults will pop up➢Performance will decrease not increase➢Debugging will get difficult➢Whole SW architecture is needed to be change.

Case study

Case shared variable

➢ Simple two thread application worked without locking in multitasking operating system in single core processor. They shared one variable for reading and writing.

➢ This was possible because of sequence of application was performed.

➢ When same code was run in multi-core environment application was failing every time.

Case study

Case Ericsson

➢ Björn Lisper has studied Parallelisation of Legacy Telecom Software in joint research project with Ericsson.

➢ Most languages like C/C++ have a memory concept and thus statements must be executed in order.

➢ Only statements surely without dependences can be run in parallel.

➢ Pointers will make the situation even worse.

Case study

Case Ericsson

➢ He has find out that automatic parallelisation won't work in general.

➢ Software under inspection was server type telecom software. AXE: Ericsson classical telephone exchange.

➢ Software is event-driven and different job trees are typically concurrent, and can be executed in parallel if no conflicts exist.

Case study

Case Ericsson

➢ Automatic parallelisation of legacy code is a pipe dream in general.

➢ May work for special applications, which have enough inherent concurrency.

➢ Simple static conflict analysis was done. Results are promising but more research is needed.

Case study

Concluding Remarks

➢Performance is the main drive to parallelise the application. It it is not limiting factor then it is not needed to do anything.➢There is no point to try to parallelise everything.➢You need to recognize that part of code where is spend most of the processing time.➢You just need to continue living with legacy code.➢There are plenty of questions but rare answers.➢Good luck for everybody with legacy code!

Resources

1.Wikipedia. Amdahl's law. http://en.wikipedia.org/wiki/Amdahl%27s_law. Referred on 29.04.2009.2.Wikipedia. Karp-Flatt metric. http://en.wikipedia.org/wiki/Karp-Flatt_Metric. Referred on 29.04.2009.3.Valgrind. Manual. http://valgrind.org/docs/manual/hg-manual.html Referred on 30.04.2009.4.Thies W. and co. A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Program. 5.Knafla B. and Leopold C. Parallelisation a Real-time streering simulation for computer games with OpenMP. John von Neumann institute for Computing, julich, NIC series Vol 38.6.Aleena F. Clark N. Commutativity for Software Parallelisation: Letting Program Transformations See the Big Picture. ASPLOS'09, March 7-11.2009.7.Lisper B. Parallelisation of Legacy Telecom Software. Multicore Days 12.09.2008.

Resources

8. Developer portal Http://developer.sun.com9. Sun Studio http://developers.sun.com/studio10.Hill M. & Marty M. (2008) Amdahl's Law in the Multicore Era. IEEE

Computer 7. s. 33-38.11.Hughes C. & Hughes T. (2008) Professional Multicore Programming:

Design and implementation for C++ Developers. Willey, Indianapolis, United States of America. s. 621.

12.Domeika M. (2008) Software Development for Embedded Multi-core Systems. Elsevier Inc, United States if America.

top related