introduction to multicore .ppt

Multicore: An Introduction

N. Rajagopal,

Systems Software Practice

[email protected]

Agenda

• Background

• Drivers for Multicore

• How is SW prepared?

• Challenges of multicore programming

• Functional programming

• Refactoring

• Hybrid approaches

• State of industry

• Summary

• Q&A

Background

• Moore’s Law (transistor density doubles every 18 months)

• But the gaps between the transistor count and performance is increasing

• Between 1993, 1999, CPU speeds increased 10 times

• First 1Ghz CPU came in 2000. We should have had 10Ghz CPU by now. It is not there. Where is it?

• Intel’s 3.4Ghz CPU was introduced in 2004. Where is the 4Ghz processor?

Answer is that it is unlikely to ever come!

CPU Clock speed increase over years

Source: http://www.cs.utexas.edu/users/cart/publications/isca00.pdf

Gap is increasing!

Source: http://www.embedded.com/columns/technicalinsights/198701652?_requestid=1042869

Slowing signs

• Over last 30 years, CPU designers achieved performance in three ways:– Clock speed (new processes, materials etc)– Execution optimization (doing more per cycle: Pipelining,

branch prediction, multiple instructions in same cycle etc)– Cache (Putting memory closer to CPU: 2Mb+ caches are

now common)

• Steam is running out of these techniques– Clock speed: (heat, physical issues, leakage currents)– Less and lesser returns from execution optimization,

though cachesize has still potential to go up

Source: embedded.com

Heat is on!

How is Semicon industry trying to do?

• Create simpler cores and put more of them in single package• It is easier for semicon vendors to do this compared to

increasing speed• Instead of 10Ghz processor, have 10 numbers of 1Ghz cores!• All cores have L1 caches but access shared memory outside• First versions of the processors were from Intel and now

there are many multicore processors• Initially, multicore processors were for server market• Now they are in desktops and even in embedded products

Some multicore processors..

• Dual Core processors– One Gen Purpose Core and another specialized core– Been in market for some time– Network processors (Intel IXP)– Signal processors (TI OMAP processors)

• Intel Quad Core, ATOM dual core– Intel: Nehalem – Core i7 with 8 cores– AMD: Montreal – 8 cores

• Cavium 16 cores• Tilera 64 cores• Predictions of 100’s and even thousands of cores in coming

years• Intel Larrabee rumoured to be 32 cores

Typical multicore processor

Courtesy: http://zone.ni.com/cms/images/devzone/tut/rwlesvfu47926.jpg

How is Software Community prepared?

• “There ain’t no such thing as a free lunch” – R. A. Heinlien

• Software community has enjoyed regular performance gains for decades without doing anything special

• Now the party is coming to an end and it is a rude shock

• There is expectation that SW should take over from HW in driving next generation performance improvements

• So what to be done?

What is to be done?

Need to move to parallel programming model

Challenges of parallel programming

• Concurrency– Multiple threads of execution access same pieces of data– Need to synchronize the access– Possibilities of race conditions and deadlocks

Challenges of parallel programming – Race conditions

Trivial C statementb = b + 1; Assume that there are two threads running this line of code, where b is a variable shared by the two threads, and b started with the value 5.

Possible order of execution

(thread1) load b into some register in thread 1.

(thread2) load b into some register in thread 2.

(thread1) add 1 to thread 1's register, computing 6.

(thread2) add 1 to thread 2's register, computing 6.

(thread1) store the register value (6) to b.

(thread2) store the register value (6) to b.

• We started with 5, then two threads each added one, but the final result is 6 -- not the expected 7. The problem is that the two threads interfered with each other, causing a wrong final answer.

• Threads do not execute atomically, performing single operations all at once. Another thread may interrupt it between essentially any two instructions and manipulate some shared resource.

Challenges of parallel programming - Deadlocksvoid *function1()

{pthread_mutex_lock(&lock1); - Execution step 1pthread_mutex_lock(&lock2); - Execution step 3 DEADLOCK!!!...pthread_mutex_lock(&lock2);pthread_mutex_lock(&lock1);...

} void *function2(){

pthread_mutex_lock(&lock2); - Execution step 2pthread_mutex_lock(&lock1);......pthread_mutex_lock(&lock1);pthread_mutex_lock(&lock2);

}

main(){

pthread_create(&thread1, NULL, function1, NULL);pthread_create(&thread2, NULL, function1, NULL);

}


• Challenge of visualization– Difficult to visualize beyond 4-5 threads, each doing

different activities– Human mind is not used to think/process data in parallel– Complexities of interplay of threads increase exponentially

beyond few threads


• How one is confident that there are no bugs?– Race conditions are dependent on timing – Changes in execution times of one thread or other can change access

order to variables, triggering a previously unnoticed race– It is fully possible for a code to pass through all rounds of testing and

work fine for years before a set of conditions like memory speed or process load bring a new bug to life

– Lastly, it may not be possible to reproduce the bug at all times (may occur on one server, may not occur on another server for same environments)

• Leaves with an uneasy feeling about the code• Visualization tools are coming to market, but they are

inadequete

What are the options?

Courtesy: http://bigyellowtaxi.files.wordpress.com/2008/06/crossroads.jpg

Option #1: Functional Programming

Option #1: Functional Programming

• Main issue of concurrent programming is the parallel access to shared data

• All languages suffer from same issue• So can we avoid the whole thing?

• Functional programming languages takes a radically different approach to the problem

• Roots in theoretical computer science research called Lambda calculas by Alanzo Church in 1949 in Princeton University along with Turing, Godel and van Nueman

Functional Programming - background

• Based on mathematical system called “formal system”• Set of “axioms” and “set of rules on how to operate on them”• Stack them up with complex rules• Alan Church was interested in “computability” issues and not

programming issues (Computers did not exist then)• LISP was first implementation of Alonzo’s lambda calculas on

computers

So what is functional programming?

• Functions are used for everything, even for simplest of computations

• Variables are just aliases – they cant even change values (immutable)

• State is only in stack• No concept of global variables• No “side effects”• One cannot modify state in a function or outside of it.• If you call the same function with same parameters, it should

return identical results at all times• Since there are no global variables, let alone variables, there

is no race conditions, no deadlocks!

Referential Transperancy

Source: Functions + Messages + Concurrency = Erlang – Presentation by Joe Amstrong

Advantages of FP

• Unit testing– Check all possible inputs to a function– Check by passing parameters that represent edge cases

• Debugging– In Imperative programming, there is no surity that you can reproduce

the bug in same sequence– Main reason is that behaviour of program dependent on object states,

global variables and external state– In FP, a bug is a bug and it surfaces all time

• Concurrency– No global variables, no race conditions, no locks– Easy to refactor code

• String s1 = operation1();• String s2 = operation2();• String s3 = concat( s1, s2);

– First and second lines can be run on 2 different cores– Tools can easily refactor code

Popular environments

• Erlang– Invented by Ericcson– Concurrent programming environment– Message passing– Used in highly scalable, mission critical switch products with hot

upgrade support

• Haskell– Haskell is a pure functional language

• LISP– Original parent

Example code fragment

• Erlang

Courtesy: TBD

So, why are programmers not jumping?

• Complete change of mindset needed!• Learning curve is steep• “University and research” Image• Needs lots of resources (recursion, stack based state)• CIOs are not comfortable• Lack of trained staff • Only strong believers are the adopting

Option #2: Refactoring

Refactoring

• There is just too much existing software• No one is developing software from scratch and it is built on

existing code bases/products• Let us look at tools to help refactor existing code to get

performance gains rather asking to code in new languages• May not be able to utilize all cores to maximum, but it is an

evolutionary path

Refactoring tools for non functional programmers

• Many tools are coming to market to help refactoring code to multiple core processors

• Key ones– RapidMind– OpenMP– CUDA– Ct– Clik– Pervasive software

– And many others

Rapidmind

• Startup based in Waterloo, Canada• Helps to parallelize code across multiple cores• Supports C++ language• Can be targeted to different processor families: AMD, x86, Cell, nVidia etc• Extensive code changes needed in code

• Plus points:– Multiple platforms, no lockin with architecture– 1000’s of threads– Existing compilers

• Minus points:– Lockin to the Rapidmind platform!

• Costs:– $1500 for dev platform

Rapidmind

OpenMP

• Standard for multiprocessor shared memory programming• Defined for C/C++/Fortran• OpenMP = compiler directives + runtime library + env

variables• Code is instrumented• Threads are created and destroyed automatically

OpenMP example code

int main(int argc, char **argv)

{ const int N = 100000;

int i, a[N];

#pragma omp parallel

for for (i = 0; i < N; i++)

a[i] = 2 * i;

return 0;

}

OpenMP

• Plus points:– Easiness to do– Incremental Paralleization– Hiding of thread semantics– Scalable to many threads– Coarse/Fine grained parallelization

• Minus points:– Popular only in scientific communities only– Need separate tool chain

• Products:– Sun Studio 12

Source: Sun Studio presentation

CUDA

• From nVidia• Defines extensions to code to parallelize the code• Concept of “Thread block”• Each thread pool has one shared memory to work on• Scalable to 1000’s of threads• Defined for C/C++ Languages

• Plus points:– Scales very well for scientific, visualization communities (Financial

markets, computational mechanics, computational biology etc)– Well integrated with nVidia processors – no middleware

• Minus points:– Vendor lockin to nVidia/CUDA (processor/tool chain)

Is there a middle way?

Option #3: Hybrid environments

• How to experiment in FP without throwing away all investments done so far?

• Scala:• Hybrid language• Combines Java style OO with FP paradigm• Compiles to bytecode• Can use Java libraries in Scala code; Can inherit Java class• Use eclipse tool chains, plugins• Really married to Java

• F#:• Experiment FP safely from known .NET environment• Use FP where needed, use .NET in other place

Status of the industry

OS readiness

• Windows XP can support 4 logical cores, 2 physical cores; supports SMP model

• Windows Vista, Windows 7 better tuned for multicore

• Finetuning of global locks, mutexes, better graphics performance

• Linux 2.6+ has SMP support and hence multicore support

“Embarrassingly” Parallel Programs

• An embarrassingly parallel workload (or embarrassingly parallel problem) is one for which little or no effort is required to separate the problem into a number of parallel tasks.

• Examples:

– The Mandelbroit set and other fractal calculations, where each point can be calculated independently.

– Rendering of graphics; each pixel may be rendered independently. In computer animation, each frame may be rendered independently

– Large scale face recognition that involves comparing thousands of input faces with similarly large number of faces.

– Computer simulations comparing many independent scenarios, such as climate models.

– Genetic algorithms and other evolutionary computation

– weather prediction Models

– Event simulation and reconstruction in particle physics.

• Compilation and Build systems: Each compilation can be run parallel and get performance. (example: make –j or dmake in sun Studio)

• CUDA/OpenMP is very popular in the scientific and visualization markets and widely used

Source: Wikipedia

Server programming

• Servers run OS like Linux 2.6 that support pthreads and is multithreaded

• Server daemon spawns a pthreadfor each request

• OS schedules them on different cores

• Decent scalability seen for server side programs on multicore systems

• Examples: web services• Challenges are IO performance,

Cache size, performance tuning of OS, web browser etc

Source: http://www.edn.com/index.asp?layout=articlePrint&articleID=CA6646279

Server Virtualization

• Virtualization allows different OS to run on different cores• One core can be running Linux and another Windows• This is quite popular and allows efficient use of cores

Packet processing applications

• Packet processing lends to parallel processing

• Each IP packet is independent

• IP forwarding is largely stateless processing

• Multiple cores can take packets from an incoming queue, do processing and put it on output queue

• Similar applications in security

• Solutions from Tilera, Cavium

Packet processing applications

Source: http://www.cisco.com/en/US/prod/collateral/routers/

High reliable concurrent systems

• Functional programs are getting used here• Major users:

– Ericcson (GPRS system)– Jabber– Twitter– Tmobile SMS system– Nortel VPN gateway product

• Some people are trying out hybrids in their system– Scala: Twitter

Desktop applications

• Common programs like word processing, spreadsheets are not amenable to parallel processing or threading

• Unlikely that desktop applications will be rewritten with multiple threads

• So don’t expect much performance gains by running this PPT on a quadcore processor

• But simple things are possible:– While you download a document, virus scan can run on

another core– Provide processor affinity in Windows

Embedded Applications

• Embedded applications without OS are typically hand crafted• Limited scope for using multiple cores• Scope exists in running specialized code in one core and rest

of code in another

Courtesy: Windriver Inc

Challenges of performance improvement

• Lots of existing code is not written in multithreaded way or amenable also

• There are other factors like IO issues, memory issues, beyond just CPU. Best results are seen for CPU intensive jobs, but may not be for IO bound jobs

• There could be bottlenecks in inter-core communication

• Typical applications like CRUD may find it difficult to scale

Sweet-spots of multicore software

• Virtualization• Server programming• Packet processing• Scientific computing, visualization

Summary

• Multicore is here to stay• Onus is on SW community to show performance

gains• Unfortunately, lots of existing code is not written for

using the benefits of multicore• Many programs are also not amenable for parallel

processing• Options like FP, Refactoring, Hybrid being tried out• Long time and way to go before we can tap the

performance gains

introduction to multicore .ppt

Documents