introduction to multicore .ppt
TRANSCRIPT
Agenda
• Background
• Drivers for Multicore
• How is SW prepared?
• Challenges of multicore programming
• Functional programming
• Refactoring
• Hybrid approaches
• State of industry
• Summary
• Q&A
Background
• Moore’s Law (transistor density doubles every 18 months)
• But the gaps between the transistor count and performance is increasing
• Between 1993, 1999, CPU speeds increased 10 times
• First 1Ghz CPU came in 2000. We should have had 10Ghz CPU by now. It is not there. Where is it?
• Intel’s 3.4Ghz CPU was introduced in 2004. Where is the 4Ghz processor?
Answer is that it is unlikely to ever come!
CPU Clock speed increase over years
Source: http://www.cs.utexas.edu/users/cart/publications/isca00.pdf
Gap is increasing!
Source: http://www.embedded.com/columns/technicalinsights/198701652?_requestid=1042869
Slowing signs
• Over last 30 years, CPU designers achieved performance in three ways:– Clock speed (new processes, materials etc)– Execution optimization (doing more per cycle: Pipelining,
branch prediction, multiple instructions in same cycle etc)– Cache (Putting memory closer to CPU: 2Mb+ caches are
now common)
• Steam is running out of these techniques– Clock speed: (heat, physical issues, leakage currents)– Less and lesser returns from execution optimization,
though cachesize has still potential to go up
Source: embedded.com
Heat is on!
How is Semicon industry trying to do?
• Create simpler cores and put more of them in single package• It is easier for semicon vendors to do this compared to
increasing speed• Instead of 10Ghz processor, have 10 numbers of 1Ghz cores!• All cores have L1 caches but access shared memory outside• First versions of the processors were from Intel and now
there are many multicore processors• Initially, multicore processors were for server market• Now they are in desktops and even in embedded products
Some multicore processors..
• Dual Core processors– One Gen Purpose Core and another specialized core– Been in market for some time– Network processors (Intel IXP)– Signal processors (TI OMAP processors)
• Intel Quad Core, ATOM dual core– Intel: Nehalem – Core i7 with 8 cores– AMD: Montreal – 8 cores
• Cavium 16 cores• Tilera 64 cores• Predictions of 100’s and even thousands of cores in coming
years• Intel Larrabee rumoured to be 32 cores
Typical multicore processor
Courtesy: http://zone.ni.com/cms/images/devzone/tut/rwlesvfu47926.jpg
How is Software Community prepared?
• “There ain’t no such thing as a free lunch” – R. A. Heinlien
• Software community has enjoyed regular performance gains for decades without doing anything special
• Now the party is coming to an end and it is a rude shock
• There is expectation that SW should take over from HW in driving next generation performance improvements
• So what to be done?
What is to be done?
Need to move to parallel programming model
Challenges of parallel programming
• Concurrency– Multiple threads of execution access same pieces of data– Need to synchronize the access– Possibilities of race conditions and deadlocks
Challenges of parallel programming – Race conditions
Trivial C statementb = b + 1; Assume that there are two threads running this line of code, where b is a variable shared by the two threads, and b started with the value 5.
Possible order of execution
(thread1) load b into some register in thread 1.
(thread2) load b into some register in thread 2.
(thread1) add 1 to thread 1's register, computing 6.
(thread2) add 1 to thread 2's register, computing 6.
(thread1) store the register value (6) to b.
(thread2) store the register value (6) to b.
• We started with 5, then two threads each added one, but the final result is 6 -- not the expected 7. The problem is that the two threads interfered with each other, causing a wrong final answer.
• Threads do not execute atomically, performing single operations all at once. Another thread may interrupt it between essentially any two instructions and manipulate some shared resource.
Challenges of parallel programming - Deadlocksvoid *function1()
{pthread_mutex_lock(&lock1); - Execution step 1pthread_mutex_lock(&lock2); - Execution step 3 DEADLOCK!!!...pthread_mutex_lock(&lock2);pthread_mutex_lock(&lock1);...
} void *function2(){
pthread_mutex_lock(&lock2); - Execution step 2pthread_mutex_lock(&lock1);......pthread_mutex_lock(&lock1);pthread_mutex_lock(&lock2);
}
main(){
pthread_create(&thread1, NULL, function1, NULL);pthread_create(&thread2, NULL, function1, NULL);
}
Challenges of parallel programming
• Challenge of visualization– Difficult to visualize beyond 4-5 threads, each doing
different activities– Human mind is not used to think/process data in parallel– Complexities of interplay of threads increase exponentially
beyond few threads
Challenges of parallel programming
• How one is confident that there are no bugs?– Race conditions are dependent on timing – Changes in execution times of one thread or other can change access
order to variables, triggering a previously unnoticed race– It is fully possible for a code to pass through all rounds of testing and
work fine for years before a set of conditions like memory speed or process load bring a new bug to life
– Lastly, it may not be possible to reproduce the bug at all times (may occur on one server, may not occur on another server for same environments)
• Leaves with an uneasy feeling about the code• Visualization tools are coming to market, but they are
inadequete
What are the options?
Courtesy: http://bigyellowtaxi.files.wordpress.com/2008/06/crossroads.jpg
Option #1: Functional Programming
Option #1: Functional Programming
• Main issue of concurrent programming is the parallel access to shared data
• All languages suffer from same issue• So can we avoid the whole thing?
• Functional programming languages takes a radically different approach to the problem
• Roots in theoretical computer science research called Lambda calculas by Alanzo Church in 1949 in Princeton University along with Turing, Godel and van Nueman
Functional Programming - background
• Based on mathematical system called “formal system”• Set of “axioms” and “set of rules on how to operate on them”• Stack them up with complex rules• Alan Church was interested in “computability” issues and not
programming issues (Computers did not exist then)• LISP was first implementation of Alonzo’s lambda calculas on
computers
So what is functional programming?
• Functions are used for everything, even for simplest of computations
• Variables are just aliases – they cant even change values (immutable)
• State is only in stack• No concept of global variables• No “side effects”• One cannot modify state in a function or outside of it.• If you call the same function with same parameters, it should
return identical results at all times• Since there are no global variables, let alone variables, there
is no race conditions, no deadlocks!
Referential Transperancy
Source: Functions + Messages + Concurrency = Erlang – Presentation by Joe Amstrong
Advantages of FP
• Unit testing– Check all possible inputs to a function– Check by passing parameters that represent edge cases
• Debugging– In Imperative programming, there is no surity that you can reproduce
the bug in same sequence– Main reason is that behaviour of program dependent on object states,
global variables and external state– In FP, a bug is a bug and it surfaces all time
• Concurrency– No global variables, no race conditions, no locks– Easy to refactor code
• String s1 = operation1();• String s2 = operation2();• String s3 = concat( s1, s2);
– First and second lines can be run on 2 different cores– Tools can easily refactor code
Popular environments
• Erlang– Invented by Ericcson– Concurrent programming environment– Message passing– Used in highly scalable, mission critical switch products with hot
upgrade support
• Haskell– Haskell is a pure functional language
• LISP– Original parent
Example code fragment
• Erlang
Courtesy: TBD
So, why are programmers not jumping?
• Complete change of mindset needed!• Learning curve is steep• “University and research” Image• Needs lots of resources (recursion, stack based state)• CIOs are not comfortable• Lack of trained staff • Only strong believers are the adopting
Option #2: Refactoring
Refactoring
• There is just too much existing software• No one is developing software from scratch and it is built on
existing code bases/products• Let us look at tools to help refactor existing code to get
performance gains rather asking to code in new languages• May not be able to utilize all cores to maximum, but it is an
evolutionary path
Refactoring tools for non functional programmers
• Many tools are coming to market to help refactoring code to multiple core processors
• Key ones– RapidMind– OpenMP– CUDA– Ct– Clik– Pervasive software
– And many others
Rapidmind
• Startup based in Waterloo, Canada• Helps to parallelize code across multiple cores• Supports C++ language• Can be targeted to different processor families: AMD, x86, Cell, nVidia etc• Extensive code changes needed in code
• Plus points:– Multiple platforms, no lockin with architecture– 1000’s of threads– Existing compilers
• Minus points:– Lockin to the Rapidmind platform!
• Costs:– $1500 for dev platform
Rapidmind
OpenMP
• Standard for multiprocessor shared memory programming• Defined for C/C++/Fortran• OpenMP = compiler directives + runtime library + env
variables• Code is instrumented• Threads are created and destroyed automatically
OpenMP example code
int main(int argc, char **argv)
{ const int N = 100000;
int i, a[N];
#pragma omp parallel
for for (i = 0; i < N; i++)
a[i] = 2 * i;
return 0;
}
OpenMP
• Plus points:– Easiness to do– Incremental Paralleization– Hiding of thread semantics– Scalable to many threads– Coarse/Fine grained parallelization
• Minus points:– Popular only in scientific communities only– Need separate tool chain
• Products:– Sun Studio 12
Source: Sun Studio presentation
CUDA
• From nVidia• Defines extensions to code to parallelize the code• Concept of “Thread block”• Each thread pool has one shared memory to work on• Scalable to 1000’s of threads• Defined for C/C++ Languages
• Plus points:– Scales very well for scientific, visualization communities (Financial
markets, computational mechanics, computational biology etc)– Well integrated with nVidia processors – no middleware
• Minus points:– Vendor lockin to nVidia/CUDA (processor/tool chain)
Is there a middle way?
Option #3: Hybrid environments
• How to experiment in FP without throwing away all investments done so far?
• Scala:• Hybrid language• Combines Java style OO with FP paradigm• Compiles to bytecode• Can use Java libraries in Scala code; Can inherit Java class• Use eclipse tool chains, plugins• Really married to Java
• F#:• Experiment FP safely from known .NET environment• Use FP where needed, use .NET in other place
Status of the industry
OS readiness
• Windows XP can support 4 logical cores, 2 physical cores; supports SMP model
• Windows Vista, Windows 7 better tuned for multicore
• Finetuning of global locks, mutexes, better graphics performance
• Linux 2.6+ has SMP support and hence multicore support
“Embarrassingly” Parallel Programs
• An embarrassingly parallel workload (or embarrassingly parallel problem) is one for which little or no effort is required to separate the problem into a number of parallel tasks.
• Examples:
– The Mandelbroit set and other fractal calculations, where each point can be calculated independently.
– Rendering of graphics; each pixel may be rendered independently. In computer animation, each frame may be rendered independently
– Large scale face recognition that involves comparing thousands of input faces with similarly large number of faces.
– Computer simulations comparing many independent scenarios, such as climate models.
– Genetic algorithms and other evolutionary computation
– weather prediction Models
– Event simulation and reconstruction in particle physics.
• Compilation and Build systems: Each compilation can be run parallel and get performance. (example: make –j or dmake in sun Studio)
• CUDA/OpenMP is very popular in the scientific and visualization markets and widely used
Source: Wikipedia
Server programming
• Servers run OS like Linux 2.6 that support pthreads and is multithreaded
• Server daemon spawns a pthreadfor each request
• OS schedules them on different cores
• Decent scalability seen for server side programs on multicore systems
• Examples: web services• Challenges are IO performance,
Cache size, performance tuning of OS, web browser etc
Source: http://www.edn.com/index.asp?layout=articlePrint&articleID=CA6646279
Server Virtualization
• Virtualization allows different OS to run on different cores• One core can be running Linux and another Windows• This is quite popular and allows efficient use of cores
Packet processing applications
• Packet processing lends to parallel processing
• Each IP packet is independent
• IP forwarding is largely stateless processing
• Multiple cores can take packets from an incoming queue, do processing and put it on output queue
• Similar applications in security
• Solutions from Tilera, Cavium
Packet processing applications
Source: http://www.cisco.com/en/US/prod/collateral/routers/
High reliable concurrent systems
• Functional programs are getting used here• Major users:
– Ericcson (GPRS system)– Jabber– Twitter– Tmobile SMS system– Nortel VPN gateway product
• Some people are trying out hybrids in their system– Scala: Twitter
Desktop applications
• Common programs like word processing, spreadsheets are not amenable to parallel processing or threading
• Unlikely that desktop applications will be rewritten with multiple threads
• So don’t expect much performance gains by running this PPT on a quadcore processor
• But simple things are possible:– While you download a document, virus scan can run on
another core– Provide processor affinity in Windows
Embedded Applications
• Embedded applications without OS are typically hand crafted• Limited scope for using multiple cores• Scope exists in running specialized code in one core and rest
of code in another
Courtesy: Windriver Inc
Challenges of performance improvement
• Lots of existing code is not written in multithreaded way or amenable also
• There are other factors like IO issues, memory issues, beyond just CPU. Best results are seen for CPU intensive jobs, but may not be for IO bound jobs
• There could be bottlenecks in inter-core communication
• Typical applications like CRUD may find it difficult to scale
Sweet-spots of multicore software
• Virtualization• Server programming• Packet processing• Scientific computing, visualization
Summary
• Multicore is here to stay• Onus is on SW community to show performance
gains• Unfortunately, lots of existing code is not written for
using the benefits of multicore• Many programs are also not amenable for parallel
processing• Options like FP, Refactoring, Hybrid being tried out• Long time and way to go before we can tap the
performance gains
Q&A