![Page 1: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/1.jpg)
Multi-core Programming
Threading Concepts
![Page 2: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/2.jpg)
Basics of VTune™ Performance Analyzer
2
Topics
• A Generic Development Cycle• Case Study: Prime Number Generation• Common Performance Issues
![Page 3: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/3.jpg)
Threaded Programming Methodology 3
What is Parallelism?
• Two or more processes or threads execute at the same time
• Parallelism for threading architectures– Multiple processes
• Communication through Inter-Process Communication (IPC)
– Single process, multiple threads• Communication through shared memory
![Page 4: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/4.jpg)
Threaded Programming Methodology 4
n = number of processors
Tparallel = {(1-P) + P/n} Tserial
Speedup = Tserial / Tparallel
Amdahl’s Law
Describes the upper bound of parallel execution speedup
Serial code limits speedup
(1-P
)P
Tse
rial
(1-P
)
P/2
0.5 + 0.25
1.0/0.75 = 1.33
n = 2n = ∞
P/∞
…
0.5 + 0.0
1.0/0.5 = 2.0
P - proportion of a computation where parallelization may occur..
![Page 5: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/5.jpg)
5Threaded Programming Methodology
Processes and Threads
• Modern operating systems load programs as processes
• Resource holder• Execution
• A process starts executing at its entry point as a thread
• Threads can create other threads within the process– Each thread gets its own stack
• All threads within a process share code & data segments
Code segment
Data segment
thread
main()
…thread thread
Stack Stack
Stack
![Page 6: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/6.jpg)
Threaded Programming Methodology 6
Threads – Benefits & Risks
• Benefits– Increased performance and better resource
utilization• Even on single processor systems - for hiding latency
and increasing throughput
– IPC through shared memory is more efficient• Risks
– Increases complexity of the application– Difficult to debug (data races, deadlocks, etc.)
![Page 7: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/7.jpg)
Threaded Programming Methodology 7
Commonly Encountered Questions with Threading Applications
• Where to thread?• How long would it take to thread?• How much re-design/effort is required?• Is it worth threading a selected region?• What should the expected speedup be?• Will the performance meet expectations?• Will it scale as more threads/data are added?• Which threading model to use?
![Page 8: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/8.jpg)
Threaded Programming Methodology 8
Prime Number Generationbool TestForPrime(int val){ // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++;
return (factor > limit);}
void FindPrimes(int start, int end){ int range = end - start + 1; for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}
i factor
61 3 5 7 63 365 3 567 3 5 7 69 3 71 3 5 7 73 3 5 7 9 75 3 577 3 5 7 79 3 5 7 9
![Page 9: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/9.jpg)
Threaded Programming Methodology 9
Development Methodology
• Analysis– Find computationally intense code
• Design (Introduce Threads)– Determine how to implement threading solution
• Debug for correctness– Detect any problems resulting from using threads
• Tune for performance– Achieve best parallel performance
![Page 10: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/10.jpg)
Threaded Programming Methodology 10
Development Cycle Analysis
– VTune™ Performance Analyzer
Design (Introduce Threads)– Intel® Performance libraries: IPP and MKL– OpenMP* (Intel® Compiler)– Explicit threading (Win32*, Pthreads*)
Debug for correctness– Intel® Thread Checker– Intel Debugger
Tune for performance– Intel® Thread Profiler– VTune™ Performance Analyzer
![Page 11: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/11.jpg)
Threaded Programming Methodology 11
Let’s use the project PrimeSingle for analysis • PrimeSingle <start> <end>
Usage: ./PrimeSingle 1 1000000
Analysis - Sampling
• Use VTune Sampling to find hotspots in application
bool TestForPrime(int val){ // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++;
return (factor > limit);}
void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}
Identifies the time consuming regions
![Page 12: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/12.jpg)
Threaded Programming Methodology 12
Analysis - Call Graph
This is the level in the call tree where we need to thread
Used to find proper level in the call-tree to thread
![Page 13: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/13.jpg)
Threaded Programming Methodology 13
Analysis
• Where to thread?– FindPrimes()
• Is it worth threading a selected region?
– Appears to have minimal dependencies– Appears to be data-parallel– Consumes over 95% of the run time
Baseline measurement
![Page 14: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/14.jpg)
Threaded Programming Methodology 14
Foster’s Design Methodology
• From Designing and Building Parallel Programs by Ian Foster
• Four Steps:– Partitioning
• Dividing computation and data– Communication
• Sharing data between computations– Agglomeration
• Grouping tasks to improve performance– Mapping
• Assigning tasks to processors/threads
![Page 15: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/15.jpg)
15Threaded Programming Methodology
Designing Threaded Programs
• Partition
– Divide problem into tasks
• Communicate
– Determine amount and pattern of communication
• Agglomerate
– Combine tasks• Map
– Assign agglomerated tasks to created threads
TheProblem
Initial tasks
Communication
Combined Tasks
Final Program
![Page 16: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/16.jpg)
Threaded Programming Methodology 16
Parallel Programming Models
• Functional Decomposition– Task parallelism– Divide the computation, then associate the data– Independent tasks of the same problem
• Data Decomposition– Same operation performed on different data– Divide data into pieces, then associate
computation
![Page 17: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/17.jpg)
Threaded Programming Methodology 17
Decomposition Methods• Functional Decomposition
– Focusing on computations can reveal structure in a problem
Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC
Domain Decomposition
• Focus on largest or most frequently accessed data structure
• Data Parallelism• Same operation applied to all data
Atmosphere Model
Ocean Model
Land Surface Model
Hydrology Model
![Page 18: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/18.jpg)
Threaded Programming Methodology 18
Pipelined Decomposition
• Computation done in independent stages• Functional decomposition
– Threads are assigned stage to compute– Automobile assembly line
• Data decomposition– Thread processes all stages of single instance– One worker builds an entire car
![Page 19: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/19.jpg)
Threaded Programming Methodology 19
LAME Pipeline Strategy
Frame NFrame N + 1
Time
Other N
Prelude N
Acoustics N
Encoding N
T 2
T 1
Acoustics N+1
Prelude N+1
Other N+1
Encoding N+1
Acoustics N+2
Prelude N+2
T 3
T 4
Prelude N+3
Hierarchical Barrier
OtherPrelude Acoustics Encoding
FrameFetch next frameFrame characterizationSet encode parameters
Psycho AnalysisFFT long/shortFilter assemblage
Apply filteringNoise ShapingQuantize & Count bits
Add frame headerCheck correctnessWrite to disk
![Page 20: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/20.jpg)
Threaded Programming Methodology 20
Design
• What is the expected benefit?
• How do you achieve this with the least effort?
• How long would it take to thread?• How much re-design/effort is required?
Rapid prototyping with OpenMPRapid prototyping with OpenMP
Speedup(2P) = 100/(96/2+4) = ~1.92X
![Page 21: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/21.jpg)
Threaded Programming Methodology 21
OpenMP
• Fork-join parallelism: – Master thread spawns a team of threads as
needed– Parallelism is added incrementally
• Sequential program evolves into a parallel program
Parallel Regions
Master Thread
![Page 22: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/22.jpg)
Threaded Programming Methodology 22
Design#pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }
OpenMP
Create threads here for this parallel region
Divide iterations of the for loop
![Page 23: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/23.jpg)
Threaded Programming Methodology 23
Design
• What is the expected benefit?• How do you achieve this with the least effort?
• How long would it take to thread?• How much re-design/effort is required?• Is this the best speedup possible?
Speedup of 1.40X (less than 1.92X)Speedup of 1.40X (less than 1.92X)
![Page 24: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/24.jpg)
Threaded Programming Methodology 24
Debugging for Correctness
• Is this threaded implementation right?• No! The answers are different each time …
![Page 25: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/25.jpg)
Threaded Programming Methodology 25
Debugging for Correctness
Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks
Intel® Thread Checker
VTune™ Performance Analyzer
+DLLs (Instrumented)
Binary Instrumentation
Primes.exePrimes.exe
(Instrumented)
Runtime Data
Collector
threadchecker.thr(result file)
![Page 26: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/26.jpg)
Threaded Programming Methodology 26
Thread Checker
![Page 27: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/27.jpg)
Threaded Programming Methodology 27
Debugging for Correctness
• How much re-design/effort is required?
• How long would it take to thread?
Thread Checker reported only 2 dependencies, so effort required
should be low
![Page 28: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/28.jpg)
Threaded Programming Methodology 28
Debugging for Correctness#pragma omp parallel for
for( int i = start; i <= end; i+= 2 ){
if( TestForPrime(i) )
#pragma omp critical
globalPrimes[gPrimesFound++] = i;
ShowProgress(i, range);
}
#pragma omp critical
{
gProgress++;
percentDone = (int)(gProgress/range *200.0f+0.5f)
}
Will create a critical section for this reference
Will create a critical section for both these references
![Page 29: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/29.jpg)
Threaded Programming Methodology 29
Correctness
• Correct answer, but performance has slipped to ~1.33X
• Is this the best we can expect from this algorithm?
No! From Amdahl’s Law, we expect speedup close to 1.9X
![Page 30: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/30.jpg)
Threaded Programming Methodology 30
Common Performance Issues
• Parallel Overhead– Due to thread creation, scheduling …
• Synchronization– Excessive use of global data, contention for the
same synchronization object• Load Imbalance
– Improper distribution of parallel work• Granularity
– No sufficient parallel work
![Page 31: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/31.jpg)
Threaded Programming Methodology 31
Tuning for Performance
Thread Profiler pinpoints performance bottlenecks in threaded applications
Thread Profiler
VTune™ Performance Analyzer
+DLL’s (Instrumented)
Binary Instrumentation
Primes.c
Primes.exe (Instrumented)
Runtime Data
Collector
Bistro.tp/guide.gvs(result file)
CompilerSource
Instrumentation
Primes.exe
/Qopenmp_profile
![Page 32: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/32.jpg)
32Threaded Programming
Methodology
Thread Profiler for OpenMP
![Page 33: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/33.jpg)
Threaded Programming Methodology 33
Thread Profiler for OpenMPSpeedup Graph
Estimates threading speedup and potential speedup
– Based on Amdahl’s Law computation
Gives upper and lower bound estimates
![Page 34: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/34.jpg)
Threaded Programming Methodology 34
Thread Profiler for OpenMPserial
serial
parallel
![Page 35: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/35.jpg)
Threaded Programming Methodology 35
Thread Profiler for OpenMPThread 0
Thread 1
Thread 2
Thread 3
![Page 36: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/36.jpg)
36 Threaded Programming Methodology
Thread Profiler (for Explicit Threads)
![Page 37: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/37.jpg)
37 Threaded Programming Methodology
Thread Profiler (for Explicit Threads)
Why so many transitions?
![Page 38: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/38.jpg)
Threaded Programming Methodology 38
Performance
• This implementation has implicit synchronization calls• This limits scaling performance due to the resulting context
switches
Back to the design stage
![Page 39: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/39.jpg)
Threaded Programming Methodology 39
Is that much contention expected?
The algorithm has many more updates than the 10 needed for showing progress
Performance
void ShowProgress( int val, int range ){ int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f);
if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone);}
void ShowProgress( int val, int range ){ int percentDone; static int lastPercentDone = 0;#pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}
This change should fix the contention issue
![Page 40: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/40.jpg)
Threaded Programming Methodology 40
Design
• Goals– Eliminate the contention due to implicit
synchronization
Speedup is 2.32X ! Is that right?Speedup is 2.32X ! Is that right?
![Page 41: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/41.jpg)
Threaded Programming Methodology 41
Performance
• Our original baseline measurement had the “flawed” progress update algorithm
• Is this the best we can expect from this algorithm?
Speedup is actually 1.40X (<<1.9X)! Speedup is actually 1.40X (<<1.9X)!
![Page 42: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/42.jpg)
42 Threaded Programming Methodology
Performance Re-visited
Still have 62% of execution time in locks
and synchronization
![Page 43: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/43.jpg)
Threaded Programming Methodology 43
Performance Re-visited
Let’s look at the OpenMP locks…void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;
#pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) )#pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); }}
Lock is in a loop
void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;
#pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }}
![Page 44: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/44.jpg)
Threaded Programming Methodology 44
Performance Re-visitedLet’s look at the second lock
void ShowProgress( int val, int range ){ int percentDone; static int lastPercentDone = 0;
#pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}
This lock is also being called within a loop
void ShowProgress( int val, int range ){ long percentDone, localProgress; static int lastPercentDone = 0;
localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; }}
![Page 45: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/45.jpg)
45
Thread Profiler for OpenMP
Thread 0
342 factors to test 116747
Thread 1
612 factors to test 373553
Thread 2
789 factors to test 623759
Thread 3
934 factors to test 873913
500000
250000
750000
1000000
![Page 46: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/46.jpg)
Threaded Programming Methodology 46
Fixing the Load ImbalanceDistribute the work more evenly
void FindPrimes(int start, int end){ // start is always odd int range = end - start + 1;
#pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }}
Speedup achieved is 1.68X
![Page 47: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/47.jpg)
47 Threaded Programming Methodology
Final Thread Profiler Run
Speedup achieved is 1.80XSpeedup achieved is 1.80X
![Page 48: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/48.jpg)
48 Threaded Programming Methodology
Comparative Analysis
Threading applications require multiple iterations of going through
the software development cycle
![Page 49: Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation](https://reader034.vdocument.in/reader034/viewer/2022051416/56649e445503460f94b38293/html5/thumbnails/49.jpg)
Threaded Programming Methodology 49
Threading MethodologyWhat’s Been Covered
• Four step development cycle for writing threaded code from serial and the Intel® tools that support each step– Analysis– Design (Introduce Threads)– Debug for correctness– Tune for performance
• Threading applications require multiple iterations of designing, debugging and performance tuning steps
• Use tools to improve productivity