scalability of threaded applications intel software college
TRANSCRIPT
Scalability of Threaded Applications
Intel Software College
2
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Objectives
After completion of this module you will understand
• The need for designing multithreaded applications for scalability to take advantage of an increasing number of available cores
• What tools are available to measure and predict scalability
• How several different factors can inhibit scaling of applications on increased number of cores
3
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Why focus on scalability?• Measuring and estimating scalability
• Where would you start?
Tools for scalability analysis
Factors inhibiting scalability• Serially Dominant Workloads
• Granularity and Parallel Overhead
• Load Imbalance
• Synchronization Issue
• Memory Related Issues
• I/O
4
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
What is scalability?
Handle growing amounts of work in a graceful manner
What resources might be increased?• Cores and threads
• Memory capacity
• Data, problem size• Not a resource, but likely to see increases as computation power increases
“What is it that we really mean by scalability? A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added.”
-- Werner Vogels CTO - Amazon.com
5
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Many-core array• CMP with 10s-100s low
power cores• Scalar cores• Capable of TFLOPS+• Full System-on-Chip• Servers, workstations,
embedded…
Dual core• Symmetric multithreading
Multi-core array• CMP with ~10 cores
Evolution
Large, Scalar cores for high single-thread performance
Scalar plus many core for highly threaded workloads
Evolutionary configurable architecture
6
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Amdahl’s Law
Speedup is limited by the
amount of serial code
Maximum Theoretical Speedup from Amdahl's Law
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
Number of cores
Sp
ee
du
p
%serial= 0
%serial=10
%serial=20
%serial=30
%serial=40
%serial=50
Ψ(p) ≤
1
s + (1 - s) / pwhere 0 ≤ s ≤ 1, the
fraction of serial operations
7
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Question 1
A: 1.25B: 2.0C: 4.0D: No speedup
If application is only 25% serial, what’s the maximum speedup you can ever achieve, assuming infinite number of processors ? (ignore parallel overhead)
8
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Scaled Speedup (Gustafson-Barsis’s Law)
Amdahl’s Law does not take into account
• overhead costs
• increases in problem size able to be computed with more cores
Increasing the number of cores enables…
• Increasing the problem size ―> Decreasing the sequential fraction of computation ―> Increasing Speedup
Given p cores and a parallel code solving a problem of size n, let s be the fraction of serial execution in the code.
Ψ ≤ p + (1 – p) / s
Scaled Speedup estimates how much faster parallel execution is over same
computation on single core
Assumes problem size increases linearly with number of cores
9
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Using Scaled Speedup
If application runs on 64 cores in 220 seconds with 11 seconds devoted to serial execution, what is the scaled speedup?
Assuming fixed serial time, what is single core execution time?
(220-11)*64 + 11 = 13387 seconds
• Amdahl’s Law then yields speedup of 60.84 on 64 cores with 0.08271% serial time
Would serial time be fixed? Would problem fit on one core?
Ψ = 64 + (1 – 64) (11/220)
= 64 – 63 * 0.05
= 60.85
Amdahl’s Law
5% serial on 64 cores => 15.42
10
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Question 2
What is the maximum amount of serial execution time for a parallel application to achieve a scaled speedup of 7.5 on an eight-core system?
Using Amdahl’s Law, serial percentage must be ≤ 0.952%
7.5 = 8 + (1 – 8) s
s = 0.5 / 7
= 0.071 => 7.1%
11
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Estimating potential scalability of serial applications
• Need to estimate serial vs. parallelizable execution times• Speedup estimate based on Amdahl’s law
• VTune sampling• Identify potential areas for parallelization
• Example: loops• Use clock ticks to estimate parallel time
• Serial time = Total run time – parallelizable run time• Compute scalability estimate
• VTune call graph• See potential call trees for parallelization
• Use “Total time” (self + descendents) for parallelizable run time
12
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Estimating scalability upper bound for parallel applications
Need to estimate serial vs. parallel execution times• Speedup estimate based on Amdahl’s law • Serial percentage for Gustafson-Barsis’s Law
Thread Profiler• Use critical path information in Profile View• Use information in Concurrency Level view
• Experimental technique based on CPU utilization of all processors/cores
13
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Finding Serial and Parallel TimeThread Profiler
Thread Profiler – for parallel applications
• Use Concurrency Level View
• Total Serial (CL:0 and CL:1) and Parallel (CL:2 and up) times• Under Utilized times counted as parallel time
14
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Perfmon Dual Xeon™ processor/2.8Ghz/Windows* XP
0
20
40
60
80
100
120
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Workload run time (sec)
%C
PU
uti
liza
tio
n
CPU 0
CPU 1
Serial Serial Parallel
Finding Serial and Parallel TimeCPU utilization
Experimental approaches –for parallel applications• Monitor utilization of all CPUs
over time
• Parallel region is where all CPUs are active
• Perfmon* (Windows) or mpstat (Linux)
• Example: 76% serial, 24% parallel on DP
Perfmon* or mpstat does not capture sub-second behavior
15
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Work
load
01
Work
load
13
Work
load
12
Work
load
06
Work
load
07
Work
load
08
Work
load
09
Work
load
10
Work
load
11
Work
load
05
Work
load
04
Work
load
03
Work
load
02
0.0
2.0
1.0
4.0
3.0
Sp
eed
up
Measured 2P Speedup
Measured 4P Speedup
2P and 4P Speedup (IBM* X440 .NET RC1)
How can speedup estimate help identify scalability issues?
• Different workloads can exercise different parts of application
• Estimates can point to workloads that need scalability analysis and improvement
• Compare measured vs. estimate
• Choose largest delta workloads for analysis Workloads 5 & 12 show significant difference
between estimate and actual; focus tuning hereWorkload 11 is predicted to have low scaling
16
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Quick Review: Measuring and Estimating Speedup
Estimate serial vs. parallel times in workloads
• Allows prediction of speedup upper bounds
Serial applications
• Estimate based on VTune Sampling or Callgraph runs
Parallel applications
• Use Thread Profiler
• Experimental techniques• Measuring CPU utilization over time for all processors
17
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Why focus on scalability?• Measuring and estimating scalability
• Where would you start?
Tools for scalability analysis
Factors inhibiting scalability• Serially Dominant Workloads
• Granularity and Parallel Overhead
• Load Imbalance
• Synchronization Issue
• Memory Related Issues
• I/O
18
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Approaching a serial application
1. Pick a workload
2. Establish a scalability target• Example: Must have at least 2.5x improvement 1core4core
3. Estimate amount of parallelization required• Dictated by Amdahl’s law
• Example: 2.5X improvement 1c4c would require 80% of run time to be parallelized
• Identify areas to parallelize• Cannot find areas to meet required amount of parallelization?
• Reset scalability target and continue parallelization
4. Parallelize and measure speedup
5. Did you meet the scalability target?• If not, root cause and improve
Repeat for other workloads
19
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Approaching a parallel application
1. Pick a workload
2. Estimate expected speedup• Amdahl, Gustafson-Barsis
3. Measure speedup
4. Did you meet the expected scaling?• If not, root cause and improve
Repeat for other workloads
20
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Question 3a: What is the best design for scalability?
Audio processing application
• Left channel computation
• Right channel computation
21
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Question 3b: What is the best design for scalability?
Video stream encoding
• Thread intra-frame?
• Thread groups of pictures?
22
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Question 3c: What is the best design for scalability?
Room Assignment Problem (Simulated Annealing)
Goal: Find most compatible roommate assignments
Method:
• Roomers take interest survey
• Roommates initially chosen at random
• Two people are swapped at random
• Does new assignment increase common interests in roommates (reduce conflict)?• If yes, keep new assignment• If no, undo swap; shrinking random chance to keep bad match
• Continue until solution stabilizes
23
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Why focus on scalability?• Measuring and estimating scalability
• Where would you start?
Tools for scalability analysis
Factors inhibiting scalability• Serially Dominant Workloads
• Granularity and Parallel Overhead
• Load Imbalance
• Synchronization Issue
• Memory Related Issues
• I/O
24
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Windows*: Perfmon*
Recommended first set of counters
• “Processor” performance object: %processor time, %privileged time (for each CPU)
• “System” performance object: Context Switches/sec, System Calls/sec
• “PhysicalDisk” performance object: Disk Read bytes/sec, Disk Write bytes/sec (for each disk)
• “Memory” performance object: Pages/sec
• “Network Interface” performance object: Bytes Total/sec (for each network card)
Windows command line tools available
• Logman
• Relog
• Typeperf
25
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Windows*: Fixing Process to Core
Eliminate “noise” from context switches that abandon cache
Windows Task Manager
• “Process” Tab right click on process to set affinity
Windows APIs
• SetProcessAffinityMask
• SetThreadAffinityMask
26
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
VTune* call graph
Helps isolate call trees for potential threading
27
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
VTune Counter Monitor
Tracks operating system counters over time
Some relevant counters:
• Processor time
• Available memory
• Context switches
28
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel Thread Profiler
Identifies
• Serial vs. parallel run times
• Lock contention areas
• Parallel overhead
• Load imbalance
29
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Loop graph viewer
Being considered for VTune 9.0
View program as loop hierarchies• Loop self times and total times (in terms of instructions retired)
• Similar to call graph self and total times
• Loop counts
Helps identify loops for coarse grain threading• Loop hierarchies can span functions and files
PIN tool based prototype• Currently Linux-only
30
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Quick Review: Tools
No single tool may give you all the answers for scalability issues
• Thread Profiler comes close
Simple tools can provide insight into scalability issues
• Perfmon• Monitoring of CPU utilization of processors and application threads
31
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Why focus on scalability?• Measuring and estimating scalability
• Where would you start?
Tools for scalability analysis
Factors inhibiting scalability• Serially Dominant Workloads
• Granularity and Parallel Overhead
• Load Imbalance
• Synchronization Issue
• Memory Related Issues
• I/O
32
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Effects of serial domination
Serially dominated workloads do not scale well
• Amdahl’s Law
How to estimate serial time?
• VTune sampling, VTune Call graph• Serial applications
• Thread Profiler, experimental approaches• Parallel applications
33
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
76% serial time on DP• 53% non-concurrent
time on UP
• 4P theoretical scaling 1.6X
• 4P measured scaling 1.1X
Case Study 1W
ork
load
01
Work
load
13
Work
load
12
Work
load
06
Work
load
07
Work
load
08
Work
load
09
Work
load
10
Work
load
11
Work
load
05
Work
load
04
Work
load
03
Work
load
02
0.0
2.0
1.0
4.0
3.0
Sp
eed
up
Measured 2P Speedup
Measured 4P Speedup
2P and 4P Speedup (IBM* X440 .NET RC1) Perfmon Dual Xeon™ processor/2.8Ghz/Windows* XP
0
20
40
60
80
100
120
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Workload run time (sec)
%C
PU
uti
liza
tio
n
CPU 0
CPU 1
Serial Serial Parallel
Parallelize serial sections for better scalability
34
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Question 4
Profile shows 80% of runtime spent calculating multidimensional FFT
Assume calling sequence of fft2d fft1d fftcc
Where would you thread for better scaling?
gprof profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 64.71 93.43 93.43 23952910 0.00 0.00 fftcc_ 11.47 110.00 16.57 23952910 0.00 0.00 fft1d_ 11.41 126.47 16.47 100 164.70 1402.89 ssf_3dcs_ 4.94 133.59 7.13 151600 0.05 0.77 fft2d_ 2.94 137.84 4.25 37900 0.11 0.11 phaseshift3d_ 2.24 141.07 3.23 100 32.30 32.30 imaging_
A: fftcc
B: fft1d
C: fft2d
D: All of the above
35
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Top-Down Design
The iterative “hotspot” tuning process:
Find and fix hot spot…
Find and fix hot spot…
Find and fix hot spot…
The top-down parallelization process:
Find the highest level of natural parallelism…
Top-down approach considers the parallelism of the whole application rather than individual hotspots.
The result is usually a more scalable, parallel application.
36
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Why focus on scalability?• Measuring and estimating scalability
• Where would you start?
Tools for scalability analysis
Factors inhibiting scalability• Serially Dominant Workloads
• Granularity and Parallel Overhead
• Load Imbalance
• Synchronization Issue
• Memory Related Issues
• I/O
37
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Granularity
Loosely defined as the ratio of computation to synchronization
Be sure there is enough work to merit parallel computation
Example: Working on the railroad. How many more workers can be added?
38
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 1: Workload-dependent Scaling
Lab shows a chosen number of spheres bouncing within an enclosed box
• Obey laws of physics for bouncing off walls and colliding with other spheres
User is able to control
• Number of spheres
• Amount of physics computation before rendering
• Whether to run with single thread or multithreaded• Load balance between threads will be explored in later lab
GUI frames per second displayed is performance metric
39
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Parallel Overhead
Parallel overhead impacts scalability
Thread creation/destruction
• Amount of work vs. overhead
• Thread Pool (Windows*) may be a good solution
Synchronization
• Call overhead
• Transition in and out of kernel space
Possible indicators
• High kernel time
• Thread Profiler• Critical Path view showing large overhead times
40
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Example: Threading Quicksort
Algorithm:
• Pick pivot value from elements
• Partition data around pivot• Less-than or equal to pivot• Greater than pivot
• Quicksort the two partitions
Pivot
Less-than or equal
Greater-than
QuickSort(int p, int r) // Assume global array of data{ if (p < r) { int q = Partition(p, r); QuickSort(p, q-1); // sort less-than QuickSort(q+1, r); // sort greater-than }}
p q r
41
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Example: Threading Quicksort
How about creating threads at each recursive call?
DWORD WINAPI QuickSort(LPVOID pr){ int p = ((qParams *)pr)->s; int r = ((qParams *)pr)->t; qParams lo, hi; HANDLE hLOHI[2];
if (p < r) { int q = Partition(p, r); lo.s = p; lo.t = q; hi.s = q+1; hi.t = r; hLOHI[0] = CreateThread(NULL, 0, QuickSort, (LPVOID) &lo, 0, NULL); hLOHI[1] = CreateThread(NULL, 0, QuickSort, (LPVOID) &hi, 0, NULL); WaitForMultipleObjects(2, hLOHI, TRUE, INFINITE); } return 0;}
typedef struct { int s,t;} qParams;
For single parameter
42
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Quicksort Performance Results
43
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Is There a More Scalable Quicksort Implementation?
Thread pool to control number of threads
Producer/Consumer relationship with index pair queue• Dequeue pair struct from queue and partition (Consumer)• Recursive calls become enqueue of index struct (Producer)
DWORD WINAPI QuickSort(LPVOID pArg){ int p, r, q; while (1) { WaitForSingleObject(hSem, INFINITE); dequeue(&p, &r); if (p < r) { q = Partition(p, r); enqueue(p, q); enqueue(q+1, r); q = ReleaseSemaphore(hSem, 2, NULL); } } return 0;}
Semaphore counts number of pairs in queue
Encapsulation of index pairs done in queue routines
44
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Quickstart Thread Pool Performance
45
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Why focus on scalability?• Measuring and estimating scalability
• Where would you start?
Tools for scalability analysis
Factors inhibiting scalability• Serially Dominant Workloads
• Granularity and Parallel Overhead
• Load Imbalance
• Synchronization Issue
• Memory Related Issues
• I/O
46
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Looking at Load Imbalances
Load imbalance reduces scalability
Why?
• Idle CPU
• Easier to spot on 4P or above• On 2P, idle times might be mistaken as “serial” sections
How do you detect this?
• Windows* Perfmon
• Linux* mpstat
• Thread Profiler
47
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 2
Linux* mpstat CPU data
2P data not suggestive of load imbalance
4P data shows CPUs drop off
2 c
ore
s
1 c
ore
1 c
ore
4 c
ore
s
2 c
ore
s1
core
48
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 2: Improved
Linux* mpstat CPU data
Second figure shows improvement with load balancing
• 1P-4P scaling improves from 2.1x to 2.7x
4 c
ore
s
49
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Spotting Load Imbalance in Thread Profiler
Differences in Active Thread state
Differences in Active Thread state
First problem noticed is create/destroy threads for each iteration…
…but there is a difference in Active Thread state within pairs.
50
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 2: Effects of Load Balance in Multi-Threaded Implementation
Use Load Balance control within Basic Physics GUI to control number of spheres assigned to threads
How does this affect FPS measure?
51
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Why focus on scalability?• Measuring and estimating scalability
• Where would you start?
Tools for scalability analysis
Factors inhibiting scalability• Serially Dominant Workloads
• Granularity and Parallel Overhead
• Load Imbalance
• Synchronization Issue
• Memory Related Issues
• I/O
52
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Synchronization
Lost time waiting for locks
Most likely scenario for high contention
• Work inside AND outside protected region is very small
• “Threads pile up” on the lock• Symptoms: High context
switches/sec, high kernel times
Spotting highly contended synchronization objects
• Thread Profiler
Busy
Idle
In CriticalTime
Thread 0
Thread 1
Thread 2
Thread 3
53
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Lock Contention Indicators in Thread Profiler
Large percentage of Locks time
Large amount of Impact time associated with
object
54
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Synchronization PrimitivesWindows*
Choice of synchronization primitives
• Atomic increments/decrements• InterlockedIncrement
• Critical Section, Critical Section with spin count• EnterCriticalSection, LeaveCriticalSection, SetCriticalSectionSpinCount• Works within a single process
• Events• Signal condition has been changed/satisfied
• Mutex• Works across processes as well
• Semaphore• Works across processes as well
55
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 3: Measuring Synchronization Object Overhead
Determine overhead for using different synchronization objects• InterlockedIncrement• CRITICAL_SECTION• CRITICAL_SECTION with spin count• Mutex• Semaphore
CRITICAL_SECTION is used as baseline
• InterlockedIncrement is specialized functionality; others more general
56
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Lock times relative to InterlockedIncrement1P/1C/1T (data in L1 cache)
Windows XP 32-bit (MP kernel)(higher => more expensive)
0.0
1.0
2.0
3.0
4.0
5.0
6.0In
telo
cked
Incr
emen
t
Ent
er+L
eave
Crit
ical
Sec
tion
Ent
er+L
eave
Crit
ical
Sec
tion
with
spin
cou
nt 4
000
Acq
uire
+Rel
ease
Mut
ex
Acq
uire
+Rel
ease
Sem
apho
re
Lock type
Rel
ativ
e co
st(h
igh
er i
s m
ore
ex
pen
sive
)
Mrm 2.61C/1T
Synchronization Primitives Costs: Un-contended
>50x
Use the least expensive synchronization method possible
57
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Lock Contention
Lock contention reduces scalability
Following factors combine to produce contention and reduce scalability
• Amount of work inside vs. outside protected region
• Synchronization primitive costs
• OS context switches during lock contention
Possible indicators (without Thread Profiler)
• High context switches/sec• >10,000/s should be investigated
• And high kernel time• >20% should be investigated
Watch for high context switches/sec and kernel time
58
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Reducing Lock Contention
Lock contention reduces scalability – fix?
• Ideally, work inside << work outside• Redesign
• Explore use of “spin count” (Windows*)• InitializeCriticalSectionAndSpinCount, SetCriticalSectionSpinCount• #define _WIN32_WINNT 0x0403 // or higher• Spin count = 4000 recommended by Microsoft*• Not very portable
59
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 3
17% serial time on 2P• 9% serial time on UP
• 4P theoretical scaling 3.1X
• Measured 4P scaling is 0.4X
Why should we think this is a Synchronization issue?
Clues: 50% kernel time
>40K context switches/sec on 4P
Work
load
01
Work
load
13
Work
load
12
Work
load
06
Work
load
07
Work
load
08
Work
load
09
Work
load
10
Work
load
11
Work
load
05
Work
load
04
Work
load
03
Work
load
02
0.0
2.0
1.0
4.0
3.0
Sp
eed
up
Measured 2P Speedup
Measured 4P Speedup
2P and 4P Speedup (IBM* X440 .NET RC1)
Perfmon* Dual Xeon™ processor 2.8Ghz/Windows*XP
0
20
40
60
80
100
120
1 6 11 16 21 26 31 36 41 46 51 56 61 66
Workload run time (sec)
%C
PU
uti
lizat
ion
CPU0
CPU1
Serial Parallel
Pretty good load balance
60
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 3: Speedup
We have a negative scaling problem…
1/2/4 Xeon™ processors / Windows* XP(2.8Ghz/512KL2/2MBiL3)
1.001.26
0.68
0.00
1.00
2.00
3.00
4.00
Orig. Code
Sc
alin
g f
ac
tor
1P
1P/2P
1P/4P
If adding more threads results in worse performance, there must be some increased contention on a shared resource
61
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 3: First Approach
Root cause: A class defined a critical section as a static member variable
Solution: Have each instance of class use separate lock by removing static declaration
Before: 4 threads randomly accessing 8 lights with 1 global lock
After: 4 threads randomly accessing 8 lights with 8 private locks
Before
After
62
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 3: Performance
Observations
• 4P scaling has improved from 0.7x to 1.3x
• There still is much work to do:• Now 80,000 context switches/second
• Utilization of each CPU near 75%
1/2/4 Xeon™ processors / Windows* XP(2.8Ghz/512KL2/2MBiL3)
1.00 1.001.26
1.54
0.68
1.32
0.00
1.00
2.00
3.00
4.00
Orig. Code New Code
Sca
lin
g f
acto
r
1P
1P/2P
1P/4P
63
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 3: Second Approach
Perfmon* observations (4P)
• Almost no serial execution, utilization of each CPU near 50%
• Almost 200,000 Context Switches/sec!
64
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 3: Diagnosis
Root cause: poor choice of synchronization primitive• Computation is incrementing a single variable• Threads contending on single Critical Section object
Solution: Use of “InterlockedIncrement”• Critical section with spin count is another possibility
1/2/4 Scaling Xeon™ processors / Windows* XP(2GHz/512KL2/2MBiL3)
1.00 1.001.40
1.81
0.41
3.21
0.00
1.00
2.00
3.00
4.00
Orig. Code New Code
Sca
lin
g f
acto
r
1P
1P/2P
1P/4PUse the least expensive synchronization method possible
65
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
How Much of Data Structure to be Locked?
Example: Array of counts/buckets/pointers (random access)
• Enumeration sort, radix sort, bucket sort
• Hash table
Lock whole structure?• Easy to implement• Severely restricts access
Lock individual elements?• Individual access by different threads• Extra space in structure
66
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Modulo Locks
Assuming little contention for individual elements
Create array of locks to protect every Kth element• Fixed number of locks, say 2
• Lock index used to determine which elements are protected• To access element Data[Q], thread must hold LOCK[Q % 2]
• Works for 2-D and 3-D arrays• For example, with eight locks, accessing A[i,j] would use LOCK[(i+j) % 8]
Set number of locks equal to number of threads
67
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Why focus on scalability?• Measuring and estimating scalability
• Where would you start?
Tools for scalability analysis
Factors inhibiting scalability• Serially Dominant Workloads
• Granularity and Parallel Overhead
• Load Imbalance
• Synchronization Issue
• Memory Related Issues
• I/O
68
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Frontside Bus (FSB) Bandwidth
Cores share bus in current Intel® multi-core architectures• Saturating the bus limits scalability
• Newer independent bus designs improve scalability
• Applies to current SMP platforms too
Good metric to monitor, if • CPU utilization is close to 100%• Poor scaling to 4P or 8P• Low context switches/sec
How do you measure this?• VTune™ Performance Analyzer
• Compare 1-thread vs. multi-thread VTune runs• Look for areas where clock ticks show significant jumps• Code inspection
69
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 4
IPF Madison 1.5Ghz/9M/400Mhz
• 1P to 2P scaling: 1.28
• 1P to 4P scaling: 1.27
2P close to FSB saturation
• ~5GB/s
• Madison 400Mhz bus peakbandwidth is 6.4GB/s
Solution?
• Change algorithm / data structures to keepdata in cache more often
• Easier said than done
70
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Frontside Bus Lab
Intent of this lab
• Observe impact of FSB saturation on scalability using Stream benchmark
• Learn use of appropriate VTune performance event to monitor bus utilization
ChipsetMCH
Bus 0 Bus 1
Core 02M L2
Woodcrest Socket 0
Core 12M L2
Core 22M L2
Core 32M L2
Woodcrest Socket 1
Core 42M L2
Core 52M L2
Core 62M L2
Core 72M L2
71
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Computing FSB Data BandwidthCore 2™ Processor
Bus bandwidth (MB/s) per core = (BDC.ta / CCU.b) * TB
• BDC.ta is the BUS_DRDY_CLOCKS.THIS_AGENT event count• Counts the number of bus cycles when data is sent on the bus (the DRDY
[Data Ready] signal is asserted on the bus)
• CCU.b is the CPU_CLK_UNHALTED.BUS event count• Counts the number of bus cycles occurred during measurement (bus
cycles when core is not halted)
• TB is Theoretical bandwidth of the bus in MB/s = 8 * bus frequency• Example: 2.6GHz processor with 1333MHz FSB
• Theoretical bandwidth = 8 bytes/clock * 1333 = 10664 MB/s
Total bus bandwidth = ∑ “per core” bandwidths
72
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Frontside Bus Lab: VTune Notes
• Where is “BUS_DRDY_CLOCKS.THIS_AGENT” event? • Configure Sampling Events Tab -> Event Groups: “External Bus Events”
• View results as Table in VTune• Easier to compute bandwidth
• View results per CPU (Show/Hide CPU Info.)
73
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Frontside Bus Lab: Example
Bus bandwidth (MB/s) per core = (BDC.ta / CCU.b) * TB• TB = 8 bytes/clock * 1333 MHz = 10664 MB/s
Processor 0 BW = (159,204,930 / 1,318,288,023) * 10664 MB/s
= (0.159/1.318)*10.7 GB/s
~ 1.29 GB/s
74
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 4: Measuring Frontside Bus Saturation
Intent of this activity
• Observe impact of FSB saturation on scalability using Stream benchmark
• Learn use of appropriate VTune performance event to monitor bus utilization
What you may seeScalabilitybus0.bat: ~ 5 sec / ~1.2 GB/s
Scalabilitybus0123.bat: ~18 sec / ~4 GB/s
75
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Frontside Bus Lab: 4 Core Examples
Processor 0 BW = (0.308/3.216)*10.7 GB/s ~ 1.02 GB/s
Processor 1 BW = (0.312/3.311)*10.7 GB/s ~ 1.00 GB/s
Processor 2 BW = (0.308/3.224)*10.7 GB/s ~ 1.02 GB/s
Processor 3 BW = (0.311/3.305)*10.7 GB/s ~ 1.00 GB/s
Total BW ~ 1.0*4 = 4 GB/s
Processor 0 BW = (0.309/4.922)*10.7 GB/s ~ 0.67 GB/s
Processor 2 BW = (0.313/4.989)*10.7 GB/s ~ 0.67 GB/s
Processor 4 BW = (0.315/5.027)*10.7 GB/s ~ 0.67 GB/s
Processor 6 BW = (0.315/5.023)*10.7 GB/s ~ 0.67 GB/s
Total BW ~ 0.67*4 = 2.68 GB/s
76
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Frontside Bus Lab: 8 Core Example
Total BUS_DRDY_CLOCKS.THIS_AGENT = 2,524,773,727
Average Total CPU_CLK_UNHALTED.BUS = 52,382,932,480 / 8
= 6,547,866,560
Total Avg. BW = 2,524,773,727 / 6,547,866,560 * 10.7 GB/s
~ 4.11 GB/s
77
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Frontside Bus Lab: Discussion
Almost 3x slower run time from 1 to 4 cores
• Same amount of data transferred by each thread
• Contention for shared bus makes everything run slower
How do clockticks compare in first 2 runs?
• Notice the clock ticks go up significantly in the 4 stream case in the source view as well
Why is there a difference in MB/s reported by Stream vs. what you calculated using VTune?
78
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Frontside Bus Lab: CautionMeasuring BW on Underutilized System
Process or thread migration can break the formula
Example: Single thread Stream allowed to migrate in the lab
Is bandwidth used equal to (3.9 * 4 =) 15.6 GB/s?
Processor0 Processor1 Processor2 Processor3Bus Data Ready 89,419,935 59,181,430 193,423,450 158,135,505Clockticks 2,358,400,000 1,545,600,000 5,033,600,000 4,128,000,000FSB BW MB/s 3927 3890 3928 3942
Dempsey 3.2/2S/2C/1066Mhz - 1 Stream process allowed to migrate
Best to tie threads to cores for bandwidth analysis
79
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
What is False Sharing?
Multiple threads repeatedly write to the same cache line shared by cores• Usually different data
• Cache lines get invalidated• Forces additional reads from memory
• Severe performance impact in tight loops, in general• Threads read/write to the same cache line very rapidly
• Good metric to monitor if • CPU utilization of all processors very high• Poor scaling to 4P or 8P• Low context switches/sec
80
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Detecting False Sharing with VTune Analyzer
Core 2® processor-based events:
• MACHINE_NUKES.MEM_ORDER event
• Significant last level cache read misses • 2nd Level or 3rd Level Cache Read Misses• MEM_LOAD_RETIRED.L2_MISS
• Significant FSB activity• BUS_DRDY_CLOCKS.THIS_AGENT
Compare 1-thread vs. multi-thread VTune runs
• Look for areas where clock ticks show significant jumps
• Code inspection
81
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
False Sharing Example 1
#define N_THREADS 16double sum=0.0, sum_local[N_THREADS];
#pragma omp parallel{ int me = omp_get_thread_num(); sum_local[me] = 0.0; #pragma omp for for (i=0; i<N; i++) sum_local[me] += x[i] * y[i]; #pragma omp atomic sum += sum_local[me];}
No overlap of memory access; no sync needed
Each thread can invalidate cache line for
others
To fix, declare and use true local sum variable for each thread
82
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
False Sharing Example 2
Normalization of an array of spatial vectors (double precision)
• 10,000 vectors (<256K size; fits in L2)
• 5⁄ vectors per cache line
False sharing case
• Round-Robin distribution
• Each thread works on “start index + i*Num_Threads”
No false sharing case
• Each thread works on a block of data
• Block per thread = Array size / Num_Threads
V0 V1 . . . .V9V8V7V6V5V4V3V2
Thread 0 Thread 1 Thread 2 Thread 3
V0 V1 . . . .V5000V4999…V2500V2499…V2
Thread 0 Thread 1 Thread 2 Thread 3
83
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
False Sharing Example 2 – Effects on Speedup (2S/2C Dempsey; HT off)
Effects of False SharingDempsey 3.2 2S/2C Windows 2003 Server 64-bit
ScalabilityLab-FS.exe 32-bit
0.0
1.0
2.0
3.0
4.0
0 1 2 3 4
Number of Cores/Threads
Sca
ling
No False Sharing
With False Sharing
Sp
ee
du
p
84
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 5: Identifying False Sharing
Intent of this activity
• Observe impact of false sharing on scalability
• Learn use of appropriate VTune performance events• Compare and contrast false sharing vs. no false sharing
85
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 5: DiscussionTypical Results
Sum of events on all cores
Event 1T 4T-FS 4T-NOFS
CPU_CLK_UNHALTED.CORE 38.6 E+09 173.2 E+09 37.1 E+09
INST_RETIRED.ANY 30.5 E+09 30.5 E+09 28.5 E+09
MACHINE_NUKES.MEM_ORDER 1.92 E+06 105.48 E+06 0.04 E+06
MEM_LOAD_RETIRED.L2_MISS 0.020 E+06 75.557 E+06 0.185 E+06
BUS_DRDY_CLOCKS.THIS_AGENT 8.4 E+06 2454.5 E+06 7.3 E+06
No false sharing in single thread execution
MACHINE_NUKES.MEM_ORDER counts events most likely due to false sharing
• Cache misses can be indication of problems
86
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Effects of On-die Shared Cache
Last level cache (LLC) size, shared vs. not shared
• Dempsey: 2MB L2 not shared
• Merom/Woodcrest: 4MB L2 shared
• Clovertown: 8MB L2 (4MB per die shared)
Cache sensitive application will run better with threads on cores not sharing cache
Chipset
L2 L2 L2 L2
Dempsey
Chipset
L2L2
Woodcrest
Chipset
Clovertown
L2L2
87
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Detecting Effects of On-die Shared Cache
VTune sampling
• LLC cache misses increase significantly when run on same socket vs. different sockets
Experiments with single socket vs. multi-socket show differences in scaling
May require thread affinity to correct performance problems
88
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Paying Attention to NUMA Issues
NUMA may affect scalability
• Non-Uniform Memory Access
• Adds extra memory layer to locate data• Registers• Cache• Memory• “Far” Memory
Cache-coherent Interconnect
MEM
MEM
MEM
MEM
Chipset
Dual IndependentBus
MEM
Chipset
FSB
MEM
89
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Paying attention to NUMA issues
NUMA related scalability issues depend on• Platform design
• NUMA aware OS used or not• NUMA aware OSs’: Windows* Server 2003 and Linux 2.6 kernel
• Application being NUMA aware or not
Check for NUMA issues if• Scaling falls off when going from SMP to NUMA
• Low context switches/sec
• Application is memory latency sensitive
How do you detect this?• Knowledge of platform architecture• Through experimentation • Tie threads to different cores to measure performance
• Measure memory latency ratio between “near” and “far” memory
90
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
OS and Application NUMA Support
Definition of node• Own processors and memory
• Connected to the larger system through a cache-coherent interconnect
Role of NUMA-aware OS• Schedule threads on processors in the same node as memory being used
• Satisfy memory-allocation requests from within the node• But will allocate memory from other nodes if necessary
Role of NUMA-aware applications• Use of NUMA APIs
• Topology of nodes• Memory per node
• Use of Affinity Mask APIs• SetThreadAffinityMask, SetProcessAffinityMask• Keep threads sharing memory on the same node
91
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Why focus on scalability?• Measuring and estimating scalability
• Where would you start?
Tools for scalability analysis
Factors inhibiting scalability• Serially Dominant Workloads
• Granularity and Parallel Overhead
• Load Imbalance
• Synchronization Issue
• Memory Related Issues
• I/O
92
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Watching I/O
I/O impacts scalability
• CPU likely to be idle
• Check for I/O to disk and network
How do you detect this?
• Windows* Perfmon
• Linux* vmstat, sar, iostat
93
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Case Study 5
Linux* mpstat CPU data, sar I/O data
Correlation between disk write peaks and CPU utilization troughs
When I/O is reduced using application configuration options• 1P-4P scaling improves from 1.9x to
2.9x
Striped or RAID disk configurations could have helped
Overlapped I/O implementation in application
94
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Factors Inhibiting Scalability Summary
Serially dominated workload
Choice of synchronization primitives and lock contention
Granularity
Parallel overhead
I/O (Disk and Network)
Load Imbalance
High front side bus utilization
Memory related• NUMA• False sharing• Shared cache effects
Ap
plic
atio
n D
om
ain
Pla
tform
/CP
U
95
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
96
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Backup
97
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
MESI protocol
Every cache line is marked with one of the four following states (coded in two additional bits):• M - Modified: The cache line is present only in the current cache, and is dirty; it has been
modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (not longer valid) main memory state.
• E - Exclusive: The cache line is present only in the current cache, but is clean; it matches main memory.
• S - Shared: Indicates that this cache line may be stored in other caches of the machine.
• I - Invalid: Indicates that this cache line is invalid.
A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Shared or Exclusive states) to satisfy a read.
A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation.
A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must be written back first.
98
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
MESI protocol (contd.)
A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other CPUs in the system) of the corresponding main memory location and insert the data that it holds. This is typically done by forcing the read to back off (i.e. to abort the memory bus transaction), then writing the data to main memory and changing the cache line to the Shared state.
A cache that holds a line in the Shared state must also snoop all invalidate broadcasts from other CPUs, and discard the line (by moving it into Invalid state) on a match.
A cache that holds a line in the Exclusive state must also snoop all read transactions from all other CPUs, and move the line to Shared state on a match.
The Modified and Exclusive states are always precise: i.e. they match the true cacheline ownership situation in the system. The Shared state may be imprecise: if another CPU discards a Shared line, and this CPU becomes the sole owner of that cacheline, the line will not be promoted to Exclusive state. (because broadcasting all cacheline replacements from all CPUs is not practical over a broadcast snoop bus)
99
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Linux*: vmstat
Quick way to watch for• Disk i/o
• Overall cpu utilization
• Swap
• Context switches
vmstat –n 1• print header only once
• output data every 1 sec
100
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Linux*: sar
May not be installed by default• sysstat package (CD 3 RPMS directory – RHEL3 U2)
Monitors• Disk i/o, all CPUs, swap, network traffic, interrupts
sar –U ALL –bWw –o <binfile> 1 0Report statistics, 1 sec interval, forever-U ALL Report on all CPUs-b aggregated disk I/O (for more details use iostat)-W swap statistics-w context switches/s-o <binfile> write to binary file–f <binfile> read from binary file
101
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Linux*: sar
Using sar in application launch scriptsexport PATH=$PATH:/sbin
sar –U ALL –bWw –o app.sar 1 0 &
<launch your app>
kill -9 `pidof sar` (could use: killall -9 sar)
kill -9 `pidof sadc` (could use: killall -9 sadc)
sar seems more expensive• Time gaps in reports even if 1 sec output is requested
• Needs more investigation
102
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Linux*: mpstat
mpstat –P ALL 1-P ALL all processors
Using mpstat in application launch scriptsexport PATH=$PATH:/sbin
mpstat –P ALL 1 >mpstat.out &
<launch your app>
kill -9 `pidof mpstat` (could use: killall -9 mpstat)
103
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Linux*: top
Useful interactive mode options• Press the appropriate keys
s changes delay between updatesu selects only specified user’s processH show threads & utilization (toggle)
(shows CPU on which thread is scheduled)
i idle processes or threads (toggle)
b batch mode
• Does not report threads
104
Copyright © 2006, Intel Corporation. All rights reserved.
Scalability of Multithreaded Applications
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Linux*: using processor affinity
schedutils package• “taskset” command
Affinity system call APIs• sched_setaffinity
• In 2.6 kernels