scalability of threaded applications intel software college

104
Scalability of Threaded Applications Intel Software College

Upload: dwain-mosley

Post on 26-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalability of Threaded Applications Intel Software College

Scalability of Threaded Applications

Intel Software College

Page 2: Scalability of Threaded Applications Intel Software College

2

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Objectives

After completion of this module you will understand

• The need for designing multithreaded applications for scalability to take advantage of an increasing number of available cores

• What tools are available to measure and predict scalability

• How several different factors can inhibit scaling of applications on increased number of cores

Page 3: Scalability of Threaded Applications Intel Software College

3

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

Page 4: Scalability of Threaded Applications Intel Software College

4

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

What is scalability?

Handle growing amounts of work in a graceful manner

What resources might be increased?• Cores and threads

• Memory capacity

• Data, problem size• Not a resource, but likely to see increases as computation power increases

“What is it that we really mean by scalability? A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added.”

-- Werner Vogels CTO - Amazon.com

Page 5: Scalability of Threaded Applications Intel Software College

5

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Many-core array• CMP with 10s-100s low

power cores• Scalar cores• Capable of TFLOPS+• Full System-on-Chip• Servers, workstations,

embedded…

Dual core• Symmetric multithreading

Multi-core array• CMP with ~10 cores

Evolution

Large, Scalar cores for high single-thread performance

Scalar plus many core for highly threaded workloads

Evolutionary configurable architecture

Page 6: Scalability of Threaded Applications Intel Software College

6

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Amdahl’s Law

Speedup is limited by the

amount of serial code

Maximum Theoretical Speedup from Amdahl's Law

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

Number of cores

Sp

ee

du

p

%serial= 0

%serial=10

%serial=20

%serial=30

%serial=40

%serial=50

Ψ(p) ≤

1

s + (1 - s) / pwhere 0 ≤ s ≤ 1, the

fraction of serial operations

Page 7: Scalability of Threaded Applications Intel Software College

7

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 1

A: 1.25B: 2.0C: 4.0D: No speedup

If application is only 25% serial, what’s the maximum speedup you can ever achieve, assuming infinite number of processors ? (ignore parallel overhead)

Page 8: Scalability of Threaded Applications Intel Software College

8

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Scaled Speedup (Gustafson-Barsis’s Law)

Amdahl’s Law does not take into account

• overhead costs

• increases in problem size able to be computed with more cores

Increasing the number of cores enables…

• Increasing the problem size ―> Decreasing the sequential fraction of computation ―> Increasing Speedup

Given p cores and a parallel code solving a problem of size n, let s be the fraction of serial execution in the code.

Ψ ≤ p + (1 – p) / s

Scaled Speedup estimates how much faster parallel execution is over same

computation on single core

Assumes problem size increases linearly with number of cores

Page 9: Scalability of Threaded Applications Intel Software College

9

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Using Scaled Speedup

If application runs on 64 cores in 220 seconds with 11 seconds devoted to serial execution, what is the scaled speedup?

Assuming fixed serial time, what is single core execution time?

(220-11)*64 + 11 = 13387 seconds

• Amdahl’s Law then yields speedup of 60.84 on 64 cores with 0.08271% serial time

Would serial time be fixed? Would problem fit on one core?

Ψ = 64 + (1 – 64) (11/220)

= 64 – 63 * 0.05

= 60.85

Amdahl’s Law

5% serial on 64 cores => 15.42

Page 10: Scalability of Threaded Applications Intel Software College

10

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 2

What is the maximum amount of serial execution time for a parallel application to achieve a scaled speedup of 7.5 on an eight-core system?

Using Amdahl’s Law, serial percentage must be ≤ 0.952%

7.5 = 8 + (1 – 8) s

s = 0.5 / 7

= 0.071 => 7.1%

Page 11: Scalability of Threaded Applications Intel Software College

11

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Estimating potential scalability of serial applications

• Need to estimate serial vs. parallelizable execution times• Speedup estimate based on Amdahl’s law

• VTune sampling• Identify potential areas for parallelization

• Example: loops• Use clock ticks to estimate parallel time

• Serial time = Total run time – parallelizable run time• Compute scalability estimate

• VTune call graph• See potential call trees for parallelization

• Use “Total time” (self + descendents) for parallelizable run time

Page 12: Scalability of Threaded Applications Intel Software College

12

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Estimating scalability upper bound for parallel applications

Need to estimate serial vs. parallel execution times• Speedup estimate based on Amdahl’s law • Serial percentage for Gustafson-Barsis’s Law

Thread Profiler• Use critical path information in Profile View• Use information in Concurrency Level view

• Experimental technique based on CPU utilization of all processors/cores

Page 13: Scalability of Threaded Applications Intel Software College

13

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Finding Serial and Parallel TimeThread Profiler

Thread Profiler – for parallel applications

• Use Concurrency Level View

• Total Serial (CL:0 and CL:1) and Parallel (CL:2 and up) times• Under Utilized times counted as parallel time

Page 14: Scalability of Threaded Applications Intel Software College

14

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Perfmon Dual Xeon™ processor/2.8Ghz/Windows* XP

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

Workload run time (sec)

%C

PU

uti

liza

tio

n

CPU 0

CPU 1

Serial Serial Parallel

Finding Serial and Parallel TimeCPU utilization

Experimental approaches –for parallel applications• Monitor utilization of all CPUs

over time

• Parallel region is where all CPUs are active

• Perfmon* (Windows) or mpstat (Linux)

• Example: 76% serial, 24% parallel on DP

Perfmon* or mpstat does not capture sub-second behavior

Page 15: Scalability of Threaded Applications Intel Software College

15

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Work

load

01

Work

load

13

Work

load

12

Work

load

06

Work

load

07

Work

load

08

Work

load

09

Work

load

10

Work

load

11

Work

load

05

Work

load

04

Work

load

03

Work

load

02

0.0

2.0

1.0

4.0

3.0

Sp

eed

up

Measured 2P Speedup

Measured 4P Speedup

2P and 4P Speedup (IBM* X440 .NET RC1)

How can speedup estimate help identify scalability issues?

• Different workloads can exercise different parts of application

• Estimates can point to workloads that need scalability analysis and improvement

• Compare measured vs. estimate

• Choose largest delta workloads for analysis Workloads 5 & 12 show significant difference

between estimate and actual; focus tuning hereWorkload 11 is predicted to have low scaling

Page 16: Scalability of Threaded Applications Intel Software College

16

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Quick Review: Measuring and Estimating Speedup

Estimate serial vs. parallel times in workloads

• Allows prediction of speedup upper bounds

Serial applications

• Estimate based on VTune Sampling or Callgraph runs

Parallel applications

• Use Thread Profiler

• Experimental techniques• Measuring CPU utilization over time for all processors

Page 17: Scalability of Threaded Applications Intel Software College

17

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

Page 18: Scalability of Threaded Applications Intel Software College

18

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Approaching a serial application

1. Pick a workload

2. Establish a scalability target• Example: Must have at least 2.5x improvement 1core4core

3. Estimate amount of parallelization required• Dictated by Amdahl’s law

• Example: 2.5X improvement 1c4c would require 80% of run time to be parallelized

• Identify areas to parallelize• Cannot find areas to meet required amount of parallelization?

• Reset scalability target and continue parallelization

4. Parallelize and measure speedup

5. Did you meet the scalability target?• If not, root cause and improve

Repeat for other workloads

Page 19: Scalability of Threaded Applications Intel Software College

19

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Approaching a parallel application

1. Pick a workload

2. Estimate expected speedup• Amdahl, Gustafson-Barsis

3. Measure speedup

4. Did you meet the expected scaling?• If not, root cause and improve

Repeat for other workloads

Page 20: Scalability of Threaded Applications Intel Software College

20

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 3a: What is the best design for scalability?

Audio processing application

• Left channel computation

• Right channel computation

Page 21: Scalability of Threaded Applications Intel Software College

21

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 3b: What is the best design for scalability?

Video stream encoding

• Thread intra-frame?

• Thread groups of pictures?

Page 22: Scalability of Threaded Applications Intel Software College

22

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 3c: What is the best design for scalability?

Room Assignment Problem (Simulated Annealing)

Goal: Find most compatible roommate assignments

Method:

• Roomers take interest survey

• Roommates initially chosen at random

• Two people are swapped at random

• Does new assignment increase common interests in roommates (reduce conflict)?• If yes, keep new assignment• If no, undo swap; shrinking random chance to keep bad match

• Continue until solution stabilizes

Page 23: Scalability of Threaded Applications Intel Software College

23

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

Page 24: Scalability of Threaded Applications Intel Software College

24

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Windows*: Perfmon*

Recommended first set of counters

• “Processor” performance object: %processor time, %privileged time (for each CPU)

• “System” performance object: Context Switches/sec, System Calls/sec

• “PhysicalDisk” performance object: Disk Read bytes/sec, Disk Write bytes/sec (for each disk)

• “Memory” performance object: Pages/sec

• “Network Interface” performance object: Bytes Total/sec (for each network card)

Windows command line tools available

• Logman

• Relog

• Typeperf

Page 25: Scalability of Threaded Applications Intel Software College

25

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Windows*: Fixing Process to Core

Eliminate “noise” from context switches that abandon cache

Windows Task Manager

• “Process” Tab right click on process to set affinity

Windows APIs

• SetProcessAffinityMask

• SetThreadAffinityMask

Page 26: Scalability of Threaded Applications Intel Software College

26

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

VTune* call graph

Helps isolate call trees for potential threading

Page 27: Scalability of Threaded Applications Intel Software College

27

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

VTune Counter Monitor

Tracks operating system counters over time

Some relevant counters:

• Processor time

• Available memory

• Context switches

Page 28: Scalability of Threaded Applications Intel Software College

28

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel Thread Profiler

Identifies

• Serial vs. parallel run times

• Lock contention areas

• Parallel overhead

• Load imbalance

Page 29: Scalability of Threaded Applications Intel Software College

29

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Loop graph viewer

Being considered for VTune 9.0

View program as loop hierarchies• Loop self times and total times (in terms of instructions retired)

• Similar to call graph self and total times

• Loop counts

Helps identify loops for coarse grain threading• Loop hierarchies can span functions and files

PIN tool based prototype• Currently Linux-only

Page 30: Scalability of Threaded Applications Intel Software College

30

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Quick Review: Tools

No single tool may give you all the answers for scalability issues

• Thread Profiler comes close

Simple tools can provide insight into scalability issues

• Perfmon• Monitoring of CPU utilization of processors and application threads

Page 31: Scalability of Threaded Applications Intel Software College

31

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

Page 32: Scalability of Threaded Applications Intel Software College

32

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Effects of serial domination

Serially dominated workloads do not scale well

• Amdahl’s Law

How to estimate serial time?

• VTune sampling, VTune Call graph• Serial applications

• Thread Profiler, experimental approaches• Parallel applications

Page 33: Scalability of Threaded Applications Intel Software College

33

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

76% serial time on DP• 53% non-concurrent

time on UP

• 4P theoretical scaling 1.6X

• 4P measured scaling 1.1X

Case Study 1W

ork

load

01

Work

load

13

Work

load

12

Work

load

06

Work

load

07

Work

load

08

Work

load

09

Work

load

10

Work

load

11

Work

load

05

Work

load

04

Work

load

03

Work

load

02

0.0

2.0

1.0

4.0

3.0

Sp

eed

up

Measured 2P Speedup

Measured 4P Speedup

2P and 4P Speedup (IBM* X440 .NET RC1) Perfmon Dual Xeon™ processor/2.8Ghz/Windows* XP

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

Workload run time (sec)

%C

PU

uti

liza

tio

n

CPU 0

CPU 1

Serial Serial Parallel

Parallelize serial sections for better scalability

Page 34: Scalability of Threaded Applications Intel Software College

34

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Question 4

Profile shows 80% of runtime spent calculating multidimensional FFT

Assume calling sequence of fft2d fft1d fftcc

Where would you thread for better scaling?

gprof profile:

Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 64.71 93.43 93.43 23952910 0.00 0.00 fftcc_ 11.47 110.00 16.57 23952910 0.00 0.00 fft1d_ 11.41 126.47 16.47 100 164.70 1402.89 ssf_3dcs_ 4.94 133.59 7.13 151600 0.05 0.77 fft2d_ 2.94 137.84 4.25 37900 0.11 0.11 phaseshift3d_ 2.24 141.07 3.23 100 32.30 32.30 imaging_

A: fftcc

B: fft1d

C: fft2d

D: All of the above

Page 35: Scalability of Threaded Applications Intel Software College

35

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Top-Down Design

The iterative “hotspot” tuning process:

Find and fix hot spot…

Find and fix hot spot…

Find and fix hot spot…

The top-down parallelization process:

Find the highest level of natural parallelism…

Top-down approach considers the parallelism of the whole application rather than individual hotspots.

The result is usually a more scalable, parallel application.

Page 36: Scalability of Threaded Applications Intel Software College

36

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

Page 37: Scalability of Threaded Applications Intel Software College

37

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Granularity

Loosely defined as the ratio of computation to synchronization

Be sure there is enough work to merit parallel computation

Example: Working on the railroad. How many more workers can be added?

Page 38: Scalability of Threaded Applications Intel Software College

38

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 1: Workload-dependent Scaling

Lab shows a chosen number of spheres bouncing within an enclosed box

• Obey laws of physics for bouncing off walls and colliding with other spheres

User is able to control

• Number of spheres

• Amount of physics computation before rendering

• Whether to run with single thread or multithreaded• Load balance between threads will be explored in later lab

GUI frames per second displayed is performance metric

Page 39: Scalability of Threaded Applications Intel Software College

39

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Parallel Overhead

Parallel overhead impacts scalability

Thread creation/destruction

• Amount of work vs. overhead

• Thread Pool (Windows*) may be a good solution

Synchronization

• Call overhead

• Transition in and out of kernel space

Possible indicators

• High kernel time

• Thread Profiler• Critical Path view showing large overhead times

Page 40: Scalability of Threaded Applications Intel Software College

40

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Example: Threading Quicksort

Algorithm:

• Pick pivot value from elements

• Partition data around pivot• Less-than or equal to pivot• Greater than pivot

• Quicksort the two partitions

Pivot

Less-than or equal

Greater-than

QuickSort(int p, int r) // Assume global array of data{ if (p < r) { int q = Partition(p, r); QuickSort(p, q-1); // sort less-than QuickSort(q+1, r); // sort greater-than }}

p q r

Page 41: Scalability of Threaded Applications Intel Software College

41

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Example: Threading Quicksort

How about creating threads at each recursive call?

DWORD WINAPI QuickSort(LPVOID pr){ int p = ((qParams *)pr)->s; int r = ((qParams *)pr)->t; qParams lo, hi; HANDLE hLOHI[2];

if (p < r) { int q = Partition(p, r); lo.s = p; lo.t = q; hi.s = q+1; hi.t = r; hLOHI[0] = CreateThread(NULL, 0, QuickSort, (LPVOID) &lo, 0, NULL); hLOHI[1] = CreateThread(NULL, 0, QuickSort, (LPVOID) &hi, 0, NULL); WaitForMultipleObjects(2, hLOHI, TRUE, INFINITE); } return 0;}

typedef struct { int s,t;} qParams;

For single parameter

Page 42: Scalability of Threaded Applications Intel Software College

42

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Quicksort Performance Results

Page 43: Scalability of Threaded Applications Intel Software College

43

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Is There a More Scalable Quicksort Implementation?

Thread pool to control number of threads

Producer/Consumer relationship with index pair queue• Dequeue pair struct from queue and partition (Consumer)• Recursive calls become enqueue of index struct (Producer)

DWORD WINAPI QuickSort(LPVOID pArg){ int p, r, q; while (1) { WaitForSingleObject(hSem, INFINITE); dequeue(&p, &r); if (p < r) { q = Partition(p, r); enqueue(p, q); enqueue(q+1, r); q = ReleaseSemaphore(hSem, 2, NULL); } } return 0;}

Semaphore counts number of pairs in queue

Encapsulation of index pairs done in queue routines

Page 44: Scalability of Threaded Applications Intel Software College

44

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Quickstart Thread Pool Performance

Page 45: Scalability of Threaded Applications Intel Software College

45

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

Page 46: Scalability of Threaded Applications Intel Software College

46

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Looking at Load Imbalances

Load imbalance reduces scalability

Why?

• Idle CPU

• Easier to spot on 4P or above• On 2P, idle times might be mistaken as “serial” sections

How do you detect this?

• Windows* Perfmon

• Linux* mpstat

• Thread Profiler

Page 47: Scalability of Threaded Applications Intel Software College

47

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 2

Linux* mpstat CPU data

2P data not suggestive of load imbalance

4P data shows CPUs drop off

2 c

ore

s

1 c

ore

1 c

ore

4 c

ore

s

2 c

ore

s1

core

Page 48: Scalability of Threaded Applications Intel Software College

48

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 2: Improved

Linux* mpstat CPU data

Second figure shows improvement with load balancing

• 1P-4P scaling improves from 2.1x to 2.7x

4 c

ore

s

Page 49: Scalability of Threaded Applications Intel Software College

49

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Spotting Load Imbalance in Thread Profiler

Differences in Active Thread state

Differences in Active Thread state

First problem noticed is create/destroy threads for each iteration…

…but there is a difference in Active Thread state within pairs.

Page 50: Scalability of Threaded Applications Intel Software College

50

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 2: Effects of Load Balance in Multi-Threaded Implementation

Use Load Balance control within Basic Physics GUI to control number of spheres assigned to threads

How does this affect FPS measure?

Page 51: Scalability of Threaded Applications Intel Software College

51

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

Page 52: Scalability of Threaded Applications Intel Software College

52

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Synchronization

Lost time waiting for locks

Most likely scenario for high contention

• Work inside AND outside protected region is very small

• “Threads pile up” on the lock• Symptoms: High context

switches/sec, high kernel times

Spotting highly contended synchronization objects

• Thread Profiler

Busy

Idle

In CriticalTime

Thread 0

Thread 1

Thread 2

Thread 3

Page 53: Scalability of Threaded Applications Intel Software College

53

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Lock Contention Indicators in Thread Profiler

Large percentage of Locks time

Large amount of Impact time associated with

object

Page 54: Scalability of Threaded Applications Intel Software College

54

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Synchronization PrimitivesWindows*

Choice of synchronization primitives

• Atomic increments/decrements• InterlockedIncrement

• Critical Section, Critical Section with spin count• EnterCriticalSection, LeaveCriticalSection, SetCriticalSectionSpinCount• Works within a single process

• Events• Signal condition has been changed/satisfied

• Mutex• Works across processes as well

• Semaphore• Works across processes as well

Page 55: Scalability of Threaded Applications Intel Software College

55

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 3: Measuring Synchronization Object Overhead

Determine overhead for using different synchronization objects• InterlockedIncrement• CRITICAL_SECTION• CRITICAL_SECTION with spin count• Mutex• Semaphore

CRITICAL_SECTION is used as baseline

• InterlockedIncrement is specialized functionality; others more general

Page 56: Scalability of Threaded Applications Intel Software College

56

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Lock times relative to InterlockedIncrement1P/1C/1T (data in L1 cache)

Windows XP 32-bit (MP kernel)(higher => more expensive)

0.0

1.0

2.0

3.0

4.0

5.0

6.0In

telo

cked

Incr

emen

t

Ent

er+L

eave

Crit

ical

Sec

tion

Ent

er+L

eave

Crit

ical

Sec

tion

with

spin

cou

nt 4

000

Acq

uire

+Rel

ease

Mut

ex

Acq

uire

+Rel

ease

Sem

apho

re

Lock type

Rel

ativ

e co

st(h

igh

er i

s m

ore

ex

pen

sive

)

Mrm 2.61C/1T

Synchronization Primitives Costs: Un-contended

>50x

Use the least expensive synchronization method possible

Page 57: Scalability of Threaded Applications Intel Software College

57

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Lock Contention

Lock contention reduces scalability

Following factors combine to produce contention and reduce scalability

• Amount of work inside vs. outside protected region

• Synchronization primitive costs

• OS context switches during lock contention

Possible indicators (without Thread Profiler)

• High context switches/sec• >10,000/s should be investigated

• And high kernel time• >20% should be investigated

Watch for high context switches/sec and kernel time

Page 58: Scalability of Threaded Applications Intel Software College

58

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Reducing Lock Contention

Lock contention reduces scalability – fix?

• Ideally, work inside << work outside• Redesign

• Explore use of “spin count” (Windows*)• InitializeCriticalSectionAndSpinCount, SetCriticalSectionSpinCount• #define _WIN32_WINNT 0x0403 // or higher• Spin count = 4000 recommended by Microsoft*• Not very portable

Page 59: Scalability of Threaded Applications Intel Software College

59

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3

17% serial time on 2P• 9% serial time on UP

• 4P theoretical scaling 3.1X

• Measured 4P scaling is 0.4X

Why should we think this is a Synchronization issue?

Clues: 50% kernel time

>40K context switches/sec on 4P

Work

load

01

Work

load

13

Work

load

12

Work

load

06

Work

load

07

Work

load

08

Work

load

09

Work

load

10

Work

load

11

Work

load

05

Work

load

04

Work

load

03

Work

load

02

0.0

2.0

1.0

4.0

3.0

Sp

eed

up

Measured 2P Speedup

Measured 4P Speedup

2P and 4P Speedup (IBM* X440 .NET RC1)

Perfmon* Dual Xeon™ processor 2.8Ghz/Windows*XP

0

20

40

60

80

100

120

1 6 11 16 21 26 31 36 41 46 51 56 61 66

Workload run time (sec)

%C

PU

uti

lizat

ion

CPU0

CPU1

Serial Parallel

Pretty good load balance

Page 60: Scalability of Threaded Applications Intel Software College

60

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: Speedup

We have a negative scaling problem…

1/2/4 Xeon™ processors / Windows* XP(2.8Ghz/512KL2/2MBiL3)

1.001.26

0.68

0.00

1.00

2.00

3.00

4.00

Orig. Code

Sc

alin

g f

ac

tor

1P

1P/2P

1P/4P

If adding more threads results in worse performance, there must be some increased contention on a shared resource

Page 61: Scalability of Threaded Applications Intel Software College

61

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: First Approach

Root cause: A class defined a critical section as a static member variable

Solution: Have each instance of class use separate lock by removing static declaration

Before: 4 threads randomly accessing 8 lights with 1 global lock

After: 4 threads randomly accessing 8 lights with 8 private locks

Before

After

Page 62: Scalability of Threaded Applications Intel Software College

62

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: Performance

Observations

• 4P scaling has improved from 0.7x to 1.3x

• There still is much work to do:• Now 80,000 context switches/second

• Utilization of each CPU near 75%

1/2/4 Xeon™ processors / Windows* XP(2.8Ghz/512KL2/2MBiL3)

1.00 1.001.26

1.54

0.68

1.32

0.00

1.00

2.00

3.00

4.00

Orig. Code New Code

Sca

lin

g f

acto

r

1P

1P/2P

1P/4P

Page 63: Scalability of Threaded Applications Intel Software College

63

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: Second Approach

Perfmon* observations (4P)

• Almost no serial execution, utilization of each CPU near 50%

• Almost 200,000 Context Switches/sec!

Page 64: Scalability of Threaded Applications Intel Software College

64

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 3: Diagnosis

Root cause: poor choice of synchronization primitive• Computation is incrementing a single variable• Threads contending on single Critical Section object

Solution: Use of “InterlockedIncrement”• Critical section with spin count is another possibility

1/2/4 Scaling Xeon™ processors / Windows* XP(2GHz/512KL2/2MBiL3)

1.00 1.001.40

1.81

0.41

3.21

0.00

1.00

2.00

3.00

4.00

Orig. Code New Code

Sca

lin

g f

acto

r

1P

1P/2P

1P/4PUse the least expensive synchronization method possible

Page 65: Scalability of Threaded Applications Intel Software College

65

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

How Much of Data Structure to be Locked?

Example: Array of counts/buckets/pointers (random access)

• Enumeration sort, radix sort, bucket sort

• Hash table

Lock whole structure?• Easy to implement• Severely restricts access

Lock individual elements?• Individual access by different threads• Extra space in structure

Page 66: Scalability of Threaded Applications Intel Software College

66

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Modulo Locks

Assuming little contention for individual elements

Create array of locks to protect every Kth element• Fixed number of locks, say 2

• Lock index used to determine which elements are protected• To access element Data[Q], thread must hold LOCK[Q % 2]

• Works for 2-D and 3-D arrays• For example, with eight locks, accessing A[i,j] would use LOCK[(i+j) % 8]

Set number of locks equal to number of threads

Page 67: Scalability of Threaded Applications Intel Software College

67

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

Page 68: Scalability of Threaded Applications Intel Software College

68

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus (FSB) Bandwidth

Cores share bus in current Intel® multi-core architectures• Saturating the bus limits scalability

• Newer independent bus designs improve scalability

• Applies to current SMP platforms too

Good metric to monitor, if • CPU utilization is close to 100%• Poor scaling to 4P or 8P• Low context switches/sec

How do you measure this?• VTune™ Performance Analyzer

• Compare 1-thread vs. multi-thread VTune runs• Look for areas where clock ticks show significant jumps• Code inspection

Page 69: Scalability of Threaded Applications Intel Software College

69

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 4

IPF Madison 1.5Ghz/9M/400Mhz

• 1P to 2P scaling: 1.28

• 1P to 4P scaling: 1.27

2P close to FSB saturation

• ~5GB/s

• Madison 400Mhz bus peakbandwidth is 6.4GB/s

Solution?

• Change algorithm / data structures to keepdata in cache more often

• Easier said than done

Page 70: Scalability of Threaded Applications Intel Software College

70

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab

Intent of this lab

• Observe impact of FSB saturation on scalability using Stream benchmark

• Learn use of appropriate VTune performance event to monitor bus utilization

ChipsetMCH

Bus 0 Bus 1

Core 02M L2

Woodcrest Socket 0

Core 12M L2

Core 22M L2

Core 32M L2

Woodcrest Socket 1

Core 42M L2

Core 52M L2

Core 62M L2

Core 72M L2

Page 71: Scalability of Threaded Applications Intel Software College

71

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Computing FSB Data BandwidthCore 2™ Processor

Bus bandwidth (MB/s) per core = (BDC.ta / CCU.b) * TB

• BDC.ta is the BUS_DRDY_CLOCKS.THIS_AGENT event count• Counts the number of bus cycles when data is sent on the bus (the DRDY

[Data Ready] signal is asserted on the bus)

• CCU.b is the CPU_CLK_UNHALTED.BUS event count• Counts the number of bus cycles occurred during measurement (bus

cycles when core is not halted)

• TB is Theoretical bandwidth of the bus in MB/s = 8 * bus frequency• Example: 2.6GHz processor with 1333MHz FSB

• Theoretical bandwidth = 8 bytes/clock * 1333 = 10664 MB/s

Total bus bandwidth = ∑ “per core” bandwidths

Page 72: Scalability of Threaded Applications Intel Software College

72

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: VTune Notes

• Where is “BUS_DRDY_CLOCKS.THIS_AGENT” event? • Configure Sampling Events Tab -> Event Groups: “External Bus Events”

• View results as Table in VTune• Easier to compute bandwidth

• View results per CPU (Show/Hide CPU Info.)

Page 73: Scalability of Threaded Applications Intel Software College

73

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: Example

Bus bandwidth (MB/s) per core = (BDC.ta / CCU.b) * TB• TB = 8 bytes/clock * 1333 MHz = 10664 MB/s

Processor 0 BW = (159,204,930 / 1,318,288,023) * 10664 MB/s

= (0.159/1.318)*10.7 GB/s

~ 1.29 GB/s

Page 74: Scalability of Threaded Applications Intel Software College

74

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 4: Measuring Frontside Bus Saturation

Intent of this activity

• Observe impact of FSB saturation on scalability using Stream benchmark

• Learn use of appropriate VTune performance event to monitor bus utilization

What you may seeScalabilitybus0.bat: ~ 5 sec / ~1.2 GB/s

Scalabilitybus0123.bat: ~18 sec / ~4 GB/s

Page 75: Scalability of Threaded Applications Intel Software College

75

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: 4 Core Examples

Processor 0 BW = (0.308/3.216)*10.7 GB/s ~ 1.02 GB/s

Processor 1 BW = (0.312/3.311)*10.7 GB/s ~ 1.00 GB/s

Processor 2 BW = (0.308/3.224)*10.7 GB/s ~ 1.02 GB/s

Processor 3 BW = (0.311/3.305)*10.7 GB/s ~ 1.00 GB/s

Total BW ~ 1.0*4 = 4 GB/s

Processor 0 BW = (0.309/4.922)*10.7 GB/s ~ 0.67 GB/s

Processor 2 BW = (0.313/4.989)*10.7 GB/s ~ 0.67 GB/s

Processor 4 BW = (0.315/5.027)*10.7 GB/s ~ 0.67 GB/s

Processor 6 BW = (0.315/5.023)*10.7 GB/s ~ 0.67 GB/s

Total BW ~ 0.67*4 = 2.68 GB/s

Page 76: Scalability of Threaded Applications Intel Software College

76

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: 8 Core Example

Total BUS_DRDY_CLOCKS.THIS_AGENT = 2,524,773,727

Average Total CPU_CLK_UNHALTED.BUS = 52,382,932,480 / 8

= 6,547,866,560

Total Avg. BW = 2,524,773,727 / 6,547,866,560 * 10.7 GB/s

~ 4.11 GB/s

Page 77: Scalability of Threaded Applications Intel Software College

77

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: Discussion

Almost 3x slower run time from 1 to 4 cores

• Same amount of data transferred by each thread

• Contention for shared bus makes everything run slower

How do clockticks compare in first 2 runs?

• Notice the clock ticks go up significantly in the 4 stream case in the source view as well

Why is there a difference in MB/s reported by Stream vs. what you calculated using VTune?

Page 78: Scalability of Threaded Applications Intel Software College

78

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Frontside Bus Lab: CautionMeasuring BW on Underutilized System

Process or thread migration can break the formula

Example: Single thread Stream allowed to migrate in the lab

Is bandwidth used equal to (3.9 * 4 =) 15.6 GB/s?

Processor0 Processor1 Processor2 Processor3Bus Data Ready 89,419,935 59,181,430 193,423,450 158,135,505Clockticks 2,358,400,000 1,545,600,000 5,033,600,000 4,128,000,000FSB BW MB/s 3927 3890 3928 3942

Dempsey 3.2/2S/2C/1066Mhz - 1 Stream process allowed to migrate

Best to tie threads to cores for bandwidth analysis

Page 79: Scalability of Threaded Applications Intel Software College

79

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

What is False Sharing?

Multiple threads repeatedly write to the same cache line shared by cores• Usually different data

• Cache lines get invalidated• Forces additional reads from memory

• Severe performance impact in tight loops, in general• Threads read/write to the same cache line very rapidly

• Good metric to monitor if • CPU utilization of all processors very high• Poor scaling to 4P or 8P• Low context switches/sec

Page 80: Scalability of Threaded Applications Intel Software College

80

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Detecting False Sharing with VTune Analyzer

Core 2® processor-based events:

• MACHINE_NUKES.MEM_ORDER event

• Significant last level cache read misses • 2nd Level or 3rd Level Cache Read Misses• MEM_LOAD_RETIRED.L2_MISS

• Significant FSB activity• BUS_DRDY_CLOCKS.THIS_AGENT

Compare 1-thread vs. multi-thread VTune runs

• Look for areas where clock ticks show significant jumps

• Code inspection

Page 81: Scalability of Threaded Applications Intel Software College

81

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

False Sharing Example 1

#define N_THREADS 16double sum=0.0, sum_local[N_THREADS];

#pragma omp parallel{ int me = omp_get_thread_num(); sum_local[me] = 0.0; #pragma omp for for (i=0; i<N; i++) sum_local[me] += x[i] * y[i]; #pragma omp atomic sum += sum_local[me];}

No overlap of memory access; no sync needed

Each thread can invalidate cache line for

others

To fix, declare and use true local sum variable for each thread

Page 82: Scalability of Threaded Applications Intel Software College

82

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

False Sharing Example 2

Normalization of an array of spatial vectors (double precision)

• 10,000 vectors (<256K size; fits in L2)

• 5⁄ vectors per cache line

False sharing case

• Round-Robin distribution

• Each thread works on “start index + i*Num_Threads”

No false sharing case

• Each thread works on a block of data

• Block per thread = Array size / Num_Threads

V0 V1 . . . .V9V8V7V6V5V4V3V2

Thread 0 Thread 1 Thread 2 Thread 3

V0 V1 . . . .V5000V4999…V2500V2499…V2

Thread 0 Thread 1 Thread 2 Thread 3

Page 83: Scalability of Threaded Applications Intel Software College

83

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

False Sharing Example 2 – Effects on Speedup (2S/2C Dempsey; HT off)

Effects of False SharingDempsey 3.2 2S/2C Windows 2003 Server 64-bit

ScalabilityLab-FS.exe 32-bit

0.0

1.0

2.0

3.0

4.0

0 1 2 3 4

Number of Cores/Threads

Sca

ling

No False Sharing

With False Sharing

Sp

ee

du

p

Page 84: Scalability of Threaded Applications Intel Software College

84

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 5: Identifying False Sharing

Intent of this activity

• Observe impact of false sharing on scalability

• Learn use of appropriate VTune performance events• Compare and contrast false sharing vs. no false sharing

Page 85: Scalability of Threaded Applications Intel Software College

85

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Activity 5: DiscussionTypical Results

Sum of events on all cores

Event 1T 4T-FS 4T-NOFS

CPU_CLK_UNHALTED.CORE 38.6 E+09 173.2 E+09 37.1 E+09

INST_RETIRED.ANY 30.5 E+09 30.5 E+09 28.5 E+09

MACHINE_NUKES.MEM_ORDER 1.92 E+06 105.48 E+06 0.04 E+06

MEM_LOAD_RETIRED.L2_MISS 0.020 E+06 75.557 E+06 0.185 E+06

BUS_DRDY_CLOCKS.THIS_AGENT 8.4 E+06 2454.5 E+06 7.3 E+06

No false sharing in single thread execution

MACHINE_NUKES.MEM_ORDER counts events most likely due to false sharing

• Cache misses can be indication of problems

Page 86: Scalability of Threaded Applications Intel Software College

86

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Effects of On-die Shared Cache

Last level cache (LLC) size, shared vs. not shared

• Dempsey: 2MB L2 not shared

• Merom/Woodcrest: 4MB L2 shared

• Clovertown: 8MB L2 (4MB per die shared)

Cache sensitive application will run better with threads on cores not sharing cache

Chipset

L2 L2 L2 L2

Dempsey

Chipset

L2L2

Woodcrest

Chipset

Clovertown

L2L2

Page 87: Scalability of Threaded Applications Intel Software College

87

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Detecting Effects of On-die Shared Cache

VTune sampling

• LLC cache misses increase significantly when run on same socket vs. different sockets

Experiments with single socket vs. multi-socket show differences in scaling

May require thread affinity to correct performance problems

Page 88: Scalability of Threaded Applications Intel Software College

88

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Paying Attention to NUMA Issues

NUMA may affect scalability

• Non-Uniform Memory Access

• Adds extra memory layer to locate data• Registers• Cache• Memory• “Far” Memory

Cache-coherent Interconnect

MEM

MEM

MEM

MEM

Chipset

Dual IndependentBus

MEM

Chipset

FSB

MEM

Page 89: Scalability of Threaded Applications Intel Software College

89

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Paying attention to NUMA issues

NUMA related scalability issues depend on• Platform design

• NUMA aware OS used or not• NUMA aware OSs’: Windows* Server 2003 and Linux 2.6 kernel

• Application being NUMA aware or not

Check for NUMA issues if• Scaling falls off when going from SMP to NUMA

• Low context switches/sec

• Application is memory latency sensitive

How do you detect this?• Knowledge of platform architecture• Through experimentation • Tie threads to different cores to measure performance

• Measure memory latency ratio between “near” and “far” memory

Page 90: Scalability of Threaded Applications Intel Software College

90

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

OS and Application NUMA Support

Definition of node• Own processors and memory

• Connected to the larger system through a cache-coherent interconnect

Role of NUMA-aware OS• Schedule threads on processors in the same node as memory being used

• Satisfy memory-allocation requests from within the node• But will allocate memory from other nodes if necessary

Role of NUMA-aware applications• Use of NUMA APIs

• Topology of nodes• Memory per node

• Use of Affinity Mask APIs• SetThreadAffinityMask, SetProcessAffinityMask• Keep threads sharing memory on the same node

Page 91: Scalability of Threaded Applications Intel Software College

91

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

Page 92: Scalability of Threaded Applications Intel Software College

92

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Watching I/O

I/O impacts scalability

• CPU likely to be idle

• Check for I/O to disk and network

How do you detect this?

• Windows* Perfmon

• Linux* vmstat, sar, iostat

Page 93: Scalability of Threaded Applications Intel Software College

93

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Case Study 5

Linux* mpstat CPU data, sar I/O data

Correlation between disk write peaks and CPU utilization troughs

When I/O is reduced using application configuration options• 1P-4P scaling improves from 1.9x to

2.9x

Striped or RAID disk configurations could have helped

Overlapped I/O implementation in application

Page 94: Scalability of Threaded Applications Intel Software College

94

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Factors Inhibiting Scalability Summary

Serially dominated workload

Choice of synchronization primitives and lock contention

Granularity

Parallel overhead

I/O (Disk and Network)

Load Imbalance

High front side bus utilization

Memory related• NUMA• False sharing• Shared cache effects

Ap

plic

atio

n D

om

ain

Pla

tform

/CP

U

Page 95: Scalability of Threaded Applications Intel Software College

95

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Page 96: Scalability of Threaded Applications Intel Software College

96

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Backup

Page 97: Scalability of Threaded Applications Intel Software College

97

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

MESI protocol

Every cache line is marked with one of the four following states (coded in two additional bits):• M - Modified: The cache line is present only in the current cache, and is dirty; it has been

modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (not longer valid) main memory state.

• E - Exclusive: The cache line is present only in the current cache, but is clean; it matches main memory.

• S - Shared: Indicates that this cache line may be stored in other caches of the machine.

• I - Invalid: Indicates that this cache line is invalid.

A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Shared or Exclusive states) to satisfy a read.

A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation.

A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must be written back first.

Page 98: Scalability of Threaded Applications Intel Software College

98

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

MESI protocol (contd.)

A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other CPUs in the system) of the corresponding main memory location and insert the data that it holds. This is typically done by forcing the read to back off (i.e. to abort the memory bus transaction), then writing the data to main memory and changing the cache line to the Shared state.

A cache that holds a line in the Shared state must also snoop all invalidate broadcasts from other CPUs, and discard the line (by moving it into Invalid state) on a match.

A cache that holds a line in the Exclusive state must also snoop all read transactions from all other CPUs, and move the line to Shared state on a match.

The Modified and Exclusive states are always precise: i.e. they match the true cacheline ownership situation in the system. The Shared state may be imprecise: if another CPU discards a Shared line, and this CPU becomes the sole owner of that cacheline, the line will not be promoted to Exclusive state. (because broadcasting all cacheline replacements from all CPUs is not practical over a broadcast snoop bus)

Page 99: Scalability of Threaded Applications Intel Software College

99

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: vmstat

Quick way to watch for• Disk i/o

• Overall cpu utilization

• Swap

• Context switches

vmstat –n 1• print header only once

• output data every 1 sec

Page 100: Scalability of Threaded Applications Intel Software College

100

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: sar

May not be installed by default• sysstat package (CD 3 RPMS directory – RHEL3 U2)

Monitors• Disk i/o, all CPUs, swap, network traffic, interrupts

sar –U ALL –bWw –o <binfile> 1 0Report statistics, 1 sec interval, forever-U ALL Report on all CPUs-b aggregated disk I/O (for more details use iostat)-W swap statistics-w context switches/s-o <binfile> write to binary file–f <binfile> read from binary file

Page 101: Scalability of Threaded Applications Intel Software College

101

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: sar

Using sar in application launch scriptsexport PATH=$PATH:/sbin

sar –U ALL –bWw –o app.sar 1 0 &

<launch your app>

kill -9 `pidof sar` (could use: killall -9 sar)

kill -9 `pidof sadc` (could use: killall -9 sadc)

sar seems more expensive• Time gaps in reports even if 1 sec output is requested

• Needs more investigation

Page 102: Scalability of Threaded Applications Intel Software College

102

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: mpstat

mpstat –P ALL 1-P ALL all processors

Using mpstat in application launch scriptsexport PATH=$PATH:/sbin

mpstat –P ALL 1 >mpstat.out &

<launch your app>

kill -9 `pidof mpstat` (could use: killall -9 mpstat)

Page 103: Scalability of Threaded Applications Intel Software College

103

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: top

Useful interactive mode options• Press the appropriate keys

s changes delay between updatesu selects only specified user’s processH show threads & utilization (toggle)

(shows CPU on which thread is scheduled)

i idle processes or threads (toggle)

b batch mode

• Does not report threads

Page 104: Scalability of Threaded Applications Intel Software College

104

Copyright © 2006, Intel Corporation. All rights reserved.

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Linux*: using processor affinity

schedutils package• “taskset” command

Affinity system call APIs• sched_setaffinity

• In 2.6 kernels