ispass 2011 tanima dey wei wang, jack w. davidson, mary l. soffa department of computer science...

27
ISPASS 2011 Characterizing Multi-threaded Applications based on Shared-Resource Contention Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science University of Virginia 1

Upload: kiley-seat

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

ISPASS 2011

Characterizing Multi-threaded Applications based on

Shared-Resource Contention

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa

Department of Computer Science

University of Virginia

1

MotivationThe number of cores doubles every 18 monthsExpected: Performance number of coresOne of the bottlenecks is shared resource contention

For multi-threaded workloads, contention is unavoidable

To reduce contention, it is necessary to understand where and how the contention is created

2

Shared Resource Contention in Chip-Multiprocessors

Intel Quad Core Q9550

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

Front -Side Bus

3

Application 1 Thread

Application 2 Thread

Scenario 1 Multi-threaded applicationsWith co-runner

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

4

Application 1 Thread

Application 2 Thread

Without co-runner

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

Application Thread

5

Scenario 2Multi-threaded applications

Shared-Resource Contention

Intra-application contentionContention among threads from the same application

(No co-runners)

Inter-application contentionContention among threads from the co-running

application

6

ContributionsA general methodology to evaluate a multi-threaded

application’s performance Intra-application contention Inter-application contentionContention in the memory-hierarchy shared resources

Characterizing applications facilitates better understanding of the application’s resource sensitivity

Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7

OutlineMotivationContributionsMethodologyMeasuring intra-application contentionMeasuring inter-application contentionRelated WorkSummary

8

Methodology

9

Designed to measure both intra- and inter-application contention for a targeted shared resourceL1-cache, L2-cacheFront Side Bus (FSB)

Each application is run in two configurationsBaseline: threads do not share the targeted resourceContention: threads share the targeted resource

Multiple number of targeted resourceDetermine contention by comparing performance

(gathering hardware performance counters’ values)

OutlineMotivationContributionsMethodologyMeasuring intra-application contention (See paper)Measuring inter-application contentionRelated WorkSummary

10

L1-cache

Baseline Configuration

Contention Configuration

Measuring inter-application contention

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

Application 1 Thread

Application 2 Thread

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

11

Measuring inter-application contentionL2-cache

Baseline Configuration

Contention Configuration

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

Application 1 Thread

Application 2 Thread

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

12

Measuring inter-application contentionFSB

Baseline Configuration

Memory

C0

C2

C4

C6

L2 L2

L1 L1L1 L1

C1

C3

C5

C7

L2 L2

L1 L1L1 L1

Application 1 Thread

Application 2 Thread

13

Measuring intra-application contentionFSB

Contention Configuration

Memory

C0

C2

C4

C6

L2 L2

L1 L1L1 L1

C1

C3

C5

C7

L2 L2

L1 L1L1 L1

Application 1 Thread

Application 2 Thread

14

PARSEC Benchmarks

15

Application Domain Benchmark(s)

Financial Analysis Blackscholes (BS)Swaptions (SW)

Computer Vision Bodytrack (BT)

Engineering Canneal (CN)

Enterprise Storage Dedup (DD)

Animation Facesim (FA)Fluidanimate (FL)

Similarity Search Ferret (FE)

Rendering Raytrace (RT)

Data Mining Streamcluster (SC)

Media Processing Vips (VP)X264 (X2)

Experimental platformPlatform 1: Yorkfield

Intel Quad core Q955032 KB L1-D and L1-I

cache6MB L2-cache2GB MemoryCommon FSB

C0

L2 cache

Memory

L1 cache

Memory Controller Hub (Northbridge)

FSB

MB

FSB interface

L2 cache

L2 HW-PF

FSB interface

L2 HW-PF

L1 HW-PF

C1

C2

C3

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

1616

Tanima Dey

Experimental platform

Memory

Memory Controller Hub (Northbridge)FSB

MB

FSB

C0

L2 cache

L1 cache

FSB interface

L2 cache

L2 HW-PF

FSB interface

L2 HW-PF

L1 HW-PF

C2

C4

C6

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

C1

L2 cache

L1 cache

FSB interface

L2 cache

L2 HW-PF

FSB interface

L2 HW-PF

L1 HW-PF

C3

C5

C7

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

Platform 2: Harpertown

1717

18

Performance AnalysisInter-application contention

For i-th co-runner

PercentPerformanceDifferencei =

( PerformanceBasei – PerformanceContendi ) * 100

PerformanceBasei

Absolute performance difference sum

APDS = Σ abs ( PercentPerformanceDifferencei )

Inter-application contentionL1-cache – for Streamcluster

19

Bla

ck

sc

ho

les

Bo

dy

tra

ck

Ca

nn

ea

l

De

du

p

Fa

ce

sim

Fe

rre

t

Flu

ida

nim

ate

Ra

ytr

ac

e

Sw

ap

tio

ns

Vip

s

X2

64

-8

-6

-4

-2

0

2

4

6

8

Inter-application L1-cache Contention

Co-running benchmarks

Pe

rfo

rma

nc

e D

iffe

ren

ce

(%

)

Inter-application L1-cache contention Streamcluster

20

Inter-application L1-cache Contention

-8

-6-4

-20

2

46

8

Bla

ck

sc

ho

les

Bo

dy

tra

ck

Ca

nn

ea

l

De

du

p

Fa

ce

sim

Fe

rre

t

Flu

ida

nim

ate

Ra

ytr

ac

e

Str

ea

mc

lus

ter

Sw

ap

tio

ns

Vip

s

X2

64

Co-running benchmarks

Pe

rfo

rma

nc

e D

iffe

ren

ce

(%

)

21

Inter-application contention

21

L1-cache

Inter-application contention

22

L2-cache

Inter-application contentionFSB

23

Characterization

24

Benchmarks

L1-cache L2-cache FSB

Blackscholes

none none none

Bodytrack inter inter intra

Canneal intra inter intra

Dedup inter intra, inter intra, inter

Facesim inter inter intra

Ferret intra intra, inter intra

Fluidanimate

inter inter intra

Raytrace none none intra

Streamcluster

inter inter intra

Swaptions none none none

Vips intra inter inter

X264 inter intra, inter intra

SummaryThe methodology generalizes contention analysis of

multi-threaded applicationsNew approach to characterize applicationsUseful for performance analysis of existing and future

architecture or benchmarks Helpful for creating new workloads of diverse

properties

Provides insights for designing improved contention-aware scheduling methods

25

Related WorkCache contention

Knauerhase et al. IEEE Micro 2008Zhuravleve et al. ASPLOS 2010Xie et al. CMP-MSI 2008Mars et al. HiPEAC 2011

Characterizing parallel workload Jin et al., NASA Technical Report 2009

PARSEC benchmark suiteBienia et al. PACT 2008Bhadauria et al. IISWC 2009

26

Thank you!

27