tanima ispass2011 presentation 59ispass.org/ispass2011/slides/2_4.pdftanima dey wee a g, jac a dso ,...

27
Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , ay So a Department of Computer Science University of Virginia ISPASS 2011 1

Upload: others

Post on 08-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Tanima DeyWei Wang, Jack W. Davidson, Mary L. Soffae a g, Jac a dso , a y So a

Department of Computer ScienceUniversity of Virginia

ISPASS 2011

y g

1

Page 2: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

M i iMotivation The number of cores doubles every 18 months Expected: Performance number of cores One of the bottlenecks is shared resource contention

For multi-threaded workloads, contention is unavoidable

To reduce contention it is necessary to understand To reduce contention, it is necessary to understand where and how the contention is created

2

Page 3: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Shared Resource Contention inShared Resource Contention in Chip‐Multiprocessorsp p

C C C C Application 1C0 C1 C2 C3

L1 L1L1 L1

Application 1 Thread

Application 2 Thread

L2 L2

Front -Side Bus

Thread

Memory

Intel Quad Core Q95503

Page 4: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Scenario 1Scenario 1 Multi‐threaded applicationspp With co-runner

C0 C1 C2 C3

Application 1 Thread

3

L L

L1 L1L1 L1Application 2 

Thread

L2 L2

MMemory

4

Page 5: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Scenario 2Scenario 2Multi‐threaded applications Without co-runner

pp

C0 C1 C2 C3

Application Thread

L2 L2

L1 L1L1 L1

L2 L2

MemoryMemory

5

Page 6: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Shared‐Resource Contention Intra application contention Intra-application contention

Contention among threads from the same application (No co-runners)( )

Inter-application contention Contention among threads from the co-running

application

6

Page 7: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

C ib iContributions A general methodology to evaluate a multi-threaded g gy

application’s performance Intra-application contention Inter-application contention Contention in the memory-hierarchy shared resources

Characterizing applications facilitates better understanding of the application’s resource sensitivityunderstanding of the application s resource sensitivity

Thorough performance analyses and characterizationThorough performance analyses and characterization of multi-threaded PARSEC benchmarks

7

Page 8: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

O tliOutline MotivationMotivation Contributions Methodologygy Measuring intra-application contention Measuring inter-application contentiong pp Related Work Summary

8

Page 9: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

MethodologyMethodology Designed to measure both intra- and inter-

application contention for a targeted shared resourceapplication contention for a targeted shared resource L1-cache, L2-cache Front Side Bus (FSB)

Each application is run in two configurations Baseline: threads do not share the targeted resource Contention: threads share the targeted resource

Multiple number of targeted resource Determine contention by comparing performance

9

Determine contention by comparing performance (gathering hardware performance counters’ values)

Page 10: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

O tliOutline MotivationMotivation Contributions Methodologygy Measuring intra-application contention (See paper) Measuring inter-application contentiong pp Related Work Summary

10

Page 11: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

L1-cacheMeasuring inter‐application contention

Application 1 Thread

C0 C1 C2 C3

L1 L1L1 L1

Thread

Application 2 Thread

C0 C1 C2 C3

L1 L1L1 L1

L2 L2 L2 L2

Baseline Contention

Memory Memory

Baseline Configuration

Contention Configuration

11

Page 12: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

lMeasuring inter‐application contention L2-cache

Application 1 Thread

C0 C1 C2 C3

L1 L1L1 L1Application 2 

Thread

C0 C1 C2 C3

L1 L1L1 L1

L2 L2 L2 L2

Memory Memory

Baseline Configuration

Contention Configuration

12

Page 13: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

M i i t li ti t tiMeasuring inter‐application contention FSB

Application 1 Thread

C0 C2 C4 C6

L1 L1L1 L1

C1 C3 C5 C7

L1 L1L1 L1

Thread

Application 2 Thread

L2 L2 L2 L2

Memory

Baseline Configuration

13

Page 14: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

lMeasuring intra‐application contention FSB

Application 1 Thread

C0 C2 C4 C6

L1 L1L1 L1

C1 C3 C5 C7

L1 L1L1 L1Application 2 

Thread

L2 L2 L2 L2

Memory

Contention Configuration

14

Page 15: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

PARSEC BenchmarksApplication Domain Benchmark(s)Application Domain Benchmark(s)

Financial   Analysis Blackscholes (BS)Swaptions (SW)

C t Vi i   B d t k (BT)Computer Vision  Bodytrack (BT)

Engineering Canneal (CN)

Enterprise  Storage  Dedup (DD)

Animation Facesim (FA)Fluidanimate (FL)

Similarity  Search Ferret (FE)Similarity  Search Ferret (FE)

Rendering  Raytrace (RT)

Data  Mining Streamcluster (SC)

15

Media Processing Vips (VP)X264 (X2)

Page 16: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Experimental platformExperimental platform Platform 1: Yorkfield

C C C C Intel Quad core Q9550 32 KB L1-D and L1-I

h

C0

L1 cache

L1 

C1 C2 C3

L1 cache

L1 

L1 cache

L1 

L1 cache

L1 cache 6MB L2-cache 2GB Memory

L2 cache L2 cacheL2  L2 

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

2GB Memory Common FSB FSB 

interface

L2 HW‐PF

FSB interface

L2 HW‐PF

Memory Controller Hub (Northbridge)

FSB

Memory

MB

1616

Page 17: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Experimental platformExperimental platform Platform 2: Harpertown

C0

L1 cache

C2 C4 C6

L1 cache L1 cache L1 cache

C1

L1 cache

C3 C5 C7

L1 cache L1 cache L1 cache

L2 cache L2 cache

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L2 cache L2 cache

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L1 HW‐PF

L2 cache

FSB interface

L2 cacheL2 

HW‐PFFSB 

interface

L2 HW‐PF

L2 cache

FSB interface

L2 cacheL2 

HW‐PFFSB 

interface

L2 HW‐PF

Memory Controller Hub (Northbridge)FSB FSB

Tanima DeyMemory

MB

1717

Page 18: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Performance Analysis Inter-application contention

For i-th co-runnerPercentPerformanceDifferencei = ( PerformanceBasei – PerformanceContendi ) * 100

PerformanceBasePerformanceBasei

Absolute performance difference sum Absolute performance difference sumAPDS = Σ abs ( PercentPerformanceDifferencei )

18

Page 19: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

I t li ti t tiInter‐application contention L1-cache – for Streamcluster

8Inter-application L1-cache Contention

2

4

6

iffer

ence

(%)

-4

-2

0

erfo

rman

ce D

-8

-6

chol

es

ytra

ck

anne

al

Ded

up

aces

im

Ferr

et

nim

ate

ytra

ce

ptio

ns

Vips

X264

P e

19

Bla

cksc

Bod

y

Ca D

Fa

Flui

dan

Ray

Swap

Co-running benchmarks

Page 20: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Inter‐application L1‐cache contentionInter application L1 cache contentionStreamcluster

Inter-application L1-cache Contention

68

nce

(%)

-4-2024

man

ce D

iffer

en

-8-64

chol

es

dytra

ck

anne

al

Ded

up

aces

im

Ferr

et

nim

ate

aytra

ce

clus

ter

aptio

ns

Vips

X264

Perfo

rm

20

Bla

cksc

Bod C

a D

Fa

Flui

dan

Ra

Stre

amc

Swa

Co-running benchmarks

Page 21: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

I t li ti t tiInter‐application contention L1-cache

2121

Page 22: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

I t li ti t tiInter‐application contention L2-cache

22

Page 23: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

I t li ti t tiInter‐application contention FSB

23

Page 24: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

CharacterizationBenchmarks L1‐cache L2‐cache FSB

Blackscholes none none none

Bodytrack inter inter intra

C l i t i t i tCanneal intra inter intra

Dedup inter intra, inter intra, inter

Facesim inter inter intra

Ferret intra intra, inter intra

Fluidanimate inter inter intra

Raytrace none none intraRaytrace none none intra

Streamcluster inter inter intra

Swaptions  none none none

Vi   i i i

24

Vips  intra inter inter

X264 inter intra, inter intra

Page 25: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Summary The methodology generalizes contention analysis of

multi-threaded applicationsN h t h t i li ti New approach to characterize applications

Useful for performance analysis of existing and future architecture or benchmarksarchitecture or benchmarks

Helpful for creating new workloads of diverse properties

Provides insights for designing improved contention-h d li th daware scheduling methods

25

Page 26: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Related Work Cache contention

Knauerhase et al. IEEE Micro 2008 Zhuravleve et al ASPLOS 2010 Zhuravleve et al. ASPLOS 2010 Xie et al. CMP-MSI 2008 Mars et al. HiPEAC 2011

Characterizing parallel workload Jin et al., NASA Technical Report 2009

PARSEC benchmark suite Bienia et al. PACT 2008 Bhadauria et al IISWC 2009 Bhadauria et al. IISWC 2009

26

Page 27: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science

Thank you!Thank you!

27