tanima ispass2011 presentation 59ispass.org/ispass2011/slides/2_4.pdftanima dey wee a g, jac a dso ,...
TRANSCRIPT
![Page 1: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/1.jpg)
Tanima DeyWei Wang, Jack W. Davidson, Mary L. Soffae a g, Jac a dso , a y So a
Department of Computer ScienceUniversity of Virginia
ISPASS 2011
y g
1
![Page 2: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/2.jpg)
M i iMotivation The number of cores doubles every 18 months Expected: Performance number of cores One of the bottlenecks is shared resource contention
For multi-threaded workloads, contention is unavoidable
To reduce contention it is necessary to understand To reduce contention, it is necessary to understand where and how the contention is created
2
![Page 3: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/3.jpg)
Shared Resource Contention inShared Resource Contention in Chip‐Multiprocessorsp p
C C C C Application 1C0 C1 C2 C3
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
L2 L2
Front -Side Bus
Thread
Memory
Intel Quad Core Q95503
![Page 4: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/4.jpg)
Scenario 1Scenario 1 Multi‐threaded applicationspp With co-runner
C0 C1 C2 C3
Application 1 Thread
3
L L
L1 L1L1 L1Application 2
Thread
L2 L2
MMemory
4
![Page 5: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/5.jpg)
Scenario 2Scenario 2Multi‐threaded applications Without co-runner
pp
C0 C1 C2 C3
Application Thread
L2 L2
L1 L1L1 L1
L2 L2
MemoryMemory
5
![Page 6: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/6.jpg)
Shared‐Resource Contention Intra application contention Intra-application contention
Contention among threads from the same application (No co-runners)( )
Inter-application contention Contention among threads from the co-running
application
6
![Page 7: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/7.jpg)
C ib iContributions A general methodology to evaluate a multi-threaded g gy
application’s performance Intra-application contention Inter-application contention Contention in the memory-hierarchy shared resources
Characterizing applications facilitates better understanding of the application’s resource sensitivityunderstanding of the application s resource sensitivity
Thorough performance analyses and characterizationThorough performance analyses and characterization of multi-threaded PARSEC benchmarks
7
![Page 8: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/8.jpg)
O tliOutline MotivationMotivation Contributions Methodologygy Measuring intra-application contention Measuring inter-application contentiong pp Related Work Summary
8
![Page 9: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/9.jpg)
MethodologyMethodology Designed to measure both intra- and inter-
application contention for a targeted shared resourceapplication contention for a targeted shared resource L1-cache, L2-cache Front Side Bus (FSB)
Each application is run in two configurations Baseline: threads do not share the targeted resource Contention: threads share the targeted resource
Multiple number of targeted resource Determine contention by comparing performance
9
Determine contention by comparing performance (gathering hardware performance counters’ values)
![Page 10: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/10.jpg)
O tliOutline MotivationMotivation Contributions Methodologygy Measuring intra-application contention (See paper) Measuring inter-application contentiong pp Related Work Summary
10
![Page 11: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/11.jpg)
L1-cacheMeasuring inter‐application contention
Application 1 Thread
C0 C1 C2 C3
L1 L1L1 L1
Thread
Application 2 Thread
C0 C1 C2 C3
L1 L1L1 L1
L2 L2 L2 L2
Baseline Contention
Memory Memory
Baseline Configuration
Contention Configuration
11
![Page 12: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/12.jpg)
lMeasuring inter‐application contention L2-cache
Application 1 Thread
C0 C1 C2 C3
L1 L1L1 L1Application 2
Thread
C0 C1 C2 C3
L1 L1L1 L1
L2 L2 L2 L2
Memory Memory
Baseline Configuration
Contention Configuration
12
![Page 13: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/13.jpg)
M i i t li ti t tiMeasuring inter‐application contention FSB
Application 1 Thread
C0 C2 C4 C6
L1 L1L1 L1
C1 C3 C5 C7
L1 L1L1 L1
Thread
Application 2 Thread
L2 L2 L2 L2
Memory
Baseline Configuration
13
![Page 14: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/14.jpg)
lMeasuring intra‐application contention FSB
Application 1 Thread
C0 C2 C4 C6
L1 L1L1 L1
C1 C3 C5 C7
L1 L1L1 L1Application 2
Thread
L2 L2 L2 L2
Memory
Contention Configuration
14
![Page 15: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/15.jpg)
PARSEC BenchmarksApplication Domain Benchmark(s)Application Domain Benchmark(s)
Financial Analysis Blackscholes (BS)Swaptions (SW)
C t Vi i B d t k (BT)Computer Vision Bodytrack (BT)
Engineering Canneal (CN)
Enterprise Storage Dedup (DD)
Animation Facesim (FA)Fluidanimate (FL)
Similarity Search Ferret (FE)Similarity Search Ferret (FE)
Rendering Raytrace (RT)
Data Mining Streamcluster (SC)
15
Media Processing Vips (VP)X264 (X2)
![Page 16: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/16.jpg)
Experimental platformExperimental platform Platform 1: Yorkfield
C C C C Intel Quad core Q9550 32 KB L1-D and L1-I
h
C0
L1 cache
L1
C1 C2 C3
L1 cache
L1
L1 cache
L1
L1 cache
L1 cache 6MB L2-cache 2GB Memory
L2 cache L2 cacheL2 L2
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
2GB Memory Common FSB FSB
interface
L2 HW‐PF
FSB interface
L2 HW‐PF
Memory Controller Hub (Northbridge)
FSB
Memory
MB
1616
![Page 17: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/17.jpg)
Experimental platformExperimental platform Platform 2: Harpertown
C0
L1 cache
C2 C4 C6
L1 cache L1 cache L1 cache
C1
L1 cache
C3 C5 C7
L1 cache L1 cache L1 cache
L2 cache L2 cache
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L2 cache L2 cache
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L1 HW‐PF
L2 cache
FSB interface
L2 cacheL2
HW‐PFFSB
interface
L2 HW‐PF
L2 cache
FSB interface
L2 cacheL2
HW‐PFFSB
interface
L2 HW‐PF
Memory Controller Hub (Northbridge)FSB FSB
Tanima DeyMemory
MB
1717
![Page 18: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/18.jpg)
Performance Analysis Inter-application contention
For i-th co-runnerPercentPerformanceDifferencei = ( PerformanceBasei – PerformanceContendi ) * 100
PerformanceBasePerformanceBasei
Absolute performance difference sum Absolute performance difference sumAPDS = Σ abs ( PercentPerformanceDifferencei )
18
![Page 19: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/19.jpg)
I t li ti t tiInter‐application contention L1-cache – for Streamcluster
8Inter-application L1-cache Contention
2
4
6
iffer
ence
(%)
-4
-2
0
erfo
rman
ce D
-8
-6
chol
es
ytra
ck
anne
al
Ded
up
aces
im
Ferr
et
nim
ate
ytra
ce
ptio
ns
Vips
X264
P e
19
Bla
cksc
Bod
y
Ca D
Fa
Flui
dan
Ray
Swap
Co-running benchmarks
![Page 20: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/20.jpg)
Inter‐application L1‐cache contentionInter application L1 cache contentionStreamcluster
Inter-application L1-cache Contention
68
nce
(%)
-4-2024
man
ce D
iffer
en
-8-64
chol
es
dytra
ck
anne
al
Ded
up
aces
im
Ferr
et
nim
ate
aytra
ce
clus
ter
aptio
ns
Vips
X264
Perfo
rm
20
Bla
cksc
Bod C
a D
Fa
Flui
dan
Ra
Stre
amc
Swa
Co-running benchmarks
![Page 21: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/21.jpg)
I t li ti t tiInter‐application contention L1-cache
2121
![Page 22: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/22.jpg)
I t li ti t tiInter‐application contention L2-cache
22
![Page 23: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/23.jpg)
I t li ti t tiInter‐application contention FSB
23
![Page 24: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/24.jpg)
CharacterizationBenchmarks L1‐cache L2‐cache FSB
Blackscholes none none none
Bodytrack inter inter intra
C l i t i t i tCanneal intra inter intra
Dedup inter intra, inter intra, inter
Facesim inter inter intra
Ferret intra intra, inter intra
Fluidanimate inter inter intra
Raytrace none none intraRaytrace none none intra
Streamcluster inter inter intra
Swaptions none none none
Vi i i i
24
Vips intra inter inter
X264 inter intra, inter intra
![Page 25: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/25.jpg)
Summary The methodology generalizes contention analysis of
multi-threaded applicationsN h t h t i li ti New approach to characterize applications
Useful for performance analysis of existing and future architecture or benchmarksarchitecture or benchmarks
Helpful for creating new workloads of diverse properties
Provides insights for designing improved contention-h d li th daware scheduling methods
25
![Page 26: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/26.jpg)
Related Work Cache contention
Knauerhase et al. IEEE Micro 2008 Zhuravleve et al ASPLOS 2010 Zhuravleve et al. ASPLOS 2010 Xie et al. CMP-MSI 2008 Mars et al. HiPEAC 2011
Characterizing parallel workload Jin et al., NASA Technical Report 2009
PARSEC benchmark suite Bienia et al. PACT 2008 Bhadauria et al IISWC 2009 Bhadauria et al. IISWC 2009
26
![Page 27: Tanima ISPASS2011 presentation 59ispass.org/ispass2011/slides/2_4.pdfTanima Dey Wee a g, Jac a dso , a y So ai Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science](https://reader034.vdocument.in/reader034/viewer/2022042414/5f2e8f319627b539cd42ec52/html5/thumbnails/27.jpg)
Thank you!Thank you!
27