evaluating the impact of simultaneous multithreading on network servers using real hardware
DESCRIPTION
Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware. Yaoping Ruan Princeton University. Vivek Pai, Princeton University Erich Nahum , IBM T.J. Watson John Tracey , IBM T.J. Watson. Motivation. Network servers Throughput matters Hardware intensive - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/1.jpg)
Evaluating the Impact of Simultaneous Multithreading on Network Servers
Using Real Hardware
Yaoping RuanPrinceton University
Vivek Pai, Princeton UniversityErich Nahum, IBM T.J. WatsonJohn Tracey, IBM T.J. Watson
![Page 2: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/2.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 2
Motivation
Network servers Throughput matters Hardware intensive
Simultaneous Multithreading (SMT) Processor support for high throughput Simulated since mid-90s Now - Intel Xeon/Pentium 4 (Hyper-
Threading), IBM POWER5 available
![Page 3: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/3.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 3
How Does SMT Work? Simultaneous execution of multiple jobs Higher utilization of functional units
cycles (direction of data flow)
Job 1Processor 1
Job 2Processor 2
Job 1&2SMT processor
(Colored blocks are functional units currently in use)
![Page 4: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/4.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 4
SMT Architecture
Appear as multi-processors for OS and app.
Architectural State Registers #1
DuplicatedResource
Architectural State Registers #2
Shared Resource
Pipeline Execution Units
Cache Hierarchy
System Bus
Main Memory
![Page 5: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/5.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 5
Contributions Detailed analysis of multiple real hardware
platforms and server packagesIncludes previously ignored OS overheads
Micro-architectural performance analysisDemonstrates dominance of memory hierarchy
Comparison with simulation studiesExplain why SMT provides relatively small
benefits on real hardwareOverly-aggressive memory simulation yielded
higher expected benefits
![Page 6: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/6.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 6
Outline
BackgroundMeasurement methodologyThroughput & improvementMicro-architectural performanceDiscussion
![Page 7: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/7.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 7
Measurements OverviewMetrics
Server throughputThroughput improvements (relative speedups)Architectural features (CPI, miss ratio, etc.)
Multiple configurationsHardware platforms (clock speed, cache, etc.)Server software (Apache, Flash, TUX, etc.)Kernel configuration (uniprocessor and
multiprocessor)
![Page 8: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/8.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 8
Hardware Platforms
Three models of Xeon processors
Clock rate 2.0GHz 3.06Ghz 3.06GHz L3
L3 - 1MB
Mem latency (cycles)
220 350 cycles
L1/L2 cache sizes, main memory, buses and # threads/processor are the same
Clock rate Cache
![Page 9: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/9.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 9
Web Servers
5 Web server packages Apache-MP: multi-process Apache-MT: multi-thread Flash: event-driven TUX: in-kernel Haboob: Java server, staged multi-thread model
Benchmark SPECweb96 and SPECweb99
![Page 10: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/10.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 10
System Configuration
5 configuration labels # CPUs, SMT on/off, kernel type
1P-UP 1P-MP 2T 2P 4T
on onSMT
Multiprocessor kernelkernel
1# CPUs 2
(T – # threads, P – # processors)
![Page 11: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/11.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 11
Outline
BackgroundMeasurement methodologyThroughput & improvement
Single processor Dual-processor
Micro-architectural performanceDiscussion
![Page 12: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/12.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 12
Apache-MP, 3.06GHz
0
200
400
600
800
1000
1200
1P-UP 1P-MP 2Tw/ SMT
2P 4Tw/ SMT
Th
rou
gh
pu
t (M
b/s
)Throughput Evaluation
2T vs. 1P-MP
4T vs. 2P
2T vs. 1P-UP
single processor dual-processor
![Page 13: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/13.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 13
Improvement on Single Processor
2T : 2 threads, multiprocessor kernel1P-MP: 1 thread, multiprocessor kernel
2T vs. 1P-MP
-10
0
10
20
30
40
Apache-MP Apache-MT Flash TUX Haboob
Th
rou
gh
pu
t im
pro
vem
ent
(%)
2.0GHz 3.06GHz 3.06GHz L3
![Page 14: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/14.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 14
2T vs. 1P-UP
-10
0
10
20
30
40
Apache-MP Apache-MT Flash TUX Haboob
Th
rou
gh
pu
t im
pro
vem
ent
(%)
2.0GHz 3.06GHz 3.06GHz L3
Improvement on Single Processor
2T : 2 threads, Multiprocessor kernel1P-UP: 1 threads, Uniprocessor kernel
Kernel overhead
![Page 15: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/15.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 15
Improvement on Dual-processor4T: 4 threads (2 processors, 2T/Processor)2P: 2 physical processors (SMT disabled)
4T vs. 2P
-20
-10
0
10
20
30
40
Apache-MP Apache-MT Flash TUX HaboobTh
rou
gh
pu
t im
pro
vem
ent
(%)
2.0GHz 3.06GHz 3.06GHz L3
2.0GHz & 3.06GHz with L3 are better Memory is still the
bottleneck
![Page 16: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/16.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 16
Micro-architectural Analysis
Use Oprofile In-house patch to measure extra events
About 25 performance events Cache miss/hit TLB miss/hit Branches Pipeline stall, clear, etc. Bus utilization
![Page 17: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/17.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 17
L1 Instruction Cache Miss Rate
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Apache-MP Apache-MT Flash TUX Haboob
1P-UP 1P-MP 2T(SMT)
2P 4T(SMT)
![Page 18: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/18.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 18
L2 Cache Miss Rate
Instruction & data unified Lower rate in SMT due to higher L1 misses
0%
2%
4%
6%
8%
10%
Apache-MP Apache-MT Flash TUX Haboob
1P-UP 1P-MP 2T(SMT)
2P 4T(SMT)
![Page 19: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/19.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 19
Apache-MP
02468
10121416
1P-UP 1P-MP 2T 2P 4T
Putting Events TogetherC
ycle
s pe
r In
stru
ctio
n (C
PI)
work L1 Miss L2 Miss ITLBDTLB Branch Clear Buffer
work
L1 Miss
L2 Miss
others
![Page 20: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/20.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 20
Non-overlapped CPI
L1/L2 miss penalty dominates
![Page 21: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/21.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 21
Measuring Bus Utilization
Event: FSB_DATA_ACTIVITYCPU cycles when the bus is busy
Normalized to CPU speedComparable across all CPU clock rate
![Page 22: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/22.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 22
Bus Utilization Results 2.0GHz & 3.06GHz
L3 have less data transfer cyclesLower memory
latency in 2.0GHz & 3.06GHz with L3
Coefficient of correlation between bus utilization & speedups : 0.62 ~ 0.95
Apache-MP
0
5
10
15
20
1P-UP
1P-M
P 2T 2P 4T
Bu
s U
tiliz
atio
n (
%)
2.0GHz 3.06GHz 3.06GHz L3
![Page 23: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/23.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 23
Outline
BackgroundMeasurement parametersThroughput speedupMicro-architectural performanceDiscussion
Compare to simulationOther Web workloads
![Page 24: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/24.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 24
SMT Performance on Web Servers
Simulation
Multiprocessorkernel
Uniprocessor kernel
Dualprocessor
-10%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Th
rou
gh
pu
t im
pro
vem
ent
![Page 25: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/25.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 25
Compare to Simulation
Simulation Measurement
Size Miss rate Size Miss rate
L1-I 128 KB 2.0% 12 KB 17%
L1-D 128 KB 3.6% 8 KB 5.7%
L2 16 MB 1.4% 512 KB 3.9%
Mem latency 90 cycles 220 ~ 350 cycles
![Page 26: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/26.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 26
Processor Development Trend
2000 20031996
62-cycle mem
32 KB L1
256 KB L2
90-cycle mem
128 KB L1
16384 KB L2
90-cycle mem
64 KB L1
16384 KB L2
74-cycle mem
16 KB L1
256 KB L2
94-cycle mem
16 KB L1
512 KB L2
350-cycle mem
8-12 KB L1
512 KB L2
Simulated models:
Actual processors:
![Page 27: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/27.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 27
SMT on SPECweb99
SPECweb99 results in paperDynamic + staticMultiple programs
• CGI requests, user profile logging, etc.
Speedup very close to static-only workloadsNo more negative speedups in FlashMay be due to better sharing of resources of
different programs
![Page 28: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/28.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 28
Summary
More realistic speedup evaluation of SMT 3 processors, 5 servers, 2 kernels Exposed factors not previously examined 5~15% speedup in our best cases
Detailed analysis of memory hierarchy impact on SMT performance All other architecture overheads secondary Reasons why simulation results were overly
optimistic
![Page 29: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/29.jpg)
Thank you
http://www.cs.princeton.edu/~yruan
![Page 30: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/30.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 30
Future Work
Ways of improving Simultaneous Multithreading performanceServer performance on POWER5Using execution driven simulation for deeper
understanding
Study Chip Multiprocessor (CMP)Intel, AMD, and IBM
![Page 31: Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware](https://reader036.vdocument.in/reader036/viewer/2022062810/56815a9c550346895dc81c24/html5/thumbnails/31.jpg)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 31
Pipeline Clears (per Byte)
Conditions when the whole pipeline needs to be flushed
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Apache-MP Apache-MT Flash TUX Haboob
1T-UP 1T-MP 2T 2P 4T