an empirical model for predicting cross-core performance...
TRANSCRIPT
![Page 1: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/1.jpg)
An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors
Jiacheng Zhao
Institute of Computing Technology, CAS
In Conjunction with Prof. Jingling Xue,
UNSW, Australia
Sep 11, 2013
![Page 2: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/2.jpg)
Problem – Resource Utilization in Datacenters
How?
2013/9/11
ASPLOS’09 by David Meisner+
![Page 3: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/3.jpg)
Problem – Resource Utilization in Datacenters
Applications
Core Core
L1 L1
Co-Runners
Core Core
L1 L1
Shared Cache
Memory Controller
2013/9/11
Co-located applications
Contention for shared cache, shared IMC, etc.
Negative and unpredictable interference
Two types of applications
Batch – No QoS guarantees
Latency Sensitive - Attain high QoS
Co-location is disabled
Low server utilization
Lacking the knowledge of interference
![Page 4: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/4.jpg)
Problem – Resource Utilization in Datacenters
Co-located applications
Contention for shared cache, shared IMC, etc.
Negative and unpredictable interference
Two types of applications
Batch – No QoS guarantees
Latency Sensitive - Attain high QoS
Co-location is disabled
Low server utilization
Lacking the knowledge of interference
2013/9/11
![Page 5: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/5.jpg)
Problem – Resource Utilization in Datacenters
Co-located applications
Contention for shared cache, shared IMC, etc.
Negative and unpredictable interference
Two types of applications
Batch – No QoS guarantees
Latency Sensitive - Attain high QoS
Co-location is disabled
Low server utilization
Lacking the knowledge of interference
2013/9/11
Figure: Task placement in datacenters
[Micro’11 by Jason Mars+]
![Page 6: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/6.jpg)
Our Goals: Predicting the interference
Quantitatively predict the cross-core performance interference
Applicable for arbitrarily co-locations
Identify any “safe” co-locations
Deployable for datacenters
2013/9/11
![Page 7: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/7.jpg)
Our Intuition – Mining a model from large training data
2013/9/11
Using machine learning approaches
Training Set
![Page 8: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/8.jpg)
Motivation example
2013/9/11
𝑃𝐷𝑚𝑐𝑓 =
0.485𝑃𝑏𝑤 + 0.183𝑃𝑐𝑎𝑐ℎ𝑒 − 0.138, 𝑖𝑓 𝑃𝑏𝑤 < 3.20.706𝑃𝑏𝑤 + 1.725𝑃𝑐𝑎𝑐ℎ𝑒 − 0.220, 𝑖𝑓 3.2 ≤ 𝑃𝑏𝑤 ≤ 9.60.907𝑃𝑏𝑤 + 3.087𝑃𝑐𝑎𝑐ℎ𝑒 − 0.561, 𝑖𝑓 𝑃𝑏𝑤 > 9.6
![Page 9: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/9.jpg)
Outline
Introduction
Our Key Observations
Our Approach – Two-Phase Approach
Experimental Results
Conclusion
2013/9/11
![Page 10: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/10.jpg)
Our Key Observations
Observation 1: The function depends only on the pressure on shared
resources, regardless of individual pressures from one co-runner.
For an application A, PDA = f(Pcache, Pbw)
(Pcache, Pbw) = g(A1,A2,…,Am)
2013/9/11
![Page 11: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/11.jpg)
Our Key Observations
Observation 2:
The function f is piecewise.
2013/9/11
![Page 12: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/12.jpg)
Our Key Observations
Naively, we can create A’s prediction model using brute-force approach
BUT, we can NOT apply brute force approach for each application!
Thousands of applications in one datacenter
Frequent software updates
Different generations of processors
Even steps for one application is expensive
Observation 3:
The function form is platform-dependent and application independent
Only the coefficients are application-dependent
2013/9/11
![Page 13: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/13.jpg)
Outline
Introduction
Our Key Observations
Our Approach - Two-Phase Approach
Experimental Results
Conclusion
2013/9/11
![Page 14: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/14.jpg)
Our Approach - Two-Phase Approach
2013/9/11
Phase 1: Get the abstract model
Find a function form best suitable for all applications on a given platform
Phase 2: Instantiate the abstract model
Determine the application-specific coefficients (a11, etc.)
Training Applications
Co-running Trainer
Heavy – many training workloads
Run once for one platform
𝑃D =
𝑎11𝑃𝑏𝑤 + 𝑎12𝑃𝑐𝑎𝑐ℎ𝑒 + 𝑎13, 𝑠𝑢𝑏𝑑𝑜𝑚𝑎𝑖𝑛1𝑎21𝑃𝑏𝑤 + 𝑎22𝑃𝑐𝑎𝑐ℎ𝑒 + 𝑎23, 𝑠𝑢𝑏𝑑𝑜𝑚𝑎𝑖𝑛2𝑎31𝑃𝑏𝑤 + 𝑎32𝑃𝑐𝑎𝑐ℎ𝑒 + 𝑎33, 𝑠𝑢𝑏𝑑𝑜𝑚𝑎𝑖𝑛3
OneApplication
Co-running Trainer
Light-weighted, with a small number of trainings
Run once for one application
𝑃𝐷𝑚𝑐𝑓 =
0.49𝑃𝑏𝑤 + 0.18𝑃𝑐𝑎𝑐ℎ𝑒 − 0.13, 𝑃𝑏𝑤 < 3.20.71𝑃𝑏𝑤 + 1.73𝑃𝑐𝑎𝑐ℎ𝑒 − 0.22, 𝑜𝑡ℎ𝑒𝑟𝑠
0.91𝑃𝑏𝑤 + 3.09𝑃𝑐𝑎𝑐ℎ𝑒 − 0.56, 𝑃𝑏𝑤 > 9.6
![Page 15: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/15.jpg)
Our Approach - Two-Phase Approach
2013/9/11
![Page 16: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/16.jpg)
Our Approach - Two-Phase Approach
2013/9/11
Q1: What are selected as
application features
Q2: How?
Q3: What’s the cost of the training?
![Page 17: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/17.jpg)
Our Approach – Some Key Points
2013/9/11
Q1: What are selected as application features?
Runtime profiles
Shared cache consumption
Bandwidth consumption
![Page 18: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/18.jpg)
Our Approach – Some Key Points
2013/9/11
Q2: How to create the abstract model?
Regression analysis
Configurable
Each configuration
binding to a function form
Searching for the best function form for all applications in the training set
![Page 19: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/19.jpg)
Our Approach – Some Key Points
2013/9/11
Q3: What’s the cost of the training when instantiation
Cover all sub-domains of the piecewise function, say S
Constant points for each sub-domain, say C
The constant depends on the form of abstraction model
C*S training runs in total
Usually C and S are small, our experience: C=4, S=3
![Page 20: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/20.jpg)
Outline
Introduction
Our Key Observations
Our Approach - Two-Phase Approach
Experimental Results
Conclusion
2013/9/11
![Page 21: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/21.jpg)
Experimental Results
2013/9/11
Accuracy of our two-phase regression approach
Prediction precision
Error analysis
Deployment in a datacenter
Utilization gained
QoS enforced and violated
![Page 22: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/22.jpg)
Experimental Results
2013/9/11
Benchmarks:
SPEC2006
Nine real-world datacenter applications
Nlp-mt, openssl, openclas, MR-iindex, etc.
Platforms:
Intel quad-core Xeon E5506 (main)
Datacenter:
300 quad-core Xeon E5506
![Page 23: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/23.jpg)
Some Predictor Function
2013/9/11
![Page 24: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/24.jpg)
Prediction precision for SPEC Benchmarks
2013/9/11
Prediction Error: Average 0.2%, from 0.0% to 8.6%.
![Page 25: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/25.jpg)
Prediction precision for datacenter applications
15 workloads for each datacenter applications
2013/9/11
Prediction Error: Average 0.3%, from 0.0% to 5%.
![Page 26: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/26.jpg)
Error Distribution
2013/9/11
-4.00%
-3.00%
-2.00%
-1.00%
0.00%
1.00%
2.00%
3.00%
4.00%
Error Distribution
![Page 27: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/27.jpg)
Prediction Efficiency
Precision
Two-Phase:
0.0~11.7%, Average: 0.40%
Brute-Force
0.0~10.1%, Average: 0.23%
Efficiency
co-running: ~200 12
2013/9/11
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P
erf
orm
ance
De
grad
atio
nWorkload ID
Real Two-Phase Brute-Force
![Page 28: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/28.jpg)
Benefits of piecewise predictor functions
2013/9/11
![Page 29: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/29.jpg)
Benefits of piecewise predictor functions
2013/9/11
![Page 30: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/30.jpg)
Deployment in a datacenter
2013/9/11
300 quad-core Xeon 1200 tasks when fully occupied
Applications Latency sensitive: Nlp-mt
machine translation
600 dedicated cores, 2/chip
Batch job
600 tasks, kmeans, MR
Our Purpose QoS policy
Issue batch jobs to idle cores
![Page 31: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/31.jpg)
Cross-platform applicability
Six-core Intel Xeon
2013/9/11
0%
20%
40%
60%
80%
1 6 11 16 21 26 AVG
Per
form
ance
Deg
rad
atio
n
Workload ID
Real Predicted
Prediction Error: Average 0.1%, range from 0.0% to 10.2%
![Page 32: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/32.jpg)
Cross-platform applicability
Quad-core AMD
2013/9/11
Prediction Error: Average 0.3%, range from 0.0% to 5.1%
0%
10%
20%
30%
40%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 AVG
Per
form
ance
Deg
rad
atio
n
Workload ID
Real Predicted
![Page 33: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/33.jpg)
Outline
Introduction
Our Key Observations
Our Approach - Two-Phase Approach
Experimental Results
Conclusion
2013/9/11
![Page 34: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/34.jpg)
Conclusion
An empirical model, based on our key observations
Using aggregated resource consumptions to create the predictor function, thus
working for arbitrarily co-locations
Piecewise is reasonable and effective
Breaking the model creation into two phases, for efficiency
2013/9/11
![Page 35: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/35.jpg)
2013/9/11
![Page 36: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/36.jpg)
Backup slides
2013/9/11
How to make the training set representative?
Partition the space into grids
Sample for each grid
![Page 37: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/37.jpg)
Backup slides
2013/9/11
How to do domain partitioning?
Specified in configuration file
Syntax: (shared resourcei, conditioni), e.g. (Pbw, equal(4))
Empirical knowledge to perform this task
#Aggregation#Pre-Processing: none, exp(2), log(2), pow(2)#mode: add, mul
#Domain Partitioning: {((Pbw), equal(4)), ((Pcache), equal(4)), ((Pcache, Pbw), equal(4, 4))},#Function: linear, polynomial(2)
![Page 38: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications](https://reader034.vdocument.in/reader034/viewer/2022050120/5f500a657f1d8a71f15383bc/html5/thumbnails/38.jpg)
Backup slides
Two sources of error:
Estimation for shared resources
consumption
L2 LinesIn
Phase behavior of applications
2013/9/11