ramazan bitirgen, engin ipek and jose f.martinez micro’08 presented by pak,eunji coordinated...
TRANSCRIPT
Ramazan Bitirgen, Engin Ipek and Jose F.MartinezMICRO’08
Presented by PAK,EUNJI
Coordinated Management of Multiple Interacting Resources in Chip Multipro-
cessors :A Machine Learning Approach
Resource sharing problem in CMP Increasing levels of pressure on shared system resources Efficient sharing is necessary for high utilization and performance Multiple interacting resources
Cache Space, DRAM Bandwidth and Power Budget Allocation of a resource affects demands of other resources
Propose a resource allocation framework At runtime, monitors the execution of each application and
learns a predictive model of performance as a function of re-source allocation decisions and periodically allocates resources to each core using the model
Introduction
Per-application HW performance model Use Artificial Neural Networks (ANNs) Predict each app’s performance as a function of the re-
sources allocated to it Global resource manager
At every interval, searches the possible resource allocations by querying the application performance model
Resource Allocation Framework
Use ANNs Input units, hidden units
and an output unit con-nected via a set of weighted edges
Hidden(output) unit calcu-lates a weighted sum of their inputs(hidden values) based on edge weights
Edge weights are trained with training examples (data sets)
How to Predict a Performance?(Artificial Neural Networks)
Input units L2 cache space, off-chip bandwidth,
power budget Number of read hits, read misses,
write hits, and write misses over the last 20K inst and over the 1.5M inst
Fraction of cache ways that are dirty (the amount of WB traffic)
Activation function Use sigmoid (integer to value in [0, 1])
Model performance as a function of its allocated resources and recent behav-ior
Training during first 1.2 billion cycle with randomly allocated resource
Always keep a training set consisting of 300 points
Retrained at every 2,500,000 cycle
How to Predict a Performance?(Adaptation to per-APP Performance Model)
Optimization Prevent memorizing outliers
in a sample data Cross validation
Data set is divided into N equal-sized folds (N-1 training sets and 1 test set)
Ensemble consists of N ANN models
Performance is predicted av-eraging the predictions of all ANNs in the ensemble
Prediction error is estimated as a function of CoV of the predictions by each ANN in the ensemble (will be used for re-source allocation)
How to Predict a Performance?(Adaptation to per-APP Performance Model)
Training
Test TrningTest
Make resource allocation decision (at every 500,000 cycle) us-ing the trained per-application performance model
Discard queries involving an app with a high error estimate Fairly distribute resources to the running applications Predict the perf and compute the prediction error If the performance is estimated to be inaccurate (error > 9%), app is ex-
cluded from global resource allocation
Search the space with stochastic hill climbing It starts with a random solution, and iteratively makes small changes to
the solution, each time improving it a little. When the algorithm cannot see any improvement anymore, it terminates 2,000 trials produces the best tradeoff between search performance and
overhead
Resource Allocation
HW implementation Single HW ANN and multiplex edge weights on the fly to achieve 16 ‘virtual’ ANNs 12 * 4 + 4 multipliers as many as weighted edges 50 entry-table-based quantized sigmoid function Calculate in a pipelined manner Prediction(search) takes 16 cycles for 16 virtual ANNs
Area, Power, and Delay 3% of the chip’s area 3W power consumption Possible to make 2,000 queries within 5% of interval
OS Interface Embed training set and the ANN weights to the process state OS communicates the desired objective function through CR
Implementation & Overhead
Tools & architecture Heavily modified version of SESC With Wattch(power), HotSpot(temperature)
Baseline : Intel’s Core2Quad, DDR2-800 4-core CMP, frequency = 0.9GHz-4.0GHz(0.1GHz unit) 4MB, 16-way shared L2 cache
Distributed 60W power budget among 4 apps via per-core DVFS Outs is limited to 57W Statically allocate 5W
Partition L2 cache space at the granularity of cache ways Allocate one way to each app Distribute the remaining 12 ways
Each app statically allocated 800MB/s of off-chip DRAM bandwidth and the remain-ing 3.2GB/s is distributed
Experimental Setup
Metrics Weighted speedup Sum of IPCs Harmonic mean of normalized IPCs Weighted sum of IPCs
Workload 9 quad-core multi-programmed workloads from SPEC2000 and NAS
suites Classify into 3 categories
CPU-bound Memory-bound Cache Sensitive
Experimental Setup
Configurations Unmanaged Isolated Cache Management (Cache)
Utility-based cache partitioning, MICRO’2006 Distribute L2 cache ways to minimize miss rate
Isolated Power Management (Power) An analysis of efficient multi-core global power management
policies : Maximizing performance for a given power budget, MI-CRO’2006
Isolated Bandwidth Management (BW) Fair Queuing Memory System, Micro ‘06
Uncoordinated Cache + Power, Cache + BW, Power + BW, Cache + Power + BW Continuous Stochastic Hill-Climbing (Coordinated-HC)
Learning based SMT processor resource distribution(issue-queue, ROB, and register file), ISCA ’06
Fair-share Proposed scheme (Coordinated-ANN)
ANN-based models of the applications’ IPC response to resource allocation are used to guide a stochastic hill-climbing search
Experimental Setup
Performance Results are normalized to Fair-Share 14% average speedup over Fair-Share Similar for other metrics
Evaluation Results
P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P
Sensitivity to confidence threshold Results are normalized to Fair-Share
Evaluation Results
P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P
Confidence estimated mechanism Fraction of the total execution time where the ANN could predict the
resource allocation optimization for each application
Evaluation Results
P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P
Proposed a resource allocation framework that Manages multiple shared CMP resources in a coordinated fashion through ANNs and periodic resource allocation scheme
Coordinated approach to multiple resource management is a key to delivering high performance in multi-pro-grammed workloads
Conclusions