sensor-based fast thermal evaluation model for energy efficient high-performance datacenters q....
TRANSCRIPT
IMPACT
A r izo n a S tate U n iv e r s ity
Sensor-Based Fast Thermal Evaluation Model For Energy Efficient High-Performance Datacenters
Q. Tang, T. Mukherjee, Sandeep K. S. GuptaDepartment of Computer Sc. & Engg.
Arizona State University&
Phil Cayton, Intel Corp.
IMPACT
A r izo n a S tate U n iv e r s ity
Heating problem in Data Center
Power densities are increasing exponentially along with Moore Law
Current cooling solutions at various levels
Chip / component level Server/board level Rack level Data center level
IMPACT
A r izo n a S tate U n iv e r s ity
Two steps of reducing heating effects
Design and deployment stage (Civil & Mechanical Engineering Approach )
Increasing air conditioner capacity Designing optimized layout to facilitate air circulation
Operation stage (Computer Science Approach) Example: dynamically assigning tasks to avoid
overheated servers and to achieve thermal balancing Assigning task to servers who consume less energy
IMPACT
A r izo n a S tate U n iv e r s ity
Thermal Management of Datacenter
Motivation and significance Compute Intensive Applications (Online Gaming,
Computer Movie Animation, Data Mining) requiring increased utilization of Data Center
Maximizing computing capacity is a demanding requirement
New blade servers can be packed more densely Energy cost is rising dramatically
Goal Improving thermal performance Lowering hardware failure rate Reducing energy cost
IMPACT
A r izo n a S tate U n iv e r s ity
Typical layout of a datacenter
Rack outlet temperature Tout
Rack inlet temperature Tin
Air conditioner supply temperature Ts
IMPACT
A r izo n a S tate U n iv e r s ity
Schematic View of Thermal Management
C o n tro l
F eed b ack
T ran sd u ce r
Se ns o r D ataD atabas e
C FD s im ulat io ns o f tware
P o lic yC o ntro l le r
M o abSc he dule r
O the r Im pac tfac to r s
C o lle c t ing e nviro nm e ntal data andlo ad info rm atio n f ro m s e ns o r s
`
C o rre lat io n o flo ad & po we r
C o s t Analys is
Sc he duling P o l ic y
C o ntro l P o l ic y
Inc o m ing tas k
O ns i te s urve y
M a p loa d to pow e rc ons um ption
P ro c e s sM igrat io n
H is to ry Se ns o r D ata
C ur re nt Se ns o r D ata
D atac enter
Abs trac t H e atM o de l
T arg e t
IMPACT
A r izo n a S tate U n iv e r s ity
Thermal Scheduling: Problem Statement
We present results of thermal-aware scheduling to improve the (blade server based) energy efficient of datacenter
Given a total task C, how to divide it among N server node to finish computing task with minimal total energy cost ?
IMPACT
A r izo n a S tate U n iv e r s ity
Energy Conservation
i iout i p outQ f C T
i iin i p inQ f C T
Inlet Airflow, a mixture ofSupplied cold air and Recirculated hot air
Outlet Airflow Server Power Consumption Pi
Depending on amount of computing task
i iout in iQ Q P
IMPACT
A r izo n a S tate U n iv e r s ity
Thermal Management
Ta s kA s s ig m e n t
Po we rV e cto r
Te m pe ra tu reD is tr ibu t io n
C o o lin gC o s t
To ta lC o s t+
C o m pu t in g C o s t
Different task assignment result in different power consumption distribution
Different power consumption distribution results in different temperature distribution
Different temperature distribution results in different total energy cost
IMPACT
A r izo n a S tate U n iv e r s ity
Example
Inlet temperaturedistributionwithout Cooling
25C
25C
Cooling lowered Inlet temperature lowered blowredline threshold
Different schedulingResults different inletTemperature distribution
Scheduling 1
Scheduling 2
Demand for cooling load /energy
Demand for cooling load/energy
IMPACT
A r izo n a S tate U n iv e r s ity
Total Energy Cost of Datacenter
Computing energy cost Cooling energy cost
keep the maximal inlet temperature below the redline temperature of devices 25C
COP: Coefficient Of Performance (COP)
Total Energy Cost
the amount of heat removed
the energy consumed by the cooling device.COP =
IMPACT
A r izo n a S tate U n iv e r s ity
Observation
Even with the same computing power dissipation, different temperature distribution may demand different cooling load, results in different total energy cost
We can manipulating task scheduling to achieve best temperature distribution, consequently minimize total energy cost
IMPACT
A r izo n a S tate U n iv e r s ity
Uniform Outlet Profile
Why Naive Based on observation and intuition No mathematical formalization
Uniform Outlet Profile (UOP) Assigning tasks in a way trying to
achieve unifrom outlet temperature distribution Tc
Assigning more task to nodes with low inlet temperature (water filling process)
Tc
Temperature risedue to power consumption
Inlet Temperature
IMPACT
A r izo n a S tate U n iv e r s ity
Uniform Task
Uniform Task (UT) Assigning all chassis the
same amount of tasks (power consumptions)
All nodes experience the same power consumption and temperature rise
IMPACT
A r izo n a S tate U n iv e r s ity
Minimum Computing Energy
Minimum computing energy (cooling inlet) Assigning tasks in a way to keep the number
of active (power on) chassis as small as possible
IMPACT
A r izo n a S tate U n iv e r s ity
Abstract Heat Flow Model
N 1 A C
R ecircu la tio n
T su p T in T o u t T A C in
N 2 N 3
1 2 1 3
2 13 1
1 1
Observation Airflow pattern are stable (confirmed through CFD simulation)
Hypothesis The amount of recirculated heat is stable, can be characterized Define aij the percentage of recirculated heat from node i to node j
IMPACT
A r izo n a S tate U n iv e r s ity
Cross Interference among Server Nodes
Cross Interference Coefficients (CIC) Define aij the percentage of recirculated heat from
node i to node j Cross interference coefficients
Cross Interference Matrix Correlations among power consumption (utilization
rate), temperature, and cross interference
1
2
0 0
0 ...
0 ... ...
p
p
n p
f C
f CK
f C
IMPACT
A r izo n a S tate U n iv e r s ity
Fast Thermal Evaluation
Use profiling process to calculate cross interference coefficients
Temperature Prediction
A Configuration of Distributed System
NumericalSimulation (hours)
Fast ThermalEvaluation (real time)
Thermal Performance Evaluation
IMPACT
A r izo n a S tate U n iv e r s ity
Formalizing optimization problem
To minimize cooling energy cost, we only need to minimize maximal inlet temperature
Formalized optimization problem based on abstract heat flow model, can be converged into LP, ILP, linear, nonlinear problems according to different models and policies
IMPACT
A r izo n a S tate U n iv e r s ity
Simulation Environment
2 Row Datacenter Ten standard 42U racks Each rack has five Dell 1855 Blade server CFD simulation is used for evaluate
temperature distribution (Flovent from Flomerics)
IMPACT
A r izo n a S tate U n iv e r s ity
DataCenter model
Node 1
Node 2
Node 5
Node 50
Node 25
Node 30
IMPACT
A r izo n a S tate U n iv e r s ity
Cross Interference Coefficients
Confirmed with datacenter reality
Strong interference to neighboring nodes
IMPACT
A r izo n a S tate U n iv e r s ity
Fast Thermal Evaluation Results
Provides fast and accurate temperature prediction
Practical for online real-time thermal management
IMPACT
A r izo n a S tate U n iv e r s ity
Simulation Results: Analysis & Summary
XInt consistently outperforms all other scheduling algorithms
Compared with MinHR, XInt is more practicabel Task oriented scheduling vs. Power oriented
scheduling Online, real-time XInt is mathematically formalized
IMPACT
A r izo n a S tate U n iv e r s ity
Future Works
Integrating with cluster management software platforms
Moab, Torque, etc Considering task priorities and time
constraints
IMPACT
A r izo n a S tate U n iv e r s ity
Related Works
Consil vs Fast Thermal Evaluation Deduction vs. Prediction Current vs. future, which is more important for
proactive and preventive thermal management MinHR vs. XInt
Both characterize recirculation in similar granulites Aggregated effects vs. point to point Offline vs. online Power oriented vs. Task oriented
IMPACT
A r izo n a S tate U n iv e r s ity
Supply Heat Index (SHI)
Roughly characterize recirculation Cannot differentiate the same SHI but different
temperature distribution