cost aware fault recovery in clouds (im 2013)
Post on 08-Aug-2018
214 Views
Preview:
TRANSCRIPT
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
1/29
COST AWARE FAULT RECOVERYIN CLOUDSAssaf Israel, Danny RazTechnion - Israel Institute of Technology
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
2/29
FAULTS IN DATACENTERS
Weve come a long way in terms of server resilience
Enterprise gra
Component A
Compute
(CPU, RAM, Fans, Net)
~
Storage ~
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
3/29
FAULTS IN DATACENTERS Typical first year of a new 1800 servers cluster @ Google:
- thousands of hard drive failures
~1000 individual machine failures
~3 router failures (have to immediately pull traffic for an hour)
~5 racks go wonky (40-80 machines see 50% packet loss)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get ba
~1 network rewiring (~5% of machines down over 2-day span)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to com
~0.5 overheating (power down most machines in
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
4/29
FAULTS IN DATACENTERS
Other factors also contribute to lack of resilience
Distribution of service disruption evenThe Datacenter as a Computer (200
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
5/29
RECOVERY
Most of the time we would like to recover as quickly as p
Backup
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
6/29
RECOVERY
Most of the time we would like to recovery as quickly as Single host recovery may take advantage of vacant re
Backup
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
7/29
RECOVERY
Most of the time we would like to recovery as quickly as Single host recovery may take advantage of vacant re
Backup
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
8/29
RECOVERY
Larger failures (Racks, Network segments, Power regionsMay require powering more machines
Backup
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
9/29
RECOVERY
Larger failures (Racks, Network segments, Power regionsMay require powering more machines
Backup
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
10/29
RECOVERY COST
ServiceDegradation
BackupInfrastructure
RecoveryCost
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
11/29
RECOVERY COST
ServiceDegradation
BackupInfrastructure
RecoveryCost
,
,,
, - Service deg. cost of when recovered at - Infrastructure cost of
, , - 0/1 Decision vectors
Can be formally expressed as:
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
12/29
RECOVERY COST
Service degradation depends on: Task setup/initialization
Host setup/initialization
Network configuration (if recovered to a different network segme
Storage mapping
Storage migration (if recovered to a different SAN)
Software patches
Integrity checks
Manual host configuration
Recovery target location (latency/bandwidth)
ServiceDegradation
BackupInfrastructure
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
13/29
RECOVERY COST
Pre-planning can help reduce recovery cost
Activating additional backup infrastructure: Can help lowering some of Service Degradation costs
At the expense of additional maintenance costs
ServiceDegradation
BackupInfrastructure
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
14/29
OBSERVATION
Not all tasks are equal Interactive & vital monitoring
High-priority non-interactive
Non-interactive user-facing
Batch
Housekeeping tasks
Some are more susceptible to long downtimes than oth
Web-scW. Cirne
Tight SLA
Relaxed SLA
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
15/29
GOAL
We would like to recover expensive tasks faster Balance service degradation and infrastructure costs
Backup
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
16/29
GOAL
We would like to recover expensive tasks first Balance service degradation and infrastructure costs
Backup
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
17/29
GOAL
Formal: Minimize the total recovery cost
Infrastructurecosts
Service degradationcosts
Under somepacking constraints
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
18/29
APPROXIMATION - OVERVIEW
Integer Program
LP Relaxation
Linear
Transformations ||Light Graphs
CycleBreaking
Activation
RoundingApproximation bounds
Cost 1 Load
6
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
19/29
IF WE HAD MORE INFO
Backup
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
20/29
IF WE HAD MORE INFO
If we knew which of backup hosts are active we could approximate the Service degradation costs
Backup
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
21/29
MINIMUM GENERAL ASSIGNMENT PROB
Bins, Items Each item have a size, depends on the target bin
Each item have a cost, depends on the target bin
Goal:Packall items into bins at minimum cost, under packing c
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
22/29
MIN-GAP
Has been studied extensively Known results:
LP-Based 2-Approx. (Shmoys and Tardos, 1993)
LP-Based
-Approx. (Fleischer, Goemans, Mirrokni and Svir Local Ratio-Based 2 -Approx. (Cohen, Katzir and Raz, 2006)
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
23/29
LOCAL SEARCH
Iteratively find the next backup machine to activate Stop when theres no improvement in recovery costs
Backup
Active host
Inactive hostBase cost - All backups are
inactive Next AcFind theactivate
recover
is minim(Using it
Stop conditionIf( < ):
return last RPElse:
Activate return +
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
24/29
SIMULATIONS
Based on data from IBM Research Compute Cloud (RC
Several hundreds hosts, with a few thousands VMs
4 host configurations, 3 VM configurations
EC2-like SLA policies(higher availability guaranties, at higher rates)
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
25/29
RECOVERY COST BY RACK SIZE
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 6 10 17 34
Cost[%]
Rack size (#hosts/rack)
Normalized Recovery Cost by Rack size
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
26/29
RECOVERY COST BY VM SLA DISTRIBUTIO
0
50000
100000
150000
200000
250000
0
0.0
2
0.0
4
0.0
6
0.0
80.1
0.1
2
0.1
4
0.1
6
0.1
80.2
0.2
2
0.2
4
0.2
6
0.2
80.3
0.3
2
0.3
4
0.3
6
0.3
80.4
0.4
2
0.4
4
0.4
6
0.4
80.5
Cost
SLA Distribution
2 host racks - Total & Service costs
20% - Cheap to recover
80% - Expensive to recover
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
27/29
RECOVERY COST BY VM SLA DISTRIBUTIO
0
50000
100000
150000
200000
250000
0
0.0
2
0.0
4
0.0
6
0.0
80.1
0.1
2
0.1
4
0.1
6
0.1
80.2
0.2
2
0.2
4
0.2
6
0.2
80.3
0.3
2
0.3
4
0.3
6
0.3
80.4
0.4
2
0.4
4
0.4
6
0.4
80.5
Cost
SLA Distribution
2 host racks - Total & Service costs
Active
Servic
Inactiv
ServicLocal
Servic
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
28/29
CONCLUSION
Large scale infrastructure mandates fault tolerance tec
Pre-planning can help reduce recovery cost
Classifying tasks by SLAs can improve overall recovery c
LP-Based Load/Cost Approximation with guaranteed pe Local Search heuristic with good practical performance
-
8/22/2019 Cost Aware Fault Recovery in Clouds (IM 2013)
29/29
THANK YOU !
Questions ?
top related