aws ec2 spot instances for mission critical services

AWS EC2 Spot Instances for Mission CriticalServices

Jerry Danysz, Vıctor del Rosal, Horacio Gonzalez–VelezCloud Competency Centre, National College of Ireland

E: [email protected], {Victor.DelRosal,horacio}@ncirl.ie

KEYWORDS

AWS; spot instance; utility computing; elasticprovisioning; SLA; random forest regression;cloud computing

ABSTRACT

For over a decade now, Amazon Web Services(AWS) has offered its spare capacity at a discoun-ted price in the form of EC2 spot instances. Thisdiscount comes at the price of variable pricingand sudden instance termination. In this paper,we present a machine-learning solution to one ofthe challenges when using AWS Spot Instances,namely the termination of the instance on shortnotice. Our system, Spot Instance ManagementSystem (SimS), can effectively manage spot in-stances and keep up the availability at the de-sired level using 100-tree Random Forest Regres-sion model. By using a risk assessment mechan-ism and proactive actions, SimS assures a three-nines SLA using AWS spot instances with lowerrunning costs on workloads for a major Europeanfinancial institution.

INTRODUCTION

AWS EC2 spot instances have offered asignificant discount–up to 90% according toAWS–compared to their dedicated and on-demand/reserved models for over a decade. Oneof the base assumptions behind spot instances isthat price is dynamic and can change anytimebased on available capacity and current demandfor the type of instance. Consequently, the at-tractive pricing comes with the trade-offs relatedto the availability of the spare capacity on a givenregion at a given period in time.

The use of spot instances requires customersto set a maximum instance price (per hour) theywish to pay and also to understand the risk, e.g.if the price of a given instance changes abovethe maximum price, the instance could be sud-denly terminated and all data lost if not savedelsewhere. Until 2015, the instance terminationwas executed without prior notification to thecustomer. Since then, a different type of ter-mination behaviours is offered, ranging from de-fault termination through stopping the instance

to hibernation of the instance. Nonetheless, theusage of spot instances, even if cost-effective, isnormally circumscribed to applications that areeither non-critical or designed for interruptions.

This paper presents the Spot Instance Man-agement System (SimS) whose main goal is tocounteract the sudden/unexpected termination ofcloud services running on top of EC2 spot in-stances. SimS employs a parallelised RandomForest Regression [5], [6] model for continuousresponse with 100 trees to predict price fluctu-ations. In this case, rather the regression modelallows systems to swiftly detect when price wouldchange and calculate the risk of the instance ter-mination, and then consider the appropriate ac-tions to maintain a 99.9% (a.k.a three nines) Ser-vice Level Agreement (SLA).

The actions could be live migration to anotheravailability zone or redeployment of spot instancewith a higher maximum bid. It is worth to noticethat all actions are done proactively, while therisky instance is still running, therefore minim-ising the potential downtime of any service usingEC2 spot instances.

RELATED WORK

Spot Price Prediction (SPP) has been a topicof research since the introduction of AWS spotinstances in 2009. Yehuda et al. [1] predict thespot price by deconstruction and reverse engin-eering of a hypothetical spot instance algorithmthat, despite common assumptions on spot prices,does not necessarily lead to supply- and demand-driven fluctuations but alternatively where pricesare randomly generated from dynamic hidden re-serve pricing. To confirm their hypothesis, theyanalyse the spot market and divided it into threepricing epochs with each epoch change at the sig-nificant change of SPP pricing models. They ac-knowledge that there is a market element in theprice, but prices are still driven from the hiddenreserved price.

In contrast, Singh and Dutta [13] seem notto fully concur with Yehuda et al.’s conclusiongiven that their model for dynamic price pre-diction is accounting for global market trendsand local seasonality. The dynamic price predic-tion model is presenting two types of predictions:

Communications of the ECMS, Volume 34, Issue 1, Proceedings, ©ECMS Mike Steglich, Christian Mueller, Gaby Neumann, Mathias Walther (Editors) ISBN: 978-3-937436-68-5/978-3-937436-69-2(CD) ISSN 2522-2414

short-term (hourly based) and long-term (a weekahead) based on analysis of 9 months of histor-ical data for the top ten most used spot instances.They present the prediction result with an aver-age 9.4% prediction error in short-term predictionand 20% in long-term (five days and more) priceprediction.

Accurate price prediction for cloud instancestypically relies on assertive workload quantifica-tion, which is related to the application type, e.g.HPC [11], [12] or to extrinsic factors such as sea-sonality or social-demographic factors [14].

Consequently, AWS spot price prediction hasbeen previously modelled using a moving simula-tion model to create an artificial neural network-based algorithm for price prediction [16]. It em-ploys historical data available for only mediumsize instances in a period of 7 months to train theMLP model, resulting in 4% prediction error inshort-term prediction (hourly prediction) on av-erage for medium size spot instances. This leadsto the conclusion that neural network models arewell suited for the prediction of price changes ofspot instances. Zhao et al. [17] follow a differ-ent prediction approach by using a time-basedseries forecasting method, ARIMA Model (Auto-Regressive Integrated Moving Average) that islighter compared to machine learning techniqueslike neural networking. Other approaches [13]have added a seasonal component to their AR-IMA model effectively changing it to the SAR-IMA model. SARIMA has been used to analysefive months of historical data and create a predic-tion that is close to the average price for a periodof 48 hours.

Random forest regression using historic AWStraces has been recently reported in the liter-ature [8]. Similar approaches have previouslyused Support Vector Poly Kernel Regression(SMOReg), Gaussian Process and Linear Re-gression [3], and multiple discrete-time model-ling [7]. All these models have been trained forthe month with 12 months of historical data forthe three most used types of instances. Their pre-dictions have been typically generated for short(next hour), medium (half day) and Long (nextday) periods. Per the conclusion, the neuralnetwork-based algorithms are performing betterthan others for medium (half day predictions)where SMOReg is better suited for predictionswith highly variable months, and random forestregressions seem to deal better with workloadvariability.

From the above analysis, we can see that thereare different heuristic approaches to prediction ofspot instance pricing, ranging from reverse en-gineering and understanding what principles arebehind the price level, through the classical stat-istical approach to the most modern use of artifi-

cial intelligence and machine learning techniques.For this research, we want to prove that MLModels, specifically Random Forest Regression,in combination with business-driven automationcan achieve a 99.9% SLA whilst accurately pre-dicting price and potential price variation. Ourapproach has been successfully evaluated to sup-port mission-critical cloud workloads from a ma-jor European financial institution.

*

AWS Spot bidding strategies overview

Previous research has attempted to find thebest strategy to find a golden bid to assure spotinstance availability. Andrzejak et al. [2], ques-tion how bidding may be conducted with stricttarget dates or SLAs, focusing their research onbidding strategies with that goal. Li et al. [10],classify common bidding approaches into threetypes:• White box approach where bidding strategiesare taking into account interactions between dif-ferent market participants and effectively biddingcan influence a spot price.• Grey box approach has more individual bidingstrategies wherein, contrary to the white-box ap-proach, market interactions are not taken into ac-count, but strategies are focusing mostly on work-load, cost and availability of the resources.• Black box approach, consisting of the mostcommon strategies which derives bidding fromhistorical spot pricing data and do not focus somuch on workload, cost and availability, nor oninteractions between market participants.

The five most classic strategies in the black boxapproach have been discussed by Li et al. [10] andby Voorsluys and Buyya [15]:1. The minimum price, where the bid is based onhistorical minimum spot price2. Mean, the bid price is set as the mean of allvalues of the historical spot price3. High, the bid price on the maximum price ob-served in historical data4. Current, the bid price is set as the value ofcurrent spot price5. On-demand is the bid price equal to the on-demand price of the instance.

The above five strategies have also been incor-porated to the solution for reliable provisioningof spot instances by Voorsluys and Buyya [15],combining all five strategies with fault tolerancetechniques like migration to assure the most reli-able solution for the limitations of spot instances.A survey on spot pricing by Kumar et al. [9]presents four bidding strategies: bidding on nearto reserve price; bidding on above the averageprice calculated from the historical data; biddingclose to the on-demand price; and, bidding overthe on-demand price. Each of the aforementioned

LOW MEDIUM HIGH

Availability DIntegrity D

Confidentiality DTABLE I: Business Application Critically Mat-rix

techniques has its own benefits particularly forthe final consumer, but also its costs in terms ofthe actual deployment and the business interestof the cloud provider.

In contrast to other attempts to use machinelearning for AWS spot price prediction [4], the au-thors researched the possibility of maintaining theagreed service level of 99.9% that is common fora business-critical application while savings costsby running those applications on EC2 Instances.The aim was to test how EC2 spot instances canbe used to reliably host mission-critical applica-tions. Underpinned by a parallel Random ForestRegression model, we have employed a real applic-ation scenario borrowed from a major EuropeanFinancial Institution to validate our findings.

METHODOLOGY

• Availability: impact if information availabilityis affected.• Integrity: impact if information integrity is af-fected.• Confidentiality: information confidentialitylevel.

The assumption is a scenario where the ITService Provider is responsible for providingbusiness-critical batch processing and ERP ap-plication for a European Financial Institutionwith agreed Availability of the service on thelevel of 99.9%. The IT Service Provider andthe European Financial Institution have agreedupon a project to run the batch processing ap-plication using only AWS Spot Instances for thecost-effectiveness. The criticality matrix shownin Table I determines the application criticalitylevel of Availability, Integrity, and Confidential-ity.

As seen above, the application is highly criticalto the core business of the European FinancialInstitution. If the availability of the system and,therefore, the application is compromised, the in-stitution’s ability to operate properly is degraded,potentially leading to significant profit loss andreputational damage. Compromise of informa-tion integrity may lead to substantial profit lossesas well as posing a risk to confidentiality.

Since the system holds sizeable amounts ofconfidential information subject to General Data

Protection Regulation (GDPR), in case of a con-fidentiality breach, the Institution could face legalactions based on GDPR provisions as well aswidespread loss of customer trust, negatively im-pacting the ability to conduct business.

Based on the criticality matrix, the followingservice level requirements are presented:

• Application Criticality: High• AWS Region: eu-central-1• Cumulative Downtime including maintenance3.65 days per year• Mean Time to Respond: 2 minutes• Mean Time to Resolve: 15 minutes

Based on the above, in the Service Level Agree-ment (SLA), the service level has been agreed atthree nines (99.9%). The service Level has beenmeasured via the monitoring system site24x7 1,which calculates system availability using HTTPresponse codes and general host response.

We have deployed a Python data analysis mod-ule based on random forest regression model us-ing 100 trees, parallelised using 4 instances ata time. To save computing time and costs as-sociated with it, we have decided to limit dataanalysis to only the EU-CENTRAL-1 region andthree selected types of instances c5.xlarge,

t3.micro and t3.medium.The Spot Instance Management System

(SiMS) has been developed composed of fourmain modules:

• The Data Collection Module: responsible fordownloading and aggregating historical data forEC2 Spot prices.• The Data Pump Module: responsible for mov-ing collected data from S3 Landing zone Bucketto S3 Staging Zone bucket after data collection,where later this data is used by Data AnalysisModule for training the machine learning al-gorithm.• The Data Analysis Module: responsible foranalysing historical data gathered by the datacollection module and then is responsible for ap-plying the machine learning model based on ran-dom forest regression. Execution is started viaAmazon Lambda function that is responsible forstarting the AWS Fargate Task(Figure 1).• The Risk Assessment and Automation module:responsible for risk analysis of the instance inter-ruption in each of availability zones in the con-figured region and next for taking the appropriateactions concerning the level of the risk(Figure 2).

The prime idea behind the system is to haveautomated mechanism that with use of machinelearning, can predict the price of the spot instancein the next hour and act accordingly by migratingthe affected spot instance to the next availabilityzone. In case when all availability zones would

1https://www.site24x7.com

https://www.site24x7.com

Fig. 1: The Data Analysis Module

Fig. 2: The Risk Automation Module

Fig. 3: Instance Migration Process

be labelled as High Risk, the system will redeploythe spot instance with a new bid price increasedby 20%.

In the worst-case scenario, where there wouldbe no spare capacity to spin the machine, thesystem would run the on-demand machine onlywhen capacity is unavailable.

The risk engine is the part of the risk and auto-mation module responsible for the calculation ofthe risk of instance interruption in each of theavailability zones. The risk is calculated based onthe maximum bid price, current price, predictedprice and the threshold for each of the risks levelLow Lr, Medium Mr, and High Hr.

The risk is calculated for each of the availabilityzones in a given region with the following formu-lae using Current price (Cp), Maximum bid price(Mb), and Prediction price(Pp) :

Lr = Cp <

(55

100x Mb

)AND (Pp < Mb)

Mr = Cp >

(55

100x Mb

)AND Cp <

(8

10x Mb

)AND (Pp < Mb)

Hr = Cp >

(8

10x Mb

)AND (Pp < Mb)

In case prediction price (Pp) is larger than themaximum bid price, risk is calculated as High.

If current risk is Medium and there is at leastone availability zone with low-risk assessmentthan migration operation is starting to that zone.If there is no low zone, there is no action taken. Ifcurrent risk is high and low or medium zones areavailable, then the migration process will movethe instance to the lowest risk zone. If the risk ishigh in all availability zones, the machine is re-deployed by the automation engine to the sameavailability zone but with a 20% higher maximumbid price. The migration process is shown in Fig-ure 3.

EVALUATION

For simulation purposes, a Python script hasbeen developed to execute the following steps:• Change the predicted price for the next fullhour to one of the values selected randomlyon every execution (on-demand price, predictedprice, maximum bid price, current price, mean ofall above).• Execute Risk Recalculation.

After risk is recalculated, we would relay onthe automation module to evaluate and executenecessary actions against the SimS Managed in-stance. For the instance that is not managed (the

Fig. 4: Root Cause Analysis Report from Mon-itoring System

Fig. 5: Outage reports of control system

control instance), the script will execute the fol-lowing steps: evaluate if the maximum bid priceis lower than the predicted price and If the pre-dicted price is higher, terminate the instance after2 minutes. The authors have also included a sim-ulation of an engineer acting on a given incidentcreated by the monitoring system. A manual in-tervention required to bring the system back on-line has been calculated to take approximatelyfour minutes and forty five seconds.

The data collection bucket contains landingzone, staging store and triggers, where triggerfiles at the end of the execution of data collec-tion and data pump are stored.

To gather availability data, it was necessary tofind a monitoring system that would be independ-ent of the solution and would provide SLA-typereports and SLA configuration. The authors op-ted for the Site24x7 SaaS offering and configuredit for website monitoring, checking connectivityto HTTP ports of defined targets, response times,DNS response time and general availability byping, in one minute intervals.

During the migration execution, the followingoutputs are generated:

• New Temporary Amazon Image (AMI) createdbased on the current instance• New Instance deployed to target availabilityzone based on the created image

Fig. 6: Cloud Watch Log representing No Ac-tions

Fig. 7: Redeployment of Instance with new BidPrice

• Old Instance Data removal from DynamoDBInstances Table.

As part of the interruption of Control System,we have the following:

• Scenario: Classic Termination of EC2 Spot In-stance• Actors: Simulator• Desired Outcome: System Terminated and re-stored

The simulation script to terminate EC2 in-stances operates with 2 minute delays, simulat-ing the Amazon Notification grace period. Italso later simulates the system engineer’s inputby restoring the service.

The control system during the experiment hasbeen terminated a number of times causing re-ported outage (Figure 5) and unavailability of theapplication.

SimS Automated Actions low Risk

• Scenario: risk Level low low low• Actors: SimS System• Desired Outcome: no migration initiated

Expected behaviour, if all availability zones arelow risk, then no action is taken (Figure6). Aspresented above, the automation module evalu-ated the risk and did not perform any actions.

Fig. 8: Response time monitoring of the System

Fig. 9: No outage reported by the system. Mon-itoring system did not report system outage, theonly indication of migration is a short spike inresponse time.

SimS Automated Actions High Risk

• Scenario: Risk Level high high high• Actors: SimS System• Desired Outcome: instance redeployed withhigher bid price, no system downtime reportedby monitoring system

During the execution, the Automation Modulewill evaluate risk in all availability zone and basedon the result will move the instance to anotheravailability zone or redeploy to current with thehigher bid price.

In this scenario, The Sims System detectedthat all availability zones in the region are highrisk. In this situation, the system is designed toredeploy the instance with a higher bid price tomitigate the high risk of interruption and there-fore to maintain desired availability level by set-ting up (migrating new instance) and reassigningthe elastic IP from Risky instance to newly de-ployed one, therefore allowing traffic to reach thesystem without any issue.

5-Day SLA Monitoring

• Scenario: 5-Day SLA Monitoring• Desired Outcome: SLA Level of 99.9%

A simulation script has been scheduled andrunning in four hour intervals affecting the classicEC2 Spot instance as well as risk analysis data forthe SimS System. Due to random price selection,the exact time and date of the next interruptionwere not known.

Results show that usage of a proactive systemmanaging spot instances can prove to be valuableif our main goals are low costs and availability ofthe system using the spot instances.

We can see that if automation is designed to actwhile a risky instance is still running, automatedswitch-over is almost seamless but at the price

Fig. 10: 5-Day SLA Report for Control System

Fig. 11: 5-Day availability report for ControlSystem

Fig. 12: 5-Day SLA Report for SiMS ManagedSystem

of a drop in performance during the migration ofthe instance.

For applications that are very response time-sensitive this could be still the issue, as well asfor the application where data is written continu-ously as during the migration the checkpoint (im-age) of the machine is created and any data writ-ten in the time between image creation and switchover of the traffic to the newly deployed machinewould be lost.

The question in here would be the trade-off, incase of classic spot instance the customer can face

Fig. 13: 5-Day availability report for SimS Man-aged System

unexpected termination and if they do not haveany solution that would periodically save the datafrom that instance while running they could facecomplete data loss in comparing to few secondsof loss in case of SimS Managed System, there-fore automated system as one presented in thisresearch can significantly expand possibilities ofthe application of spot instances as well as canhelp with reduction of operational costs.

The Spot Instance Management system for itsrisk analysis is using Random Forest regression-based machine learning model, while this modelduring our research proved sufficient, the modelitself was not a part of in-depth testing and there-fore could not be as accurate as it could behoped. Necessary comparison testing showed usthat price predicted are the same as current pricesor difference in price is minimal.

Our prediction machine learning model re-quires a lot of computing power, to calculate pricepredictions for 150 types of EC2 spot instancesfor every availability zone in EU region would re-quire over 26 hours running in 4CPU docker con-tainers provided by AWS. Even with those lim-

itations, it has been proved that smart, proact-ive system that can move around spot instancesbased on the predicted price and therefore riskcalculated from it, can be effective in achievingnot only 99.9% availability but even 100%

The authors’ research question poses how canwe assure SLA Level of 99.9% for service runningon EC2 Spot instance, as mentioned above andshown in the evidence, the SimS System is ableto effectively manage spot instances and keep upthe availability on the desired level and achieverequired three nines SLA thanks to its Risk as-sessment mechanism and proactive actions beforeany terminate can happen.

With automatic bid rise in case of high riskin all availability zone at the current stage, thiscould lead to higher than desired bid price andtherefore not always attain cost-effectiveness.

CONCLUSIONS

The research question poses how a service levelof 99.9% can be assured. In the paper, the au-thors show that, in principle, Spot instance Man-agement system (SimS) can effectively managespot instances to keep an availability level of99.9%

For a system like SimS to be effective and ac-curate, there is a requirement for a reliable ma-chine learning module to predict spot price in thenext hour. In hourly five-day tests, we have sim-ulated a number of interruptions of classic EC2spot instances, and we combined this with thechange of the risk level in availability zones toforce the SimS system to act and migrate affectedmachine if are located in a high risk availabilityzone. Migration is using image creation (check-pointing).

Currently, due to the limitation in computingpower and the aim to run the system as serverlessas possible, the support for more than just a sub-set of instances is limited by the computationalpower of AWS Fargate docker containers.

Future researchers could take this solution astep further by selecting more sophisticated ma-chine learning models that would take into ac-count not only the historical data but also sea-sonality, allowing support for a more significantnumber of instances and availability zones.

In our research, we have presented the theoret-ical application of the SiMS system in the finan-cial sector to run important CRM application. Asan overall conclusion, the authors posit that withthe minor adjustments mentioned in this chapter,the solution could be of commercial value in avariety of sectors where service availability, aswell as low cost, play a pivotal role.

References

[1] O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster,and D. Tsafrir. Deconstructing Amazon EC2 spotinstance pricing. ACM Transactions on Economicsand Computation, 1(3):16, 2013.

[2] A. Andrzejak, D. Kondo, and S. Yi. Decision modelfor cloud computing under SLA constraints. In MAS-COTS ’10, pages 257–266, Miami, Aug. 2010.

[3] S. Arevalos, F. Lopez-Pires, and B. Baran. A com-parative evaluation of algorithms for auction-basedcloud pricing prediction. In IC2E, pages 99–108, Ber-lin, Apr. 2016. IEEE.

[4] M. Baughman, C. Haas, R. Wolski, I. Foster, andK. Chard. Predicting Amazon Spot Prices withLSTM Networks. In ScienceCloud’18, Tempe, June2018. ACM.

[5] L. Breiman. Random forests. Machine Learning,45:5–32, 2001.

[6] A. Cutler, D. R. Cutler, and J. R. Stevens. Randomforests. In C. Zhang and Y. Ma, editors, EnsembleMachine Learning, pages 157–175. Springer, Boston,2012.

[7] R. Keller, L. Hafner, T. Sachs, and G. Fridgen.Scheduling flexible demand in cloud computing spotmarkets: A real options approach. Business & In-formation Systems Engineering, 62:25–39, 2020.

[8] V. Khandelwal, A. K. Chaturvedi, and C. P. Gupta.Amazon EC2 spot price prediction using regressionrandom forests. IEEE Transactions on Cloud Com-puting, 8(1):59–72, 2020.

[9] D. Kumar, G. Baranwal, Z. Raza, and D. P. Vidy-arthi. A survey on spot pricing in cloud comput-ing. Journal of Network and Systems Management,26:809–856, 2018.

[10] Z. Li, M. Kihl, and A. Robertsson. On a feedbackcontrol-based mechanism of bidding for cloud spotservice. In CloudCom ’15, pages 290–297, Vancouver,Nov 2015. IEEE.

[11] A. Marathe et al. Exploiting redundancy and applic-ation scalability for cost-effective, time-constrainedexecution of HPC applications on Amazon EC2.IEEE Transactions on Parallel & Distributed Sys-tems, 27(9):2574–2588, 2016.

[12] D. Petcu et al. Next generation HPC clouds: Aview for large-scale scientific and data-intensive ap-plications. In Euro-Par 2014, volume 8806 of Lec-ture Notes in Computer Science, pages 26–37, Porto,Aug. 2014. Springer.

[13] V. K. Singh and K. Dutta. Dynamic price pre-diction for amazon spot instances. In 2015 48thHawaii International Conference on System Sciences(HICSS), pages 1513–1520. IEEE, 2015.

[14] P. Smith, H. Gonzalez-Velez, and S. Caton. Socialauto-scaling. In PDP 2018, pages 186–195, Cam-bridge, Mar. 2018. IEEE Computer Society.

[15] W. Voorsluys and R. Buyya. Reliable provisioningof spot instances for compute-intensive applications.In AINA 2012, pages 542–549, Fukuoka, Mar. 2012.IEEE.

[16] R. M. Wallace et al. Applications of neural-basedspot market prediction for cloud computing. In ID-AACS ’13, volume 2, pages 710–716, Berlin, Sept.2013. IEEE.

[17] H. Zhao, M. Pan, X. Liu, X. Li, and Y. Fang. Ex-ploring fine-grained resource rental planning in cloudcomputing. IEEE Transactions on Cloud Comput-ing, 3(3):304–317, 2015.

aws ec2 spot instances for mission critical services

Documents