a cost-effective strategy for intermediate data storage in scientific cloud workflow systems dong...
Post on 19-Dec-2015
216 views
TRANSCRIPT
A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems
Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen
Swinburne University of Technology Melbourne, Australia
Outline
> Part 1: Introduction to our Work
> Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems
Part 1: Introduction to our Work
> SwinDeW Workflow Series
> SwinCloud System
SwinDeW Workflow Series
SwinDeW – Swinburne Decentralised Workflow- foundation prototype based on p2p
– SwinDeW – past
– SwinDeW-S (for Services) – past
– SwinDeW-B (for BPEL4WS) – past
– SwinDeW-G (for Grid) – past
– SwinDeW-A (for Agents) – current
– SwinDeW-V (for Verification) – current
– SwinDeW-C (for Cloud) – current
SwinCloud
Swinburne Computing Facilities
Astrophysics Supercomputer
VMware
Cloud Simulation Environment
Data Centres with Hadoop
· GT4· SuSE Linux
Swinburne CS3
…...
…...
· GT4· CentOS Linux
Swinburne ESR
…...
…...
· GT4· CentOS Linux
Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems
> A Motivating Example and Problem Analysis
> Important Concepts and Cost Model of Datasets Storage in the Cloud
> A Cost-Effective Datasets Storage Strategy for Scientific Cloud Workflow Systems
> Evaluation and Conclusion
Part 2: A Cost-Effective Data Storage Strategy
> A Motivating Example and Problem Analysis
A Motivating Example
> Parkes radio telescope and pulsar survey
> Pulsar searching workflow
De-disperse
Acceleate
Record Raw Data
Extract Beam
Pulse Seek
FFT Seek
FFA Seek
Get Candidates
Elimanate candidates
Fold to XML
Extract Beam
Get Candidates
…...
…...
…...
…...
Make decision
Trial Measure 1
Trial Measure 1200
Trial Measure 2
…...Compress
Beam
…...
A Motivating Example
> Current storage strategy
– Delete all the intermediate data, due to storage limitation
> Some intermediate data should be stored.
> Some need not.
De-disperse
Acceleate
Record Raw Data
Extract Beam
Pulse Seek
FFT Seek
FFA Seek
Get Candidates
Elimanate candidates
Fold to XML
Extract Beam
Get Candidates
…...
…...…
...…...
Make decision
Trial Measure 1
Trial Measure 1200
Trial Measure 2
…...Compress
Beam
…...
A Motivating Example
> Scientific cloud workflow systems
– a scientific workflow system in the Cloud.
– Storage is not bottle-neck anymore.
• Large data centres
• Unlimited storage resource with pay-for-use model
– Data products can be shared easily.
• All the data are managed in the data centres
• Internet based access and SOA
Problem Analysis
> Which datasets should be stored?
– Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006]
– Datasets should be stored based on the trade-off of computation cost and storage cost.
– Scientific workflows are very complex and there are dependencies among datasets.
– Furthermore, one scientist can not decide the storage status of a dataset anymore.
> A cost-effective datasets storage strategy is needed.
Part 2: A Cost-Effective Data Storage Strategy
> Important Concepts and Cost Model of Datasets Storage in the Cloud
Intermediate data Dependency Graph (IDG)
> A classification of the application data
– Input data (original) and intermediate data (generated data)
> Data provenance
– A kind of meta-data that records how data are generated.
> IDG
d3
d1 d2
d4 d5
d6
d7
Datasets Storage Cost Model
> Cost = C + S
– Cost: total cost of managing intermediate datasets
– C: total cost of computation resources
– S: total cost of storage resources
> We use CostC (USDs per time unit) and CostS (USDs per time unit multiply data size) to denote the prices of computation resources and storage resources
IDG with Cost Model
> A dataset di in IDG has the attributes: <size, flag, tp, t, pSet, fSet, CostR> – size : size of di
– flag : denotes storage status of di
– tp : time to produce di from its direct predecessors
– t : usage rate of di in the system
– pSet : set of deleted datasets linked to di
– fSet : set of deleted datasets linked by di
– CostR : di ’s cost rate
IDG with Cost Model
> Generation cost of di :
> If di ’s storage status changes, the generation cost of all the datasets in di .fSet will be affected by genCost(di)
…...
…...
…...…...
…...
…...
…...
…... …...
…...
di
pSet fSet
…... …...
CostCtdtddgenCost pSetdd pjpii ij . ..)(
IDG with Cost Model
> CostR : di ’s cost rate, which means the average cost per time unit of the dataset di in the system
– If di is a stored dataset :
– If di is a deleted dataset :
> The total cost rate of the system is :
> Given a time duration, denoted as [T0, Tn], the total system cost is :
CostSsizedCostRd ii ..
tddgenCostCostRd iii .)(.
IDGd iiRCostd .
· ni
TTt IDGd i dtRCostdCostTotal
0._
Part 2: A Cost-Effective Data Storage Strategy
> A cost-effective strategy for intermediate data storage in scientific cloud workflow systems
Intermediate data storage strategy
> Algorithm 1: deciding newly generated intermediate datasets’ storage status
> Algorithm 2: managing stored intermediate datasets
> Algorithm 3: deciding the regenerated intermediate datasets’ storage status
Algorithm 1
> Suppose d0 is a newly generated intermediate dataset
> First, we add its information to the IDG
> Next, we check if d0 needs to be stored or not by comparing:
CostSsized
tddgenCost
.
.)(
0
00
Algorithm 2
> Suppose d0 is a stored dataset
> We set a threshold time to d0 as the frequence to check d0 ’s storage status, where
> To check if d0 still need to be stored, we have to compare:
CostSsizeddgenCosttd .)(. 000
CostSsized
tddgenCosttddgenCost fSetdd ii
.
.)(.)(
0
. 000 0
Lemma and Theorem
> Lemma: The deletion of stored intermediate dataset di in the IDG does not affect the stored datasets adjacent to di
> Theorem: If regenerated intermediate dataset di is stored, only the stored datasets adjacent to di in the IDG may need to be deleted to reduce the system cost.
…... …...
dipSet fSet
…... …...
Algorithm 3
> Suppose d0 is a regenerated dataset.
> We assume it should be stored, and calculate the potential cost benefit.
> Then we check if the stored predecessor and successor datasets of d0 still need to be stored, and accumulate the cost benefit.
> We calculate the final cost benefit to decide d0 ’s storage status:
CostSsizedtddgenCosttddgenCost fSetdd ii ..)(.)( 0. 000 0
fSetdd mjjjj jm
tddgenCosttddgenCostCostSsized . .)(.)(.
fSetdd nkkkk kn
tddgenCosttddgenCostCostSsized . .)(.)(.
0
Part 2: A Cost-Effective Data Storage Strategy
> Evaluation and Conclusion
Evaluation
> IDG of the pulsar searching workflow
> Adopt Amazon’s cost model (EC2+S3):
– $0.15 per Gigabyte per month for the storage resources.
– $0.1 per CPU hour for the computation resources.
Raw beam data
Accelerated De-
dispersion files
De-dispersion
files
Extracted & compressed
beamSeek
results files
Candidate list XML files
Size:Generation time:
20 GB245 mins1 mins80 mins300 mins790 mins27 mins
25 KB1 KB16 MB90 GB90 GB
Evaluation
> Simulation strategies: 1) Store all the datasets; 2) Delete all the datasets; 3) Store high generation cost datasets; 4) Store often used datasets; 5) Dependency based strategy.
Total cost of 50 days
0
10
20
30
40
50
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49Days
Co
st ($
)
Store all
Store none
Store high generationcost datasets
Store often useddatasets
Dependency basedstrategy
Conclusion and Future Work
> Conclusion
– Our strategy is cost-effective!
– Based on datasets’ cost rates
– Considered the dependencies among datasets
> Future work
– Data placement
– Minimum cost benchmark
End
> Questions?