a cost-effective strategy for intermediate data storage in scientific cloud workflow systems dong...

A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems

Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen

Swinburne University of Technology Melbourne, Australia

Outline

> Part 1: Introduction to our Work

> Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems

Part 1: Introduction to our Work

> SwinDeW Workflow Series

> SwinCloud System

SwinDeW Workflow Series

SwinDeW – Swinburne Decentralised Workflow- foundation prototype based on p2p

– SwinDeW – past

– SwinDeW-S (for Services) – past

– SwinDeW-B (for BPEL4WS) – past

– SwinDeW-G (for Grid) – past

– SwinDeW-A (for Agents) – current

– SwinDeW-V (for Verification) – current

– SwinDeW-C (for Cloud) – current

SwinCloud

Swinburne Computing Facilities

Astrophysics Supercomputer

VMware

Cloud Simulation Environment

Data Centres with Hadoop

· GT4· SuSE Linux

Swinburne CS3

…...

…...

· GT4· CentOS Linux

Swinburne ESR

…...

…...

· GT4· CentOS Linux

Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems

> A Motivating Example and Problem Analysis

> Important Concepts and Cost Model of Datasets Storage in the Cloud

> A Cost-Effective Datasets Storage Strategy for Scientific Cloud Workflow Systems

> Evaluation and Conclusion

Part 2: A Cost-Effective Data Storage Strategy

> A Motivating Example and Problem Analysis

A Motivating Example

> Parkes radio telescope and pulsar survey

> Pulsar searching workflow

De-disperse

Acceleate

Record Raw Data

Extract Beam

Pulse Seek

FFT Seek

FFA Seek

Get Candidates

Elimanate candidates

Fold to XML

Extract Beam

Get Candidates

…...

…...

…...

…...

Make decision

Trial Measure 1

Trial Measure 1200

Trial Measure 2

…...Compress

Beam

…...


> Current storage strategy

– Delete all the intermediate data, due to storage limitation

> Some intermediate data should be stored.

> Some need not.

De-disperse

Acceleate

Record Raw Data

Extract Beam

Pulse Seek

FFT Seek

FFA Seek

Get Candidates

Elimanate candidates

Fold to XML

Extract Beam

Get Candidates

…...

…...…

...…...

Make decision

Trial Measure 1

Trial Measure 1200

Trial Measure 2

…...Compress

Beam

…...


> Scientific cloud workflow systems

– a scientific workflow system in the Cloud.

– Storage is not bottle-neck anymore.

• Large data centres

• Unlimited storage resource with pay-for-use model

– Data products can be shared easily.

• All the data are managed in the data centres

• Internet based access and SOA

Problem Analysis

> Which datasets should be stored?

– Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006]

– Datasets should be stored based on the trade-off of computation cost and storage cost.

– Scientific workflows are very complex and there are dependencies among datasets.

– Furthermore, one scientist can not decide the storage status of a dataset anymore.

> A cost-effective datasets storage strategy is needed.


> Important Concepts and Cost Model of Datasets Storage in the Cloud

Intermediate data Dependency Graph (IDG)

> A classification of the application data

– Input data (original) and intermediate data (generated data)

> Data provenance

– A kind of meta-data that records how data are generated.

> IDG

d3

d1 d2

d4 d5

d6

d7

Datasets Storage Cost Model

> Cost = C + S

– Cost: total cost of managing intermediate datasets

– C: total cost of computation resources

– S: total cost of storage resources

> We use CostC (USDs per time unit) and CostS (USDs per time unit multiply data size) to denote the prices of computation resources and storage resources

IDG with Cost Model

> A dataset di in IDG has the attributes: <size, flag, tp, t, pSet, fSet, CostR> – size : size of di

– flag : denotes storage status of di

– tp : time to produce di from its direct predecessors

– t : usage rate of di in the system

– pSet : set of deleted datasets linked to di

– fSet : set of deleted datasets linked by di

– CostR : di ’s cost rate

IDG with Cost Model

> Generation cost of di :

> If di ’s storage status changes, the generation cost of all the datasets in di .fSet will be affected by genCost(di)

…...

…...

…...…...

…...

…...

…...

…... …...

…...

di

pSet fSet

…... …...

CostCtdtddgenCost pSetdd pjpii ij . ..)(

IDG with Cost Model

> CostR : di ’s cost rate, which means the average cost per time unit of the dataset di in the system

– If di is a stored dataset :

– If di is a deleted dataset :

> The total cost rate of the system is :

> Given a time duration, denoted as [T0, Tn], the total system cost is :

CostSsizedCostRd ii ..

tddgenCostCostRd iii .)(.

IDGd iiRCostd .

· ni

TTt IDGd i dtRCostdCostTotal

0._


> A cost-effective strategy for intermediate data storage in scientific cloud workflow systems

Intermediate data storage strategy

> Algorithm 1: deciding newly generated intermediate datasets’ storage status

> Algorithm 2: managing stored intermediate datasets

> Algorithm 3: deciding the regenerated intermediate datasets’ storage status

Algorithm 1

> Suppose d0 is a newly generated intermediate dataset

> First, we add its information to the IDG

> Next, we check if d0 needs to be stored or not by comparing:

CostSsized

tddgenCost

.

.)(

0

00

Algorithm 2

> Suppose d0 is a stored dataset

> We set a threshold time to d0 as the frequence to check d0 ’s storage status, where

> To check if d0 still need to be stored, we have to compare:

CostSsizeddgenCosttd .)(. 000

CostSsized

tddgenCosttddgenCost fSetdd ii

.

.)(.)(

0

. 000 0

Lemma and Theorem

> Lemma: The deletion of stored intermediate dataset di in the IDG does not affect the stored datasets adjacent to di

> Theorem: If regenerated intermediate dataset di is stored, only the stored datasets adjacent to di in the IDG may need to be deleted to reduce the system cost.

…... …...

dipSet fSet

…... …...

Algorithm 3

> Suppose d0 is a regenerated dataset.

> We assume it should be stored, and calculate the potential cost benefit.

> Then we check if the stored predecessor and successor datasets of d0 still need to be stored, and accumulate the cost benefit.

> We calculate the final cost benefit to decide d0 ’s storage status:

CostSsizedtddgenCosttddgenCost fSetdd ii ..)(.)( 0. 000 0

fSetdd mjjjj jm

tddgenCosttddgenCostCostSsized . .)(.)(.

fSetdd nkkkk kn

tddgenCosttddgenCostCostSsized . .)(.)(.

0


> Evaluation and Conclusion

Evaluation

> IDG of the pulsar searching workflow

> Adopt Amazon’s cost model (EC2+S3):

– $0.15 per Gigabyte per month for the storage resources.

– $0.1 per CPU hour for the computation resources.

Raw beam data

Accelerated De-

dispersion files

De-dispersion

files

Extracted & compressed

beamSeek

results files

Candidate list XML files

Size:Generation time:

20 GB245 mins1 mins80 mins300 mins790 mins27 mins

25 KB1 KB16 MB90 GB90 GB

Evaluation

> Simulation strategies: 1) Store all the datasets; 2) Delete all the datasets; 3) Store high generation cost datasets; 4) Store often used datasets; 5) Dependency based strategy.

Total cost of 50 days

0

10

20

30

40

50

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49Days

Co

st ($

)

Store all

Store none

Store high generationcost datasets

Store often useddatasets

Dependency basedstrategy

Conclusion and Future Work

> Conclusion

– Our strategy is cost-effective!

– Based on datasets’ cost rates

– Considered the dependencies among datasets

> Future work

– Data placement

– Minimum cost benchmark

End

> Questions?

a cost-effective strategy for intermediate data storage in scientific cloud workflow systems dong...

Documents