a cost-effective strategy for intermediate data storage in scientific cloud workflow systems dong...

28
A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University of

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems

Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen

Swinburne University of Technology Melbourne, Australia

Page 2: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Outline

> Part 1: Introduction to our Work

> Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems

Page 3: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Part 1: Introduction to our Work

> SwinDeW Workflow Series

> SwinCloud System

Page 4: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

SwinDeW Workflow Series

SwinDeW – Swinburne Decentralised Workflow- foundation prototype based on p2p

– SwinDeW – past

– SwinDeW-S (for Services) – past

– SwinDeW-B (for BPEL4WS) – past

– SwinDeW-G (for Grid) – past

– SwinDeW-A (for Agents) – current

– SwinDeW-V (for Verification) – current

– SwinDeW-C (for Cloud) – current

Page 5: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

SwinCloud

Swinburne Computing Facilities

Astrophysics Supercomputer

VMware

Cloud Simulation Environment

Data Centres with Hadoop

· GT4· SuSE Linux

Swinburne CS3

…...

…...

· GT4· CentOS Linux

Swinburne ESR

…...

…...

· GT4· CentOS Linux

Page 6: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Part 2: A Cost-Effective Intermediate Data Storage Strategy for Scientific Cloud Workflow Systems

> A Motivating Example and Problem Analysis

> Important Concepts and Cost Model of Datasets Storage in the Cloud

> A Cost-Effective Datasets Storage Strategy for Scientific Cloud Workflow Systems

> Evaluation and Conclusion

Page 7: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Part 2: A Cost-Effective Data Storage Strategy

> A Motivating Example and Problem Analysis

Page 8: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

A Motivating Example

> Parkes radio telescope and pulsar survey

> Pulsar searching workflow

De-disperse

Acceleate

Record Raw Data

Extract Beam

Pulse Seek

FFT Seek

FFA Seek

Get Candidates

Elimanate candidates

Fold to XML

Extract Beam

Get Candidates

…...

…...

…...

…...

Make decision

Trial Measure 1

Trial Measure 1200

Trial Measure 2

…...Compress

Beam

…...

Page 9: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

A Motivating Example

> Current storage strategy

– Delete all the intermediate data, due to storage limitation

> Some intermediate data should be stored.

> Some need not.

De-disperse

Acceleate

Record Raw Data

Extract Beam

Pulse Seek

FFT Seek

FFA Seek

Get Candidates

Elimanate candidates

Fold to XML

Extract Beam

Get Candidates

…...

…...…

...…...

Make decision

Trial Measure 1

Trial Measure 1200

Trial Measure 2

…...Compress

Beam

…...

Page 10: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

A Motivating Example

> Scientific cloud workflow systems

– a scientific workflow system in the Cloud.

– Storage is not bottle-neck anymore.

• Large data centres

• Unlimited storage resource with pay-for-use model

– Data products can be shared easily.

• All the data are managed in the data centres

• Internet based access and SOA

Page 11: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Problem Analysis

> Which datasets should be stored?

– Data challenge: double every year over the next decade and further -- [Szalay et al. Nature, 2006]

– Datasets should be stored based on the trade-off of computation cost and storage cost.

– Scientific workflows are very complex and there are dependencies among datasets.

– Furthermore, one scientist can not decide the storage status of a dataset anymore.

> A cost-effective datasets storage strategy is needed.

Page 12: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Part 2: A Cost-Effective Data Storage Strategy

> Important Concepts and Cost Model of Datasets Storage in the Cloud

Page 13: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Intermediate data Dependency Graph (IDG)

> A classification of the application data

– Input data (original) and intermediate data (generated data)

> Data provenance

– A kind of meta-data that records how data are generated.

> IDG

d3

d1 d2

d4 d5

d6

d7

Page 14: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Datasets Storage Cost Model

> Cost = C + S

– Cost: total cost of managing intermediate datasets

– C: total cost of computation resources

– S: total cost of storage resources

> We use CostC (USDs per time unit) and CostS (USDs per time unit multiply data size) to denote the prices of computation resources and storage resources

Page 15: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

IDG with Cost Model

> A dataset di in IDG has the attributes: <size, flag, tp, t, pSet, fSet, CostR> – size : size of di

– flag : denotes storage status of di

– tp : time to produce di from its direct predecessors

– t : usage rate of di in the system

– pSet : set of deleted datasets linked to di

– fSet : set of deleted datasets linked by di

– CostR : di ’s cost rate

Page 16: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

IDG with Cost Model

> Generation cost of di :

> If di ’s storage status changes, the generation cost of all the datasets in di .fSet will be affected by genCost(di)

…...

…...

…...…...

…...

…...

…...

…... …...

…...

di

pSet fSet

…... …...

CostCtdtddgenCost pSetdd pjpii ij . ..)(

Page 17: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

IDG with Cost Model

> CostR : di ’s cost rate, which means the average cost per time unit of the dataset di in the system

– If di is a stored dataset :

– If di is a deleted dataset :

> The total cost rate of the system is :

> Given a time duration, denoted as [T0, Tn], the total system cost is :

CostSsizedCostRd ii ..

tddgenCostCostRd iii .)(.

IDGd iiRCostd .

· ni

TTt IDGd i dtRCostdCostTotal

0._

Page 18: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Part 2: A Cost-Effective Data Storage Strategy

> A cost-effective strategy for intermediate data storage in scientific cloud workflow systems

Page 19: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Intermediate data storage strategy

> Algorithm 1: deciding newly generated intermediate datasets’ storage status

> Algorithm 2: managing stored intermediate datasets

> Algorithm 3: deciding the regenerated intermediate datasets’ storage status

Page 20: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Algorithm 1

> Suppose d0 is a newly generated intermediate dataset

> First, we add its information to the IDG

> Next, we check if d0 needs to be stored or not by comparing:

CostSsized

tddgenCost

.

.)(

0

00

Page 21: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Algorithm 2

> Suppose d0 is a stored dataset

> We set a threshold time to d0 as the frequence to check d0 ’s storage status, where

> To check if d0 still need to be stored, we have to compare:

CostSsizeddgenCosttd .)(. 000

CostSsized

tddgenCosttddgenCost fSetdd ii

.

.)(.)(

0

. 000 0

Page 22: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Lemma and Theorem

> Lemma: The deletion of stored intermediate dataset di in the IDG does not affect the stored datasets adjacent to di

> Theorem: If regenerated intermediate dataset di is stored, only the stored datasets adjacent to di in the IDG may need to be deleted to reduce the system cost.

…... …...

dipSet fSet

…... …...

Page 23: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Algorithm 3

> Suppose d0 is a regenerated dataset.

> We assume it should be stored, and calculate the potential cost benefit.

> Then we check if the stored predecessor and successor datasets of d0 still need to be stored, and accumulate the cost benefit.

> We calculate the final cost benefit to decide d0 ’s storage status:

CostSsizedtddgenCosttddgenCost fSetdd ii ..)(.)( 0. 000 0

fSetdd mjjjj jm

tddgenCosttddgenCostCostSsized . .)(.)(.

fSetdd nkkkk kn

tddgenCosttddgenCostCostSsized . .)(.)(.

0

Page 24: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Part 2: A Cost-Effective Data Storage Strategy

> Evaluation and Conclusion

Page 25: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Evaluation

> IDG of the pulsar searching workflow

> Adopt Amazon’s cost model (EC2+S3):

– $0.15 per Gigabyte per month for the storage resources.

– $0.1 per CPU hour for the computation resources.

Raw beam data

Accelerated De-

dispersion files

De-dispersion

files

Extracted & compressed

beamSeek

results files

Candidate list XML files

Size:Generation time:

20 GB245 mins1 mins80 mins300 mins790 mins27 mins

25 KB1 KB16 MB90 GB90 GB

Page 26: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Evaluation

> Simulation strategies: 1) Store all the datasets; 2) Delete all the datasets; 3) Store high generation cost datasets; 4) Store often used datasets; 5) Dependency based strategy.

Total cost of 50 days

0

10

20

30

40

50

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49Days

Co

st ($

)

Store all

Store none

Store high generationcost datasets

Store often useddatasets

Dependency basedstrategy

Page 27: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

Conclusion and Future Work

> Conclusion

– Our strategy is cost-effective!

– Based on datasets’ cost rates

– Considered the dependencies among datasets

> Future work

– Data placement

– Minimum cost benchmark

Page 28: A Cost-Effective Strategy for Intermediate Data Storage in Scientific Cloud Workflow Systems Dong Yuan, Yun Yang, Xiao Liu, Jinjun Chen Swinburne University

End

> Questions?