1 a distributed task scheduler optimizing data transfer time taura lab. kei takahashi (56428) taura...

1

A distributed Task SchedulerOptimizing Data Transfer Time

Taura lab. Kei Takahashi (56428)

2

Task SchedulersTask Schedulers A system which distributes many serial

tasks onto the grid environment►Task assignments►File transfers

Many serial tasks can be executed in parallel

Some constraints need to be considered►Machine availability►Data location

A system which distributes many serial tasks onto the grid environment►Task assignments►File transfers

Many serial tasks can be executed in parallel

Some constraints need to be considered►Machine availability►Data location

3

Data Intensive ApplicationsData Intensive Applications

A computation using large data►Some gigabytes to petabytes►Natural language processing, data mining, etc.►A simple algorithm can extract useful general

knowledge A scheduler need additionally consider:

►Reduction in data transfers►Effective placement of data replicas

A computation using large data►Some gigabytes to petabytes►Natural language processing, data mining, etc.►A simple algorithm can extract useful general

knowledge A scheduler need additionally consider:

►Reduction in data transfers►Effective placement of data replicas

4

An Example of Scheduling An Example of Scheduling The scheduler maps tasks on each

machine

The scheduler maps tasks on each machine

AA BB

SchedulerScheduler Task t0Requires : f0Task t0Requires : f0

Task t1Requires : f1Task t1Requires : f1

Task t2Requires : f1Task t2Requires : f1

File f1File f1 File f0File f0

t1t1

A

B

f0A

B

Shorter processing time

t2t2f1 t1t1

t2t2t0t0

t0t0

5

Related WorkRelated Work Schedulers for data intensive applications

►GrADS: predicts transfer time, but only uses static bandwidth value between two hosts

Efficient multicast

Topology-aware bandwidth simulation►PBS: scheduling parallel jobs to nodes with

consodering network topology

Schedulers for data intensive applications►GrADS: predicts transfer time, but only uses st

atic bandwidth value between two hosts Efficient multicast

Topology-aware bandwidth simulation►PBS: scheduling parallel jobs to nodes with

consodering network topology

6

Topology-aware TransferTopology-aware Transfer If a bandwidth topology map is given, file

transfers can be more precisely estimated►Detecting congestions

By the estimation, more precise optimization becomes possible for the task scheduling

If a bandwidth topology map is given, file transfers can be more precisely estimated►Detecting congestions

By the estimation, more precise optimization becomes possible for the task schedulingCongestionoccurs

Switch

Destination NodesSource Nodes

7

Research Purpose Research Purpose Design and implement a distributed task

scheduler for data intensive applications►Predict data transfer time by using network

topology map with bandwidth►Map tasks onto machine►Plan efficient data transfers

Design and implement a distributed task scheduler for data intensive applications►Predict data transfer time by using network

topology map with bandwidth►Map tasks onto machine►Plan efficient data transfers

8

Input and OutputInput and Output Given information:

►File Locations and size►Network Topology

Output:►Task Scheduling:

• Assignments of nodes to tasks

►Transfer Scheduling:• From/to which host data are transferred• Limit bandwidth during the transfer if needed

The amount of data transfer changes depending on a task schedule

Given information:►File Locations and size►Network Topology

Output:►Task Scheduling:

• Assignments of nodes to tasks

►Transfer Scheduling:• From/to which host data are transferred• Limit bandwidth during the transfer if needed

The amount of data transfer changes depending on a task schedule

9

AgendaAgenda Background Purpose Our Approach Related Work Conclusion

Background Purpose Our Approach Related Work Conclusion

10

Our ApproachOur Approach Task execution time is hard to predict in general

►Schedule one task for each host at a time ►Assign new tasks when a certain number of

hosts have completed, or certain time passed When the data size and bandwidth is known,

file transfer time is predictable►Optimize data transfer time by using network t

opology and bandwidth information• Multicast• Transfer priorities

Task execution time is hard to predict in general►Schedule one task for each host at a time ►Assign new tasks when a certain number of

hosts have completed, or certain time passed When the data size and bandwidth is known,

file transfer time is predictable►Optimize data transfer time by using network t

opology and bandwidth information• Multicast• Transfer priorities

11

Problem FormulationProblem Formulation Final goal: minimizing the makespan Immediate goal: minimizing the sum of time

that takes before each task can start

More immediate goal: minimizing the sum of arrival time of every file on every node

Final goal: minimizing the makespan Immediate goal: minimizing the sum of time

that takes before each task can start

More immediate goal: minimizing the sum of arrival time of every file on every node

))(_(_

,_

taskeveryi

jifileeveryj

filetimetransfer

))(_(( ,_ _max ji

taskeveryi fileeveryj

filetimetransfer

(filei, j : the j th file required by taski )


12

(An image will come here) (An image will come here)

13

AlgorithmAlgorithm When some nodes are unschduled,

►Create an initial candidate task schedule►For each candidate task schedule:

• Decide priorities on each file transfer (including ongoing transfers)

• Plan efficient file transfer schedule• Estimate the usage of each link, and find the most

crowded link

►Search for a better schedule whose most crowded link is less crowded(by using heuristics like GA, SA)

When some nodes are unschduled,►Create an initial candidate task schedule►For each candidate task schedule:

• Decide priorities on each file transfer (including ongoing transfers)

• Plan efficient file transfer schedule• Estimate the usage of each link, and find the most

crowded link

►Search for a better schedule whose most crowded link is less crowded(by using heuristics like GA, SA)

14

F0F0

F1F1

Transfer Priorities (1)Transfer Priorities (1) If several transfers share a link, they divide the bandwidth

of that link Since a task cannot be started before the “entire” file has

arrived, it is more efficient to put priorities on each transfer► If two transfers are going on, the it is more efficient to

transfer the smaller file first

If several transfers share a link, they divide the bandwidth of that link

Since a task cannot be started before the “entire” file has arrived, it is more efficient to put priorities on each transfer► If two transfers are going on, the it is more efficient to

transfer the smaller file first

F0F0

F1F1

50MBps

50MBps

100MBps

100MBps

Both F0 and F1 arrive20 second after

100MBps 100MBps 100MBps

F1F1

Transfer of F0 completes 10 seconds after, and F1 arrive 20 seconds after

1GB 1GBF0F0

1GB 1GB

15

Transfer Priorities (2)Transfer Priorities (2) Objective: minimizing

Priorities are determined by these criteria:►A file needed by more tasks►A smaller file►A file with less replicas (sources)

Objective: minimizing

Priorities are determined by these criteria:►A file needed by more tasks►A smaller file►A file with less replicas (sources)

))(_(_

,_

taskeveryi

jifileeveryj

filetimetransfer


16

Transfer PlanningTransfer Planning Transfers are planned from a file with

higher priority (Greedy method)►The highest priority transfer can use as much

bandwidth as possible►A transfer with less priority can use the rest of

the bandwidth The bandwidth for ongoing transfers are

also reassigned

Transfers are planned from a file with higher priority (Greedy method)►The highest priority transfer can use as much

bandwidth as possible►A transfer with less priority can use the rest of

the bandwidth The bandwidth for ongoing transfers are

also reassigned

17

Transfer Priorities (example)Transfer Priorities (example)

Task0, Task1, Task2 are scheduled►File0: 3GB, needed by task0►File1: 2GB, needed by task0, task1, task2►File2: 1GB, needed by task1

Transferring priority:►File1 > File2 > File0

File1 is transferred using multicast pipeline

Task0, Task1, Task2 are scheduled►File0: 3GB, needed by task0►File1: 2GB, needed by task0, task1, task2►File2: 1GB, needed by task1

Transferring priority:►File1 > File2 > File0

File1 is transferred using multicast pipeline

Total BandwidthTask0 Task1 Task2

File1Task0 Task1 Task2

File2Task0 Task1 Task2 Task0 Task1 Task2

File0

18

Pipeline Multicast (1)Pipeline Multicast (1) For a given schedule, it is known which

nodes require which files When multiple nodes need a common file,

a pipeline multicast shortens transfer time(in the case of large files)

The speed of a pipeline broadcast is limited by the narrowest link in the tree

A broadcast can be sped up by efficiently using multiple sources

For a given schedule, it is known which nodes require which files

When multiple nodes need a common file, a pipeline multicast shortens transfer time(in the case of large files)

The speed of a pipeline broadcast is limited by the narrowest link in the tree

A broadcast can be sped up by efficiently using multiple sources

19

Pipeline Multicast (2)Pipeline Multicast (2) The tree is constructed depth-first manner Every related link is only used twice

(upward/downward) Since disk access is as slow as network, the disk

access bandwidth should be also counted

The tree is constructed depth-first manner Every related link is only used twice

(upward/downward) Since disk access is as slow as network, the disk

access bandwidth should be also counted

Source Destination

20

Multi-source MulticastMulti-source Multicast M nodes have the same source data; N nodes need it For each link in the order of bandwidth:

► If the link connects two nodes/switches which are already connected to the source node:

→ Discard the link►Otherwise: → Adopt the link(Kruskal's Algorithm: it maximizes the narrowest link in the pipeline

s)

M nodes have the same source data; N nodes need it For each link in the order of bandwidth:

► If the link connects two nodes/switches which are already connected to the source node:

→ Discard the link►Otherwise: → Adopt the link(Kruskal's Algorithm: it maximizes the narrowest link in the pipeline

s)

Discard this link

Pipeline 1

Pipeline 2Source Destination

21

Find Crowded LinkFind Crowded Link On a certain schedule, for each link,

►list every transfer using that link►sum up transfer size►calculate ”rough transfer time” as follows:

(total transfer size) / (bandwidth) Find the longest “rough transfer time”

On a certain schedule, for each link,►list every transfer using that link►sum up transfer size►calculate ”rough transfer time” as follows:

(total transfer size) / (bandwidth) Find the longest “rough transfer time”

Link 0 Link 1 Link 2 Link 3

Bandwidth bw0 bw1 bw2 Bw3

File 0 size0 0 0 size0

File 1 0 size1 size1 size1

File 2 0 0 0 size2

Rough transfer time

size0 / bw0 size1 / bw1 size1 / bw1 (size0 + size1 + size2 ) / bw3

22

Improve the ScheduleImprove the Schedule After the most crowded link is found, the

scheduler tries to reduce the transfer size by altering task assignments

We are thinking of using GA or Simulated Annealing. Since the most crowded link is known, we can try to reduce the transfer of this link in the mutation phase.

After the most crowded link is found, the scheduler tries to reduce the transfer size by altering task assignments

We are thinking of using GA or Simulated Annealing. Since the most crowded link is known, we can try to reduce the transfer of this link in the mutation phase.

23

Actual TransfersActual Transfers After the transfer schedule has determined,

the plan is performed as simulated In a pipeline transfer, the bandwidth is the s

ame through links The bandwidth of the narrowest link is limite

d to the calculated value When detecting a significant change in ban

dwidth, the schedule is reconstructed►The bandwidth is measured by using existing

methods (eg. nettimer, bprobe)

After the transfer schedule has determined, the plan is performed as simulated

In a pipeline transfer, the bandwidth is the same through links

The bandwidth of the narrowest link is limited to the calculated value

When detecting a significant change in bandwidth, the schedule is reconstructed►The bandwidth is measured by using existing

methods (eg. nettimer, bprobe)

24

Re-scheduling TransfersRe-scheduling Transfers When a file transfer has finished, transfer

schedule is recalculated►Calculating the priority on each task►Assign bandwidth for each task

A new bandwidth value is assigned for each transfer, but the pipeline is not changed

When a file transfer has finished, transfer schedule is recalculated►Calculating the priority on each task►Assign bandwidth for each task

A new bandwidth value is assigned for each transfer, but the pipeline is not changed

25

Task DescriptionTask Description A user need specify required files, output fil

es and dependencies for each task In order to enable flexible task description,

we provide scheduling API for script languages►Files are identified by URIs

(ex. “abc.com:/home/kay/some/location”)►The scheduler analyses dependencies from fil

epath

A user need specify required files, output files and dependencies for each task

In order to enable flexible task description, we provide scheduling API for script languages►Files are identified by URIs

(ex. “abc.com:/home/kay/some/location”)►The scheduler analyses dependencies from fil

epath

26

Task SubmissionTask Submission A task is submitted by calling submit()

with required filepath

A task is submitted by calling submit() with required filepath

fs = sched.list_files("abc.com:/home/kay/some/location/**”)parsed_files = []

for fn in fs: new_fn = fn + ".parsed“ command = "parser " + fn + " " + new_fn sched.submit(command, [new_fn], []) parsed_files.append(new_fn)

sched.gather(parsed_files, “abc.com:/home/kay/parsed/")

27

ConclusionConclusion Introduced new scheduling algorithm

►Predict transfer time by using network topology, and search for a better task schedule

►Efficiently transfer files by limiting bandwidth►Dynamically re-scheduling transfers

Current status:►The implementation is ongoing

Introduced new scheduling algorithm►Predict transfer time by using network

topology, and search for a better task schedule►Efficiently transfer files by limiting bandwidth►Dynamically re-scheduling transfers

Current status:►The implementation is ongoing

28

PublicationsPublications 高橋慧 , 田浦健次朗 , 近山隆 . マイグレーションを支援する分散集合オブジ

ェクト．並列／分散／協調処理に関するサマーワークショップ（ SWoPP2005 ），武雄， 2005 年 8 月．

高橋慧 , 田浦健次朗 , 近山隆 . マイグレーションを支援する分散集合オブジェクト . 先進的計算基盤シンポジウム（ SACSIS 2005 ），筑波， 2005 年 5月．

高橋慧 , 田浦健次朗 , 近山隆 . マイグレーションを支援する分散集合オブジェクト．並列／分散／協調処理に関するサマーワークショップ（ SWoPP2005 ），武雄， 2005 年 8 月．

高橋慧 , 田浦健次朗 , 近山隆 . マイグレーションを支援する分散集合オブジェクト . 先進的計算基盤シンポジウム（ SACSIS 2005 ），筑波， 2005 年 5月．

1 a distributed task scheduler optimizing data transfer time taura lab. kei takahashi (56428) taura...

Documents

file transfer time

data transfer changes

data mining

data size

sum of time

file locations

j th file

entire file