1 a distributed task scheduler optimizing data transfer time taura lab. kei takahashi (56428) taura...
TRANSCRIPT
1
A distributed Task SchedulerOptimizing Data Transfer Time
Taura lab. Kei Takahashi (56428)
2
Task SchedulersTask Schedulers A system which distributes many serial
tasks onto the grid environment►Task assignments►File transfers
Many serial tasks can be executed in parallel
Some constraints need to be considered►Machine availability►Data location
A system which distributes many serial tasks onto the grid environment►Task assignments►File transfers
Many serial tasks can be executed in parallel
Some constraints need to be considered►Machine availability►Data location
3
Data Intensive ApplicationsData Intensive Applications
A computation using large data►Some gigabytes to petabytes►Natural language processing, data mining, etc.►A simple algorithm can extract useful general
knowledge A scheduler need additionally consider:
►Reduction in data transfers►Effective placement of data replicas
A computation using large data►Some gigabytes to petabytes►Natural language processing, data mining, etc.►A simple algorithm can extract useful general
knowledge A scheduler need additionally consider:
►Reduction in data transfers►Effective placement of data replicas
4
An Example of Scheduling An Example of Scheduling The scheduler maps tasks on each
machine
The scheduler maps tasks on each machine
AA BB
SchedulerScheduler Task t0Requires : f0Task t0Requires : f0
Task t1Requires : f1Task t1Requires : f1
Task t2Requires : f1Task t2Requires : f1
File f1File f1 File f0File f0
t1t1
A
B
f0A
B
Shorter processing time
t2t2f1 t1t1
t2t2t0t0
t0t0
5
Related WorkRelated Work Schedulers for data intensive applications
►GrADS: predicts transfer time, but only uses static bandwidth value between two hosts
Efficient multicast
Topology-aware bandwidth simulation►PBS: scheduling parallel jobs to nodes with
consodering network topology
Schedulers for data intensive applications►GrADS: predicts transfer time, but only uses st
atic bandwidth value between two hosts Efficient multicast
Topology-aware bandwidth simulation►PBS: scheduling parallel jobs to nodes with
consodering network topology
6
Topology-aware TransferTopology-aware Transfer If a bandwidth topology map is given, file
transfers can be more precisely estimated►Detecting congestions
By the estimation, more precise optimization becomes possible for the task scheduling
If a bandwidth topology map is given, file transfers can be more precisely estimated►Detecting congestions
By the estimation, more precise optimization becomes possible for the task schedulingCongestionoccurs
Switch
Destination NodesSource Nodes
7
Research Purpose Research Purpose Design and implement a distributed task
scheduler for data intensive applications►Predict data transfer time by using network
topology map with bandwidth►Map tasks onto machine►Plan efficient data transfers
Design and implement a distributed task scheduler for data intensive applications►Predict data transfer time by using network
topology map with bandwidth►Map tasks onto machine►Plan efficient data transfers
8
Input and OutputInput and Output Given information:
►File Locations and size►Network Topology
Output:►Task Scheduling:
• Assignments of nodes to tasks
►Transfer Scheduling:• From/to which host data are transferred• Limit bandwidth during the transfer if needed
The amount of data transfer changes depending on a task schedule
Given information:►File Locations and size►Network Topology
Output:►Task Scheduling:
• Assignments of nodes to tasks
►Transfer Scheduling:• From/to which host data are transferred• Limit bandwidth during the transfer if needed
The amount of data transfer changes depending on a task schedule
9
AgendaAgenda Background Purpose Our Approach Related Work Conclusion
Background Purpose Our Approach Related Work Conclusion
10
Our ApproachOur Approach Task execution time is hard to predict in general
►Schedule one task for each host at a time ►Assign new tasks when a certain number of
hosts have completed, or certain time passed When the data size and bandwidth is known,
file transfer time is predictable►Optimize data transfer time by using network t
opology and bandwidth information• Multicast• Transfer priorities
Task execution time is hard to predict in general►Schedule one task for each host at a time ►Assign new tasks when a certain number of
hosts have completed, or certain time passed When the data size and bandwidth is known,
file transfer time is predictable►Optimize data transfer time by using network t
opology and bandwidth information• Multicast• Transfer priorities
11
Problem FormulationProblem Formulation Final goal: minimizing the makespan Immediate goal: minimizing the sum of time
that takes before each task can start
More immediate goal: minimizing the sum of arrival time of every file on every node
Final goal: minimizing the makespan Immediate goal: minimizing the sum of time
that takes before each task can start
More immediate goal: minimizing the sum of arrival time of every file on every node
))(_(_
,_
taskeveryi
jifileeveryj
filetimetransfer
))(_(( ,_ _max ji
taskeveryi fileeveryj
filetimetransfer
(filei, j : the j th file required by taski )
(filei, j : the j th file required by taski )
12
(An image will come here) (An image will come here)
13
AlgorithmAlgorithm When some nodes are unschduled,
►Create an initial candidate task schedule►For each candidate task schedule:
• Decide priorities on each file transfer (including ongoing transfers)
• Plan efficient file transfer schedule• Estimate the usage of each link, and find the most
crowded link
►Search for a better schedule whose most crowded link is less crowded(by using heuristics like GA, SA)
When some nodes are unschduled,►Create an initial candidate task schedule►For each candidate task schedule:
• Decide priorities on each file transfer (including ongoing transfers)
• Plan efficient file transfer schedule• Estimate the usage of each link, and find the most
crowded link
►Search for a better schedule whose most crowded link is less crowded(by using heuristics like GA, SA)
14
F0F0
F1F1
Transfer Priorities (1)Transfer Priorities (1) If several transfers share a link, they divide the bandwidth
of that link Since a task cannot be started before the “entire” file has
arrived, it is more efficient to put priorities on each transfer► If two transfers are going on, the it is more efficient to
transfer the smaller file first
If several transfers share a link, they divide the bandwidth of that link
Since a task cannot be started before the “entire” file has arrived, it is more efficient to put priorities on each transfer► If two transfers are going on, the it is more efficient to
transfer the smaller file first
F0F0
F1F1
50MBps
50MBps
100MBps
100MBps
Both F0 and F1 arrive20 second after
100MBps 100MBps 100MBps
F1F1
Transfer of F0 completes 10 seconds after, and F1 arrive 20 seconds after
1GB 1GBF0F0
1GB 1GB
15
Transfer Priorities (2)Transfer Priorities (2) Objective: minimizing
Priorities are determined by these criteria:►A file needed by more tasks►A smaller file►A file with less replicas (sources)
Objective: minimizing
Priorities are determined by these criteria:►A file needed by more tasks►A smaller file►A file with less replicas (sources)
))(_(_
,_
taskeveryi
jifileeveryj
filetimetransfer
(filei, j : the j th file required by taski )
16
Transfer PlanningTransfer Planning Transfers are planned from a file with
higher priority (Greedy method)►The highest priority transfer can use as much
bandwidth as possible►A transfer with less priority can use the rest of
the bandwidth The bandwidth for ongoing transfers are
also reassigned
Transfers are planned from a file with higher priority (Greedy method)►The highest priority transfer can use as much
bandwidth as possible►A transfer with less priority can use the rest of
the bandwidth The bandwidth for ongoing transfers are
also reassigned
17
Transfer Priorities (example)Transfer Priorities (example)
Task0, Task1, Task2 are scheduled►File0: 3GB, needed by task0►File1: 2GB, needed by task0, task1, task2►File2: 1GB, needed by task1
Transferring priority:►File1 > File2 > File0
File1 is transferred using multicast pipeline
Task0, Task1, Task2 are scheduled►File0: 3GB, needed by task0►File1: 2GB, needed by task0, task1, task2►File2: 1GB, needed by task1
Transferring priority:►File1 > File2 > File0
File1 is transferred using multicast pipeline
Total BandwidthTask0 Task1 Task2
File1Task0 Task1 Task2
File2Task0 Task1 Task2 Task0 Task1 Task2
File0
18
Pipeline Multicast (1)Pipeline Multicast (1) For a given schedule, it is known which
nodes require which files When multiple nodes need a common file,
a pipeline multicast shortens transfer time(in the case of large files)
The speed of a pipeline broadcast is limited by the narrowest link in the tree
A broadcast can be sped up by efficiently using multiple sources
For a given schedule, it is known which nodes require which files
When multiple nodes need a common file, a pipeline multicast shortens transfer time(in the case of large files)
The speed of a pipeline broadcast is limited by the narrowest link in the tree
A broadcast can be sped up by efficiently using multiple sources
19
Pipeline Multicast (2)Pipeline Multicast (2) The tree is constructed depth-first manner Every related link is only used twice
(upward/downward) Since disk access is as slow as network, the disk
access bandwidth should be also counted
The tree is constructed depth-first manner Every related link is only used twice
(upward/downward) Since disk access is as slow as network, the disk
access bandwidth should be also counted
Source Destination
20
Multi-source MulticastMulti-source Multicast M nodes have the same source data; N nodes need it For each link in the order of bandwidth:
► If the link connects two nodes/switches which are already connected to the source node:
→ Discard the link►Otherwise: → Adopt the link(Kruskal's Algorithm: it maximizes the narrowest link in the pipeline
s)
M nodes have the same source data; N nodes need it For each link in the order of bandwidth:
► If the link connects two nodes/switches which are already connected to the source node:
→ Discard the link►Otherwise: → Adopt the link(Kruskal's Algorithm: it maximizes the narrowest link in the pipeline
s)
Discard this link
Pipeline 1
Pipeline 2Source Destination
21
Find Crowded LinkFind Crowded Link On a certain schedule, for each link,
►list every transfer using that link►sum up transfer size►calculate ”rough transfer time” as follows:
(total transfer size) / (bandwidth) Find the longest “rough transfer time”
On a certain schedule, for each link,►list every transfer using that link►sum up transfer size►calculate ”rough transfer time” as follows:
(total transfer size) / (bandwidth) Find the longest “rough transfer time”
Link 0 Link 1 Link 2 Link 3
Bandwidth bw0 bw1 bw2 Bw3
File 0 size0 0 0 size0
File 1 0 size1 size1 size1
File 2 0 0 0 size2
Rough transfer time
size0 / bw0 size1 / bw1 size1 / bw1 (size0 + size1 + size2 ) / bw3
22
Improve the ScheduleImprove the Schedule After the most crowded link is found, the
scheduler tries to reduce the transfer size by altering task assignments
We are thinking of using GA or Simulated Annealing. Since the most crowded link is known, we can try to reduce the transfer of this link in the mutation phase.
After the most crowded link is found, the scheduler tries to reduce the transfer size by altering task assignments
We are thinking of using GA or Simulated Annealing. Since the most crowded link is known, we can try to reduce the transfer of this link in the mutation phase.
23
Actual TransfersActual Transfers After the transfer schedule has determined,
the plan is performed as simulated In a pipeline transfer, the bandwidth is the s
ame through links The bandwidth of the narrowest link is limite
d to the calculated value When detecting a significant change in ban
dwidth, the schedule is reconstructed►The bandwidth is measured by using existing
methods (eg. nettimer, bprobe)
After the transfer schedule has determined, the plan is performed as simulated
In a pipeline transfer, the bandwidth is the same through links
The bandwidth of the narrowest link is limited to the calculated value
When detecting a significant change in bandwidth, the schedule is reconstructed►The bandwidth is measured by using existing
methods (eg. nettimer, bprobe)
24
Re-scheduling TransfersRe-scheduling Transfers When a file transfer has finished, transfer
schedule is recalculated►Calculating the priority on each task►Assign bandwidth for each task
A new bandwidth value is assigned for each transfer, but the pipeline is not changed
When a file transfer has finished, transfer schedule is recalculated►Calculating the priority on each task►Assign bandwidth for each task
A new bandwidth value is assigned for each transfer, but the pipeline is not changed
25
Task DescriptionTask Description A user need specify required files, output fil
es and dependencies for each task In order to enable flexible task description,
we provide scheduling API for script languages►Files are identified by URIs
(ex. “abc.com:/home/kay/some/location”)►The scheduler analyses dependencies from fil
epath
A user need specify required files, output files and dependencies for each task
In order to enable flexible task description, we provide scheduling API for script languages►Files are identified by URIs
(ex. “abc.com:/home/kay/some/location”)►The scheduler analyses dependencies from fil
epath
26
Task SubmissionTask Submission A task is submitted by calling submit()
with required filepath
A task is submitted by calling submit() with required filepath
fs = sched.list_files("abc.com:/home/kay/some/location/**”)parsed_files = []
for fn in fs: new_fn = fn + ".parsed“ command = "parser " + fn + " " + new_fn sched.submit(command, [new_fn], []) parsed_files.append(new_fn)
sched.gather(parsed_files, “abc.com:/home/kay/parsed/")
27
ConclusionConclusion Introduced new scheduling algorithm
►Predict transfer time by using network topology, and search for a better task schedule
►Efficiently transfer files by limiting bandwidth►Dynamically re-scheduling transfers
Current status:►The implementation is ongoing
Introduced new scheduling algorithm►Predict transfer time by using network
topology, and search for a better task schedule►Efficiently transfer files by limiting bandwidth►Dynamically re-scheduling transfers
Current status:►The implementation is ongoing
28
PublicationsPublications 高橋慧 , 田浦健次朗 , 近山隆 . マイグレーションを支援する分散集合オブジ
ェクト.並列/分散/協調処理に関するサマーワークショップ( SWoPP2005 ),武雄, 2005 年 8 月.
高橋慧 , 田浦健次朗 , 近山隆 . マイグレーションを支援する分散集合オブジェクト . 先進的計算基盤シンポジウム( SACSIS 2005 ),筑波, 2005 年 5月.
高橋慧 , 田浦健次朗 , 近山隆 . マイグレーションを支援する分散集合オブジェクト.並列/分散/協調処理に関するサマーワークショップ( SWoPP2005 ),武雄, 2005 年 8 月.
高橋慧 , 田浦健次朗 , 近山隆 . マイグレーションを支援する分散集合オブジェクト . 先進的計算基盤シンポジウム( SACSIS 2005 ),筑波, 2005 年 5月.