ame: an any-scale many-task computing engine
DESCRIPTION
AME: An Any-scale many-task computing Engine. Zhao Zhang, University of Chicago Daniel S. Katz, CI University of Chicago & ANL Matei Ripeanu , ECE University of British Columbia Michael Wilde, CI University of Chicago & ANL Ian Foster, CI University of Chicago & ANL. MTC application review. - PowerPoint PPT PresentationTRANSCRIPT
www.ci.anl.govwww.ci.uchicago.edu
AME: An Any-scale many-task computing Engine
Zhao Zhang, University of ChicagoDaniel S. Katz, CI University of Chicago & ANLMatei Ripeanu, ECE University of British ColumbiaMichael Wilde, CI University of Chicago & ANLIan Foster, CI University of Chicago & ANL
www.ci.anl.govwww.ci.uchicago.edu
2
MTC application review
mProject mProject mProject
mDiff mDiff
mFit mFit
mConFit
Sequenced execution of other programs
Involves several different programs
High degree of inter-task parallelism
Large number of invocations
Up to Millions
……
……
……
1 2 3
2&3
2 2 3
1&2
1
1&2 2&3
Parallelism is enabled by file dependency
Programs exchange data via (POSIX files)
www.ci.anl.govwww.ci.uchicago.edu
3
Supercomputer review
Compute Nodes with multi cores
No local disk, limited RAM disk
Full linux kernel
Large number of compute nodes
Interconnect
Interconnect
IO IO IO
Exclusive Data Collection Networks
Optional Data Collection Network
LNStorage Network
Control Network
Control Network
www.ci.anl.govwww.ci.uchicago.edu
4
Gaps
• Resource Provisioning• Task Management
– Task Dispatching– Dependency Resolution– Load Balancing
• Data Management• Resiliency
www.ci.anl.govwww.ci.uchicago.edu
5
AME Overview
www.ci.anl.govwww.ci.uchicago.edu
6
Task Management
• Task Dispatching– All tasks will be sent and queued on workers– Workers do a screen of all tasks– Workers find out the input data states and location
for all its tasks– Workers subscribe to FLS (File Location Lookup
Service) for the files the tasks need– Tasks can run immediately are pushed into a ready
queue, others are kept in a hash table– Tasks in the hash table will be moved to ready queue
once the input files are ready.
www.ci.anl.govwww.ci.uchicago.edu
7
Task Management
• Task Dispatching– Test setup
o Parameter sweep over scale and task lengtho Scale = {256, 512, 1024, 2048, 4096, 8192, 16384}coreso Task length = {0, 1, 4, 16, 64, 256} secondso 16 tasks each coreo Dispatch Rate = solutiontoTime
coresNumcoreperTasks__
_*__
Decentralized
www.ci.anl.govwww.ci.uchicago.edu
8
Task Management
• Task Dispatching– Test setup
o Parameter sweep over scale and task lengtho Scale = {256, 512, 1024, 2048, 4096, 8192, 16384}coreso Task length = {1, 4, 16, 64, 256} secondso 16 tasks each coreo Efficiency = solutiontoTime
coretaskNumlengthTask__
__*_
Decentralized
www.ci.anl.govwww.ci.uchicago.edu
9
Task Management
• Dependency Resolution
• States of Intermediate Files• Invalid: The file is not produced
yet.• Remote: The file is produced,
and stored at some peer node.• Local: The file has been moved
to local storage.• Shared: The file has been
moved to global shared file system.
www.ci.anl.govwww.ci.uchicago.edu
10
Task Management
• Dependency Resolution
Query a produced file
Query an invalid file
www.ci.anl.govwww.ci.uchicago.edu
11
Task Management
• Dependency Resolution– Test Setup:
o Parameter Sweep over scales and running time, fixed file size at 10 byteso Scale = {256, 512, 1024, 2048, 4096, 8192, 16384} coreso Running Time = {0, 1, 4, 16} secondso Each core runs 16 taskso 16 tasks are divided into 8 pairs, with a producer/consumer relation in each pairo Run the tests with the worst case
Overhead Summary
www.ci.anl.govwww.ci.uchicago.edu
12
Task Management
• Dependency Resolution – File size impact– Test Setup
o Parameter Sweep over scales and Data size, with fixed running time of 16
o Scale = {256, 512, 1024, 2048, 4096, 8192} coreso File size = {1KB, 1MB, 10MB}o Each core runs 8 taskso 8 tasks are divided into 4 pairs, with a producer/consumer
relation in each pairo Run the tests with the worst case
Performance
www.ci.anl.govwww.ci.uchicago.edu
13
Task Management
• Overhead Analysis– Query/Update/Transfer traffic congested in network transition.– Saturated CPU– Query/Update traffic congested at server side.
o Congested in the Queueo Congested by the synchronization of the server
• Test Setup– Scale: 256 cores– Running Time: 16 seconds– File Size: 10 bytes– Number of Jobs: 16 tasks per core– 16 tasks are divided into 8 pairs, with a producer/consumer relation in
each pair
Performance
Query-Queuing Query Update-Queuing Update
Average Processing Time 144.31 ms 0.30 ms 2.45 ms 0.36 ms
Standard Deviation 14.24 ms 7.15 ms 0.085 ms 0.14 ms
www.ci.anl.govwww.ci.uchicago.edu
14
Data Management
• Intermediate File Storage– Isolated file storage & processing vs. Collocated
File-based Chunk-based
Single File Size Limited to CN RAM
Limited to Aggregated Space
Collocated Isolated
Scalability High Up to Implementation
Storage Space Spread among CN Configurable
Data Movements 1 2
Transfer Traffic Pattern
Fully-distributed Partially-distributed
Saturated CN yes no
www.ci.anl.govwww.ci.uchicago.edu
15
Data Management• Intermediate File Storage
– Isolated file storage & processing vs. Collocated
• Test Setup– Parameter Sweep over scales, with fixed running time of 16 seconds– Scale = {256, 1024, 4096, 16384} cores– Each core runs 16 tasks– 16 tasks are divided into 8 pairs, with a producer/consumer relation in each pair– Run the tests with the worst case
Performance
www.ci.anl.govwww.ci.uchicago.edu
16
Application• Montage is an astronomy application that composes small
images from telescope into one large image. It has been successfully running over supercomputers and grids, with MPI and Pegasus respectively.
www.ci.anl.govwww.ci.uchicago.edu
17
Application• Test Setup
– 6 degree x 6 degree mosaic centered at galaxy M101– Input: 1319 files, each around 2MB– Output: 1 file, 3.7GB– Parallel Stages: mProjectPP, mDiffFit, mBackground– 512 cores, data management, no load-balancing
Number of Tasks
TTS 1 core (s) TTS 512 cores (s)
Speedup TTS 256 cores on GPFS (s)
mProject 1319 21220.32 56.53 375.38 1675.11
mDiffFit 3883 35960.12 95.32 377.27 732.25
mBackground 1297 9815.92 64.44 152.33 287.84
www.ci.anl.govwww.ci.uchicago.edu
18
Application• Test Setup
– 6 degree x 6 degree mosaic centered at galaxy M101– Input: 1319 files, each around 2MB– Output: 1 file, 3.7GB– Parallel Stages: mProjectPP, mDiffFit, mBackground– 512 cores, data management, no load-balancing
GPFS(MB) AME(MB) Saving(%)
mProject-input 2800 2800 0%
mProject-output 5500 0.36 100%
mDiffFit-input 31000 0 100%
mDiffFit-output 3900 0.81 100%
mBackground-input 5200 0 100%
mBackground-output 5200 5200 0%
mAdd-input 5200 5200 0%
mAdd-output 3700 3700 0%
Total 62500 16901 73%
www.ci.anl.govwww.ci.uchicago.edu
19
Application• Test Setup
– 6 degree x 6 degree mosaic centered at galaxy M101– Input: 1319 files, each around 2MB– Output: 1 file, 3.7GB– Parallel Stages: mProjectPP, mDiffFit, mBackground– 512 cores, data management, no load-balancing
www.ci.anl.govwww.ci.uchicago.edu
20
Summary• We identify and classify the gaps between MTC applications and
supercomputers into six categories: resource provisioning, task dispatching, task dependency resolution, load balancing, data management, and resiliency.
• We design and implement AME that bridges these gaps. (in future)
• The results show that AME scales well up to 16,384 core.• AME accelerates MTC applications, such as Montage on
supercomputers.
www.ci.anl.govwww.ci.uchicago.edu
21
References
www.ci.anl.govwww.ci.uchicago.edu