a ram-disk provisioning service for high performance data ... a_siparcs_2… · #pbs -q...
Post on 27-May-2020
5 Views
Preview:
TRANSCRIPT
A RAM-disk provisioning service for highperformance data analysis
Allan Espinosa† (aespinosa@cs.uchicago.edu)Mentors: M. Woitaszek� and J. Dennis�
†University of Chicago, �National Center for Atmospheric Research
July 29, 2011
1 / 64
Outline
1 Motivation: data analysis
2 Approach and challenges
3 Implementation
4 Target applications
5 Conclusions
2 / 64
Motivation: data-intensive post-processing
Simulation results
Computing center
Transfernodes
Analysis 1 Analysis 2 Analysis n
Spinning disk-basedparallel file system
. . .
TapeArchive
Analysis cluster
Multiple trips to disk is slow
3 / 64
Motivation: data-intensive post-processing
Simulation results
Computing center
Transfernodes
Analysis 1
Analysis 2 Analysis n
Spinning disk-basedparallel file system
. . .
TapeArchive
Analysis cluster
Multiple trips to disk is slow
4 / 64
Motivation: data-intensive post-processing
Simulation results
Computing center
Transfernodes
Analysis 1
Analysis 2 Analysis n
Spinning disk-basedparallel file system
. . .
TapeArchive
Analysis cluster
Multiple trips to disk is slow
5 / 64
Motivation: data-intensive post-processing
Simulation results
Computing center
Transfernodes
Analysis 1 Analysis 2
Analysis n
Spinning disk-basedparallel file system
. . .
TapeArchive
Analysis cluster
Multiple trips to disk is slow
6 / 64
Motivation: data-intensive post-processing
Simulation results
Computing center
Transfernodes
Analysis 1 Analysis 2 Analysis n
Spinning disk-basedparallel file system
. . .
TapeArchive
Analysis cluster
Multiple trips to disk is slow
7 / 64
Motivation: data-intensive post-processing
Simulation results
Computing center
Transfernodes
Analysis 1 Analysis 2 Analysis n
Spinning disk-basedparallel file system
. . .
TapeArchive
Analysis cluster
Multiple trips to disk is slow
8 / 64
Approach: Run analysis on RAM
Fast I/O access
tmpfs or formatted/dev/ram
NFS-exported RAM
Split data over multiplenodes
Lustre parallel RAM filesystem
9 / 64
Approach: Run analysis on RAM
Fast I/O access
tmpfs or formatted/dev/ram
NFS-exported RAM
Split data over multiplenodes
Lustre parallel RAM filesystem
CPU CPU
RAM-baseddisk
Analysis node
Problem: Restricted parallelism
10 / 64
Approach: Run analysis on RAM
Fast I/O access
tmpfs or formatted/dev/ram
NFS-exported RAM
Split data over multiplenodes
Lustre parallel RAM filesystem
CPU CPU
RAM-baseddisk
CPU
CPU
Problem: Restricted data size
11 / 64
Approach: Run analysis on RAM
Fast I/O access
tmpfs or formatted/dev/ram
NFS-exported RAM
Split data over multiplenodes
Lustre parallel RAM filesystem
CPU CPU
RAM-baseddisk
CPU CPU
RAM-baseddisk
Problem: Requires thorough I/O management
12 / 64
Approach: Run analysis on RAM
Fast I/O access
tmpfs or formatted/dev/ram
NFS-exported RAM
Split data over multiplenodes
Lustre parallel RAM filesystem
CPU CPU CPU CPU
Lustre parallelRAM file system
CPU CPU CPU CPU
13 / 64
Solution: Automatically-provisioned parallel file system
ControlNode
TransferNode
AnalysisNodes
ArchiveNode
Parallel RAM file systemTape
Archive
Scheduler
User Client
Submit jobs
Polynya analysis cluster
WAN
TransferNode
File system
Kraken
14 / 64
Solution: Automatically-provisioned parallel file system
ControlNode
TransferNode
AnalysisNodes
ArchiveNode
Parallel RAM file system
TapeArchive
Scheduler
User Client
Submit jobs
Polynya analysis cluster
WAN
TransferNode
File system
Kraken
15 / 64
Solution: Automatically-provisioned parallel file system
ControlNode
TransferNode
AnalysisNodes
ArchiveNode
Parallel RAM file system
TapeArchive
Scheduler
User Client
Submit jobs
Polynya analysis cluster
WAN
TransferNode
File system
Kraken
16 / 64
Solution: Automatically-provisioned parallel file system
ControlNode
TransferNode
AnalysisNodes
ArchiveNode
Parallel RAM file system
TapeArchive
Scheduler
User Client
Submit jobs
Polynya analysis cluster
WAN
TransferNode
File system
Kraken
17 / 64
Solution: Automatically-provisioned parallel file system
ControlNode
TransferNode
AnalysisNodes
ArchiveNode
Parallel RAM file systemTape
Archive
Scheduler
User Client
Submit jobs
Polynya analysis cluster
WAN
TransferNode
File system
Kraken
18 / 64
Remote triggering the workflow
Simulationfinishes
Kraken
Requestspace
Transferdatasets
Archivedatasets
Runanalysis
Triggercleanup
Workflow
Polynya
Trigger workflow
19 / 64
Remote triggering the workflow
Simulationfinishes
Kraken
Requestspace
Transferdatasets
Archivedatasets
Runanalysis
Triggercleanup
Workflow
Polynya
Trigger workflow
20 / 64
Remote triggering the workflow
Simulationfinishes
Kraken
Requestspace
Transferdatasets
Archivedatasets
Runanalysis
Triggercleanup
Workflow
Polynya
Trigger workflow
21 / 64
Remote triggering the workflow
Simulationfinishes
Kraken
Requestspace
Transferdatasets
Archivedatasets
Runanalysis
Triggercleanup
Workflow
Polynya
Trigger workflow
22 / 64
Remote triggering the workflow
Simulationfinishes
Kraken
Requestspace
Transferdatasets
Archivedatasets
Runanalysis
Triggercleanup
Workflow
Polynya
Trigger workflow
23 / 64
Requesting RAM-based disk space
Implementation: PBS Torque+Maui scheduler generic resource
Parameters:
amount of space
duration of allocation
1 Route to control node
2 Prepare space
3 Sleep until allocationexpiration
4 Email notice beforeexpiration
5 Clean up space
#PBS -W x="GRES:ramdisk@25"
#PBS -l walltime="48:00:00"
#PBS -q ramdisk_service
#PBS -l prologue=allocate.sh
#PBS -l epilogue=cleanup.sh
sleep 45h
mail user@cluster ...
sleep 3h
24 / 64
Requesting RAM-based disk space
Implementation: PBS Torque+Maui scheduler generic resource
Parameters:
amount of space
duration of allocation
1 Route to control node
2 Prepare space
3 Sleep until allocationexpiration
4 Email notice beforeexpiration
5 Clean up space
#PBS -W x="GRES:ramdisk@25"
#PBS -l walltime="48:00:00"
#PBS -q ramdisk_service
#PBS -l prologue=allocate.sh
#PBS -l epilogue=cleanup.sh
sleep 45h
mail user@cluster ...
sleep 3h
25 / 64
Requesting RAM-based disk space
Implementation: PBS Torque+Maui scheduler generic resource
Parameters:
amount of space
duration of allocation
1 Route to control node
2 Prepare space
3 Sleep until allocationexpiration
4 Email notice beforeexpiration
5 Clean up space
#PBS -W x="GRES:ramdisk@25"
#PBS -l walltime="48:00:00"
#PBS -q ramdisk_service
#PBS -l prologue=allocate.sh
#PBS -l epilogue=cleanup.sh
sleep 45h
mail user@cluster ...
sleep 3h
26 / 64
Requesting RAM-based disk space
Implementation: PBS Torque+Maui scheduler generic resource
Parameters:
amount of space
duration of allocation
1 Route to control node
2 Prepare space
3 Sleep until allocationexpiration
4 Email notice beforeexpiration
5 Clean up space
#PBS -W x="GRES:ramdisk@25"
#PBS -l walltime="48:00:00"
#PBS -q ramdisk_service
#PBS -l prologue=allocate.sh
#PBS -l epilogue=cleanup.sh
sleep 45h
mail user@cluster ...
sleep 3h
27 / 64
Requesting RAM-based disk space
Implementation: PBS Torque+Maui scheduler generic resource
Parameters:
amount of space
duration of allocation
1 Route to control node
2 Prepare space
3 Sleep until allocationexpiration
4 Email notice beforeexpiration
5 Clean up space
#PBS -W x="GRES:ramdisk@25"
#PBS -l walltime="48:00:00"
#PBS -q ramdisk_service
#PBS -l prologue=allocate.sh
#PBS -l epilogue=cleanup.sh
sleep 45h
mail user@cluster ...
sleep 3h
28 / 64
Requesting RAM-based disk space
Implementation: PBS Torque+Maui scheduler generic resource
Parameters:
amount of space
duration of allocation
1 Route to control node
2 Prepare space
3 Sleep until allocationexpiration
4 Email notice beforeexpiration
5 Clean up space
#PBS -W x="GRES:ramdisk@25"
#PBS -l walltime="48:00:00"
#PBS -q ramdisk_service
#PBS -l prologue=allocate.sh
#PBS -l epilogue=cleanup.sh
sleep 45h
mail user@cluster ...
sleep 3h
29 / 64
Requesting RAM-based disk space
Implementation: PBS Torque+Maui scheduler generic resource
Parameters:
amount of space
duration of allocation
1 Route to control node
2 Prepare space
3 Sleep until allocationexpiration
4 Email notice beforeexpiration
5 Clean up space
#PBS -W x="GRES:ramdisk@25"
#PBS -l walltime="48:00:00"
#PBS -q ramdisk_service
#PBS -l prologue=allocate.sh
#PBS -l epilogue=cleanup.sh
sleep 45h
mail user@cluster ...
sleep 3h
30 / 64
Requesting RAM-based disk space
Implementation: PBS Torque+Maui scheduler generic resource
Parameters:
amount of space
duration of allocation
1 Route to control node
2 Prepare space
3 Sleep until allocationexpiration
4 Email notice beforeexpiration
5 Clean up space
#PBS -W x="GRES:ramdisk@25"
#PBS -l walltime="48:00:00"
#PBS -q ramdisk_service
#PBS -l prologue=allocate.sh
#PBS -l epilogue=cleanup.sh
sleep 45h
mail user@cluster ...
sleep 3h
31 / 64
Transferring datasets
Implementation: Route request to transfer nodes
Striped GridFTP data nodes
Co-located as RAM-based disk space provider
Other administrative components:
GridFTP control channel server
Key-authenticated SSH∗
X509-authenticaed GRAM5∗
∗Remote trigger mechanism
32 / 64
Transferring datasets
Implementation: Route request to transfer nodes
Striped GridFTP data nodes
Co-located as RAM-based disk space provider
Other administrative components:
GridFTP control channel server
Key-authenticated SSH∗
X509-authenticaed GRAM5∗
∗Remote trigger mechanism
33 / 64
Transferring datasets
Implementation: Route request to transfer nodes
Striped GridFTP data nodes
Co-located as RAM-based disk space provider
Other administrative components:
GridFTP control channel server
Key-authenticated SSH∗
X509-authenticaed GRAM5∗
∗Remote trigger mechanism
34 / 64
Transferring datasets
Implementation: Route request to transfer nodes
Striped GridFTP data nodes
Co-located as RAM-based disk space provider
Other administrative components:
GridFTP control channel server
Key-authenticated SSH∗
X509-authenticaed GRAM5∗
∗Remote trigger mechanism
35 / 64
Transferring datasets
Implementation: Route request to transfer nodes
Striped GridFTP data nodes
Co-located as RAM-based disk space provider
Other administrative components:
GridFTP control channel server
Key-authenticated SSH∗
X509-authenticaed GRAM5∗
∗Remote trigger mechanism
36 / 64
Example application: AMWG diagnostics
Compares CESMsimulation data,observational data,reanalysis data
Parallel implementation inSwift∗
Parameters:
dataset namenumber of timesegments (years)
Dataset volume: 2.8 GBper year (1◦ data)
∗Parallel scripting engine http://www.ci.uchicago.edu/swift
37 / 64
Example application: AMWG diagnostics
Compares CESMsimulation data,observational data,reanalysis data
Parallel implementation inSwift∗
Parameters:
dataset namenumber of timesegments (years)
Dataset volume: 2.8 GBper year (1◦ data)
∗Parallel scripting engine http://www.ci.uchicago.edu/swift
38 / 64
Example application: AMWG diagnostics
Compares CESMsimulation data,observational data,reanalysis data
Parallel implementation inSwift∗
Parameters:
dataset namenumber of timesegments (years)
Dataset volume: 2.8 GBper year (1◦ data)
∗Parallel scripting engine http://www.ci.uchicago.edu/swift
39 / 64
Example application: AMWG diagnostics
Compares CESMsimulation data,observational data,reanalysis data
Parallel implementation inSwift∗
Parameters:
dataset namenumber of timesegments (years)
Dataset volume: 2.8 GBper year (1◦ data)
∗Parallel scripting engine http://www.ci.uchicago.edu/swift
40 / 64
Data movement benchmarks∗
File systemIOR-8 GridFTP� to PolynyaWrite† from Frost from Kraken
/dev/null 3,190
139 28
Lustre disk 111
113 35
tmpfs RAM 2,983
117 34
XFS RAM 2,296
125 35
Lustre RAM 2,881
134 36GridFTP from Kraken to Frost: 216 MB/s
∗units in MB/s†from D. Duplyakin’s experiments
41 / 64
Data movement benchmarks∗
File systemIOR-8 GridFTP� to PolynyaWrite† from Frost from Kraken
/dev/null 3,190 139
28
Lustre disk 111 113
35
tmpfs RAM 2,983 117
34
XFS RAM 2,296 125
35
Lustre RAM 2,881 134
36GridFTP from Kraken to Frost: 216 MB/s
∗units in MB/s†from D. Duplyakin’s experiments�32 MB TCP buffer, 16 MB block size, 4 streams
42 / 64
Data movement benchmarks∗
File systemIOR-8 GridFTP� to PolynyaWrite† from Frost from Kraken
/dev/null 3,190 139 28Lustre disk 111 113 35tmpfs RAM 2,983 117 34XFS RAM 2,296 125 35Lustre RAM 2,881 134 36
GridFTP from Kraken to Frost: 216 MB/s
∗units in MB/s†from D. Duplyakin’s experiments�32 MB TCP buffer, 16 MB block size, 16 streams
43 / 64
Data movement benchmarks∗
File systemIOR-8 GridFTP� to PolynyaWrite† from Frost from Kraken
/dev/null 3,190 139 28Lustre disk 111 113 35tmpfs RAM 2,983 117 34XFS RAM 2,296 125 35Lustre RAM 2,881 134 36
GridFTP from Kraken to Frost: 216 MB/s
∗units in MB/s†from D. Duplyakin’s experiments�32 MB TCP buffer, 16 MB block size, 16 streams
44 / 64
Application performance
Ran on 64-CPU node, 2-year time segment (8.2 GB total)
File system Runtime (s)
Lustre disk 213tmpfs RAM 29XFS RAM 29Lustre RAM 70
45 / 64
Application performance
From Frost:
Lustre RAM
XFS RAM
tmpfs RAM
Lustre disk
Data TransferAMWG Analysis
Time (s)
0 50 100 150 200 250
46 / 64
End-to-end workflow
Analysis 1
Analysis 2
. . .
Analysis n
Archive
Transfer
Cleanup
Request space
Time (s)
47 / 64
End-to-end workflow
Analysis 1
Analysis 2
. . .
Analysis n
Archive
Transfer
Cleanup
Request space
Time (s)
48 / 64
End-to-end workflow
Analysis 1
Analysis 2
. . .
Analysis n
Archive
Transfer
Cleanup
Request space
Time (s)
49 / 64
End-to-end workflow
Analysis 1
Analysis 2
. . .
Analysis n
Archive
Transfer
Cleanup
Request space
Time (s)
50 / 64
End-to-end workflow
Analysis 1
Analysis 2
. . .
Analysis n
Archive
Transfer
Cleanup
Request space
Time (s)
51 / 64
End-to-end workflow
Analysis 1
Analysis 2
. . .
Analysis n
Archive
Transfer
Cleanup
Request space
Time (s)
52 / 64
End-to-end workflow
Analysis 1
Analysis 2
. . .
Analysis n
Archive
Transfer
Cleanup
Request space
Time (s)
53 / 64
Other use case: Interactive jobs
Automated workflow split component wise
Each step is run by the user manually
Steps:
1 Request space
2 Transfers data to allocated space (globus-url-copy orGlobus Online)
3 Runs analysis on allocated space
4 Email notice before expiration
5 Cleanup by deleting request job
54 / 64
Other use case: Interactive jobs
Automated workflow split component wise
Each step is run by the user manually
Steps:
1 Request space
2 Transfers data to allocated space (globus-url-copy orGlobus Online)
3 Runs analysis on allocated space
4 Email notice before expiration
5 Cleanup by deleting request job
55 / 64
Other use case: Interactive jobs
Automated workflow split component wise
Each step is run by the user manually
Steps:
1 Request space
2 Transfers data to allocated space (globus-url-copy orGlobus Online)
3 Runs analysis on allocated space
4 Email notice before expiration
5 Cleanup by deleting request job
56 / 64
Conclusions
End-to-end analysis platform without touching spinning disk
Interface through familiar PBS interface
Workflow automation to drive analysis
Network bandwidth critical to performance
Future work:
Tune network for high performance data movement
Application-perspective file system scalability
Explore framework on other resources: disk, bandwidth, etc.
57 / 64
Conclusions
End-to-end analysis platform without touching spinning disk
Interface through familiar PBS interface
Workflow automation to drive analysis
Network bandwidth critical to performance
Future work:
Tune network for high performance data movement
Application-perspective file system scalability
Explore framework on other resources: disk, bandwidth, etc.
58 / 64
Conclusions
End-to-end analysis platform without touching spinning disk
Interface through familiar PBS interface
Workflow automation to drive analysis
Network bandwidth critical to performance
Future work:
Tune network for high performance data movement
Application-perspective file system scalability
Explore framework on other resources: disk, bandwidth, etc.
59 / 64
Conclusions
End-to-end analysis platform without touching spinning disk
Interface through familiar PBS interface
Workflow automation to drive analysis
Network bandwidth critical to performance
Future work:
Tune network for high performance data movement
Application-perspective file system scalability
Explore framework on other resources: disk, bandwidth, etc.
60 / 64
Conclusions
End-to-end analysis platform without touching spinning disk
Interface through familiar PBS interface
Workflow automation to drive analysis
Network bandwidth critical to performance
Future work:
Tune network for high performance data movement
Application-perspective file system scalability
Explore framework on other resources: disk, bandwidth, etc.
61 / 64
Conclusions
End-to-end analysis platform without touching spinning disk
Interface through familiar PBS interface
Workflow automation to drive analysis
Network bandwidth critical to performance
Future work:
Tune network for high performance data movement
Application-perspective file system scalability
Explore framework on other resources: disk, bandwidth, etc.
62 / 64
Conclusions
End-to-end analysis platform without touching spinning disk
Interface through familiar PBS interface
Workflow automation to drive analysis
Network bandwidth critical to performance
Future work:
Tune network for high performance data movement
Application-perspective file system scalability
Explore framework on other resources: disk, bandwidth, etc.
63 / 64
Questions?
A RAM-disk provisioning service for highperformance data analysis
Allan Espinosa† (aespinosa@cs.uchicago.edu)Mentors: M. Woitaszek� and J. Dennis�
†University of Chicago, �National Center for Atmospheric Research
July 29, 2011
64 / 64
top related