a ram-disk provisioning service for high performance data ... a_siparcs_2… · #pbs -q...

Post on 27-May-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A RAM-disk provisioning service for highperformance data analysis

Allan Espinosa† (aespinosa@cs.uchicago.edu)Mentors: M. Woitaszek� and J. Dennis�

†University of Chicago, �National Center for Atmospheric Research

July 29, 2011

1 / 64

Outline

1 Motivation: data analysis

2 Approach and challenges

3 Implementation

4 Target applications

5 Conclusions

2 / 64

Motivation: data-intensive post-processing

Simulation results

Computing center

Transfernodes

Analysis 1 Analysis 2 Analysis n

Spinning disk-basedparallel file system

. . .

TapeArchive

Analysis cluster

Multiple trips to disk is slow

3 / 64

Motivation: data-intensive post-processing

Simulation results

Computing center

Transfernodes

Analysis 1

Analysis 2 Analysis n

Spinning disk-basedparallel file system

. . .

TapeArchive

Analysis cluster

Multiple trips to disk is slow

4 / 64

Motivation: data-intensive post-processing

Simulation results

Computing center

Transfernodes

Analysis 1

Analysis 2 Analysis n

Spinning disk-basedparallel file system

. . .

TapeArchive

Analysis cluster

Multiple trips to disk is slow

5 / 64

Motivation: data-intensive post-processing

Simulation results

Computing center

Transfernodes

Analysis 1 Analysis 2

Analysis n

Spinning disk-basedparallel file system

. . .

TapeArchive

Analysis cluster

Multiple trips to disk is slow

6 / 64

Motivation: data-intensive post-processing

Simulation results

Computing center

Transfernodes

Analysis 1 Analysis 2 Analysis n

Spinning disk-basedparallel file system

. . .

TapeArchive

Analysis cluster

Multiple trips to disk is slow

7 / 64

Motivation: data-intensive post-processing

Simulation results

Computing center

Transfernodes

Analysis 1 Analysis 2 Analysis n

Spinning disk-basedparallel file system

. . .

TapeArchive

Analysis cluster

Multiple trips to disk is slow

8 / 64

Approach: Run analysis on RAM

Fast I/O access

tmpfs or formatted/dev/ram

NFS-exported RAM

Split data over multiplenodes

Lustre parallel RAM filesystem

9 / 64

Approach: Run analysis on RAM

Fast I/O access

tmpfs or formatted/dev/ram

NFS-exported RAM

Split data over multiplenodes

Lustre parallel RAM filesystem

CPU CPU

RAM-baseddisk

Analysis node

Problem: Restricted parallelism

10 / 64

Approach: Run analysis on RAM

Fast I/O access

tmpfs or formatted/dev/ram

NFS-exported RAM

Split data over multiplenodes

Lustre parallel RAM filesystem

CPU CPU

RAM-baseddisk

CPU

CPU

Problem: Restricted data size

11 / 64

Approach: Run analysis on RAM

Fast I/O access

tmpfs or formatted/dev/ram

NFS-exported RAM

Split data over multiplenodes

Lustre parallel RAM filesystem

CPU CPU

RAM-baseddisk

CPU CPU

RAM-baseddisk

Problem: Requires thorough I/O management

12 / 64

Approach: Run analysis on RAM

Fast I/O access

tmpfs or formatted/dev/ram

NFS-exported RAM

Split data over multiplenodes

Lustre parallel RAM filesystem

CPU CPU CPU CPU

Lustre parallelRAM file system

CPU CPU CPU CPU

13 / 64

Solution: Automatically-provisioned parallel file system

ControlNode

TransferNode

AnalysisNodes

ArchiveNode

Parallel RAM file systemTape

Archive

Scheduler

User Client

Submit jobs

Polynya analysis cluster

WAN

TransferNode

File system

Kraken

14 / 64

Solution: Automatically-provisioned parallel file system

ControlNode

TransferNode

AnalysisNodes

ArchiveNode

Parallel RAM file system

TapeArchive

Scheduler

User Client

Submit jobs

Polynya analysis cluster

WAN

TransferNode

File system

Kraken

15 / 64

Solution: Automatically-provisioned parallel file system

ControlNode

TransferNode

AnalysisNodes

ArchiveNode

Parallel RAM file system

TapeArchive

Scheduler

User Client

Submit jobs

Polynya analysis cluster

WAN

TransferNode

File system

Kraken

16 / 64

Solution: Automatically-provisioned parallel file system

ControlNode

TransferNode

AnalysisNodes

ArchiveNode

Parallel RAM file system

TapeArchive

Scheduler

User Client

Submit jobs

Polynya analysis cluster

WAN

TransferNode

File system

Kraken

17 / 64

Solution: Automatically-provisioned parallel file system

ControlNode

TransferNode

AnalysisNodes

ArchiveNode

Parallel RAM file systemTape

Archive

Scheduler

User Client

Submit jobs

Polynya analysis cluster

WAN

TransferNode

File system

Kraken

18 / 64

Remote triggering the workflow

Simulationfinishes

Kraken

Requestspace

Transferdatasets

Archivedatasets

Runanalysis

Triggercleanup

Workflow

Polynya

Trigger workflow

19 / 64

Remote triggering the workflow

Simulationfinishes

Kraken

Requestspace

Transferdatasets

Archivedatasets

Runanalysis

Triggercleanup

Workflow

Polynya

Trigger workflow

20 / 64

Remote triggering the workflow

Simulationfinishes

Kraken

Requestspace

Transferdatasets

Archivedatasets

Runanalysis

Triggercleanup

Workflow

Polynya

Trigger workflow

21 / 64

Remote triggering the workflow

Simulationfinishes

Kraken

Requestspace

Transferdatasets

Archivedatasets

Runanalysis

Triggercleanup

Workflow

Polynya

Trigger workflow

22 / 64

Remote triggering the workflow

Simulationfinishes

Kraken

Requestspace

Transferdatasets

Archivedatasets

Runanalysis

Triggercleanup

Workflow

Polynya

Trigger workflow

23 / 64

Requesting RAM-based disk space

Implementation: PBS Torque+Maui scheduler generic resource

Parameters:

amount of space

duration of allocation

1 Route to control node

2 Prepare space

3 Sleep until allocationexpiration

4 Email notice beforeexpiration

5 Clean up space

#PBS -W x="GRES:ramdisk@25"

#PBS -l walltime="48:00:00"

#PBS -q ramdisk_service

#PBS -l prologue=allocate.sh

#PBS -l epilogue=cleanup.sh

sleep 45h

mail user@cluster ...

sleep 3h

24 / 64

Requesting RAM-based disk space

Implementation: PBS Torque+Maui scheduler generic resource

Parameters:

amount of space

duration of allocation

1 Route to control node

2 Prepare space

3 Sleep until allocationexpiration

4 Email notice beforeexpiration

5 Clean up space

#PBS -W x="GRES:ramdisk@25"

#PBS -l walltime="48:00:00"

#PBS -q ramdisk_service

#PBS -l prologue=allocate.sh

#PBS -l epilogue=cleanup.sh

sleep 45h

mail user@cluster ...

sleep 3h

25 / 64

Requesting RAM-based disk space

Implementation: PBS Torque+Maui scheduler generic resource

Parameters:

amount of space

duration of allocation

1 Route to control node

2 Prepare space

3 Sleep until allocationexpiration

4 Email notice beforeexpiration

5 Clean up space

#PBS -W x="GRES:ramdisk@25"

#PBS -l walltime="48:00:00"

#PBS -q ramdisk_service

#PBS -l prologue=allocate.sh

#PBS -l epilogue=cleanup.sh

sleep 45h

mail user@cluster ...

sleep 3h

26 / 64

Requesting RAM-based disk space

Implementation: PBS Torque+Maui scheduler generic resource

Parameters:

amount of space

duration of allocation

1 Route to control node

2 Prepare space

3 Sleep until allocationexpiration

4 Email notice beforeexpiration

5 Clean up space

#PBS -W x="GRES:ramdisk@25"

#PBS -l walltime="48:00:00"

#PBS -q ramdisk_service

#PBS -l prologue=allocate.sh

#PBS -l epilogue=cleanup.sh

sleep 45h

mail user@cluster ...

sleep 3h

27 / 64

Requesting RAM-based disk space

Implementation: PBS Torque+Maui scheduler generic resource

Parameters:

amount of space

duration of allocation

1 Route to control node

2 Prepare space

3 Sleep until allocationexpiration

4 Email notice beforeexpiration

5 Clean up space

#PBS -W x="GRES:ramdisk@25"

#PBS -l walltime="48:00:00"

#PBS -q ramdisk_service

#PBS -l prologue=allocate.sh

#PBS -l epilogue=cleanup.sh

sleep 45h

mail user@cluster ...

sleep 3h

28 / 64

Requesting RAM-based disk space

Implementation: PBS Torque+Maui scheduler generic resource

Parameters:

amount of space

duration of allocation

1 Route to control node

2 Prepare space

3 Sleep until allocationexpiration

4 Email notice beforeexpiration

5 Clean up space

#PBS -W x="GRES:ramdisk@25"

#PBS -l walltime="48:00:00"

#PBS -q ramdisk_service

#PBS -l prologue=allocate.sh

#PBS -l epilogue=cleanup.sh

sleep 45h

mail user@cluster ...

sleep 3h

29 / 64

Requesting RAM-based disk space

Implementation: PBS Torque+Maui scheduler generic resource

Parameters:

amount of space

duration of allocation

1 Route to control node

2 Prepare space

3 Sleep until allocationexpiration

4 Email notice beforeexpiration

5 Clean up space

#PBS -W x="GRES:ramdisk@25"

#PBS -l walltime="48:00:00"

#PBS -q ramdisk_service

#PBS -l prologue=allocate.sh

#PBS -l epilogue=cleanup.sh

sleep 45h

mail user@cluster ...

sleep 3h

30 / 64

Requesting RAM-based disk space

Implementation: PBS Torque+Maui scheduler generic resource

Parameters:

amount of space

duration of allocation

1 Route to control node

2 Prepare space

3 Sleep until allocationexpiration

4 Email notice beforeexpiration

5 Clean up space

#PBS -W x="GRES:ramdisk@25"

#PBS -l walltime="48:00:00"

#PBS -q ramdisk_service

#PBS -l prologue=allocate.sh

#PBS -l epilogue=cleanup.sh

sleep 45h

mail user@cluster ...

sleep 3h

31 / 64

Transferring datasets

Implementation: Route request to transfer nodes

Striped GridFTP data nodes

Co-located as RAM-based disk space provider

Other administrative components:

GridFTP control channel server

Key-authenticated SSH∗

X509-authenticaed GRAM5∗

∗Remote trigger mechanism

32 / 64

Transferring datasets

Implementation: Route request to transfer nodes

Striped GridFTP data nodes

Co-located as RAM-based disk space provider

Other administrative components:

GridFTP control channel server

Key-authenticated SSH∗

X509-authenticaed GRAM5∗

∗Remote trigger mechanism

33 / 64

Transferring datasets

Implementation: Route request to transfer nodes

Striped GridFTP data nodes

Co-located as RAM-based disk space provider

Other administrative components:

GridFTP control channel server

Key-authenticated SSH∗

X509-authenticaed GRAM5∗

∗Remote trigger mechanism

34 / 64

Transferring datasets

Implementation: Route request to transfer nodes

Striped GridFTP data nodes

Co-located as RAM-based disk space provider

Other administrative components:

GridFTP control channel server

Key-authenticated SSH∗

X509-authenticaed GRAM5∗

∗Remote trigger mechanism

35 / 64

Transferring datasets

Implementation: Route request to transfer nodes

Striped GridFTP data nodes

Co-located as RAM-based disk space provider

Other administrative components:

GridFTP control channel server

Key-authenticated SSH∗

X509-authenticaed GRAM5∗

∗Remote trigger mechanism

36 / 64

Example application: AMWG diagnostics

Compares CESMsimulation data,observational data,reanalysis data

Parallel implementation inSwift∗

Parameters:

dataset namenumber of timesegments (years)

Dataset volume: 2.8 GBper year (1◦ data)

∗Parallel scripting engine http://www.ci.uchicago.edu/swift

37 / 64

Example application: AMWG diagnostics

Compares CESMsimulation data,observational data,reanalysis data

Parallel implementation inSwift∗

Parameters:

dataset namenumber of timesegments (years)

Dataset volume: 2.8 GBper year (1◦ data)

∗Parallel scripting engine http://www.ci.uchicago.edu/swift

38 / 64

Example application: AMWG diagnostics

Compares CESMsimulation data,observational data,reanalysis data

Parallel implementation inSwift∗

Parameters:

dataset namenumber of timesegments (years)

Dataset volume: 2.8 GBper year (1◦ data)

∗Parallel scripting engine http://www.ci.uchicago.edu/swift

39 / 64

Example application: AMWG diagnostics

Compares CESMsimulation data,observational data,reanalysis data

Parallel implementation inSwift∗

Parameters:

dataset namenumber of timesegments (years)

Dataset volume: 2.8 GBper year (1◦ data)

∗Parallel scripting engine http://www.ci.uchicago.edu/swift

40 / 64

Data movement benchmarks∗

File systemIOR-8 GridFTP� to PolynyaWrite† from Frost from Kraken

/dev/null 3,190

139 28

Lustre disk 111

113 35

tmpfs RAM 2,983

117 34

XFS RAM 2,296

125 35

Lustre RAM 2,881

134 36GridFTP from Kraken to Frost: 216 MB/s

∗units in MB/s†from D. Duplyakin’s experiments

41 / 64

Data movement benchmarks∗

File systemIOR-8 GridFTP� to PolynyaWrite† from Frost from Kraken

/dev/null 3,190 139

28

Lustre disk 111 113

35

tmpfs RAM 2,983 117

34

XFS RAM 2,296 125

35

Lustre RAM 2,881 134

36GridFTP from Kraken to Frost: 216 MB/s

∗units in MB/s†from D. Duplyakin’s experiments�32 MB TCP buffer, 16 MB block size, 4 streams

42 / 64

Data movement benchmarks∗

File systemIOR-8 GridFTP� to PolynyaWrite† from Frost from Kraken

/dev/null 3,190 139 28Lustre disk 111 113 35tmpfs RAM 2,983 117 34XFS RAM 2,296 125 35Lustre RAM 2,881 134 36

GridFTP from Kraken to Frost: 216 MB/s

∗units in MB/s†from D. Duplyakin’s experiments�32 MB TCP buffer, 16 MB block size, 16 streams

43 / 64

Data movement benchmarks∗

File systemIOR-8 GridFTP� to PolynyaWrite† from Frost from Kraken

/dev/null 3,190 139 28Lustre disk 111 113 35tmpfs RAM 2,983 117 34XFS RAM 2,296 125 35Lustre RAM 2,881 134 36

GridFTP from Kraken to Frost: 216 MB/s

∗units in MB/s†from D. Duplyakin’s experiments�32 MB TCP buffer, 16 MB block size, 16 streams

44 / 64

Application performance

Ran on 64-CPU node, 2-year time segment (8.2 GB total)

File system Runtime (s)

Lustre disk 213tmpfs RAM 29XFS RAM 29Lustre RAM 70

45 / 64

Application performance

From Frost:

Lustre RAM

XFS RAM

tmpfs RAM

Lustre disk

Data TransferAMWG Analysis

Time (s)

0 50 100 150 200 250

46 / 64

End-to-end workflow

Analysis 1

Analysis 2

. . .

Analysis n

Archive

Transfer

Cleanup

Request space

Time (s)

47 / 64

End-to-end workflow

Analysis 1

Analysis 2

. . .

Analysis n

Archive

Transfer

Cleanup

Request space

Time (s)

48 / 64

End-to-end workflow

Analysis 1

Analysis 2

. . .

Analysis n

Archive

Transfer

Cleanup

Request space

Time (s)

49 / 64

End-to-end workflow

Analysis 1

Analysis 2

. . .

Analysis n

Archive

Transfer

Cleanup

Request space

Time (s)

50 / 64

End-to-end workflow

Analysis 1

Analysis 2

. . .

Analysis n

Archive

Transfer

Cleanup

Request space

Time (s)

51 / 64

End-to-end workflow

Analysis 1

Analysis 2

. . .

Analysis n

Archive

Transfer

Cleanup

Request space

Time (s)

52 / 64

End-to-end workflow

Analysis 1

Analysis 2

. . .

Analysis n

Archive

Transfer

Cleanup

Request space

Time (s)

53 / 64

Other use case: Interactive jobs

Automated workflow split component wise

Each step is run by the user manually

Steps:

1 Request space

2 Transfers data to allocated space (globus-url-copy orGlobus Online)

3 Runs analysis on allocated space

4 Email notice before expiration

5 Cleanup by deleting request job

54 / 64

Other use case: Interactive jobs

Automated workflow split component wise

Each step is run by the user manually

Steps:

1 Request space

2 Transfers data to allocated space (globus-url-copy orGlobus Online)

3 Runs analysis on allocated space

4 Email notice before expiration

5 Cleanup by deleting request job

55 / 64

Other use case: Interactive jobs

Automated workflow split component wise

Each step is run by the user manually

Steps:

1 Request space

2 Transfers data to allocated space (globus-url-copy orGlobus Online)

3 Runs analysis on allocated space

4 Email notice before expiration

5 Cleanup by deleting request job

56 / 64

Conclusions

End-to-end analysis platform without touching spinning disk

Interface through familiar PBS interface

Workflow automation to drive analysis

Network bandwidth critical to performance

Future work:

Tune network for high performance data movement

Application-perspective file system scalability

Explore framework on other resources: disk, bandwidth, etc.

57 / 64

Conclusions

End-to-end analysis platform without touching spinning disk

Interface through familiar PBS interface

Workflow automation to drive analysis

Network bandwidth critical to performance

Future work:

Tune network for high performance data movement

Application-perspective file system scalability

Explore framework on other resources: disk, bandwidth, etc.

58 / 64

Conclusions

End-to-end analysis platform without touching spinning disk

Interface through familiar PBS interface

Workflow automation to drive analysis

Network bandwidth critical to performance

Future work:

Tune network for high performance data movement

Application-perspective file system scalability

Explore framework on other resources: disk, bandwidth, etc.

59 / 64

Conclusions

End-to-end analysis platform without touching spinning disk

Interface through familiar PBS interface

Workflow automation to drive analysis

Network bandwidth critical to performance

Future work:

Tune network for high performance data movement

Application-perspective file system scalability

Explore framework on other resources: disk, bandwidth, etc.

60 / 64

Conclusions

End-to-end analysis platform without touching spinning disk

Interface through familiar PBS interface

Workflow automation to drive analysis

Network bandwidth critical to performance

Future work:

Tune network for high performance data movement

Application-perspective file system scalability

Explore framework on other resources: disk, bandwidth, etc.

61 / 64

Conclusions

End-to-end analysis platform without touching spinning disk

Interface through familiar PBS interface

Workflow automation to drive analysis

Network bandwidth critical to performance

Future work:

Tune network for high performance data movement

Application-perspective file system scalability

Explore framework on other resources: disk, bandwidth, etc.

62 / 64

Conclusions

End-to-end analysis platform without touching spinning disk

Interface through familiar PBS interface

Workflow automation to drive analysis

Network bandwidth critical to performance

Future work:

Tune network for high performance data movement

Application-perspective file system scalability

Explore framework on other resources: disk, bandwidth, etc.

63 / 64

Questions?

A RAM-disk provisioning service for highperformance data analysis

Allan Espinosa† (aespinosa@cs.uchicago.edu)Mentors: M. Woitaszek� and J. Dennis�

†University of Chicago, �National Center for Atmospheric Research

July 29, 2011

64 / 64

top related