parallel i/o - international hpc summer school · paralleli/o internationalhpcsummerschool...

Post on 27-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Parallel I/OInternational HPC Summer School

July 11, 2018 Elsa GonsiorowskiHPC I/O Specialist, LLNL

LLNL-PRES-751922This work was performed under the auspices of the U.S. Department of Energy by Lawrence LivermoreNational Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Outline

MotivationI/O in ParallelStep 1: Recognize a needStep 2: Existing I/O Libraries and ToolsStep 3: I/O PatternsStep 4: Understand the File SystemStep 6: Profit

Technical Details: MPI I/OPro-Tips!

LLNL-PRES-751922 2

Motivation

LLNL-PRES-751922 3

Types of I/O

InputLaunching an executable & it’s linked librariesReading configuration fileLoading data files

OutputCheckpointsResults

ScienceMoving files from onemachine to anotherCleaning up after experiments

Everyone interacts with a file system therefore everyone does I/O!

LLNL-PRES-751922 4

Types of I/O

InputLaunching an executable & it’s linked librariesReading configuration fileLoading data files

OutputCheckpointsResults

ScienceMoving files from onemachine to anotherCleaning up after experiments

Everyone interacts with a file system therefore everyone does I/O!

LLNL-PRES-751922 4

Types of I/O

InputLaunching an executable & it’s linked librariesReading configuration fileLoading data files

OutputCheckpointsResults

ScienceMoving files from onemachine to anotherCleaning up after experiments

Everyone interacts with a file system therefore everyone does I/O!

LLNL-PRES-751922 4

Types of I/O

InputLaunching an executable & it’s linked librariesReading configuration fileLoading data files

OutputCheckpointsResults

ScienceMoving files from onemachine to anotherCleaning up after experiments

Everyone interacts with a file system therefore everyone does I/O!

LLNL-PRES-751922 4

Why should I care?

Datamovement is expensive andmust be optimized

Total execution time =Computation time

LLNL-PRES-751922 5

Why should I care?

Datamovement is expensive andmust be optimized

Total execution time =Computation time

LLNL-PRES-751922 5

Why should I care?

Datamovement is expensive andmust be optimized

Total execution time =Computation time+Communication time

LLNL-PRES-751922 5

Why should I care?

Datamovement is expensive andmust be optimized

Total execution time =Computation time+Communication time+I/O time

LLNL-PRES-751922 5

HPC Storage Stack

GPUMemory (HBM2): 900GB/s

CPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s

burst buffer"project" storage"campaign store"

HPSS (Tape + Robots): 0.2 GB/s

LLNL-PRES-751922 6

HPC Storage Stack

GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/s

Node-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s

burst buffer"project" storage"campaign store"

HPSS (Tape + Robots): 0.2 GB/s

LLNL-PRES-751922 6

HPC Storage Stack

GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/s

PFS (HDD+ SSD +Magic): 40 GB/s

burst buffer"project" storage"campaign store"

HPSS (Tape + Robots): 0.2 GB/s

LLNL-PRES-751922 6

HPC Storage Stack

GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s

burst buffer"project" storage"campaign store"

HPSS (Tape + Robots): 0.2 GB/s

LLNL-PRES-751922 6

HPC Storage Stack

GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s

burst buffer

"project" storage"campaign store"

HPSS (Tape + Robots): 0.2 GB/s

LLNL-PRES-751922 6

HPC Storage Stack

GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s

burst buffer"project" storage

"campaign store"HPSS (Tape + Robots): 0.2 GB/s

LLNL-PRES-751922 6

HPC Storage Stack

GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s

burst buffer"project" storage"campaign store"

HPSS (Tape + Robots): 0.2 GB/s

LLNL-PRES-751922 6

HPC Storage Stack

GPUMemory (HBM2): 900GB/sCPUMemory (DDR4): 120 GB/sNode-local storage or /tmp (SSD): 1.1 GB/sPFS (HDD+ SSD +Magic): 40 GB/s

burst buffer"project" storage"campaign store"

HPSS (Tape + Robots): 0.2 GB/s

LLNL-PRES-751922 6

HPC Storage Stack

GPUMemory (HBM2): 900GB/s per GPUCPUMemory (DDR4): 120 GB/s per socketNode-local storage (SSD): 1.1 GB/s per nodePFS (HDD+ SSD +Magic): 40 GB/s shared by a system

burst buffer"project" storage"campaign store"

HPSS (Tape + Robots): 0.2 GB/s shared by a center

LLNL-PRES-751922 6

File SystemsLaptop

1 user1.1 GB/s

Network FileSystem (NFS)

m servers, n clientshome directory2 GB/s throughput280K IOPS

Parallel File System(PFS)

Used byHPC jobsSystem specificscratch or project storage40GB/s throughputMillions of IOPS

LLNL-PRES-751922 7

Parallel File System

LLNL-PRES-751922 8

Parallel File System

LLNL-PRES-751922 9

Parallel File System

LLNL-PRES-751922 9

I/O in Parallel

LLNL-PRES-751922 10

Steps for Dealing with I/O

1. Recognize the need

Get some data out of the applicationGet some data out of the application fasterDeal with files efficiently

2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!

LLNL-PRES-751922 11

Steps for Dealing with I/O

1. Recognize the needGet some data out of the application

Get some data out of the application fasterDeal with files efficiently

2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!

LLNL-PRES-751922 11

Steps for Dealing with I/O

1. Recognize the needGet some data out of the applicationGet some data out of the application faster

Deal with files efficiently2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!

LLNL-PRES-751922 11

Steps for Dealing with I/O

1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently

2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!

LLNL-PRES-751922 11

Steps for Dealing with I/O

1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently

2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!

LLNL-PRES-751922 11

Steps for Dealing with I/O

1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently

2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern

4. Understand the file system you are working on5. ???6. Profit!

LLNL-PRES-751922 11

Steps for Dealing with I/O

1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently

2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern4. Understand the file system you are working on

5. ???6. Profit!

LLNL-PRES-751922 11

Steps for Dealing with I/O

1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently

2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern4. Understand the file system you are working on5. ???

6. Profit!

LLNL-PRES-751922 11

Steps for Dealing with I/O

1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently

2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!

LLNL-PRES-751922 11

Step 1: Recognize a need

LLNL-PRES-751922 12

Profiling

DarshanTau

Attend tomorrow’s performance analysis session!

LLNL-PRES-751922 13

Profiling

DarshanTau

Attend tomorrow’s performance analysis session!

LLNL-PRES-751922 13

Step 2: Existing Libraries + Tools

LLNL-PRES-751922 14

Parallel I/O Libraries and Tools

Reading &Writing Files:HDF5PnetCDFOthers: ADIOS, TyphonIO,SILOMPI-IO

Managing Files:SpindlempiFileUtilsSCR

LLNL-PRES-751922 15

Library: HDF5

Hierarchical Data FormatFile-system in a fileDatasets: multidimensional arrays of a homogeneous typeGroups: container structures which can hold datasets andother groupsOfficial support for C, C++, Fortran 77, Fortran 90, JavaImplementations in R, Perl, Python, Ruby, Haskell,Mathematica, MATLAB, etc.

LLNL-PRES-751922 16

Library: PNetCDF

Built on netCDF andMPI-IOnetCDF:

self-describing, machine-independent formatdesigned for arrays of scientific datanetCDF is implemented in C, C++, Fortran 77, Fortran 90,Java, R, Perl, Python, Ruby, Haskell, Mathematica, MATLAB,etc.

LLNL-PRES-751922 17

Library: MPI-IO

API for interacting with files withMPI conceptsblocking vs. non-blockingcollective vs. non-collective

Lower level than other librariesFine-grain control of files and offsetsC and Fortran interfacesSeparate effort from regularMPI

LLNL-PRES-751922 18

Tool: Spindle

Scalable dynamic library and Python loadingCaches linked librariesLife saver for NFS issues

https://github.com/hpc/spindle

LLNL-PRES-751922 19

Tool: mpiFileUtils

Use parallel processes to perform file operationsExecutedwithin a job allocationdbcast: broadcast a file from PFS to node-local storagedcp: copymultiple file in paralleldrm: delete files in parallelmanymore

https://github.com/hpc/mpifileutils

LLNL-PRES-751922 20

Library: SCR

Scalable Checkpoint RestartEnable checkpointing applications totake advantage of system storagehierarchiesEfficient file movement betweenstorage layersData redundancy operations

LLNL-PRES-751922 21

Step 3: I/O Patterns

LLNL-PRES-751922 22

Parallel I/O Patterns

Single file, accessed by 1 task

Single shared file, accessed by all tasksMany shared files, accessed by groups of tasks

Baton-passingCoordinated "View"

Many independent files, accessed by a subset of tasksOne file per process

LLNL-PRES-751922 23

Parallel I/O Patterns

Single file, accessed by 1 taskSingle shared file, accessed by all tasks

Many shared files, accessed by groups of tasks

Baton-passingCoordinated "View"

Many independent files, accessed by a subset of tasksOne file per process

LLNL-PRES-751922 23

Parallel I/O Patterns

Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks

Baton-passingCoordinated "View"

Many independent files, accessed by a subset of tasksOne file per process

LLNL-PRES-751922 23

Parallel I/O Patterns

Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks

Baton-passing

Coordinated "View"Many independent files, accessed by a subset of tasksOne file per process

LLNL-PRES-751922 23

Parallel I/O Patterns

Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks

Baton-passingCoordinated "View"

Many independent files, accessed by a subset of tasksOne file per process

LLNL-PRES-751922 23

Parallel I/O Patterns

Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks

Baton-passingCoordinated "View"

Many independent files, accessed by a subset of tasks

One file per process

LLNL-PRES-751922 23

Parallel I/O Patterns

Single file, accessed by 1 taskSingle shared file, accessed by all tasksMany shared files, accessed by groups of tasks

Baton-passingCoordinated "View"

Many independent files, accessed by a subset of tasksOne file per process

LLNL-PRES-751922 23

Step 4: Understand the PFS

LLNL-PRES-751922 24

Parallel File SystemPolicies

Allocation: howmuch space you have

Backups: if backups or snapshots are createdPurges: when data is deletedConfiguration: I/O pattern system is configured for

LLNL-PRES-751922 25

Parallel File SystemPolicies

Allocation: howmuch space you haveBackups: if backups or snapshots are created

Purges: when data is deletedConfiguration: I/O pattern system is configured for

LLNL-PRES-751922 25

Parallel File SystemPolicies

Allocation: howmuch space you haveBackups: if backups or snapshots are createdPurges: when data is deleted

Configuration: I/O pattern system is configured for

LLNL-PRES-751922 25

Parallel File SystemPolicies

Allocation: howmuch space you haveBackups: if backups or snapshots are createdPurges: when data is deletedConfiguration: I/O pattern system is configured for

LLNL-PRES-751922 25

Parallel File Systems

BlackMagic: IBM’s GPFS (general parallel file system)Closed sourceaka Elastic Scale Storage™ or Spectrum Scale™HPC users do not have knobs to tune

WhiteMagic: LustreOpen sourceUsers can deviate from default behavior

LLNL-PRES-751922 26

Parallel File Systems

BlackMagic: IBM’s GPFS (general parallel file system)Closed sourceaka Elastic Scale Storage™ or Spectrum Scale™HPC users do not have knobs to tune

WhiteMagic: LustreOpen sourceUsers can deviate from default behavior

LLNL-PRES-751922 26

Lustre Striping

HDDs are logically grouped intoOSTs (Object StorageTargets)Users can stripe a file across multiple OSTs

Explicitly take advantage of multiple OSTsDepends on the total amount of I/O you are doingThere is a system default

Use the correct striping for your use case

LLNL-PRES-751922 27

Lustre Striping Commands

$ lfs setstripe -c 4 -s 4M testfile2$ lfs getstripe ./testfile2./testfile2lmm_stripe_count: 4lmm_stripe_size: 4194304lmm_stripe_offset: 21

obdidx objid objid group50 8916056 0x880c58 038 8952827 0x889bfb 0

LLNL-PRES-751922 28

Lustre Striping Commands

$ lfs getstripe ./testfile./testfilelmm_stripe_count: 2lmm_stripe_size: 1048576lmm_stripe_offset: 50

obdidx objid objid group21 8891547 0x87ac9b 013 8946053 0x888185 057 8906813 0x87e83d 044 8945736 0x888048 0

LLNL-PRES-751922 29

Step 6: Profit

LLNL-PRES-751922 30

Steps for Dealing with I/O

1. Recognize the needGet some data out of the applicationGet some data out of the application fasterDeal with files efficiently

2. Investigate I/O libraries and tools, onemay be common inyour field.

3. Implement an I/O pattern4. Understand the file system you are working on5. ???6. Profit!

LLNL-PRES-751922 31

Technical Details: MPI I/O

LLNL-PRES-751922 32

Locking and Atomicity

$ export BGLOCKLESSMPIO_F_TYPE=1

int MPI_File_set_atomicity ( MPI_File mpi_fh, int flag );

LLNL-PRES-751922 33

Opening Files

int MPI_File_open(MPI_Comm comm, const char *filename,int amode, MPI_Info info, MPI_File *fh);

AMode DescriptionMPI_MODE_RDONLY read onlyMPI_MODE_RDWR reading andwritingMPI_MODE_WRONLY write onlyMPI_MODE_CREATE create the fileMPI_MODE_EXCL error if file already existsMPI_MODE_DELETE_ON_CLOSE delete file on closeMPI_MODE_UNIQUE_OPEN file will not be concurrently openedMPI_MODE_SEQUENTIAL file will only be accessed sequentiallyMPI_MODE_APPEND position of all file pointers to end

LLNL-PRES-751922 34

Organizing Data

Use MPI_Datatype to define the structure of your dataCorresponds to C struct

Read andwrite instances of this dataUse MPI_File_set_view for working with non-contiguousdata in a shared file

LLNL-PRES-751922 35

UsefulMPI Function

offset = (long long) 0;MPI_Exscan(&contribute, &offset, 1, MPI_LONG_LONG,

MPI_SUM, file_comm);

Rank 0 1 2 3 4contribute 3 4 2 7 3offset 0 3 7 9 16

LLNL-PRES-751922 36

UsefulMPI Function

offset = (long long) 0;MPI_Exscan(&contribute, &offset, 1, MPI_LONG_LONG,

MPI_SUM, file_comm);

Rank 0 1 2 3 4contribute 3 4 2 7 3

offset 0 3 7 9 16

LLNL-PRES-751922 36

UsefulMPI Function

offset = (long long) 0;MPI_Exscan(&contribute, &offset, 1, MPI_LONG_LONG,

MPI_SUM, file_comm);

Rank 0 1 2 3 4contribute 3 4 2 7 3offset 0 3 7 9 16

LLNL-PRES-751922 36

Accessing Files withMPI

LLNL-PRES-751922 37

Accessing Files withMPI

LLNL-PRES-751922 37

Accessing Files withMPILevel 0independent file ops, explicit offset, sequential dataLevel 1collective file ops, explicit offset, sequential dataLevel 2independent file ops, derived or non-contiguous dataLevel 3collective file ops, derived or non-contiguous data

LLNL-PRES-751922 38

MPI I/O& Lustre

Can be built by HPC resource providers with Lustreintegration

mpi_info_set(myinfo, "striping_factor", stripe_count);mpi_info_set(myinfo, "striping_unit", stripe_size);mpi_info_set(myinfo, "cb_nodes", num_writers);

LLNL-PRES-751922 39

Pro-Tips!

LLNL-PRES-751922 40

Pro-Tip!

StepOneProfile your code. Fix up the I/O until it doesn’t suck.

LLNL-PRES-751922 41

Pro-Tip!

Be SmartDon’t re-invent I/O, use an existing library or tool.

LLNL-PRES-751922 42

Pro-Tip!

Working with File SystemsUse the PFS for Parallel I/O, do NOT use NFS.

LLNL-PRES-751922 43

Pro-Tip!

I/O PatternCreate 1 file per node andmake this a tune-able parameter.

LLNL-PRES-751922 44

Pro-Tip!

Ask an ExpertFind the "I/O person" at your HPC center and ask for guidance.

LLNL-PRES-751922 45

This document was prepared as an account of work sponsored by an agency of the United States government. Neitherthe United States government nor Lawrence Livermore National Security, LLC, nor any of their employeesmakes anywarranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, orusefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringeprivately owned rights. Reference herein to any specific commercial product, process, or service by trade name,trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, orfavoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions ofauthors expressed herein do not necessarily state or reflect those of the United States government or LawrenceLivermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.

top related