scalablei/o - blue waters · p2 p3 disk access p0 p1 p2 p3 memory layout •small number of large...
TRANSCRIPT
![Page 2: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/2.jpg)
I/O performance overview
• Main factors in performance• Know your I/O• Striping• Data layout• Collective I/O
2 of32
![Page 3: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/3.jpg)
I/O performance
• Length of each basic operation• High throughput, high latency• Avoid using small reads / writes
• Locality matters• Avoid jumping around in the file
3 of32
![Page 4: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/4.jpg)
Single process I/O size
400
450
500
550
600
650
700
32K 256K 1M 4M 16M
Thro
ughp
ut M
iB/s
IO operation size
Contiguous I/O operations
WriteRead
4 of32
![Page 5: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/5.jpg)
Single process I/O size
0.1
1
10
100
1000
4 16 64 256 1K 4K 32K 256K 1M 4M 16M
Thro
ughp
ut M
iB/s
IO operation size
Contiguous I/O operations
WriteRead
500 MiB/s
5 of32
![Page 6: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/6.jpg)
Single process random access
0.0001
0.001
0.01
0.1
1
10
100
1000
4 16 64 256 1K 4K 32K 256K 1M 4M 16M
Thro
ughp
ut M
iB/s
IO operation size
I/O operations at random offsets
WriteRead
6 of32
![Page 7: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/7.jpg)
Know your I/O
• How fast (or slow) is it?• How is the data organized on disk?
• N-dimensional array?• Text or binary?
• Which subset of the data for each process?
7 of32
![Page 8: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/8.jpg)
Darshan I/O analysis
• Darshan collects logs of all I/O• On by default
• disable with "module unload darshan"• Logs are in
/projects/monitoring_data/darshan/YYYY-MM/
• Only writes logs on normal exit (needs call to MPI_Finalize)
8 of32
![Page 9: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/9.jpg)
Darshan I/O analysis
9 of32
![Page 10: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/10.jpg)
Darshan commands
• darshan-job-summary.pl• Generate PDF report of the whole job
• darshan-summary-per-file.sh• Generate per-file report
• darshan-parser• Extract lots of raw data
10 of32
![Page 11: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/11.jpg)
Darshan 3 not backward-compatible with 2
• Darshan 2: *.darshan.gz• Darshan 3: *.darshan
module swap darshan/3.1.3 darshan/2.3.0.1module unload gnuplot/5.0.5
11 of32
![Page 12: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/12.jpg)
Parallel I/O striping
• Blue Waters uses Lustre file system• 360 OSTs
• Essentially 360 independent file servers• Striping parameters for each file
• Size: length of each stripe• Count: number of file servers to use
OST0 OST1 OST2 OST3 OST0 OST1 OST2 OST3
12 of32
![Page 13: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/13.jpg)
Striping with Lustre
• lfs getstripe <file or dir>• Print striping parameters for a file$ lfs getstripe foo4foo4lmm_stripe_count: 4lmm_stripe_size: 1048576lmm_pattern: 1lmm_layout_gen: 0lmm_stripe_offset: 287
obdidx objid objid group287 59540013 0x38c822d 011 65961401 0x3ee7db9 0151 65962343 0x3ee8167 071 65963432 0x3ee85a8 0
13 of32
![Page 14: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/14.jpg)
Striping with Lustre
• lfs setstripe –s <size> -c <count> <file or dir>
• Directories• New files will inherit striping parameters
• Files• Set at creation time• lfs setstripe on a nonexistent file creates it
lfs setstripe –s 1M -c 16 foo16cp foo1 foo16
14 of32
![Page 15: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/15.jpg)
Set striping in MPI IO
MPI_Info info;MPI_Info_set(info, "striping_factor", "64");MPI_Info_set(info, "striping_unit", "1048576");MPI_File_open(..., info, ...);
15 of32
![Page 16: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/16.jpg)
Striping on Blue Waters
• Maximum stripe count• home / projects: 36• scratch: 360
• Can’t set > 36 on scratch directly (bug?)• Solution: set on directory, inheritlfs setstripe -c 360 mydata/touch mydata/foo
16 of32
![Page 17: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/17.jpg)
0.3
1
3
10
30
100
1 2 4 8 16 32 64 128
Thro
ughp
ut G
iB/s
Stripe count
Striped file throughput
Single file, striped
43GiB/s
*Note:olddata
17 of32
![Page 18: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/18.jpg)
2
5
10
20
50
100
10 20 50 100
Tim
e in
sec
onds
Size of file in GiB
Time to write MILC data
4 MiB stripes16 MiB stripes64 MiB stripes
Stripe length – bigger is better
Factor of 7.4x
18 of32
![Page 19: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/19.jpg)
Stripe length – doesn’t matter?
6 7 8 9
10 11 12 13 14 15
1 4 6 8 12 16 20 24 28 32
Tim
e in
sec
onds
Stripe length in MiB
Throughput in GiB/s
19 of32
![Page 20: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/20.jpg)
Striping with many files
• Leave stripe count = 1• Full file on one file server
• Each one assigned to a random file server
20 of32
![Page 21: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/21.jpg)
One file per process
0.1
1
10
100
1000
1 10 100 1000 10000
Thro
ughp
ut G
iB/s
Process count
File-per-process throughput
1 node2 nodes4 nodes8 nodes
16 nodes32 nodes64 nodes
128 nodes256 nodes
21 of32
![Page 22: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/22.jpg)
One file per process
100
110
120
130
140
150
160
1000 2000 3000 4000 5000 6000 7000 8000
Thro
ughp
ut G
iB/s
Process count
File-per-process throughput
128 nodes256 nodes
Peak:153GiB/s
22 of32
![Page 23: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/23.jpg)
Data layout
• Common pattern: N-dimensional array
23 of32
![Page 24: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/24.jpg)
Data layout
• Tile on each process
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
24 of32
![Page 25: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/25.jpg)
I/O for a tile
• Many small accesses (eek!)0
P0 P1 P2
25 of32
![Page 26: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/26.jpg)
Tile solution: collective I/O
P0P1P2P3
Diskaccess
P0 P1
P2 P3
Memorylayout
• Small number of large I/O ops• Redistribute data in memory
26 of32
![Page 27: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/27.jpg)
Collective I/O
• Built into MPI-IO• Describe shape of data with MPI datatypes• Use Mesh IO library to make it easier
• Parallel HDF5• Parallel NetCDF
27 of32
![Page 28: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/28.jpg)
Mesh I/O
int Mesh_IO_read(MPI_File fh,MPI_Offset offset,MPI_Datatype etype,int file_endian,void *buf,int ndims,const int *mesh_sizes,const int *file_mesh_sizes,const int *file_mesh_starts,int file_array_order,const int *memory_mesh_sizes,const int *memory_mesh_starts,int memory_array_order);
• ForN-dimensionalmeshes• Describesizeoffullmeshandthesubmesh youwant
• Collectiveoperation• Matchingwritefunction• github.com/oshkosher/meshio
28 of32
![Page 29: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/29.jpg)
XPACC - tiles without collective I/O
• Constantdatasize(1GB)• Runtimegrowswithprocesscount• 4500secondson4096processes
29 of32
![Page 30: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/30.jpg)
XPACC - tiles with collective I/O
• Consistently under 1 second
30 of32
![Page 31: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/31.jpg)
Compression
• Highly compressible data – minimize IO• Bioinformatics (AGTCTGTCTTGC…)
• Example file: 40 GiB• Compress to 275 MiB (151x)• Serial scan in 9.6s (4.2 GiB/s)
• Tools - github.com/oshkosher/bioio• zchunk – read in chunks: offset+length• zlines – read lines of text: line number
31 of32
![Page 32: ScalableI/O - Blue Waters · P2 P3 Disk access P0 P1 P2 P3 Memory layout •Small number of large I/O ops •Redistribute data in memory 26of 32. Collective I/O •Built into MPI-IO](https://reader033.vdocument.in/reader033/viewer/2022042313/5edd4fbdad6a402d66685b99/html5/thumbnails/32.jpg)
Summary
• All reads/writes at least 64k• Best performance: many files• With single file, enable striping• Use collective I/O to avoid small accesses
32 of32
Questions / corrections / [email protected]