i/o bottleneck investigation in deep learning...

1
10 100 1000 10000 100000 1 2 4 8 16 36 72 144 288 576 1152 2304 4608 9216 Time (s) Number of Processes LMDB LMDBIO-LMM LMDBIO-LMM-DIO LMDBIO-LMM-DM LMDBIO-LMM-DIO-PROV LMDBIO-LMM-DIO-PROV-COAL LMDBIO-LMM-DIO-PROV-COAL-STAG I/O Bottleneck Investigation in Deep Learning Systems Sarunya Pumma , 1,2 Min Si , 2 Wu - chun Feng , 1 and Pavan Balaji 2 Motivation 1 Virginia Tech, 2 Argonne National Laboratory Deep Learning & Challenges Robotics Asimo (Honda) Offline & Online Data Analytics Real Time News Feed (Facebook) Facial Recognition Deep Dense Face Detector (Yahoo Labs) Network Size (width and depth) Batch Size (# samples) I/O Bound Communication bound Compute bound High-dimensional input data Image classification Data Science Bowl’s tumor detection from CT scans Networks with large number of parameters Unsupervised image feature extraction LLNL’s network with 15 billion parameters High volume data Sentiment analysis Twitter analysis Yelp’s review fraud detection Image classification ImageNet’s image classification Image feature extraction Tumor detection from CT scans In the past decade … 10 – 20x improvement in processor speed 10 – 20x improvement in network speed Only 1.5x improvement in I/O performance I/O will eventually become a bottleneck for most computations Deep Learning Scaling Overall Training Time (CIFAR10-Large-AlexNet, 512 iterations) Training Time Breakdown (CIFAR10-Large-AlexNet, 512 iterations) LMDB Inefficiencies (cont.) Caffe’s I/O Subsystem: LMDB Problem 1: Mmap’s Interprocess Contention Underlying I/O in mmap relies on the CFS scheduler to wake up processes after I/O has been completed Processes are put to sleep while waiting for I/O to complete I/O completion interrupt is a bottom-half interrupt The handler does not have knowledge about the specific process that triggered the I/O operation Every process that is waiting for I/O is marked as runnable Every reader is woken up each time an I/O interrupt comes in This causes a large number of unnecessary context switches 0 20 40 60 80 100 120 140 160 1 2 4 8 16 32 64 128 256 512 Context Switches Millions Number of Processes 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 4 8 16 32 64 128 256 512 Read Time Breakdown Number of Processes User time Kernel time Sleep time Problem 2: Sequential Data Access Restriction LMDB data access is sequential in nature due to the B+-tree structure There is no way to randomly access a data record All branch nodes associated with the previous records must be read before accessing a particular record When multiple processes read the data, they read extra data Different processes do different amount of work, causing skew D0 D1 D2 D3 Database P0 reads P1 reads P2 reads P3 reads P1 seeks P2 seeks P3 seeks LMDB redundant data movement Our Solution: LMDBIO (cont.) LMDBIO-LMM-DM (cont.) D0 D1 D2 D3 Database P0 seeks Sequential (in-memory) P0 sends cursor to P1 P1 sends cursor to P2 P2 sends cursor to P3 P0 reads P0 accesses P1 seeks P1 accesses P2 seeks P2 accesses P3 seeks P3 accesses P1 reads P2 reads P3 reads Concurrent Part II: Parallel I/O and in-memory sequential seek Our Solution: LMDBIO Optimization: Take into account data access pattern of deep learning and Linux’s I/O scheduling to reduce mmap’s contentions Shared file system (shared between nodes) Page cache Read Map Shared Memory mmap buffer (Process 0) Copy Process 0 Process 1 Process 2 Access Access Access Localized mmap Only one process does mmap on each node Using MPI shared-memory (MPI-3) to share data Even LMDBIO has extra copy (from mmap to shared memory), Caffe still gains benefit from LMDBIO LMDBIO-LMM LMDB Inefficiencies Context Switches Read Time Breakdown Uses Lightning Memory-mapped database (LMDB) for accessing the dataset B+-tree representation of the data Database is mapped to memory using mmap and accessed through direct buffer arithmetic Virtual memory allocated for the size of the full file Specific physical pages dynamically loaded by the OS on-demand Pros: makes it easy to manipulate complex Cons: OS has very little knowledge of the data structures (e.g., B+ trees) since LMDB access model and parallelism making it hard can think of it as fully in-memory to optimize Part II: Speculative Parallel I/O We use a history-based training for our estimation We correct our estimate in each iteration depending on the actual data read in all of the previous iterations The general ideal of out correction is that we attempt to expand the speculative boundaries to reduce the number of missed pages Initial iterations might be slightly inaccurate, but we converge fairly quickly (1-2 iterations) Each process estimates pages that it will need and speculatively fetches pages to memory in parallel Then each process sequentially seeks the location for another processes and sends the cursor to the next higher rank process The expectation is that the seek can be done entirely in memory Once the sequential seek is done, each reader can perform actual data access This adds a small amount of extra data reading, but allows parallel I/O The estimation of number of pages to fetch is based on the first record’s data size I.e., CIFAR10-Large record’s size is 3 KB, which is ~1 page. To read n records, it needs to fetch n pages The estimation of the read offset is performed in the same fashion Estimation of the “approximate” start and end location for each process is important If the estimate is completely wrong, we will end up reading up to 2x the dataset size (still better than the LMDB) Estimation of Speculative I/O Estimation of Speculative I/O (cont.) Results Sarunya Pumma, Min Si, Wu-chun Feng and Pavan Balaji. Towards Scalable Deep Learning via I/O Analysis and Optimization. IEEE International Conference on High Performance Computing and Communications (HPCC). Dec. 18-20, 2017, Bangkok, Thailand. 1 10 100 1000 10000 100000 1 2 4 8 16 36 72 144 288 576 1152 2304 4608 9216 Time (s) Number of Processes Caffe/LMDB Ideal 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 4 8 16 36 72 144 288 576 1152 2304 4608 9216 Execution Time Breakdown Number of Processes Param update time Param calculation time Param sync time Wait time before param sync Total backward time Total forward time Transform time Read time 660x worse than ideal Platform: Argonne’s Blues Dataset: CIFAR10-Large InfiniBand Qlogic QDR Network: AlexNet 110 TB GPFS storage MPI: MVAPICH-2.2 Each node 2 Sandy Bridge 2.6 GHz Pentium Xeon (16 cores, hyperthreading disabled) 64 GB memory 15 GB RAM disk Problem 3: Mmap’s Workflow Overheads Since mmap performs implicit I/O, the user has no control over when an I/O operation is issued. To showcase this overhead, we developed a microbenchmark to read a 256 GB file using a single reader on a single machine Mmap benchmark uses memcpy on a mmap buffer POSIX I/O benchmark uses pread mmap’s read bandwidth is approximately 2.5x lower than that of POSIX I/O Problem 4: I/O Block Size Management 0 0.5 1 1.5 2 2.5 3 4K 16 K 64 K 256 K 1M 4M 16 M 64 M 256 M 1024 M Read Bandwidth (GB/s) I/O Request Size (bytes) mmap POSIX I/O As the number of processes increases, subbatch is smaller POSIX I/O benefits from larger block size, while mmap does not Migrating LMDB to use direct I/O and larger block size can give a significant performance improvement Problem 5: I/O Randomization I/O requests are typically out of order in parallel I/O A large number of processes need to divide a large file into smaller pieces and each process needs to access a part of it Each process issues an I/O request at the same time I/O requests do not arrive at the I/O server processes in any specific order as each process is independent This causes the server processes to access the file in a nondeterministic fashion Server 1 1 3 5 7 Request queue Client 8 All requests are issued at the same time File 5 3 7 1 Client 7 Client 6 Client 5 Client 4 Client 3 Client 2 Client 1 Server 2 2 4 6 8 2 6 8 4 Request queue File Library Optimization Reducing Interprocess Contention Explicit I/O Eliminating Sequential Seek Managing I/O Size Reducing I/O Randomization LMDB - LMDBIO LMM LMM-DM (partial) LMM-DIO LMM-DIO-PROV LMM-DIO-PROV-COAL LMM-DIO-PROV-COAL-STAG Summary of LMDBIO Optimizations LMDBIO-LMM-DM Optimization: coordinate between reader processes to improve parallelism Portable Cursor Representation LMDB calls the position indicator for a record within B+ tree a “cursor” Not a simple offset from the start of file It contains the complete path of the record’s parent branch nodes (multiple pointers), a pointer to the page header, and access flags It is not trivial to port pointers across processes as virtual address spaces are different Serialize data reading and coordinate between processes Each process reads its data and sends the higher rank process the location to start fetching its data from This allows NO extra data reading: number of bytes read is EXACT… but I/O is done sequentially Part I: Serializing I/O D0 D1 D2 D3 Database P0 reads Sequential P0 sends cursor to P1 P1 sends cursor to P2 P1 reads P2 reads P3 reads Part I: Sequential I/O and cursor handoff P2 sends cursor to P3 Portable Cursor Representation (cont.) Our solution: symmetric address space Every process memory-maps the database file to the same memory location Allowing the pointers within the B+ tree to be portable across processes LMDBIO-LMM-DIO read data to shared buffer (POSIX I/O) read data to shared buffer (POSIX I/O) read data to shared buffer (POSIX I/O) seek (mmap) scatter offsets Timeline P0 P1 P2 wait wait Optimization: Replace mmap with POSIX I/O To use direct I/O, we need to know the position of each data record The root process gets offsets of all data samples by seeking the database using mmap Sequential seek is unavoidable because the offsets are not deterministic Other reader processes receive their offsets from root and perform data reading using POSIX I/O Readers share data using MPI shared buffer as same as LMM LMDBIO-LMM-DIO-PROV Optimization: Utilize provenance information to entirely replace mmap with POSIX I/O Making a case for storing data provenance information for deep learning (how the data was created) LMDB’s database layout can be deterministic only if the information of how it is created is provided We can compute exactly where the data pages are located Sequential seek can be completely eliminated All I/O operations can be done via direct I/O (mmap is completely removed) Provenance information is not stored in the original LMDB format This is an extension that we are proposing We use a separate auxiliary file to store this information This file can be created while the database is being generated or later using a one-time read of the database It is much smaller than the dataset itself (a few hundred bytes) Important Notes LMDBIO-LMM-DIO-PROV-COAL Sarunya Pumma, Min Si, Wu-chun Feng and Pavan Balaji. Parallel I/O Optimizations for Scalable Deep Learning. IEEE International Conference on Parallel and Distributed Systems (ICPADS). Dec. 15-17, 2017, Shenzhen, China. Optimization: Coalesce multiple batches of data to be read at once to allow direct I/O to benefit from large I/O size We read a larger chunk of data to enlarge I/O time to eliminate the skew in I/O A constant amount of memory is kept aside for data reading We read multiple batches of data at once LMDBIO-LMM-DIO-PROV-COAL-STAG Optimization: Adopt I/O staggering to reduce I/O randomization I/O staggering technique orders the requests Readers are divided into multiple groups with the same number of members Only one group can perform data reading at a time MPI_Send and MPI_Recv are used in the implementation Caffe/LMDB is 660x worse than ideal for 9216 processes Read time takes up 90% of the total training time for 9216 processes I/O bottleneck is caused by five major problems 1. Interprocess contention -- results in excessive number of context switches 2. Implicit I/O inefficiency -- OS fully controls I/O 3. Sequential data access restriction -- arbitrary database access is not allowed in LMDB 4. Inefficient I/O block size -- I/O request size is too small to be efficient 5. I/O randomization -- abundant readers participating in I/O at the same time We proposed 6 optimizations that address 5 problems in state of the art I/O subsystem of deep learning Experiment Information Dataset: CIFAR10-Large Network: AlexNet Batch size: 18,432 Training iterations: 512 Framework: Caffe Testbed: LCRC Bebop (Each node: 36 cores Intel Broadwell, 128 GB memory) 1.0 1.0 1.1 1.2 1.7 4.6 3.5 2.7 3.6 4.6 7.7 15.5 36.9 64.4 0 10 20 30 40 50 60 70 Factor of Improvment over LMDB Number of Processes LMDBIO-LMM LMDBIO-LMM-DIO LMDBIO-LMM-DM LMDBIO-LMM-DIO-PROV LMDBIO-LMM-DIO-PROV-COAL LMDBIO-LMM-DIO-PROV-COAL-STAG Factor of Improvement over LMDB Total Execution Time

Upload: others

Post on 19-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: I/O Bottleneck Investigation in Deep Learning Systemssynergy.cs.vt.edu/pubs/papers/pumma-lmdbio-icpp18.pdf · 2018-10-04 · I/O Bottleneck Investigation in Deep Learning Systems

10

100

1000

10000

100000

1 2 4 8 16 36 72 144 288 576 1152 2304 4608 9216

Tim

e (s

)

Number of Processes

LMDBLMDBIO-LMMLMDBIO-LMM-DIOLMDBIO-LMM-DMLMDBIO-LMM-DIO-PROVLMDBIO-LMM-DIO-PROV-COALLMDBIO-LMM-DIO-PROV-COAL-STAG

I/O Bottleneck Investigation in Deep Learning SystemsSarunya Pumma ,1,2 Min Si ,2 Wu-chun Feng ,1 and Pavan Balaji2

Motivation

1Virginia Tech, 2Argonne National Laboratory

Deep Learning & Challenges

RoboticsAsimo(Honda)

Offline & Online Data AnalyticsReal Time News Feed (Facebook)

Facial RecognitionDeep Dense Face Detector (Yahoo Labs)

Network Size(width and depth)

Bat

ch S

ize

(# s

ampl

es)

I/O Bound

Communication bound

Compute bound

• High-dimensional input data• Image classification• Data Science Bowl’s

tumor detection from CT scans

• Networks with large number of parameters• Unsupervised image feature

extraction• LLNL’s network with 15 billion

parameters

• High volume data• Sentiment analysis• Twitter analysis• Yelp’s review fraud

detection• Image classification• ImageNet’s image

classification

Image feature extraction

Tumor detection from CT scans

In the past decade …

• 10 – 20x improvement in processor speed

• 10 – 20x improvement in network speed

• Only 1.5x improvement in I/O performance

I/O will eventually become a bottleneck for most computations

Deep Learning Scaling

Overall Training Time (CIFAR10-Large-AlexNet, 512 iterations)

Training Time Breakdown(CIFAR10-Large-AlexNet, 512 iterations)

LMDB Inefficiencies (cont.)

Caffe’s I/O Subsystem: LMDB

Problem 1: Mmap’s Interprocess Contention

Underlying I/O in mmap relies on the CFS scheduler to wake up processes after I/O has been completed• Processes are put to sleep while waiting for I/O

to complete• I/O completion interrupt is a bottom-half

interrupt• The handler does not have knowledge about

the specific process that triggered the I/O operation

• Every process that is waiting for I/O is marked as runnable

• Every reader is woken up each time an I/O interrupt comes in

• This causes a large number of unnecessary context switches

0

20

40

60

80

100

120

140

160

1 2 4 8 16 32 64 128256512

ContextSwitche

sMillions

NumberofProcesses

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 4 8 16 32 64 128

256

512

ReadTim

eBreakdow

n

NumberofProcesses

Usertime Kerneltime Sleeptime

Problem 2: Sequential Data Access Restriction

• LMDB data access is sequential in nature due to the B+-tree structure

• There is no way to randomly access a data record• All branch nodes associated with the previous records

must be read before accessing a particular record• When multiple processes read the data, they read extra

data • Different processes do different amount of work, causing

skew

D0 D1 D2 D3

Database

P0reads

P1reads

P2readsP3reads

P1seeks

P2seeksP3seeks

LMDB redundant data movement

Our Solution: LMDBIO (cont.)

LMDBIO-LMM-DM (cont.)

D0 D1 D2 D3

Database

P0seeks

Sequ

entia

l(in

-mem

ory)

P0sendscursortoP1

P1sendscursortoP2

P2sendscursortoP3

P0reads

P0accesses P1seeks

P1accesses P2seeks

P2accesses P3seeks

P3accesses

P1readsP2reads

P3reads

Concurrent

Part II: Parallel I/O and in-memory sequential seek

Our Solution: LMDBIO

Optimization: Take into account data access pattern of deep learning and Linux’s I/O scheduling to reduce mmap’scontentions

Sharedfilesystem(sharedbetweennodes)

PagecacheRead

Map

SharedMemorymmap buffer(Process0)

Copy

Process0 Process1 Process2

Access Access Access

• Localized mmap• Only one process does mmap on each node• Using MPI shared-memory (MPI-3) to share data

• Even LMDBIO has extra copy (from mmap to shared memory), Caffe still gains benefit from LMDBIO

LMDBIO-LMMLMDB Inefficiencies

Context Switches Read Time Breakdown

• Uses Lightning Memory-mapped database (LMDB) for accessing the dataset• B+-tree representation of the data• Database is mapped to memory using mmap and accessed through direct buffer arithmetic• Virtual memory allocated for the size of the full file

• Specific physical pages dynamically loaded by the OS on-demand

Pros: makes it easy to manipulate complex Cons: OS has very little knowledge of the data structures (e.g., B+ trees) since LMDB access model and parallelism making it hard can think of it as fully in-memory to optimize

Part II: Speculative Parallel I/O

• We use a history-based training for our estimation• We correct our estimate in each iteration depending

on the actual data read in all of the previous iterations• The general ideal of out correction is that we attempt

to expand the speculative boundaries to reduce the number of missed pages

• Initial iterations might be slightly inaccurate, but we converge fairly quickly (1-2 iterations)

• Each process estimates pages that it will need and speculatively fetches pages to memory in parallel

• Then each process sequentially seeks the location for another processes and sends the cursor to the next higher rank process • The expectation is that the seek can be done

entirely in memory• Once the sequential seek is done, each reader can

perform actual data access• This adds a small amount of extra data reading, but

allows parallel I/O

• The estimation of number of pages to fetch is based on the first record’s data size• I.e., CIFAR10-Large record’s size is 3 KB, which is

~1 page. To read n records, it needs to fetch npages

• The estimation of the read offset is performed in the same fashion

• Estimation of the “approximate” start and end location for each process is important• If the estimate is completely wrong, we will end up

reading up to 2x the dataset size (still better than the LMDB)

Estimation of Speculative I/O

Estimation of Speculative I/O (cont.)

ResultsSarunya Pumma, Min Si, Wu-chun Feng and Pavan Balaji. Towards Scalable Deep Learning via I/O Analysis and Optimization. IEEE International Conference on High Performance Computing and Communications (HPCC). Dec. 18-20, 2017, Bangkok, Thailand.

1

10

100

1000

10000

100000

1 2 4 8 16 36 72 144

288

576

1152

2304

4608

9216

Time(s)

NumberofProcesses

Caffe/LMDB

Ideal

0%10%20%30%40%50%60%70%80%90%

100%

1 2 4 8 16 36 72 144

288

576

1152

2304

4608

9216

ExecutionTimeBreakdow

n

NumberofProcesses

Paramupdatetime

ParamcalculationtimeParamsynctime

WaittimebeforeparamsyncTotalbackwardtime

Totalforwardtime

Transformtime

Readtime

660xworsethanideal

Platform: Argonne’s Blues Dataset: CIFAR10-Large• InfiniBand Qlogic QDR Network: AlexNet• 110 TB GPFS storage MPI: MVAPICH-2.2 • Each node• 2 Sandy Bridge 2.6 GHz

Pentium Xeon (16 cores, hyperthreading disabled)

• 64 GB memory• 15 GB RAM disk

Problem 3: Mmap’s Workflow Overheads

• Since mmap performs implicit I/O, the user has no control over when an I/O operation is issued.

• To showcase this overhead, we developed a microbenchmark to read a 256 GB file using a single reader on a single machine• Mmap benchmark uses memcpy on a mmap buffer• POSIX I/O benchmark uses pread

• mmap’s read bandwidth is approximately 2.5x lowerthan that of POSIX I/O

Problem 4: I/O Block Size Management

00.51

1.52

2.53

4K

16K

64K

256K

1M

4M

16M

64M

256M

1024MRe

adBandw

idth(G

B/s)

I/ORequestSize(bytes)

mmap

POSIXI/O

• As the number of processes increases, subbatch is smaller

• POSIX I/O benefits from larger block size, while mmap does not

• Migrating LMDB to use direct I/O and larger block size can give a significant performance improvement

Problem 5: I/O Randomization

• I/O requests are typically out of order in parallel I/O

• A large number of processes need to divide a large file into smaller pieces and each process needs to access a part of it

• Each process issues an I/O request at the same time

• I/O requests do not arrive at the I/O server processes in any specific order as each process is independent

• This causes the server processes to access the file in a nondeterministic fashion

Server1

1 3 5 7

Requestqueue

Client8

Allrequestsareissuedatthesametime

File

5

3

7

1

Client7Client6Client5Client4Client3Client2Client1

Server2

2 4 6 8

2

6

8

4

Requestqueue

File

Library Optimization Reducing InterprocessContention

ExplicitI/O

EliminatingSequential Seek

Managing I/O Size

Reducing I/ORandomization

LMDB -

LMDBIO LMM ✔

LMM-DM ✔ (partial)

LMM-DIO ✔ ✔

LMM-DIO-PROV ✔ ✔ ✔

LMM-DIO-PROV-COAL ✔ ✔ ✔ ✔

LMM-DIO-PROV-COAL-STAG ✔ ✔ ✔ ✔ ✔

Summary of LMDBIO Optimizations

LMDBIO-LMM-DMOptimization: coordinate between reader processes to improve parallelism

Portable Cursor Representation

• LMDB calls the position indicator for a record within B+ tree a “cursor”• Not a simple offset from the start of file• It contains the complete path of the record’s parent

branch nodes (multiple pointers), a pointer to the page header, and access flags

• It is not trivial to port pointers across processes as virtual address spaces are different

• Serialize data reading and coordinate between processes

• Each process reads its data and sends the higher rank process the location to start fetching its data from

• This allows NO extra data reading: number of bytes read is EXACT… but I/O is done sequentially

Part I: Serializing I/O

D0 D1 D2 D3

Database

P0reads

Sequ

entia

l P0sendscursortoP1

P1sendscursortoP2P1reads

P2reads

P3reads …

Part I: Sequential I/O and cursor handoff

P2sendscursortoP3

Portable Cursor Representation (cont.)

• Our solution: symmetric address space• Every process memory-maps the database file to

the same memory location• Allowing the pointers within the B+ tree to be

portable across processes

LMDBIO-LMM-DIO

readdatatosharedbuffer(POSIXI/O)

readdatatosharedbuffer(POSIXI/O)

readdatatosharedbuffer(POSIXI/O)seek(mmap)

scatteroffsets

Timeline

P0

P1

P2wait

wait

……

Optimization: Replace mmapwith POSIX I/O• To use direct I/O, we need to know the position of each data

record• The root process gets offsets of all data samples by

seeking the database using mmap• Sequential seek is unavoidable because the offsets are not

deterministic• Other reader processes receive their offsets from root and

perform data reading using POSIX I/O• Readers share data using MPI shared buffer as same as LMM

LMDBIO-LMM-DIO-PROVOptimization: Utilize provenance information to entirely replace mmapwith POSIX I/O• Making a case for storing data provenance information for

deep learning (how the data was created)• LMDB’s database layout can be deterministic only if the

information of how it is created is provided• We can compute exactly where the data pages are located• Sequential seek can be completely eliminated• All I/O operations can be done via direct I/O (mmap is

completely removed)

• Provenance information is not stored in the original LMDB format• This is an extension that we are proposing

• We use a separate auxiliary file to store this information• This file can be created while the database is being

generated or later using a one-time read of the database

• It is much smaller than the dataset itself (a few hundred bytes)

Important Notes

LMDBIO-LMM-DIO-PROV-COAL

Sarunya Pumma, Min Si, Wu-chun Feng and Pavan Balaji. Parallel I/O Optimizations for Scalable Deep Learning. IEEE International Conference on Parallel and Distributed Systems (ICPADS). Dec. 15-17, 2017, Shenzhen, China.

Optimization: Coalesce multiple batches of data to be read at once to allow direct I/O to benefit from large I/O size

• We read a larger chunk of data to enlarge I/O time to eliminate the skew in I/O• A constant amount of memory is kept aside for data

reading• We read multiple batches of data at once

LMDBIO-LMM-DIO-PROV-COAL-STAGOptimization: Adopt I/O staggering to reduce I/O randomization

• I/O staggering technique orders the requests• Readers are divided into multiple groups with the same

number of members• Only one group can perform data reading at a time

• MPI_Send and MPI_Recv are used in the implementation

• Caffe/LMDB is 660x worse than ideal for 9216 processes• Read time takes up 90% of the total training time for 9216 processes• I/O bottleneck is caused by five major problems

1. Interprocess contention -- results in excessive number of context switches2. Implicit I/O inefficiency -- OS fully controls I/O3. Sequential data access restriction -- arbitrary database access is not allowed in LMDB4. Inefficient I/O block size -- I/O request size is too small to be efficient 5. I/O randomization -- abundant readers participating in I/O at the same time

• We proposed 6 optimizations that address 5 problems in state of the art I/O subsystem of deep learning

Experiment InformationDataset: CIFAR10-LargeNetwork: AlexNetBatch size: 18,432Training iterations: 512Framework: CaffeTestbed: LCRC Bebop (Each node: 36 cores Intel Broadwell, 128 GB memory)

1.0 1.0 1.1 1.2 1.7 4.6 3.5 2.7 3.6 4.6 7.7 15.5

36.9

64.4

0

10

20

30

40

50

60

70

Fact

or o

f Im

prov

men

t ov

er

LMD

B

Number of Processes

LMDBIO-LMMLMDBIO-LMM-DIOLMDBIO-LMM-DMLMDBIO-LMM-DIO-PROVLMDBIO-LMM-DIO-PROV-COALLMDBIO-LMM-DIO-PROV-COAL-STAG

Factor of Improvement over LMDB

Total Execution Time