best practices for ngcdocs.nvidia.com/deeplearning/dgx/pdf/best-practices-… ·  · 2018-01-24aws...

16
BEST PRACTICES FOR NGC DG-08869-001 _v001 | May 2018 Best Practices

Upload: vankhanh

Post on 31-Mar-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

BEST PRACTICES FOR NGC

DG-08869-001 _v001 | May 2018

Best Practices

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | ii

TABLE OF CONTENTS

Chapter 1. NVIDIA NGC Cloud Services Best Practices For AWS........................................ 11.1. Users And Authentication............................................................................... 1

1.1.1. User Credentials In Other Regions................................................................21.2. Data Transfer Best Practices............................................................................2

1.2.1. Upload Directly To EC2 Instance..................................................................31.2.2. Upload Data To S3.................................................................................. 4

1.2.2.1. S3 Data Upload Examples.....................................................................61.2.3. S3 Object Keys.......................................................................................9

1.2.3.1. S3 Object Key Example......................................................................101.3.  Storage.................................................................................................... 11

1.3.1. Network Storage................................................................................... 121.3.1.1. Elastic File System (EFS).................................................................... 121.3.1.2. EBS Volumes In RAID-0.......................................................................13

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 1

Chapter 1.NVIDIA NGC CLOUD SERVICES BESTPRACTICES FOR AWS

The NVIDIA® GPU Cloud™ (NGC) runs on associated cloud providers such as AmazonWeb Services (AWS). This section provides some tips and best practices for usingNVIDIA NGC Cloud Services.

The following tips and best practices are from NVIDIA and should not be taken as bestpractices from AWS. It’s best to consult with AWS before implementing any of these bestpractices. For specific AWS documentation, see the Amazon Web Services web page.

1.1. Users And AuthenticationThe first step in using NGC is to follow the instructions provided in the NGC GettingStarted Guide. Your AWS credentials are tied to a specific region; therefore, if you aregoing to change regions, be sure you use the correct key for that region. A good practiceis to name the key file with the region in the actual name.

Next, spend some time getting to know AWS IAM (Identity, Authentication, andManagement). At a high level, IAM allows you to securely create, manage, and controluser (individual) and group access to your AWS account. It is very flexible and providesa rich set of tools and policies for managing your account.

AWS provides some best practices around IAM that you should read immediately aftercreating your AWS account. There are some very important points in regard to IAM. Thefirst thing you should be aware of is that when you create your account on AWS, you areessentially creating a root account. If someone gains access to your root credentials,they can do anything they want to your account including locking you out and runningup a large bill. Therefore, you should immediately lock away your root account accesskeys.

After you've secured your root credentials, create an individual IAM user. This is verysimilar to creating a user on a *nix system. It allows you to create a unique set of securitycredentials which can be applied to a group of users or to individual users.

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 2

You should also assign a user to a group. The groups can have pre-assigned permissionsto resources - much like giving permissions to *nix users. This allows you to controlaccess to resources. AWS has some pre-defined groups that you can use. For moreinformation about pre-defined groups on AWS, see Creating Your First IAM AdminUser and Group. For IAM best practices, see the AWS Identity And Access ManagementUser Guide.

1.1.1. User Credentials In Other RegionsThe credentials that you created are only good for the region where you created them. Ifyou created them in us-east-1 (Virginia), then you can’t use them for the region in Japan.If you want to only use the region where you created your credentials, then no action isneeded. However, if you want the option to run in different regions, then you have twochoices:

‣ Option 1: create credentials in every region where you plan to run, or‣ Option 2: copy your credentials from your initial region to all other regions.

Option 1 isn’t difficult but it can be tedious depending upon how many regions youmight use. To keep track of the different keys, you should include the region name in thekey name.

Option 2 isn’t too difficult thanks to a quick and simple bash script:

#!/bin/bash myKEYNAME='bb-key'myKEYFILE=~/.ssh/id_rsa.pub if [ ! -f "${myKEYFILE}" ]; then echo "I can't find that file: ${myKEYFILE}" exit 2fi myKEY=`cat ${myKEYFILE}` for region in $( aws --output text ec2 describe-regions | cut -s -f3 | sort ); do echo "importing ${myKEYNAME} into region ${region}" aws --region ${region} ec2 import-key-pair --key-name ${myKEYNAME} --public-key-material "${myKEY}"done

In this script, the keyname for your first region is bb-key and is assigned to myKEYNAME.The file that contains the key is located in ~/.ssh/ida_rsa.pub. After defining thosetwo variables, you can run the script and it will import that key to all other AWS regions.

Before running GPU enabled instances, it is a good idea to check with AWS on whatregions have GPU enabled instances (not all of them currently have them).

1.2. Data Transfer Best PracticesOne of the fundamental questions users have around best practices for AWS isuploading and downloading data from AWS. This can be a very complicated question

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 3

and it’s best to engage with AWS to discuss the various options. For more informationabout uploading, downloading, and managing objects, see the Amazon Simple StorageService Console User Guide.

In the meantime, to help you get started, the following sections offer ideas for how toupload data to AWS.

1.2.1. Upload Directly To EC2 InstanceWhen you first begin to use AWS, you may have some data on their laptop, workstationor company system, and want to upload it to an EC2 instance that is running. Thismeans that you want to directly upload data to the compute instance they started. Aquick and easy way to do this is to use scp to copy the data from your local system tothe running instance. You’ll need the IP address or name of the instance as well as yourAWS key. An example command using scp is the following:

$ cd data$ ssh -i my-key-pair.pem -r * ubuntu@public-dns-name:/home/ubuntu

In this example, the training data is located in a subdirectory called data on yoursystem. You cd into that directory and then recursively uploaded all the data in thatdirectory to the EC2 instance that has been started with the NVIDIA Volta DeepLearning AMI. You will need to use your AWS keys to upload to the instance. The -roption means recursive so everything in the data directory, including subdirectories, arecopied to the AWS instance.

Finally, you need to specify the user on the instance (ubuntu), the machine name(NVAWS_DNS) and the fill path where the data is to be uploaded (/home/ubuntu which isthe default home directory for the ubuntu user).

There are a few key points in using scp. The first is that you need to have the SSH port(port 22) open on your AWS instance and your local system. This is done via securitygroups.

There are other ways to open and block ports in AWS, however, they are not coveredin this guide.

The second thing to note is that scp is single-threaded. That is, a single thread onyour system is doing the data transfer. This many not be enough to saturate your NIC(Network Interface Card) on your system. In that case, you might want to break up thedata into chunks and upload them to the instance. You can upload them serially (oneafter the other), or you can load them in groups (in essence in parallel).

There are a couple of options you can use for uploading the data that might help. Thefirst one is using tar to create a tar file of a directory and all subdirectories. You canthen upload that tar file to the running AWS EC2 instance.

Another option is to compress the tar file using one of many compression utilities (forexample, gzip, bzip2, xz, lzma, or 7zip). There are also parallel versions of compression

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 4

tools such as pigz (parallel gzip), lzip (uses lzlib), pbzip2 (parallel bzip2), pxz (parallelxz), or lrzip (parallel lzma utility).

You can use your favorite compression tool in combination with tar via the followingoption:

$ tar --use-compress-program=… cf file.tar

The combination allows you to specify the path to the compression utility you want touse with the --use-compress-program option.

After taring and compressing the data, upload the file to the instance using scp. Then,ssh into the instance and uncompress and untar the file before running your framework.

When compressing or creating a tar file, the process actually encrypts it. Encryptionis not covered in this guide, however, scp will encrypt the file during the transferunless you have specifically told it not to encrypt.

Another utility that might increase the upload speed is to use bbcp. It is a point-to-pointnetwork file copy application that can use multiple threads to increase the upload speed.

As explained, there are many options for uploading data directly to an AWS EC2instance. There are also some things working against you to reduce the upload speed.One big impediment to improving upload speeds is your connection to the Internet andthe network between you and the AWS instance.

If you have a 100Mbps connection to the Internet or are connecting from home using acable or phone modem, then your upload speeds might be limited compared to a 1 Gbpsconnection (or faster). The best advice is to test data transfer speeds using a variety offile sizes and number of files. You don’t have to do an exhaustive search but runningsome tests should help you get a feel for data upload speeds.

Another aspect you have to consider is the packet size on your network. The networkinside your company or inside your home may be using jumbo frames which setthe frame size to 9,000 (MTU of 9,000). This is great for closed networks because theframe size can be controlled so that you get jumbo frames from one system to the next.However, as soon as those data packets hit the Internet, they drop to the normal framesize of 1,500. This means you have to send many more packets to upload the data. Thiscauses more CPU usage on both sides of the data transfer.

Jumbo frames also reduce the percentage of the packet that is devoted to overhead (notdata). Jumbo frames are therefore more efficient when sending data from system tosystem. But as soon as the data hits the Internet, the percentage devoted to overheadincreases and you end up having to send more packets to transfer the data.

1.2.2. Upload Data To S3Another option is to upload the data to an AWS S3 bucket. S3 is an object store thatbasically has unlimited capacity. It is a very resilient and durable storage system soit’s not necessary to store your data in multiple locations. However, S3 is not POSIX

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 5

compliant so you can’t use applications that read and write directly to S3 withoutrewriting the IO portions of your code.

S3 is a solution for storing your input and output data for your applications becauseit’s so reliable and durable. To use the data, you copy it from S3 to the instances you areusing and copy data from the instance to S3 for longer-term storage. This allows you toshut down your instances and only pay for the data stored in S3.

Fundamentally, S3 is an object store (not POSIX compliant), that can scale to extremelylarge sizes and is very durable and resilient. S3 does not understand the concept ofdirectories or folders, meaning the storage is flat. However, you still use folders anddirectories to create a hierarchy. These directories just become part of the name of theobject in S3. Applications that understand how to read the object names can present youa view of the objects that includes directories or folders.

There are a multiple ways to copy data into S3 before you start up your instances. AWSmakes a set of CLI (Command Line Interface) tools available that can do data transfer foryou. The basic command is simple. Here is an example:

$ aws s3 cp <local-file> s3://mybucket/<location>

This command copies a local file on your laptop or server to your S3 bucket. In thecommand, s3://mybucket/<location> is the location in your S3 bucket. Thiscommand doesn’t use any directories or folders, instead, it puts everything into the rootof your S3 bucket.

A slightly more complex command might look like the following:

$ aws s3 cp /work/TEST s3://data-compression/example2 -recursive -include “*”

This copies an entire directory on your host system (such as your laptop), to a directoryon S3 with the name data-compression/example. It copies the entire contents of thelocal directory because of the -recursive flag and the -include “*” option. Thecommand will create subdirectories on S3 as needed. Remember, subdirectories don’treally exist, therefore they are part of the object name on S3.

S3 has the concept of a multi-part upload. This was designed for uploading large files toS3 so that if a network error is encountered you don’t have to start the upload all overagain. It breaks the object into parts and uploads these parts, sometimes in parallel, to S3and re-assembles them into the object once all of the uploads are done.

Each part is a contiguous portion of the object’s data. If you want to do multi-partupload manually, then you can control how the object parts are uploaded to S3. They canbe in any order and can even be done in parallel. After all of the parts are uploaded, youthen have to assemble them into the final object. The general rule of thumb is that whenthe object is greater than 100MB, using multi-part upload is a good idea. For objectslarger than 5GB, multi-part upload is mandatory.

While multi-part upload was designed to upload large files, it also helps improve yourthroughput since you can upload the parts in parallel. One of the nice features of theAWS CLI tools is that all aws s3 cp commands use multi-part automatically. Thisincludes aws s3 mv and aws s3 sync. You don’t have to do anything manually.Consequently, any uploads using this tool can be very fast.

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 6

Another option is to use open-source tools for uploading data to S3 in parallel. Theconcept is not to use multi-part upload but to upload objects in parallel to improveperformance .One tool that is worth examining is s3-parallel-put.

You can also use tar and a compression tool to collect many objects and compress thembefore uploading to S3. This can result in fast performance because the number of fileshas been reduced and the amount of data to be transferred is reduced. However, S3 isn’ta POSIX compliant file system so you cannot uncompress nor untar the data within S3itself. You would need to copy the data to a POSIX file system first and then perform theactions. Alternatively, you could use AWS Lambda to perform these operations, but thatis outside the scope of this document.

For a video tutorial about S3, see AWS S3 Tutorial For Beginners - Amazon SimpleStorage Service.

1.2.2.1. S3 Data Upload Examples

To understand how the various upload options impact performance, let’s look at threeexamples. All three examples test uploading data from an EC2 instance to an S3 bucket.A d2.8xlarge instance is used because it has a large amount of memory (244GB). Theinstance has a 10GbE connection along with 36 cores (18 HT cores).

All data is created using /dev/urandom. Each example has a varying number of filesand file sizes.

1.2.2.1.1. Example 1: Testing s3-parallel-put For Uploading

This example is fairly simple. It follows an astronomy pattern for the sake of discussion.It has two file sizes, 500MB and 5GB. For every three 500MB files, there are two 5GBfiles. All of the files were created in a single directory with a total 50 files consuming115GB. In total there are 20x 5GB files and 30x 500 MB files.

This test used the s3-parallel-put tool to upload all of the files. The wall time wasrecorded when the process started and when it ended giving an elapsed time for theupload. The number of simultaneous uploads was varied from 1 to 32 which indicatedhow many files were being uploaded at the same time. The data was then normalized bythe run time for uploading one file at a time.

The results are presented in the chart below along with the theoretical speedup (perfectscaling).

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 7

Figure 1 Using the s3-parallel-put tool to upload

Notice that the scaling is fairly good to about 8 processes. After that, the results fromusing the tool are slower than the theoretical from 24 to 32 processes. There is basicallyno improvement in upload time.

1.2.2.1.2. Example 2: Testing s3-parallel-put, AWS CLI, And Tar ForUploading

This example uploads a large number of smaller files and uploads them from theinstance to an S3 bucket. For this test, the following file distribution was used:

‣ 500,000 2KB files‣ 9,500,000 100KB files

All of the files were evenly split across 10 directories.

The tests uploaded the files individually but creating a compress tar file and uploadingit to S3 was also tested. The specific tests were:

1. Upload using s3-parallel-put 2. Upload using AWS CLI tools 3. Tar all of the files first, then use AWS CLI tools to upload the tar.gz file (no data

compression)

The tests were run with the wall clock time recorded at the start and at the end. Theresults are shown below.

The y-axis has been normalized to an arbitrary value but the larger the value, thelonger it takes to upload the data.

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 8

Figure 2 Comparing the s3-parallel-put tool, AWS CLI, and tar toupload

From the chart you can see that the AWS CLI tool is about 3x faster than s3-parallel-put. However, the fastest upload time is when all of the files were first tarred togetherand then uploaded. That is about 33% faster than not tarring the files.

Only the actual upload of the tar file is about ¼ of the time to upload all of the files.

Remember that instead of having individual files in S3 (individual objects), you have onelarge object which is a tar file.

1.2.2.1.3. Example 3: Testing The AWS CLI For Uploading

This examples goes back to the first example, increases the number of files in the sameproportion, and adds a very small file (less than 1.9KB). There are 40x 5GB files, 60x500MB files for a total of 100 files. Two files were added to the data set to force theuploads to contend with one very small file (1.9KB), and one large file (50GB). This is agrand total of 102 files.

The AWS CLI tools were tested. While using the CLI tool, a few combinations of usingthe tool along with tar and various compression tools were also used.

1. Tar files into single .tar file, upload with CLI 2. Tar files into single .tar file with compress, upload with CLI 3. Tar files into single .tar file, compress it, upload with CLI 4. Tar files into single .tar file with parallel compression (pigz), upload with CLI 5. Tar files into single .tar file, parallel compress, upload with CLI

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 9

The time to complete the tar and to complete the data compression are including in theoverall time.

The y-axis has been normalized to an arbitrary value but the larger the value, thelonger it takes to upload the data.

Figure 3 Using the AWS CLI tool to upload

From the testing results, the following observations were made:

‣ The CLI tool alone is the fastest‣ Using serial compression tools such as bzip and tar, greatly increases the total

upload time (fourth bar from the right).‣ Tarring all of the files together while using pigz (parallel gzip), results in the second

fastest upload time (second bar from the right). Just remember that the files are nowin one large, compressed file on S3.

‣ Using separate tasks for tar and then compression slows down the overall uploadtime

‣ pigz appears to be about 6 times faster than gzip on this EC2 instance

1.2.3. S3 Object KeysSince S3 does not understand the concept of directories or folders, the full path becomespart of the object name. In essence, S3 acts like a key-value store so that each keypoints to its associated object (S3 is more sophisticated than a key-value store but ata fundamental level, it acts simply like a key-value store). An object key might besomething like the following:

assets/js/jquery/plugins/jtables.js

The directory has several directories (folders) before the actual file name which isjtables.js.

Keys are unique with a bucket. They do not contain meta-data beyond the key name.Meta-data is kept in a different object that is also associated with the object. While

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 10

patterns in keys will not necessarily improve upload performance, they can improvesubsequent read/write performance.

An S3 bucket begins with a single index to all objects. Depending upon what your datalooks like, this can become a performance bottleneck. The reason is that all queries gothrough the partition associated with the index regardless of the key name. Ideally, itwould be good to spread your objects across multiple partitions (multiple indices) toimprove performance. This means that more storage servers are used which has moreeffective CPU resources, memory, and network performance.

The partitions are based on the object key (plus the bucket key and an version numberthat might be associated with the object). If the first few characters of the object key areall the same, then S3 will assign them to the same partition, resulting in only one serverservicing any data requests.

S3 will try to spread the object keys across multiple partitions as best it can to satisfy anumber of constraints but also trying to increase the number of partitions. However,as you are uploading data, S3 will not be able to create partitions on the fly. You arelikely to be using one partition. You can contact AWS prior to your data upload anddiscuss “pre-warming” partitions for a bucket. Over time, as you use the data, morepartitions are added as needed. The exact rules that determine how and when partitionsare created (or not), is a proprietary implementation of AWS. But it is guaranteed that ifthe object keys are fairly similar, particularly in the first few characters, they will all beon a single partition. The best way around this is to add some randomness to the firstfew characters of an object key.

1.2.3.1. S3 Object Key Example

To better understand how you might introduce some randomness into key names let’slook at a simple example. Below is a table of objects in a bucket. It includes the bucketkey and the object key for each object. In storing data, a common pattern is to the date asthe first part of the name. This carries over to the object keys as shown below.

Figure 4 Objects in a bucket

Notice that the object key is the same for each file for the first 5 characters since the“year” was used first. There is not much variation in the year especially if you areworking with recent data within the last year or two.

One option to improve randomness for the first few characters is to reverse the date onthe object keys.

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 11

Figure 5 Objects in a bucket - reverse date

This results in more randomness in the object keys. There are now 30 options for the firsttwo characters (01 to 31). This gives S3 more leeway in creating partitions for the data.

One problem you might have with introducing more randomness into the object keyis you might have to change your applications to read the date backwards and thenconvert it. But, if you are doing a great deal of data processing, it might be worth thecode change to get improved S3 performance.

Another way to introduce randomness in the object key is to use make the first fourcharacters of each object, the first four characters from the md5sum of the object asbelow.

Figure 6 Objects in a bucket - reverse first four characters

This introduces a great deal of randomness in the first four characters while stillallowing you keep the classic format for the date. But again, you may have to modifyyour application to drop the first 5 characters from the file name.

1.3. Storage

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 12

AWS has many storage options both network storage, object storage, network blockstorage, and even local storage in the instance which is referred to as ephemeral storage.To plan your use of storage within AWS, it’s best to discuss the options with your AWScontact. The following sections discuss your storage options for artificial intelligence,deep learning, and machine learning.

When you use the NVIDIA Volta Deep Learning AMI by default, you have a single AWSEBS volume (Elastic Block Storage) that is formatted with a file system and mounted as/ on the instance. The EBS is a general purpose (gp2) type volume. While not entirelyaccurate, you can think of an EBS volume as an iSCSI volume that you might mount ona system from a SAN. However, EBS volumes have some features that you might notget from a SAN, such as very high durability and availability, encryption (at rest and intransit), snapshots, and elastically scalable.

Note about your options:

‣ Using encryption at rest and in transit will impact throughput. It’s up to you tomake the decision between performance and security.

‣ You can resize EBS volumes on the fly. However, this doesn’t resize the file systemthat is using volumes. Therefore, you will need to know how to grow the file systemusing the EBS volumes:

‣ The current size limit on EBS volumes is 16TB. For anything greater than that,you either need to use EFS or use a second volume with a RAID level.

‣ Snapshots are a great way to save current data for the future.‣ There are performance limits of an EBS volume. This includes throughput and IOPS.

Remember that in general, deep learning IO is usually fairly heavy IOPS driven(read IOPS).

‣ Some instances are EBS optimized, meaning they have a much better connection toEBS volumes for improved performance.

Before training your model, you may have to upload your data from a local host to therunning instance. Before uploading your data, ensure you have enough EBS capacity tostore everything. You might estimate the size on your local host first before starting theinstance. Then you can increase the EBS volume size so that it’s larger than the data set.Another option is to upload your data to Amazon’s S3 object storage. Your applicationswon’t be able to read or write directly to S3, but you can copy the data from S3 to yourNVIDIA Volta Deep Learning instance, train the model, upload any results back to S3.This keeps all of the data within AWS and if the instance and your S3 bucket are in thesame region which can help data transfer throughput. You can upload your data file byfile to S3 or you can create a tar file or compress tar file, and upload everything to S3.Your S3 data can also be encrypted.

1.3.1. Network StorageIn the previous section, the simple option of using a single EBS volume with theNVIDIA Volta Deep Learning AMI was discussed. In this section, other options will bepresented such as using EFS or using multiple EBS volumes in a RAID group.

1.3.1.1. Elastic File System (EFS)

NVIDIA NGC Cloud Services Best Practices For AWS

www.nvidia.comBest Practices for NGC DG-08869-001 _v001 | 13

The AWS Elastic File System (EFS) can be thought of as “NFS as a service”. It allows youto create an NFS service that has very high durability and available that you can mounton instances in a specific region in multiple AZ’s (in other words, EFS is a regionallybased service). The amount of storage space in EFS is elastic so that it can grow into thePetabyte scale region. It uses NFSv4.1 (NFSv3 is not supported) to improve security. Asyou add data to EFS, it’s performance increases. It also allows you to encrypt the data inthe file system for more security.

Perhaps the best feature of EFS is that it is fully managed. You don’t have to create aninstance to act as a NFS server and allocate storage to attach to that server. Instead, youcreate an EFS file system and a mount point for the clients. As you add data, EFS willautomatically increase the storage as needed, in other words, you don’t have to addstorage and extend the file system.

For a brand new EFS file system, the performance is likely to be fairly low. The AWSdocumentation indicates that for every TB of data, you get about 50 MB/s of guaranteedthroughput. For NGC, EFS is a great AWS product for easily creating a very durableNFS storage system but the performance may be low until the file system contains alarge amount of data (multiple TB’s).

One important thing to remember is that your throughput performance will be governedby the network of your instance type. If your instance type has a 10Gbs network, thenthat will govern your NFS performance.

1.3.1.2. EBS Volumes In RAID-0

As mentioned previously, the NVIDIA Volta Deep Learning AMI comes with a singleEBS volume. Currently, EBS volumes are limited to 16TB. To get more than 16TB, youwill have to take two or more EBS volumes and combine them with Linux SoftwareRAID (mdadm).

Linux Software RAID (mdadm) allows you to create all kinds of RAID levels. EBSvolumes are already durable and available, which means that the RAID levels providesome resiliency in the effect of a block device failure such as RAID-5, are not necessary.Therefore, it’s recommended to use RAID-0.

You can combine a fairly large number of EBS into a RAID group. This allows you tocreate a 160TB RAID group for an instance. However, this should be done for capacityreasons only. Adding EBS volumes doesn’t improve the IO performance of singlethreaded applications. Single thread IO is very common in deep learning applications.

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION

REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED,

STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY

DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A

PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever,

NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall

be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED,

MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE,

AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A

SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE

(INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER

LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS

FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR

IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for

any specified use without further testing or modification. Testing of all parameters of each product is not

necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and

fit for the application planned by customer and to do the necessary testing for the application in order

to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect

the quality and reliability of the NVIDIA product and may result in additional or different conditions and/

or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any

default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA

product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license,

either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information

in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without

alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, DGX Station,

GRID, Jetson, Kepler, NVIDIA GPU Cloud, Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, Tesla and Volta are

trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries.

Other company and product names may be trademarks of the respective companies with which they are

associated.

Copyright

© 2018 NVIDIA Corporation. All rights reserved.

www.nvidia.com