final may 2015 lci hpc storage - linux clusters institute ·...

Linux Clusters Ins.tute: High Performance Storage

University of Oklahoma, 05/19/2015

Mehmet Belgin, Georgia Tech [email protected]

(in collabora9on with Wesley Emeneker)

18-22 May 2015 1

The Fundamental Ques.on

18-22 May 2015 2

• How do we meet *all* user needs for storage?

• Is it even possible?

• Confounding factors • User expecta9ons (in their own words) • Budget constraints • Applica9on needs and use cases • Exper9se in team • Exis9ng infrastructure

Examples to Common Storage Systems

18-22 May 2015 3

•  Network File System (NFS) – a distributed file system protocol for accessing files over a network.

•  Lustre – a parallel, distributed file system •  OSS – object storage server. This server stores stores and manages pieces of files (aka objects)

•  OST – object storage target. This disk is managed by the OSS and stores data •  MDS – metadata server. This server stores file metadata. •  MDT – metadata target. This disk is managed by the MDS and stores file metadata

•  General Parallel File System (GPFS) – a parallel, distributed file system. •  Metadata is not owned by any par9cular server or set of servers. •  All clients par9cipate in filesystem management •  NSD – network storage device

•  Panasas/PanFS – a parallel, distributed file system •  Metadata is owned by director blades •  File data is owned by storage blades

Nomenclature

18-22 May 2015 4

•  Object store – a place where chunks of data (aka objects) are stored. Objects are not files, though they can store individual files or different pieces of files.

•  Raw space – what the the disk label shows. Typically given in base 10.

i.e. 10TB (terabyte) == 10*10^12 bytes

•  Usable space -‐ what “df” shows once the storage is mounted. Typically given in base 2.

i.e. 10TiB (tebibyte) == 10*2^40 bytes

•  Usable space is o_en about 30% smaller (some9mes more, some9mes less) than raw space.

Which one is right for me?

18-22 May 2015 5

Lustre

The End.

18-22 May 2015 6

Thanks for par9cipa9ng!

Before we start…

18-22 May 2015 7

What is a File System?

What is a filesystem?

•  A system for files (Duh!)

•  A source of constant frustra9on

•  A filesystem is used to control how data is stored and retrieved –Wikipedia

•  It’s a container (that contains files)

•  It’s the set of disks, servers (computa9onal components), networking, and so_ware

•  All of the above

18-22 May 2015 8

Disclaimer

• There are no right answers

• There are wrong answers • No, seriously.

• It comes down to balancing tradeoffs of preferences, exper9se, costs, and case-‐by-‐case analysis

18-22 May 2015 9

Know Your Stakeholders

18-22 May 2015 10

… and keep all of them happy! (at the same 9me)

1. Users

2.  Managers and University Leadership

3.  University support staff

4.  System administrators

5.  Vendor

Users

Managers

Sysadmins

What do you need to support? Common Storage Requirements (which most users can’t ar9culate)

•  Temporary storage for intermediate results from jobs (a.k.a scratch)

•  Long-‐term storage for run9me use

•  Backups

•  Archive

•  Expor9ng said filesystem to other machines (like a user's Windows XP laptop)

•  Virtual Machine hos9ng

•  Database hos9ng

•  Map/Reduce (a.k.a Hadoop)

•  Data ingest and outgest (DMZ?)

•  System Administrator storage

18-22 May 2015 11

Tradeoffs

First, try to define ‘use purpose’ and ‘opera9onal life9me’…

•  Speed (… is a rela9ve term!) •  Space • Cost •  Scalability • Administra9ve burden • Monitoring • Reliability/Redundancy •  Features •  Support from vendor

18-22 May 2015 12

Parallel/Distributed vs. Serial Filesystems*

Serial •  It doesn’t scale beyond a single server •  It o_en isn't easy to make it reliable or redundant beyond a single server •  A single server controls everything

Parallel •  Speed increases as more components are added to it •  Built for distributed redundancy and reliability •  Mul9ple servers contribute to the management of the filesystem

18-22 May 2015 13

*None of these things are 100% true

The Most Common Solu.ons for HPC

Want to access your data from everywhere? You need “Network Aoached Storage (NAS)”!

• NFS (serial-‐ish)

• GPFS (Parallel)

•  Lustre (Parallel)

•  Panasas (Parallel)

• What about others like OrangeFS, Gluster, Ceph, XtreemFS, CIFS, HDFS, Swi_, etc.?

18-22 May 2015 14

Prepare for a Challenge

• NFS • Panasas • GPFS • Lustre

•  Your mileage may vary!

18-22 May 2015 15

Administra9ve Burden & needed exper9se

(anectodal)

low

high

Network File System (NFS)

• Can be built from commodity parts or purchased as an appliance

• A single server typically controls everything

• Where does it fall for our tradeoffs? •  No so_ware cost •  Compa9ble (not 100% POSIX) •  Underlying Filesystem does not maoer much (ZFS, ext3, …) •  True redundancy is harder (single point of failure) •  Mostly for low-‐volume, low-‐throughput workloads •  Strong client side caching, works well for small files •  Requires minimal exper9se and (rela9vely) easy to manage

18-22 May 2015 16

*Speed *Space *Cost *Scalability *Administra9ve Burden *Monitoring *Reliability/Redundancy *Features *Vendor Support

General Parallel File System (GPFS)

• Can be built from commodity parts or purchsed as an appliance

• All nodes in the GPFS cluster par9cipate in filesystem management

• Metadata is managed by every node in the cluster

• Where does it fall in our tradeoffs?

18-22 May 2015 17

Client Client

NSD Server NSD Server

Network


Lustre •  Can be built from commodity parts, or purchased as an appliance

•  Separate servers for data and metadata


18-22 May 2015 18

* Image credit: nor-tech.com


Panasas

•  Is an appliance •  Separate servers for metadata and data


18-22 May 2015 19

* Image credit: panasas.com


Appliances

• Appliances generally come with vendor tools for monitoring and management

• Do these tools increase or decrease management complexity?

• How important is vendor support for your team?

18-22 May 2015 20

Screenshot of Panasas management tool

Good idea? Bad idea? Let’s discuss!

• NFS for everything

• Panasas for everything

•  Lustre for everything

• GPFS for everything

18-22 May 2015 21

How about…

•  Lustre for work (files stored here are temporary) • NFS for home •  Tape for backup and archival

•  Lustre available everywhere •  Tape available on data movers • NFS only available on login machines

18-22 May 2015 22

Designing your storage solu.on

• Who are the stakeholders? •  How quickly should we be able to read any one file? •  How will people want to use it? •  How much training will you need? •  How much training will your users need to effec9vely use your storage? •  Do you have the knowledge necessary to do the training? •  How o_en do they need the training? •  Do you need different 9ers or types of storage?

•  Long-‐term •  Temporary •  Archive

•  From what science/usage domains are the users? •  aka what applica9ons will they be using?

• What features are necessary?

18-22 May 2015 23

Applica.on Driven Tradeoffs

• Domain Science •  Chemistry •  Aerospace •  Bio* (biology, bioinforma9cs, biomedical) •  Physics •  Business •  Economics •  etc.

• Data and Applica9on Restric9ons •  HIPAA and PHI •  ITAR •  PCI DSS •  And many more (SOX, GLBA, CJIS, FERPA, SOC, …)

18-22 May 2015 24

What you need to know

• What is the distribu9on of files? •  sizes, count

• What is the expected workload? •  How many bytes are wrioen for every byte read?

•  How many bytes are read for each file opened?

•  How many bytes are wrioen for each file opened?

• Are there any system-‐based restric9ons? •  POSIX conformance. Do you need a POSIX Filesystem?

•  Limita9ons on number of files or files per directories

•  Network compa9bility (IB vs. Ethernet)

18-22 May 2015 25

Use Case: Data Movement

•  Scenario: User needs to import a lot of data

• Where is the data coming from? •  Campus LAN?

•  Campus WAN?

• WAN?

• How o_en will the data be ingested?

• Does it need to be outgested?

• What kind of data is it?

•  Is it a one-‐9me ingest or regular?

18-22 May 2015 26

Designing your storage solu.on

• What technologies do you need to sa9sfy the requirements that you now have?

• Can you put a number on the following? •  Minimum disk throughput from a single compute node

•  Minimum aggregate throughput for the en9re filesystem for a benchmark (like iozone or IOR)

•  I/O load for representa9ve workloads from your site •  How much data and metadata is read/wrioen per job?

•  Temporary space requirements

•  Archive and backup space requirements •  How much churn is there in data that needs to be backed up?

18-22 May 2015 27

Storage Devices

18-22 May 2015 28

•  Solid State •  RAM

•  PCIe SSD •  SATA/SAS SSD

•  Spinning Disk •  SAS •  NL-‐SAS •  SATA

•  Tape

speed & cost capacity

high

low high

low

o  Serial ATA (SATA): $/byte, large capacity, less reliable, slower (7.2k RPM)

o  Serial ACached SCSI (SAS): $$/byte, small capacity, reliable, fast (15k RPM)

o  Nearline-‐SAS: SATA drives with SAS interface: more reliable than SATA, cheaper than SAS, ~SATA speeds but with lower overhead

o  Solid State Disk (SSD): No spinning disks, $$$/byte, blazing fast, reliable1

What is an IOP?

•  IOP == Input/Output Opera9on •  IOPS == Input/Output Opera9ons per Second • We care about two IOPS reports

•  The number we tell people when we say “Our Veridian Dynamics Frobulator 2021 gets 300PiB/s bandwidth!”

•  The number that affects users “Our Veridian Dynamics Frobulator 2021 only gets 5KiB/s for <insert your applica9on’s name>”

• Why the difference?

18-22 May 2015 29

More tradeoffs …

Space vs. Speed •  Do you need 10GiB/s and 10TiB of space? •  Do you need 1PiB of usable storage and 1GiB/s? •  How do you meet your requirements?

Large vs. Small Files • What is a small file?

•  No hard rule. It depends on how you define it. •  At GT, small is < 1MiB?

• Why do you care? •  Metadata opera9ons are deadly. The 9me required to do a metadata lookup on a 1TiB file takes the same 9me as a lookup on a 1KiB file.

18-22 May 2015 30

Example Storage Solu.on (Georgia Tech)

Experienced catastrophic failure(s) with all of them at least once

• Panasas appliance for scratch (shared by all) • GPFS appliance on SATA/NL-‐SAS/SAS for long-‐term (and some home) • NFS on SATA for long-‐term (many servers) • NFS on SATA for home (a few servers) • NFS for administra9ve storage • NFS for daily backups • Coraid system (NFS) for applica9on repository and VM images

• Building a homebrew GPFS from commodity components for scratch!

18-22 May 2015 31

informa9onal purposes only, not a recommenda9on

Storage Policies (Georgia Tech)

•  5GB home space •  backed up daily •  provided by GT •  NFS

• ∞ project space •  backed up daily •  faculty-‐purchased, but GT buys the backup space •  Mix of NFS and GPFS (transi9oning to GPFS)

•  5TB/7TB Scratch/Temporary •  not backed up •  purchased by GT •  Panfs (soon to be something else)

18-22 May 2015 32

Storage Policies (Georgia Tech)

18-22 May 2015 33

•  Scratch •  files older than 60 days are marked for removal

•  Users are given one week to save their data (or make a plea for more 9me)

•  Marked files are removed a_er 1 week

•  Not backed up

• Quotas •  Quota increases must be requested by owner or designated manager

Best Prac.ces •  Benchmark the system whenever you can

•  Especially when you first get it (this is the baseline) •  Then, every 9me you take the system down (so that you can tell if something has changed)

•  Run the EXACT SAME test!

•  Test the redundancy and reliability •  Does it survive a drive or server failure? Power something off or rip it out while you are pu~ng a load on it

•  Don’t solely rely on generic benchmarks •  Run the applica9ons your stakeholders care about

•  Regularly get data about your data •  Monitor the status of your filesystem, proac9vely fix problems

•  Constantly ask users (and other stakeholders) how they feel about performance •  It doesn’t maoer if benchmarks are good if they feel it is bad

18-22 May 2015 34

How About Cloud and Big Data?

Design/Standards:

•  POSIX : Portable Opera9ng System Interface (NFS, GPFS, Panasas, Lustre )

•  REST: Representa9onal State Transfer, designed for scalable web services

Case-‐specific soluMons:

•  So_ware defined, hardware independent storage (e.g. Swi_) •  Proprietary object storage (e.g. S3 for AWS, which is RESTful)

• Geo-‐replica9on: DDN WOS, Azure, Amazon S3

• Open Source object storage: Ceph vs Gluster vs … •  Big data (map/reduce): Hadoop Distributed File System (HDFS), QFS, …

18-22 May 2015 35

Future

• Hybridiza9on of storage •  Connec9ng different storages

•  Seamless migra9on between storage solu9ons ( Object store <-‐> Object store, POSIX <-‐> Object)

•  Ethernet connected drives •  Seagate’s Kine9c interface •  HGST’s open Ethernet drive

•  YAC (Yet Another Cache) •  Intel Cache Accelera9on So_ware •  DDN Infinite Memory Engine •  IBM FlashCache

18-22 May 2015 36

BONUS material: a lible bit of Benchmarking

18-22 May 2015 37

• Use real user applica9ons when possible!

•  “dd” … quick & easy.

•  “iozone” great for single/mul9-‐node read/write performance

•  “Bonnie++” simple to run, but comprehensive suite of tests

•  “zcav” good test for spinning hard disks, where speed is a func9on of distance from the first sector.

dd • Never run as root (destruc9ve if used incorrectly)! • Reads from and input file “if” and writes to output file “of”

•  You don’t need to use real files… •  can read from devices e.g. /dev/zero, /dev/random, etc •  can write to /dev/null

• Caching can be misleading… Prefer direct I/O (oflag=direct)

Example: $ dd if=/dev/zero of=./test.dd bs=1G count=1 oflag=direct1+0 records in1+0 records out1073741824 bytes (1.1 GB) copied, 37.8403 s, 28.4 MB/s

18-22 May 2015 38

iozone • Common test u9lity for read/write performance • Great for both single-‐node and mul9-‐node tes9ng (i.e. aggr. perf.) •  Sensi9ve to caching, use “-I” for direct I/O. • Can run mul9threaded (-‐t) •  Simple, ‘auto’ mode:

iozone –a

• Or pick tests using ‘-‐i’: -‐i 0: Read/re-‐read -‐i 1: write/rewrite

18-22 May 2015 39

# iozone -i 0 -i 1 -+n -r 1M -s 1G -t 16 -I … … Throughput test with 16 processes Each process writes a 1048576 kByte file in 1024 kByte records

Children see throughput for 16 initial writers = 1582953.29 kB/sec Parent sees throughput for 16 initial writers = 1542978.62 kB/sec Min throughput per process = 97130.07 kB/sec Max throughput per process = 100058.16 kB/sec Avg throughput per process = 98934.58 kB/sec Min xfer = 1019904.00 kB

Children see throughput for 16 readers = 1393510.91 kB/sec Parent sees throughput for 16 readers = 1392664.11 kB/sec Min throughput per process = 84657.09 kB/sec Max throughput per process = 88483.99 kB/sec Avg throughput per process = 87094.43 kB/sec Min xfer = 1003520.00 kB

iozone mul.-‐node tes.ng

• Great for tes9ng for HPC storage “peak” aggregate performance

• Network becomes a significant contributor

• Requires a “hos�ile” with: hosts, test_dir, iozone_path E.g: iw-h34-17 /gpfs/pace1/ddn /usr/bin/iozone iw-h34-18 /gpfs/pace1/ddn /usr/bin/iozone iw-h34-19 /gpfs/pace1/ddn /usr/bin/iozone

•  Fire away! iozone -i 0 -i 1 -+n –e -r 128k –s <file_size> -t <num_threads> -+m <hostfile> -i : tests (0: read/re-read, 1: write/re-write) -+n : no retests selected -e : include flush/fflush in timing calculations -r : record (block) size in Kb

18-22 May 2015 40

Bonnie++

• Comprehensive set of tests: •  Create files in sequen9al order •  Stat files in sequen9al order •  Delete files in sequen9al order •  Create files in random order •  Stat files in random order •  Delete files in random order (-‐-‐wikipedia)

•  Just ‘cd’ to the directory on the filesystem, then run ‘bonnie++’

• Uses 2x client memory (by default) to avoid caching effects

• Reports performance (K/sec, higher are beoer) and the CPU used to perform opera9ons (%CP, lower are beoer)

• Highly configurable, check its man page!

18-22 May 2015 41

Bonnie++

$ bonnie++Writing with putc()...done……Delete files in random order...done.Version 1.03e ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPatlas-6.pace. 7648M 39329 74 235328 34 2599 2 37794 67 37943 4 46.9 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 740 6 930 2 89 0 282 2 340 1 212 1atlas-6.pace.gatech.edu,7648M,39329,74,235328,34,2599,2,37794,67,37943,4,46.9,0,16,740,6,930,2,89,0,282,2,340,1,212,1

18-22 May 2015 42

zcav •  Part of the Bonnie++ suite •  “Constant Angular Velocity (CAV)” tests for spinning media •  I/O Performance will differ depending on the distance of the heads from the center of the circular spinning media (first sector).

•  Not meaningful for network aoached storage

•  SSD runs can be interes9ng (you expect to see a flat line, but…)

18-22 May 2015 43

SATA disk example hop://www.coker.com.au/bonnie++/zcav/results.html SSD example

(GT machine)

zcav

18-22 May 2015 44

Example: $ zcav –f /dev/sda #loops: 1, version: 1.03e#block offset (GiB), MiB/s, time0.00 115.75 2.2120.25 95.93 2.6690.50 114.63 2.2330.75 119.14 2.149……

What’s going on here?? $ zcav –f /dev/sda #loops: 1, version: 1.03e#block offset (GiB), MiB/s, time#0.00 ++++ 0.092#0.25 ++++ 0.094#0.50 ++++ 0.091……

When you run the same example twice, you see super fast “cached” results! Here’s how you flush I/O cache:

sync && echo 3 > /proc/sys/vm/drop_caches

The End. (for real this .me)

18-22 May 2015 45

Thanks for par9cipa9ng!

final may 2015 lci hpc storage - linux clusters institute ·...

Documents