parallel nfs (pnfs)ietf nfsv4.1 draft-ietf-nfsv4-minorversion1-10.txt includes pnfs, sessions/rdma,...
TRANSCRIPT
www.panasas.com
Parallel NFS (pNFS)
Garth Gibson
CTO & co-founder, PanasasAssoc. Professor, Carnegie Mellon University
www.panasas.com
Panasas Parallel Storage:
A Little Background
Slide 3 March 21, 2007 Panasas, Inc.
Panasas Storage Cluster Architecture
• Object Based
• OSD/iSCSI at core of DF
• DirectFLOW client S/W
• Red Hat, Suse, Fedora
all patch-free
• POSIX extensions for
weak consistency &
layout controls
• DirectorBlades
• DF metadata mgmt & HA
• Quorum consensus state
• Legacy clustered NAS
NFS and CIFS
• StorageBlades
• Wide striping & smart
prefetching & caching
• RAID across blades, inside
each file independently
Slide 4 March 21, 2007 Panasas, Inc.
Bladeserver of Simple PC Components
Slide 5 March 21, 2007 Panasas, Inc.
Scalable Performance
Scalable data throughput AND random I/O
Slide 6 March 21, 2007 Panasas, Inc.
Petascale Bound, Traveling with LANL
Roadrunner (1.4 PFLOPS) RFP specifies Panasas Storage Cluster
RH Linux, IBM System x3755 & BladeCenter H with Cell B.E.
Completion and acceptance anticipated in 2008
Slide 7 March 21, 2007 Panasas, Inc.
Shared, Scalable-BW Storage
Archive
SiteBackbone
PastGener-ationASCI
Platform
CurrentGener-ationASCI
Platform
Disk RichSupercomputer
Disk RichSupercomputer
Disk PoorClients
DisklessClients
ObjectArchive
cluster
File System
Gateways
ScalableOBFS
JobQueue
BackBone
cluster
cluster
cluster
cluster
cluster
cluster
cluster
EnterpriseGlobal FS
AppPerformance
Computing Speed
Parallel I/O
Network Speed
Memory
ArchivalStorage
FLOP/s
Gigabytes/sec
GigaBytes/sec
Gigabits/sec
1103 105
1011
1012
1013
1014
0.05
0.5
5TeraBytes
50
0.01
0.1
10 1
0.11
10100
8
800
80
0.8
‘96
‘97
‘00
2003
Year
102
DiskTeraBytes
2.525
2502500
MetadataInserts/sec
50
100
1000
500
Balanced System Approach
LINEAR: ~1 GB/s per TFLOPS
> 1 PB Panasas in many clusters
GM: 5600 nodes, 11000+ procs, Lightning,Bolt, Pink, TLC, Flash, Gordon
IB: 1856 nodes, 3700+ procs,Blue Steel, Coyote, & soon Roadrunner
Slide 8 March 21, 2007 Panasas, Inc.
HPC Innovators choose PanFS
Oil & Gas Life Sciences
Government
Fluid DynamicsFinance
& dozens more
www.panasas.com
Parallel NFS (pNFS)
Slide 10 March 21, 2007 Panasas, Inc.
Today’s Ubiquitous NFS
ADVANTAGES
Familiar, stable & reliable
Widely supported by vendors
Competitive market
Host Net
Storage Net
Clients
File Servers
Disk arrays
Exported Sub-
File System
LIMITATION
Client moves all data and metadata
for a sub-file system through one
network endpoint (server)
Slide 11 March 21, 2007 Panasas, Inc.
Today’s Ubiquitous NFS Doesn’t Scale
ADVANTAGES
Familiar, stable & reliable
Widely supported by vendors
Competitive market
DISADVANTAGES
Capacity doesn’t scale
Bandwidth doesn’t scale
“Cluster” by customer-exposed
namespace partitioning
Host Net
Storage Net
Clients
File Servers
Disk arrays
Exported Sub-
File System
Slide 12 March 21, 2007 Panasas, Inc.
Scale Out File Service w/ Out-of-Band
Client sees many storage addresses, accesses in parallel
Zero file servers in data path allows high bandwidth thru scalable networking
A.K.A. SAN file systems and parallel file systems
NOT NFS
Clients
File Servers
Storage
Slide 13 March 21, 2007 Panasas, Inc.
Out-of-Band Interoperability Issues
Clients
Vendor X
File Servers
Storage
ADVANTAGES
Capacity scaling
Faster bandwidth scaling
DISADVANTAGES
Requires client kernel addition
Many non-interoperable solutions
Not necessarily able to replace NFS
Vendor X
Kernel Patch/RPM
Slide 14 March 21, 2007 Panasas, Inc.
Scale Out: Cluster NFS Servers (1)
Bind many file servers into single system image with forwarding
Mount point binding less relevant, allows DNS-style balancing, more manageable
Host NetStorage Net
Clients
File Server Cluster
Disk arrays
Slide 15 March 21, 2007 Panasas, Inc.
Scale Out: Cluster NFS Server (2)
Single server does all data transfer in single system image
Servers share access to all storage and “hand off” role of accessing storage
Control and data traverse mount point path (in band) passing through one server
Typically built on top of a SAN file system or parallel file system
Host Net Storage Net
Clients
File Server Cluster
Disk arrays
Slide 16 March 21, 2007 Panasas, Inc.
Out-of-Band Standards?
Clients
Vendor X
File Servers
Storage
ADVANTAGES
Capacity scaling
Faster bandwidth scaling
Work to be done
Get widespread agreement on semantics
Build multiple reference implementations
Test interoperability constantly
Compete on SW, server implementations
Standard platforms
embed into client
code into all releases
Slide 17 March 21, 2007 Panasas, Inc.
Parallel NFS: Scalability for Mainstream
IETF NFSv4.1
draft-ietf-nfsv4-minorversion1-10.txt
Includes pNFS, sessions/RDMA,
directory delegations
U.Mich/CITI impl’g Linux client/server
Three (or more) flavors of
out-of-band metadata attributes:
FILES: NFS/ONCRPC/TCP/IP/GE
for files built on subfiles
NetApp, Sun, IBM, U.Mich/CITI
BLOCKS: SBC/FCP/FC or SBC/iSCSI
for files built on blocks
EMC, IBM (-pnfs-blocks-03.txt)
OBJECTS: OSD/iSCSI/TCP/IP/GE
for files built on objects
Panasas (-pnfs-obj-03.txt) LocalFilesystem
pNFS server
pNFS IFS
Client Apps
Layoutdriver
1. SBC (blocks)2. OSD (objects)
3. NFS (files)
NFSv4 extendedw/ orthogonal
layout metadataattributes
Layout metadatagrant & revoke
Slide 18 March 21, 2007 Panasas, Inc.
Highlights of the History of pNFS
Conversations with Gary Grider, LANL, & Lee Ward, Sandia, 2003
How to make HPC investment in High Performance File Systems persistent
Workshop on NFS Extensions for Parallel Storage, Dec 2003, Ann Arbor
Chaired by Peter Honeyman, CITI/U.Mich., & Garth Gibson, CMU
Initial problem statement, operations proposal to IETF July & Nov 2004
Garth Gibson, Panasas, Peter Corbett, NetApp, Brent Welch, Panasas
Standards development team in action
Andy Adamson, CITI/U.Mich, David Black, EMC, Garth Goodson, NetApp,
Tom Pisek, Sun, Benny Halevy, Panasas, Dave Noveck, NetApp, Spencer
Shepler, Sun, Brian Pawlowski, NetApp, Marc Eshel, IBM, & many others
Dean Hildebrand, CITI/U.Mich, with Lee Ward, did first prototype & paper
IETF working group folded it into NFSv4.1 minorversion draft in 2006
www.ietf.org/html.charters/nfsv4-charter.html
Slide 19 March 21, 2007 Panasas, Inc.
SC06 High-Performance NFS Panelists
Garth Gibson, CTO, Panasas Inc, & Prof., Carnegie Mellon Univ.
Committed to pNFS products; prototype bandwidth near DirectFlow speeds
Mike Kazar, VP & Chief Architect, Network Appliance
GX is “NAS switch”; pNFS eliminates switch overhead, faster single clients
Paul Rutherford, Sr. Director, SW Engineering, Isilon
High performance NFS = clustered NFSv3 + NFSv4 & pNFS
Uday Gupta, CTO, NAS, EMC
MPFSi will be standardized as pNFS
Roger Haskin, Sr. Manager, File Systems, IBM
Prototyping Linux pNFS on top of GPFS (GPFS client is pNFS data server)
Weakness of pNFS: file level delegation, NFS consistency semantics
Slide 20 March 21, 2007 Panasas, Inc.
pNFS Protocol Operations
LAYOUTGET
(filehandle, type, byte range) -> type-specific layout
LAYOUTRETURN
(filehandle, range) -> releases state for client
LAYOUTCOMMIT
(filehandle, byte range, updated attributes,layout-specific info) -> data visible to other clients
Timestamps, end-of-file attributes updated
CB_LAYOUTRECALL
Server tells the client to stop using a layout
CB_RECALLABLE_OBJ_AVAIL
Delegation available for a file that was not previously available
GETDEVICEINFO, GETDEVICELIST
Map deviceID in layout to type-specific addressing information
Local
Filesystem
pNFS server
pNFS IFS
Client Apps
Layout
driver
1. SBC (blocks)
2. OSD (objects)
3. NFS (files)
NFSv4 extended
w/ orthogonal
layout metadata
attributes
Layout metadata
grant & revoke
Slide 21 March 21, 2007 Panasas, Inc.
pNFS READ
LOOKUP+OPEN client to NFS server,
returns file handle and state ids
LAYOUTGET client to NFS server,
returns layout
READ client to storage devices, many
reads in parallel
LAYOUTRETURN client to NFS server
Clients can cache layouts for use with
multiple READ and multiple
LOOKUP+OPEN instances
Server uses CB_LAYOUTRECALL
when the layout is no longer valid
Linux
ComputeCluster
MetadataManager
Parallel
DataPaths
ControlPaths
READ
Storage Devices
Slide 22 March 21, 2007 Panasas, Inc.
pNFS WRITE
LOOKUP+OPEN client to NFS server,
returns file handle and state ids
LAYOUTGET client to NFS server,
returns layout
WRITE client to storage devices
LAYOUTCOMMIT client to NFS server
“publishes” write
LAYOUTRETURN client to NFS server
Server may restrict byte range of write
layout to reduce allocation overheads,
avoid quota limits, etc.
Linux
ComputeCluster
MetadataManager
Parallel
DataPaths
ControlPaths
WRITE
Storage Devices
Slide 23 March 21, 2007 Panasas, Inc.
IETF NFSv4 pushing pNFS
NFSv4.1 closing
Obsoletes v3 & v4
pNFS is principal
Objects & blocks
included w/ files
Proposed standard
is product threshold
Friday was first
round final
comments deadline
Good momentum
behind this
Slide 24 March 21, 2007 Panasas, Inc.
Current Status
pNFS is part of the IETF NFSv4 minor version 1 standard draft
Bi weekly conference calls to drive 4.1 draft to closure for Sept 07 completion
Reference open source client done in CITI
CITI owns NFSv4 Linux client and server
Weekly pNFS implementers' conference call
Callbacks not yet running; failure cases not sufficiently explored
Participants:
CITI (Files over PVFS2, Files over NFSv4)
Netapp, Sun (Files over NFSv4)
IBM (Files, based on GPFS)
EMC (Blocks, based on HighRoad/MPFS)
Panasas (Objects, based on Panasas ActiveScale Storage Cluster OSDs)
Slide 25 March 21, 2007 Panasas, Inc.
References
“NFS Version 4 Minor Version 1”draft-ietf-nfsv4-minorversion1-10.txtdraft-ietf-nfsv4-pnfs-obj-03.txthttp://www.ietf.org/html.charters/nfsv4-charter.html
“pNFS Problem Statement”Garth Gibson (Panasas), Peter Corbett (Netapp), Internet-draft, July 2004,http://www.pdl.cmu.edu/pNFS/archive/gibson-pnfs-problem-statement.html
“NFSv4 pNFS Extensions”G. Goodson (Netapp), B. Welch, B. Halevy (Panasas), D. Black (EMC), A.Adamson (CITI), Internet-draft, October 2005,http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-00.txt
“Linux pNFS Kernel Development”CITI, University of Michiganhttp://www.citi.umich.edu/projects/asci/pnfs/linux/
Slide 26 March 21, 2007 Panasas, Inc.
Panasas pNFS strategy
Slide 27 March 21, 2007 Panasas, Inc.
Multiprotocol DF & Clustered NAS
Slide 28 March 21, 2007 Panasas, Inc.
Multiprotocol DF, pNFS & Clustered NAS
CIFS & NFSv3 clustered,
+ NFSv4.1 (pNFS
metadata service)
DirectFlow servers,
+ NFSv4.1 (pNFS
object dataservers)futureproof DF investment:
- pNFS is a software upgrade
- concurrent use of DF & pNFS
Slide 29 March 21, 2007 Panasas, Inc.
Panasas Committed to pNFS
Promising preliminary results
Built on U.Mich/CITI Linux client/server code base
Layer NFSv4.1 server on DirectFlow/PanFS MDS
Many parts of pNFS solution not yet done
Iozone -c -e -r448k -s 5g -t #clients
PanFS MDS
pNFS server
pNFS IFS
Client Apps
ObjectsLayoutdriver
iSCSI OSD (objects)NFSv4 extendedw/ orthogonallayout metadataattributes
2 shelves (18 OSDs)
www.panasas.com
Thank You!
Slide 31 March 21, 2007 Panasas, Inc.
PETASCALE DATA STORAGE INSTITUTE
SciDAC university-led center of excellence
3 universities, 5 labs, 2006-2011, G. Gibson, PI
PDSI: scientific computing storage issues
www.pdl.cmu.edu/PDSI
Community building: workshops, tutorials, courses
APIs & standards: incubate, prototype, validate
Failure data collection, analysis & publication
Performance trace collection & benchmark publication
IT automation applied to HEC systems & problems
Novel mechanisms for core HEC storage problems
Slide 32 March 21, 2007 Panasas, Inc.
PDSI Example Projects:
HEC POSIX Extensions
HEC Failure Data Collection
Slide 33 March 21, 2007 Panasas, Inc.
Disk drive failure data sources
Internet services Y
36GB 10K RPM SCSI
10K RPM SCSI
15K RPM SCSI
10K RPM FC-AL
10K RPM FC-AL10K RPM FC-AL10K RPM FC-AL
Type of drive Count
520
26,734
39,039
3,700
Duration
2.5 yrs
1 month
1.5 yrs
1 yr
18GB 10K RPM SCSI
36GB 10K RPM SCSI3,400 5 yrsHPC1
HPC2
COM1
COM2
COM3
HPC3
HPC4
Supercomputing X
Various HPCs13,634
14,208
3 yrs
1 yr
400GB SATA
250GB SATA500GB SATA
7.2K RPM SATA
15K RPM SCSI15K RPM SCSI
Slide 34 March 21, 2007 Panasas, Inc.
Annual replacement rate (ARR) in the field
Datasheet MTTFs are 1,000,000 to 1,500,000 hours.
=> Expected annual replacement rate (ARR): 0.58 - 0.88 %.
ARR = 0.58ARR = 0.88
ARR avrg = 3.0
No evidence that SATA disks exhibit higher replacement rates than SCSI or FC
SATA
300K-400K
MTT-replace
Slide 35 March 21, 2007 Panasas, Inc.
NFS
complex
and othernetwork
services,
WAN
Archive
LANL Petascale (Red) FY08
Lane switches, 6
X 105 = 630 GB/s
I
B
4
X
F
a
t
T
r
e
e
M
y
r
i
n
e
t
M
y
r
i
n
e
t
Lane passthru
switches, to
connect legacy
Lightning/Bolt
Site wide
Shared
Global
Parallel
File
System
650-1500
TB
50-200
GB/s
(spans
lanes)
Road Runner
Base, 70TF,
144 node units,
12 I/O
nodes/unit, 4
socket dual
core AMD
nodes, 32 GB
mem/node, full
fat tree, 14
units,
Acceleration to
1 PF sustained
64 I/O nodes,
2 – 1gbit links
each, 16
GB/sec
96 I/O nodes,
2 – 1gbit links
each, 24
GB/sec
156 I/O nodes,
1 – 10gbit link
each, 195
GB/sec,
planned for
Accelerated
Road Runner
1PF sustained
Bolt, 20TF, 2
socket,
singe/dual
core AMD,
256 node
units, reduced
fat tree, 1920
nodes
Lightning, 14
TF, 2 socket
single core
AMD, 256
node units,
full fat tree,
1608 nodes
FTA’sN gb/s
20 gb/s per link
N gb/sN gb/s
4 gb/s
per 5-8
TB
Compute Unit
Compute Unit
IO
Unit
IO
Unit
CU
CU
CU
CU
N gb/s
fail ove
r
If more bandwidth is needed we just need to add
more lanes and add more storage, scales nearly
linearly, not N2 like a fat tree Storage Area Network.
www.panasas.com