panasas parallel file system

57
www.panasas.com Go Faster. Go Parallel. Panasas Parallel File System Brent Welch Director of Software Architecture, Panasas Inc October 16, 2008 HPC User Forum

Upload: jessamine-castro

Post on 30-Dec-2015

24 views

Category:

Documents


2 download

DESCRIPTION

Panasas Parallel File System. Brent Welch Director of Software Architecture, Panasas Inc October 16, 2008 HPC User Forum. Outline. Panasas background Storage cluster background, hardware and software Technical Topics High Availability Scalable RAID pNFS. Brent’s garden, California. - PowerPoint PPT Presentation

TRANSCRIPT

www.panasas.com

Go Faster. Go Parallel.

Panasas Parallel File System

Brent Welch

Director of Software Architecture, Panasas IncOctober 16, 2008HPC User Forum

Slide 2 | October 2008 IDC, Panasas, Inc.

Outline

Panasas background

Storage cluster background, hardware and software

Technical Topics

High Availability

Scalable RAID

pNFS

Brent’s garden, California

Slide 3 | October 2008 IDC, Panasas, Inc.

Founded 1999 By Prof. Garth Gibson, Co-Inventor of RAID

Technology Parallel File System and Parallel Storage Appliance

Locations US: HQ in Fremont, CA, USA

R&D centers in Pittsburgh & Minneapolis

EMEA: UK, DE, FR, IT, ES, BE, Russia

APAC: China, Japan, Korea, India, Australia

Customers FCS October 2003, deployed at 200+ customers

Market Focus

Alliances

Energy Academia Government Life Sciences Manufacturing Finance ISVs: Resellers:

Primary Investors

Panasas Company Overview

Slide 4 | October 2008 IDC, Panasas, Inc.

The New ActiveStor Product Line

Design, Modeling and Visualization Applications

Simulation andAnalysis Applications

Tiered QOS StorageBackup/ Secondary

ActiveStor 6000• 20, 15 or 10TB shelves• 20 GB cache/shelf• Integrated 10GigE• 600 MB/sec/shelf• Tiered Parity• ActiveImage• ActiveGuard• ActiveMirror

ActiveScale 3.2

ActiveStor 4000• 20 or 15TB shelves• 5 GB cache/shelf• Integrated 10GigE• 600 MB/sec/shelf• Tiered Parity Options:• ActiveImage• ActiveGuard• ActiveMirror

ActiveStor 200• 104 TBs (3DBs x 52SBs)• 5 shelves (20U, 35”)• Single 1GigE port/shelf• 350 MB/sec aggregate• 20 GB aggregate cache• Tiered Parity

Slide 5 | October 2008 IDC, Panasas, Inc.

Leaders in HPC choose Panasas

Slide 6 | October 2008 IDC, Panasas, Inc.

US DOE: Panasas Selected for Roadrunner – Top of the Top 500LANL $130M system will deliver 2x performance over current top BG/L

SciDAC: Panasas CTO selected to lead Petascale Data Storage InstCTO Gibson leads PDSI launched Sep 06, leveraging experience from PDSI members: LBNL/Nersc; LANL; ORNL; PNNL Sandia NL; and UCSC

Aerospace: Airframes and engines, both commercial and defenseBoeing HPC file system; major engine mfg; top 3 U.S. defense contractors

Formula-1: HPC file system for Top 2 clusters – 3 teams in totalTop clusters at Renault F-1 and BMW Sauber, Ferrari also on Panasas

Intel: Certifies Panasas storage for broad range of HPC applications, now ICRIntel uses Panasas storage for EDA design, and in HPC benchmark center

SC07: Six Panasas customers won awards at SC07 (Reno) conference

Validation: Extensive recognition and awards for HPC breakthroughs

Panasas Leadership Role in HPC

Slide 7 | October 2008 IDC, Panasas, Inc.

Panasas Joint Alliance Investments with ISVsPanasas Joint Alliance Investments with ISVs

Panasas ISV AlliancesApplications

Seismic Processing; Interpretation;Reservoir Modeling

Computational Fluid Dynamics (CFD); Comp StructuralMechanics (CSM)

Electronic Data Automation (EDA)

Trading;Derivatives Pricing;Risk Analysis

Vertical FocusEnergy

Manufacturing;Government; Higher Ed and Res;Energy

Semiconductor

Financial

Slide 8 | October 2008 IDC, Panasas, Inc.

Intel Use of Panasas in the IC Design Flow

Panasas criticalto tape-out stage

Slide 9 | October 2008 IDC, Panasas, Inc.

University of CambridgeHPC Service, Darwin Supercomputer

Darwin Supercomputer Computational UnitsNine repeating units, each consists of 64 nodes (2 racks) providing 256 cores each, 2340 cores total

All nodes within a CU connected to a full bisectional bandwidth Infiniband 900 MB/s, MPI latency of 2 s

Description of Darwin at U of Cambridge

Source: http://www.hpc.cam.ac.uk

Slide 10 | October 2008 IDC, Panasas, Inc.

Number of cells 111,091,452

Solver PBNS, DES, Unsteady

Iterations5 time steps, 100 total iters - data save after last iteration

Output size:FLUENT v6.3 (serial I/O; size of .dat file)

FLUENT v12 (serial I/O; size of .dat file)

FLUENT v12 (parallel I/O; size of .pdat file)

14,808 MB

16,145 MB

19, 683 MB

Unsteady external aero for 111 MM cell truck; 5 time steps with 100 iterations, and a single .dat file write

Univ of Cambridge DARWIN ClusterLocation: University of Cambridge http://www.hpc.cam.ac.uk

Vendor: Dell ; 585 nodes; 2340 cores; 8 GB per node; 4.6 TB total mem

CPU: Intel Woodcrest DC, 3.0 GHz / 4MB L2 cache

Interconnect: InfiniPath QLE7140 SDR HCAs; Silverstorm 9080 and 9240 switches,

File System: Panasas PanFS, 4 shelves, 20 TB capacity

Operating System: Scientific Linux CERN SLC release 4.6

DARWIN 585 nodes; 2340 cores

Panasas: 4 Shelves, 20 TB

Truck111M Cells

Details of the FLUENT 111M Cell Model

Slide 11 | October 2008 IDC, Panasas, Inc.

Number of Cores

2541

1318

779

4568

2680

1790

0

1000

2000

3000

4000

5000

64 128 256

PanFS -- FLUENT 12

NFS -- FLUENT 6.3

Truck Aero

111M Cells

Scalability of Solver + Data File WriteTi

me

(Sec

onds

) of S

olve

r + D

ata

File

Writ

e Time of Solver + Data File Write

Lower is

better

1.7x

1.5x

1.9x 1.7

x

FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster

NOTE: Read times are not included in

these results

Slide 12 | October 2008 IDC, Panasas, Inc.

Number of Cores

273

345 334

0

100

200

300

400

64 128 256

PanFS -- FLUENT 12

NFS -- FLUENT 6.3

HigherIs

Better

Truck Aero

111M Cells

Performance of Data File Write in MB/sEf

fect

ive

Rat

es o

f I/O

(MB

/s) f

or D

ata

Writ

e

39x

Effective Rates of I/O for Data File Write

FLUENT Comparison of PanFS vs. NFS on University of Cambridge Cluster

31x 20

xNOTE: Data

File Write Only

Slide 13 | October 2008 IDC, Panasas, Inc.

Panasas Architecture

Cluster technology provides scalable capacity and performance: capacity scales symmetrically with processor, caching, and network bandwidth

Scalable performance with commodity parts provides excellent price/performance

Object-based storage provides additional scalability and security advantages over block-based SAN file systems

Automatic management of storage resources to balance load across the cluster

Shared file system (POSIX) with the advantages of NAS, with direct-to-storage performance advantages of DAS and SAN

DiskCPUMemoryNetwork

Slide 14 | October 2008 IDC, Panasas, Inc.

Panasas bladeserver building block

The ShelfStorageBlade

DirectorBlade

Mid Plane

BatteryEmbeddedSwitch

Power Supplies

Rails11 slots for blades

4Uhigh

Garth Gibson, July 200814

Slide 15 | October 2008 IDC, Panasas, Inc.

Panasas Blade Hardware

DirectorBlade StorageBlade

Integrated GE Switch

Shelf Front1 DB, 10 SB Shelf Rear

Midplane routes GE, power

Battery Module(2 Power units)

Slide 16 | October 2008 IDC, Panasas, Inc.

Panasas Product Advantages

Proven implementation with appliance-like ease of use/deployment

Running mission-critical workloads at global F500 companies

Scalable performance with Object-based RAID

No degradation as the storage system scales in size

Unmatched RAID rebuild rates – parallel reconstruction

Unique data integrity features

Vertical parity on drives to mitigate media errors and silent corruptions

Per-file RAID provides scalable rebuild and per-file fault isolation

Network verified parity for end-to-end data verification at the client

Scalable system size with integrated cluster management

Storage clusters scaling to 1000+ storage nodes, 100+ metadata managers

Simultaneous access from over 12000 servers

Slide 17 | October 2008 IDC, Panasas, Inc.

Linear Performance Scaling

Breakthrough data throughput AND random I/O

Performance and scalability for all workloads

Slide 18 | October 2008 IDC, Panasas, Inc.

Scaling Clients

IOzone Multi-Shelf Performance Test (4GB sequential write/read)

0

500

1,000

1,500

2,000

2,500

3,000

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 155 200

# clients

MB

/sec

1-shelf write

1-shelf read

2-shelf write

2-shelf read

4-shelf write

4-shelf read

8-shelf write

8-shelf read

Slide 19 | October 2008 IDC, Panasas, Inc.

Proven Panasas Scalability

Storage Cluster Sizes Today (e.g.)

Boeing, 50 DirectorBlades, 500 StorageBlades in one system. (plus 25 DirectorBlades and 250 StorageBlades each in two other smaller systems.)

LANL RoadRunner. 100 DirectorBlades, 1000 StorageBlades in one system today, planning to increase to 144 shelves next year.

Intel has 5,000 active DF clients against 10-shelf systems, with even more clients mounting DirectorBlades via NFS. They have qualified a 12,000 client version of 2.3, and will deploy “lots” of compute nodes against 3.2 later this year.

BP uses 200 StorageBlade storage pools as their building block

LLNL, two realms, each 60 DirectorBlades (NFS) and 160 StorageBlades

Most customers run systems in the 100 to 200 blade size range

Slide 20 | October 2008 IDC, Panasas, Inc.

Emphasis on Data Integrity

Horizontal Parity

Per-file, Object-based RAID across OSD

Scalable on-line performance

Scalable parallel RAID rebuild

Vertical Parity

Detect and eliminate unreadable sectors and silent data corruption

RAID at the sector level within a drive / OSD

Network Parity

Client verifies per-file parity equation during reads

Provides only truly end-to-end data integrity solution available today

Many other reliability features…

Media scanning, metadata fail over, network multi-pathing, active hardware monitors, robust cluster management

Slide 21 | October 2008 IDC, Panasas, Inc.

High Availability

Quorum based cluster management

3 or 5 cluster managers to avoid split brain

Replicated system state

Cluster manager controls the blades and all other services

High performance file system metadata fail over

Primary-backup relationship controlled by cluster manager

Low latency log replication to protect journals

Client-aware fail over for application-transparency

NFS level fail over via IP takeover

Virtual NFS servers migrate among DirectorBlade modules

Lock services (lockd/statd) fully integrated with fail over system

Slide 22 | October 2008 IDC, Panasas, Inc.

Technology Review

Turn-key deployment and automatic resource configuration

Scalable Object RAID

Very fast RAID rebuild

Vertical Parity to trap silent corruptions

Network parity for end-to-end data verification

Distributed system platform with quorum-based fault tolerance

Coarse grain metadata clustering

Metadata fail over

Automatic capacity load leveling

Storage Clusters scaling to ~1000 nodes today

Compute clusters scaling to 12,000 nodes today

Blade-based hardware with 1Gb/sec building block

Bigger building block going forward

Slide 23 | October 2008 IDC, Panasas, Inc.

The pNFS Standard

The pNFS standard defines the NFSv4.1 protocol extensions between the server and client

The I/O protocol between the client and storage is specified elsewhere, for example:

SCSI Block Commands (SBC) over Fibre Channel (FC)

SCSI Object-based Storage Device (OSD) over iSCSI

Network File System (NFS)

The control protocol between the server and storage devices is also specified elsewhere, for example:

SCSI Object-based Storage Device (OSD) over iSCSI

ClientStorage

NFS 4.1 Server

Slide 24 | October 2008 IDC, Panasas, Inc.

Key pNFS Participants

Panasas (Objects)

Network Appliance (Files over NFSv4)

IBM (Files, based on GPFS)

EMC (Blocks, HighRoad MPFSi)

Sun (Files over NFSv4)

U of Michigan/CITI (Files over PVFS2)

Slide 25 | October 2008 IDC, Panasas, Inc.

pNFS Status

pNFS is part of the IETF NFSv4 minor version 1 standard draft

Working group is passing draft up to IETF area directors, expect RFC later in ’08

Prototype interoperability continuesSan Jose Connect-a-thon March ’06, February ’07, May ‘08

Ann Arbor NFS Bake-a-thon September ’06, October ’07

Dallas pNFS inter-op, June ’07, Austin February ’08, (Sept ’08)

AvailabilityTBD – gated behind NFSv4 adoption and working implementations of pNFS

Patch sets to be submitted to Linux NFS maintainer starting “soon”

Vendor announcements in 2008

Early adoptors in 2009

Production ready in 2010

www.panasas.com

Go Faster. Go Parallel.

Questions?

Thank you for your time!

Slide 27 | October 2008 IDC, Panasas, Inc.

Deep Dive: Reliability

High Availability

Cluster Management

Data Integrity

Slide 28 | October 2008 IDC, Panasas, Inc.

Vertical Parity

“RAID” within an individual drive

Seamless recovery from media errors by applying RAID schemes across disk sectors

Repairs media defects by writing through to spare sectors

Detects silent corruptions and prevents reading wrong data

Independent of horizontal array-based parity schemes

Vertical Parity

Slide 29 | October 2008 IDC, Panasas, Inc.

Network Parity

Horizontal Parity

Vertical Parity

Network Parity

Extends parity capability across the data path to the client or server node

Enables End-to-End data integrity validation

Protects from errors introduced by disks, firmware, server hardware, server software, network components and transmission

Client either receives valid data or an error notification

Slide 30 | October 2008 IDC, Panasas, Inc.

Panasas Scalability

Two Layer Architecture

Division between these two layers is an important separation of concerns

Platform maintains a robust system model and provides overall control

File system is an application layered over the distributed system platform

Automation in the distributed system platform helps the system adapt to failures without a lot of hand-holding by administrators

The file system uses protocols optimized for performance, and relies on the platform to provide robust protocols for failure handling

Parallel File System

Distributed System Platform

Hardware

Applications

Slide 31 | October 2008 IDC, Panasas, Inc.

Model-based Platform

The distributed system platform maintains a model of the system

what are the basic configuration settings (networking, etc.)

which storage and manager nodes are in the system

where services are running

what errors and faults are present in the system

what recovery actions are in progress

Model-based approach mandatory to reduce administrator overhead

Automatic discovery of resources

Automatic reaction to faults

Automatic capacity balancing

Proactive hardware monitoring

Manual

Automatic

Slide 32 | October 2008 IDC, Panasas, Inc.

Quorum-based Cluster Management

The system model is replicated on 3 or 5 (or more) nodes

Maintained via a Lamport’s PTP (Paxos) quorum voting protocol

PTP handles quorum membership change, brings partitioned members back up to date, provides basic transaction model

Each member keeps the model in a local database that is updated within a PTP transaction

7 or 14 msec update cost, which is dominated by 1 or 2 synchronous disk IOs

This robust mechanism is not on the critical path of any file system operation

Yes: change admin password (involves cluster manager)

Yes: configure quota tree

Yes: initiate service fail over

No: open close read write etc. (does not involve cluster manager)

Slide 33 | October 2008 IDC, Panasas, Inc.

File Metadata

PanFS metadata manager stores metadata in object attributes

All component objects have simple attributes like their capacity, length, and security tag

Two component objects store replica of file-level attributes, e.g. file-length, owner, ACL, parent pointer

Directories contain hints about where components of a file are stored

There is no database on the metadata manager

Just transaction logs that are replicated to a backup via a low-latency network protocol

Slide 34 | October 2008 IDC, Panasas, Inc.

File Metadata

File ownership divided along file subtree boundaries (“volumes”)

Multiple metadata managers, each own one or more volumes

Match with the quota-tree abstraction used by the administrator

Creating volumes creates more units of meta-data work

Primary-backup failover model for MDS

90us remote log update over 1GE, vs. 2us local in-memory log update

Some customers introduce lots of volumes

One volume per DirectorBlade module is ideal

Some customers are stubborn and stick with just one

E.g., 75 TB and 144 StorageBlade modules and a single MDS

Slide 35 | October 2008 IDC, Panasas, Inc.

Scalability over time

Software baseline

Single software product, but with branches corresponding to major releases that come out every year (or so)

Today most customers are on 3.0.x, which had a two year lifetime

We are just introducing 3.2, which has been in QA and beta for 7 months

There is forward development on newer features

Our goal is a major release each year, with a small number of maintenance releases in between major releases

Compatibility is key

New versions upgrade cleanly over old versions

Old clients communicate with new servers, and vice versa

Old hardware is compatible with new software

Integrated data migration to newer hardware platforms

Slide 36 | October 2008 IDC, Panasas, Inc.

Deep Dive: Networking

Scalable networking infrastructure

Integration with compute cluster fabrics

Slide 37 | October 2008 IDC, Panasas, Inc.

LANL Systems

Tourquoise Unclassified

Pink 10TF - TLC 1TF - Coyote 20TF (compute)

1 Lane PaScalBB ~ 100 GB/sec (network)

68 Panasas Shelves, 412 TB, 24 GBytes/sec (storage)

Unclassified Yellow

Flash/Gordon 11TF - Yellow Roadrunner Base 5TF – (4 RRp3 CU’s 300 TF)

2 Lane PaScalBB ~ 200 GB/sec

24 Panasas Shelves - 144 TB - 10 GBytes/sec

Classified Red

Lightning/Bolt 35TF - Roadrunner Base 70TF (Accelerated 1.37PF)

6 Lane PaScalBB ~ 600 GB/sec

144 Panasas Shelves - 936 TB - 50 GBytes/sec

Slide 38 | October 2008 IDC, Panasas, Inc.

IB and other network fabrics

Panasas is a TCP/IP, GE-based storage product

Universal deployment, Universal routability

Commodity price curve

Panasas customers use IB, Myrinet, Quadrics, …

Cluster interconnect du jour for performance, not necessarily cost

IO routers connect cluster fabric to GE backbone

Analogous to an “IO node”, but just does TCP/IP routing (no storage)

Robust connectivity through IP multipath routing

Scalable throughput at approx 650 MB/sec IO router (PCI-e class)

Working on a 1GB/sec IO router

IB-GE switching platforms

QLogic or Voltare switch provides wire-speed bridging

Slide 39 | October 2008 IDC, Panasas, Inc.

NFS and other network services,

WAN

Secure Core switches

Archive

Petascale Red Infrastructure Diagram with Roadrunnner Accelerated FY08

Scalable to

600 GB/sec before adding

Lanes

IB4X

FatTree

Myr inet

Myr inet

Site wide

Shared Global Parallel

File System

Roadruner Phase 3

1.026 PF

1GE

IONODES

10GE

IONODES

Roadrunner Phase 1

70TF

Lightning/Bolt35 TF

FTA’sNx10GE

NxGENx10GE

4 GE per 5-8

TB

Compute Unit

Compute Unit

IO Unit

IO Unit

CU

CU

CU

CU

fail over

10GE

IONODES

Slide 40 | October 2008 IDC, Panasas, Inc.

NFS complex and other network services,

WAN

Archive

LANL Petascale (Red) FY08

Lane switches, 6 X 105 = 630 GB/s

IB4X

FatTree

Myr inet

Myr inet

Lane passthru switches, to

connect legacy Lightning/Bolt

Site wide Shared Global

Parallel File System

650-1500 TB 50-200 GB/s

(spans lanes)

Road Runner Base, 70TF,

144 node units, 12 I/O

nodes/unit, 4 socket dual core AMD

nodes, 32 GB mem/node, full

fat tree, 14 units,

Acceleration to 1 PF sustained

64 I/O nodes, 2 – 1gbit links

each, 16 GB/sec

96 I/O nodes, 2 – 1gbit links

each, 24 GB/sec

156 I/O nodes, 1 – 10gbit link

each, 195 GB/sec,

planned for Accelerated

Road Runner 1PF sustained

Bolt, 20TF, 2 socket,

singe/dual core AMD, 256 node

units, reduced fat tree, 1920

nodes

Lightning, 14 TF, 2 socket single core AMD, 256 node units, full fat tree, 1608 nodes

FTA’sN gb/s

20 gb/s per link

N gb/sN gb/s

4 gb/s per 5-8

TB

Compute Unit

Compute Unit

IO Unit

IO Unit

CU

CU

CU

CU

N gb/s

fail over

If more bandwidth is needed we just need to add more lanes and add more storage, scales nearly

linearly, not N2 like a fat tree Storage Area Network.

Slide 41 | October 2008 IDC, Panasas, Inc.

To Site Network

Multi-Cluster sharing: scalable BW with fail over

NFS

DNS1

KRB

Cluster C

Compute Nodes

Archive

Layer 2 switches

Panasas

Storage

Cluster A

I/O Nodes

Cluster B

Colors depict

subnets

Slide 42 | October 2008 IDC, Panasas, Inc.

Deep Dive: Scalable RAID

Per-file RAID

Scalable RAID rebuild

Slide 43 | October 2008 IDC, Panasas, Inc.

Automatic per-file RAID

System assigns RAID level based on file size

<= 64 KB RAID 1 for efficient space allocation

> 64 KB RAID 5 for optimum system performance

> 1 GB two-level RAID-5 for scalable performance

RAID-1 and RAID-10 for optimized small writes

Automatic transition from RAID 1 to 5 without re-striping

Programmatic control for application-specific layout optimizations

Create with layout hint

Inherit layout from parent directory

Small File

RAID 1 Mirroring

RAID 5 Striping

Large File

Very Large File

2-level RAID 5 Striping

Clients are responsible for writing data and its parity

Slide 44 | October 2008 IDC, Panasas, Inc.

H G

k E

Declustered RAID

Files are striped across component objects on different StorageBlades

Component objects include file data and file parity for reconstruction

File attributes are replicated with two component objects

Declustered, randomized placement distributes RAID workload

C F

E

2-shelfBladeSet

Mirroredor 9-OSDParityStripes

FAIL

Read abouthalf of eachsurviving OSDWrite a littleto each OSD

Scales linearly

Slide 45 | October 2008 IDC, Panasas, Inc.

Scalable RAID Rebuild

Shorter repair time in larger storage pools

Customers report 30 minute rebuilds for 800GB in 40+ shelf blade set

Variability at 12 shelves due to uneven utilization of DirectorBlade modules

Larger numbers of smaller files was better

Reduced rebuild at 8 and 10 shelves because of wider parity stripe

Rebuild bandwidth is the rate at which data is regenerated (writes)

Overall system throughput is N times higher because of the necessary reads

Use multiple “RAID engines” (DirectorBlades) to rebuild files in parallel

Declustering spreads disk I/O over more disk arms (StorageBlades)

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14

# Shelves

One Volume, 1G Files

One Volume, 100MB Files

N Volumes, 1GB Files

N Volumes, 100MB Files

Rebuild MB/sec

Slide 46 | October 2008 IDC, Panasas, Inc.

Scalable rebuild is mandatory

Having more drives increases risk, just like having more light bulbs increases the odds one will be burnt out at any given time

Larger storage pools must mitigate their risk by decreasing repair times

The math saysif (e.g.) 100 drives are in 10 RAID sets of 10 drives each and

each RAID set has a rebuild time of N hours

The risk is the same if you have a single RAID set of 100 drives

and the rebuild time is N/10

Block-based RAID scales the wrong direction for this to work Bigger RAID sets repair more slowly because more data must be read

Only declustering provides scalable rebuild rates

Total number of drives Drives per RAID set Repair time

Slide 47 | October 2008 IDC, Panasas, Inc.

Deep Dive: pNFS

Standards-based parallel file systems: NFSv4.1

Slide 48 | October 2008 IDC, Panasas, Inc.

pNFS: Standard Storage Clusters

pNFS is an extension to the Network File System v4 protocol standard

Allows for parallel and direct accessFrom Parallel Network File System clients

To Storage Devices over multiple storage protocols

Moves the Network File System server out of the data path

pNFSClients

Block (FC) /Object (OSD) /

File (NFS)StorageNFSv4.1 Server

data

metadatacontrol

Slide 49 | October 2008 IDC, Panasas, Inc.

pNFS Layouts

Client gets a layout from the NFS Server

The layout maps the file onto storage devices and addresses

The client uses the layout to perform direct I/O to storage

At any time the server can recall the layout

Client commits changes and returns the layout when it’s done

pNFS is optional, the client can always use regular NFSv4 I/O

Clients

Storage

NFSv4.1 Server

layout

Slide 50 | October 2008 IDC, Panasas, Inc.

pNFS Client

Common client for different storage back ends

Wider availability across operating systems

Fewer support issues for storage vendors

Client Apps

LayoutDriver

pNFS Client

pNFS Server

ClusterFilesystem

1. SBC (blocks)2. OSD (objects)3. NFS (files)4. PVFS2 (files)5. Future backend…

Layout metadatagrant & revoke

NFSv4.1

Slide 51 | October 2008 IDC, Panasas, Inc.

pNFS is not…

Improved cache consistency

NFS has open-to-close consistency enforced by client polling of attributes

NFSv4.1 directory delegations can reduce polling overhead

Perfect POSIX semantics in a distributed file system

NFS semantics are good enough (or, all we’ll give you)

But note also the POSIX High End Computing Extensions Working Group

http://www.opengroup.org/platform/hecewg/

Clustered metadata

Not a server-to-server protocol for scaling metadata

But, it doesn’t preclude such a mechanism

Slide 52 | October 2008 IDC, Panasas, Inc.

Prototype PNFS Performance

pNFS iozone Throughput8 clients, 1+10 SB1000, 5 GB files

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

1 2 3 4 5 6 7 8

# Clients

MB

/s

write

re-write

read

re-read

Slide 53 | October 2008 IDC, Panasas, Inc.

Prototype PNFS Performance

pNFS iozone Throughput8 clients, 4+18 system, 5 GB files

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

1 2 3 4 5 6 7 8

MB

/s

write

re-write

read

re-read

Slide 54 | October 2008 IDC, Panasas, Inc.

Is pNFS Enough?

Standard for out-of-band metadataGreat start to avoid classic server bottleneck

NFS has already relaxed some semantics to favor performance

But there are certainly some workloads that will still hurt

Standard framework for clients of different storage backendsFiles (Sun, NetApp, IBM/GPFS, GFS2)

Objects (Panasas, Sun?)

Blocks (EMC, LSI)

PVFS2

Your project… (e.g., dcache.org)

Slide 55 | October 2008 IDC, Panasas, Inc.

Storage Cluster Components

StorageBlade

Processor, Memory, 2 NICs, 2 spindles

Object storage system

Block management

DirectorBlade

Processor, Memory, 2 NICs

Distributed file system

File and Object management

Cluster management

NFS/CIFS re-export

Integrated hardware/software solution

11 blades per 4u shelf

Today: 1 to 100-shelf systems

Tomorrow: 1 to 500-shelf systems

Object-based Clustered File System

Smart, Commodity Hardware

Panasas ActiveScale Storage Cluster

Slide 56 | October 2008 IDC, Panasas, Inc.

Internal cluster management makes a large collection of blades work as a single system

iSCSI/OSD

OSDFS

StorageNodes

SysMgr

PanFS

NFS/CIFS

Client

Manager Nodes

Client

Compute Nodes

RPC

Out of Band architecture with direct, parallel paths from clients to storage nodes

Up to 12,000

1,000+

100+

Slide 57 | October 2008 IDC, Panasas, Inc.

Panasas Global Storage Model

Client Node

Panasas System A Panasas System B

BladeSet 1

VolX

VolY

VolZ

/panfs/sysa/delta/file2

BladeSet 2

delta

home

BladeSet 3

VolM

VolN

VolL

Client Node Client Node Client Node

/panfs/sysb/volm/proj38

TCP/IP network

PhysicalStorage

Pool

LogicalQuotaTree

DNS NameGlobal Name