ibm-gpfs-an-overview.pdf
DESCRIPTION
IBM GPFSTRANSCRIPT
© 2010 IBM Corporation
IBM Power Systems Technical University
October 18–22, 2010 — Las Vegas, NV
IBM General Parallel File System –an Overview
Session ID: ST04
Glen Corneau, IBM Advanced Technical Skills
© 2010 IBM Corporation2
IBM Power Systems Technical University — Las Vegas, NV
AgendaOverview and Architecture
– Performance, Security and RAS
– Supported Environments
– Information Lifecycle Management
– Disaster Recovery
– Usage with Oracle RAC
Reference
– FAQ
– Administrative Commands
– Debugging
– Additional sources of Information
© 2010 IBM Corporation3
IBM Power Systems Technical University — Las Vegas, NV
IBM General Parallel File System (GPFS™)
IBM General Parallel File System (GPFS) is a
scalable high-performance file
management infrastructure for AIX®, Linux® and Windows
systems.
A highly available cluster architecture.
Concurrent shared disk access to a single global namespace.
Capabilities for high performance parallel workloads.
© 2010 IBM Corporation4
IBM Power Systems Technical University — Las Vegas, NV
ConnectionsSAN
TCP/IPInfiniBand
AvailabilityData
migration, replication,
backup
File Data Infrastructure Optimization
Application Servers
Backup /Archive
Databases
File Servers
ManagementCentralized
monitoring &automated file
mgmt
IBM GPFS is designed to enable:
A single global namespace across platforms.
High performance common storage.
Eliminating copies of data.
Improved storage use.
Simplified file management.
© 2010 IBM Corporation5
IBM Power Systems Technical University — Las Vegas, NV
Basic GPFS Cluster with Shared SAN
All features are included. All software features: snapshots, replication and multisite connectivity are included in the GPFS license. With no license keys except for client and server to add on, you get all of the features up front.
Storage area network (SAN)
GPFSCan be dedicated SAN,
vSCSI, NPIV
Each node has “direct” speeds
Simultaneous LUN access
© 2010 IBM Corporation6
IBM Power Systems Technical University — Las Vegas, NV
Why?
Enable virtually seamless multi-site operations
Reduce costs for data administration
Provide flexibility of file system access
Establish highly scalable and reliable data storage
Future protection by supporting mixed technologies
Network-based block input and output (I/O)
Application data access on network attached nodes is exactly the same as a storage area network (SAN) attached node. General Parallel File System (GPFS™) transparently sends the block-level I/O request over a TCP/IP network.
Any node can add direct attach for greater throughput.
Local area network (LAN)
GPFS
SAN
Network shared disk (NSD) clients
NSD servers
SAN
© 2010 IBM Corporation7
IBM Power Systems Technical University — Las Vegas, NV
Why?
Tie together multiple sets of data into a single namespace
Allow multiple application groups to share portions or all data
Help enable security-rich, highly available data sharing that’s also high performance
Multi-clustering can expand access and allow greater security
LAN
SANSAN
GPFS
Local area network (LAN)
SANStorage
area network (SAN)
GPFS
Network shared disk protocol
Create an enterprise-wideglobal namespace
© 2010 IBM Corporation8
IBM Power Systems Technical University — Las Vegas, NV
File system configuration and performance data
General Parallel File System (GPFS™) already is running at data sizes most companies won't be at for a few years.
File system specifications
264 files per file system
256 file systems
Maximum file system size: 299 bytes
Maximum file size equals file system size
Production file systems 4 PB
Disk input and output:
IBM AIX® 134 GB/sec
Linux® 66 GB/sec
Number of nodes: 1 to 8192
Extreme capacity and scale
© 2010 IBM Corporation9
IBM Power Systems Technical University — Las Vegas, NV
Supported server hardware
GPFS for x86 Architecture is supported on multiple x86 and AMD compatible systems:
IBM Intelligent Cluster IBM iDataPlex® IBM System x®
rack-optimized servers IBM BladeCenter® servers Non-IBM x86 and AMD
compatible servers
Power Systems System p™ BladeCenter servers IBM Blue Gene® IBM System p®
General Parallel File System (GPFS™) for IBM POWER Systems™ is supported on both IBM AIX® and Linux®.
General Parallel File System (GPFS™) for x86 Architecture™ is supported on both Linux® and Windows Server 2008.
GPFS for Power is supported on multiple IBM POWER platforms:
© 2010 IBM Corporation10
IBM Power Systems Technical University — Las Vegas, NV
Independent software vendorsupport includes:
IBM DB2®
Oracle
SAP/Business Objects
SAS
Ab Initio
Informatica
SAF
Operating system, application and solution support
Operating systems
IBM AIX®
Linux® – Red Hat– SUSE® Linux Enterprise Server– VMware ESX Server
Windows® 2008 Server
IBM Information Archive
IBM Smart Analytics Optimizer
DB2 pureScale
IBM Solutions
IBM SmartBusiness Storage Cloud
IBM TotalStorage Virtual Tape Server
IBM SAP BI Accelerator
IBM Scale Out NAS
© 2010 IBM Corporation11
IBM Power Systems Technical University — Las Vegas, NV
Supported storage hardware
In addition to IBM Storage, IBM GeneralParallel File System (GPFS™) supportsstorage hardware from these vendors:
GPFS supports many storage systems, and the IBM support team can help customers using storage hardware solutions not on this list of tested devices.
EMC
Hitachi
Hewlett Packard
DDN
© 2010 IBM Corporation12
IBM Power Systems Technical University — Las Vegas, NV
Flexible monitoring and management
Simple Network Management Protocol
Customized monitoring–User-defined exit scripts–Password-enabled IBM General Parallel File System (GPFS™)
commands
Policy-based backup–IBM Tivoli® Storage Manager incremental forever –Tivoli Storage Manager parallel backup–Policy engine can quickly create lists of files for 3rd party backup
software
Dynamic operations include adding/removing nodes, growing/shrinking file systems, adding inodes to file system, moving node roles, restriping and more.
© 2010 IBM Corporation13
IBM Power Systems Technical University — Las Vegas, NV
Configurable file system management and availability
GPFS can be configured with a smaller number of “trusted” nodes that have remote shell capabilities to all other nodes (admin nodes)
– Remote shell can be prompted or non-prompted
– Can utilize ssh-agent in this configuration
Rolling upgrades are supported
If a node providing a GPFS management functions fails, an alternative node automatically assumes responsibility preventing loss of file system access.
GPFS daemon will attempt to re-connect when socket connections are broken before starting recovery procedures
With GPFS for Linux, a Clustered NFSv3 solution is available to provide highly-available NFS serving of data in GPFS file systems to non-GPFS clients
© 2010 IBM Corporation14
IBM Power Systems Technical University — Las Vegas, NV
Licensing
GPFS Server license (performs mgmt functions or exports data)
– GPFS management roles: quorum node, file system manager, cluster configuration manager, NSD server
– Exports data through applications such as NFS, CIFS, FTP or HTTP
– Data access local or via network
GPFS client license (no mgmt functions, local consumer)
– Important: Data access local or via network
Per PVU on x86 for Linux, Windows
Per core on Power Systems for AIX, Linux
MultiSystem Offering available (1,000+ nodes)
OEM and IP licenses available
© 2010 IBM Corporation15
IBM Power Systems Technical University — Las Vegas, NV
Information Lifecycle Management
The information lifecycle management (ILM) toolset includes:
–Disk storage pools
–Filesets (named subdirectories)
–External storage pools
–High-performance metadata processing via the PolicyEngine
GPFS Manager Node•Cluster manager•Lock manager•Quota manager•Allocation manager•Policy manager
System Pool Data Pools
GPFS Clients
Storage Network
Gold Pool Silver Pool Pewter Pool
GPFS RPC Protocol
GPFS
Application
GPFS
Application
GPFS
Application
GPFS
ApplicationPosix
GPFS File System (Volume Group)
© 2010 IBM Corporation16
IBM Power Systems Technical University — Las Vegas, NV
Policy-based storage management provides: Placement Management (movement, removal) Backups and IBM Hierarchical Storage
Management (HSM) operations integrated with Tivoli Storage Manager (TSM) and High Performance Storage System (HPSS)
Reporting
Examples of policy rules Place new files on fast, reliable storage
move files as they age to slower storage, then to tape
Place media files on video-friendly storage (fast, smooth), other files on cheaper storage
Place related files together, e.g. for failure containment
ILM Policies
1 Scan files
3 Perform file operations
2 Apply rules
© 2010 IBM Corporation17
IBM Power Systems Technical University — Las Vegas, NV
ILM Integration with Hierarchical Storage Management
Policy-managed disk-to-tape migration
The idea: Integrate GPFS policies with Hierarchical Storage Management (IBM Tivoli® Storage Manager)
The advantages: Integration of disk-to-disk and disk-to-tape into fully tiered
storage Finer control of data movement across storage tiers More efficient scans and data movement using internal GPFS
functions Possibility of coalescing small disk files into large tape files
GPFS 3.3 introduced support for IBM Tivoli Storage Manager incremental forever using high performance metadata scan interface.
© 2010 IBM Corporation18
IBM Power Systems Technical University — Las Vegas, NV
Disaster Recovery – Active/Active Cluster
One geographically dispersed cluster (“stretched” cluster)
– All nodes in either site have SAN/NSD access to the disk
– Site A storage is duplicated in site B with GPFS replication
– Simple recovery actions in case of site failure (more involved if you lose tiebreaker site as well)
Performance implication: GPFS has no knowledge of a replica's physical locality. There is no way to specify disk access priority (i.e. Local storage first)
© 2010 IBM Corporation19
IBM Power Systems Technical University — Las Vegas, NV
DR – Active/Passive using Storage Replication
Uses automated commands to keep file system definitions in sync between sites.
Storage subsystem replication keeps the LUNs in sync
Failover requires production site to be down (i.e. must shutdown GPFS daemons if not a total site failure)
More involved configuration and failover than in the case of GPFS replication
© 2010 IBM Corporation20
IBM Power Systems Technical University — Las Vegas, NV
DR – Active/Active with Storage Replication
Same node layout as active/active using GPFS replication.
Same disk layout as Active/Passivewith Storage Replication
– DiskA is SAN-attached and accessible from sites A & B
– DiskB is SAN-attached and accessible from site B only
– LUN synchronization relationship from diskA to diskB
– Consistency groups should be defined over all logical subsystemsat the primary site.
Failover involves disengaging the diskA access from siteB (either via SAN or via GPFS user exit)
© 2010 IBM Corporation21
IBM Power Systems Technical University — Las Vegas, NV
Disaster Recovery – Point-in-time Storage Copy
Can be used to make off-site point-in-time copy of the LUNs that comprise a GPFS filesystem
Requires temporary suspension of primary GPFS volumes when initiating Storage Copy commands (flushes all buffers/cache to disk for a consistent on-disk file system image) – similar to JFS2 freeze/thaw.
Can be used both for availability (DR backup) or other purposes (spin off copies for slow “to tape” backups, additional data analysis, etc).
Can have a pseudo-Active/Active configuration with second (or third or more) site live at the same time as primary site.
© 2010 IBM Corporation22
IBM Power Systems Technical University — Las Vegas, NV
GPFS With Oracle RAC
Oracle RAC detects the usage of GPFS for its database files and will open them in Direct I/O mode. This bypasses GPFS cache (pagepool) for DB files, but it is still used for other files.
Obtain a copy of My Oracle Support Articles
– 282036.1, entitled “Minimum Software Versions and Patches Required to Support Oracle Products on IBM Power Systems”
– 302806.1, entitled “IBM General Parallel File System (GPFS) and Oracle RAC on AIX 5L and IBM eServer pSeries” [older, not current]
GPFS Versions 3.1, 3.2 are certified on AIX 5.3/6.1 with Oracle RAC
– For exact OS/GPFS/RAC combinations, check the first article above
– vSCSI (with or without NPIV) is supported, again, check the article above for details.
© 2010 IBM Corporation23
IBM Power Systems Technical University — Las Vegas, NV
GPFS with Oracle RAC – Basic Tuning hints
Read the section “GPFS use with Oracle” in the GPFS Planning and Installation Guide for details on threads and AIO.
Suggested that Voting and OCR not be in GPFS file systems, but rather in shared raw devices (hdisks), unless using SCSI-3 PR.
For file systems holding large Oracle databases, set the GPFS file system block size to a large value:
– 512 KB is generally suggested.
– 256 KB is suggested if there is activity other than Oracle using the file system and many small files exist which are not in the database.
– 1 MB is suggested for file systems 100 TB or larger.
The large block size makes the allocation of space for the databases manageable and has no affect on performance when Oracle is using the Asynchronous I/O (AIO) and Direct I/O (DIO) features of AIX.
© 2010 IBM Corporation24
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
Does an NSD mean network-access only?
Can GPFS be used in firewaledl environments?
What about logical volumes on AIX?
Can I create 10 file systems on my one LUN GPFS file system?
Can I use my storage subsystem to grow/shrink my GPFS LUN?
What do I have to do in a replicated environment after a non-fatal disk failure?
Can I create a raw GPFS file system?
Can I have a non-striped GPFS file system?
Do I have to restripe my file system after adding a LUN?
What is cluster quorum?
© 2010 IBM Corporation25
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
Does an NSD mean network-access only?
Answer:
No.
Each GPFS node does a disk discovery upon daemon startup and will determine at that time if disk access is local, or via the network-based NSD server.
If GPFS detects a failure in disk access locally, it will automatically switch to using the network-based NSD server(s). It will periodically check local access and switch back automatically.
The useNSDservers mount option can be set to change this default behavior.
© 2010 IBM Corporation26
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
Can GPFS be used in firewalled environments?
Answer:
Yes.
GPFS utilizes a IANA-registered port (1191/tcp) for daemon-to-daemon communication. You also have to take into account the remote shell used (typically rsh or ssh) when setting up firewall rules.
© 2010 IBM Corporation27
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
What about logical volumes on AIX?
Answer:
New file systems should not be created on logical volumes in AIX.
GPFS does support utilizing logical volumes in the case of older, migrated file systems.
Logical volumes are also typically used when setting up a 3 site disaster recovery configuration for the file system descriptor only (fsdescOnly) disks at the 3rd tiebreaker site.
© 2010 IBM Corporation28
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
Can I create 10 file systems on my one LUN GPFS file system?
Answer:
No.
A GPFS disk (NSD) belongs to one-and-only-one file system.
The typical way around this is to utilize GPFS subdirectories and symbolic links.
Example:
/app1 -> /gpfs/app1/app2 -> /gpfs/app2
This can typically provide a benefit in space utilization and distribution of I/O when used with multi-LUN GPFS file systems.
© 2010 IBM Corporation29
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
Can I use my storage subsystem to grow/shrink my GPFS LUN?
Answer:
No.
GPFS does not support the modification of LUN sizes after configured in the cluster.
You can:
– Remove a LUN from a file system
– Remove it from GPFS
– Remove from the OS
– Re-create it at the new size
– Add it back to GPFS and the file system
© 2010 IBM Corporation30
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
What do I have to do in a replicated environment after a non-fatal disk failure?
Answer:
After verification that the disk is back (but not permanently lost), the only action is to change the state of the disk:
mmchdisk <gpfs_device> start -d “gpfsXnsd”
GPFS will scan the file system's metadata and create a list of all replicated data that was changed while the disk was down, and will re-replicate.
GPFS will not automatically perform this process after the disk access is restored. A reboot and/or cluster restart will not change the disk state. This step requires administrative action.
© 2010 IBM Corporation31
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
Can I create a raw GPFS file system?
Answer:
No.
GPFS is a “cooked” file system only. The device and file system are synonymous, unlike AIX's LVM where the logical volume and file systems are separate entities that can be acted upon separately.
© 2010 IBM Corporation32
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
Can I have a non-striped GPFS file system?
Answer:
Yes, but only if that file system is made up of one LUN.
Any GPFS file system of >1 LUN will automatically be striped.
© 2010 IBM Corporation33
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
Do I have to restripe my file system after adding a LUN?
Answer:
No.
GPFS does not require you to restripe after adding a LUN to an existing file system. However, depending on your existing NSD utilization and I/O pattern, you might want to restripe.
If you have full NSDs, then restriping will allow the overall I/O to be more evenly distributed across all the LUNs.
If you have a heavy write environment with new files as opposed to re-writes of existing files, then the file system data will automatically restripe itself over time.
Note: restriping is I/O intensive, but you can select which nodes participate in that operation.
© 2010 IBM Corporation34
IBM Power Systems Technical University — Las Vegas, NV
Glen's FAQs
What is cluster quorum?
Answer:
GPFS requires a quorum of nodes designated “quorum” to maintain overall cluster availability. How that quorum is calculated depends on the method chosen.
The next two pages explain the two quorum methods.
© 2010 IBM Corporation35
IBM Power Systems Technical University — Las Vegas, NV
Legend:q - quorum nodenq – non-quorum node
Node Quorum : Standard Multi-node (default)
(# of quorum nodes / 2) + 1 [rounded down]
GPFS allows some subset of the total node population to be assigned as explicit quorum nodes; only these nodes participate in the quorum calculation.
Large clusters achieve quorum faster and can be hardened against failure more readily with fewer quorum nodes
Typically 7 nodes or less... odd numbers are good.
© 2010 IBM Corporation36
IBM Power Systems Technical University — Las Vegas, NV
Legend:q - quorum nodet - tiebreaker disknq - non-quorum node
Node Quorum: TieBreaker Disks
Clusters may contain up to 8 quorum nodes, only 1 has to be available
– Any number of non-quorum nodes.
From one to three disks may be used as tiebreakers, odd is good.
TieBreaker disks can be file systems disks, don't need to be dedicated
© 2010 IBM Corporation37
IBM Power Systems Technical University — Las Vegas, NV
Common GPFS Commands
Task View Action
Cluster Configuration mmlscluster mmchcluster
Tunables mmlsconfig mmchconfig
File System Config mmlsfs mmchfs
FS Usage mmdf
Grow / shrink file system mmadddisk / mmdeldisk
Restripe file system data mmrestripefs
Disk state mmlsdisk mmchdisk
NSD state mmlsnsd mmchnsd
Add / remove nodes mmaddnode / mmdelnode
Snapshots mmlssnapshots mmcrsnspshot / mmdelsnapshot
GPFS daemon status mmgetstate mmstartup / mmshutdown
Offline file system configuration mmbackupconfig / mmrestoreconfig
FS Mount state mmlsmount mmmount / mmumount
FS Managers mmlsmgr mmchmgr
Policy management mmlspolicy mmchpolicy / mmapplypolicy
Parallel backup with TSM mmbackup
© 2010 IBM Corporation38
IBM Power Systems Technical University — Las Vegas, NV
Generic GPFS Debugging
Check from the top down:
– Is it my membership in the cluster?
• Can we communicate?
• Is the daemon running?
– Is the file system mounted?
• Are they all mounted?
– Is there a problem with the disks?
• From the Operating System's point of view?
• From a GPFS point of view?
– Performance issue?
• Check out standard AIX performance commands
• Examine the mmpmon command
© 2010 IBM Corporation39
IBM Power Systems Technical University — Las Vegas, NV
Debugging GPFS – Some commands
Utilize the documentation (gasp! No!)
– The GPFS Problem Determination Guide has lots of good information!
First, document the cluster using the commands from previous page:
– mmlscluster, mmlsconfig, mmlsnsd, mmlsdisk, mmdf, mmlsfs
Determine the current state of cluster using commands such as
– mmgetstate, mmlsmgr and mmdiag
Read the logs:
– /var/adm/ras/mmfs.log.latest
– /var/adm/ras/mmfs.log.previous
– AIX Error Log / Linux syslog
© 2010 IBM Corporation40
IBM Power Systems Technical University — Las Vegas, NV
Information Resources
Main GPFS Page
www.ibm.com/systems/software/cluster/gpfs
GPFS Resources
Links to the documentation, FAQ, Wiki and more
www.ibm.com/systems/clusters/software/gpfs/resources.html
GPFS forum
www-128.ibm.com/developerworks/forums/dw_forum.jsp?forum=479&cat=13
GPFS wiki
http://www.ibm.com/developerworks/wikis/display/hpccentral/General+Parallel+File+System+(GPFS)
© 2010 IBM Corporation41
IBM Power Systems Technical University — Las Vegas, NV
What We CoveredOverview and Architecture
– Performance, Security and RAS
– Supported Environments
– Information Lifecycle Management
– Disaster Recovery
– Usage with Oracle RAC
Reference
– FAQ
– Administrative Commands
– Debugging
– Additional sources of Information
© 2010 IBM Corporation42
IBM Power Systems Technical University — Las Vegas, NV
TrademarksThe following are trademarks of the International Business Machines Corporation in the United States, other countries, or both.
The following are trademarks or registered trademarks of other companies.
* All other products may be trademarks or registered trademarks of their respective companies.
Notes: Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.All customer examples cited or described in this presentation are presented as illustrations of the manner in which some customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics will vary depending on individual customer configurations and conditions.This publication was produced in the United States. IBM may not offer the products, services or features discussed in this document in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the product or services available in your area.All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.Information about non-IBM products is obtained from the manufacturers of those products or their published announcements. IBM has not tested those products and cannot confirm the performance, compatibility, or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.Prices subject to change without notice. Contact your IBM representative or Business Partner for the most current pricing in your geography.
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office.IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency, which is now part of the Office of Government Commerce.
For a complete list of IBM Trademarks, see www.ibm.com/legal/copytrade.shtml:
*, AS/400®, e business(logo)®, DBE, ESCO, eServer, FICON, IBM®, IBM (logo)®, iSeries®, MVS, OS/390®, pSeries®, RS/6000®, S/30, VM/ESA®, VSE/ESA, WebSphere®, xSeries®, z/OS®, zSeries®, z/VM®, System i, System i5, System p, System p5, System x, System z, System z9®, BladeCenter®
Not all common law marks used by IBM are listed on this page. Failure of a mark to appear does not mean that IBM does not use the mark nor does it mean that the product is not actively marketed or is not significant within its relevant market.
Those trademarks followed by ® are registered trademarks of IBM in the United States; all others are trademarks or common law marks of IBM in the United States.