powerpoint, 2.8m
TRANSCRIPT
Why Smart Storage?
➢ Central control of critical data➤ One central resource to fail-over in disaster
planning➤ Banks, trading floor, air lines want zero downtime
➢ Smart storage is shared by all hosts & OS’es➤ Amortize the costs of high availability and disaster
planning over all of your hosts➤ Use different OS’es for different jobs (UNIX for the
web, IBM mainframes for data processing)➤ Zero-time “transfer” from host to host when both
are connected➤ Enables cluster file systems
Data Center Storage Systems
➢ Change the way you think of storage➤ Shared Connectivity Model➤ “Magic” Disks➤ Scales to new capacity➤ Storage that runs for years at a time
➢ Symmetrix case study➤ Symmetrix 8000 Architecture➤ Symmetrix Applications
➢ Data center class operating systems
Traditional Model of Connectivity
➢ Direct Connect➤ Disk attached directly to host ➤ Private - OS controls access and provides security➤ Storage I/O traffic only
• Separate system used to support network I/O (networking, web browsing, NFS, etc)
Shared Models of Connectivity
➢ VMS Cluster➤ Shared disk & partitions➤ Same OS on each node➤ Scales to dozens of nodes
➢ IBM Mainframes➤ Shared disk & partitions➤ Same OS on each node➤ Handful of nodes
➢ Network Disks➤ Shared disk/private partition➤ Same OS➤ Raw/block access via network➤ Handful of nodes
New Models of Connectivity
➢ Every host in a data center could be connected to the same storage system➤ Heterogeneous OS & data format (CKD & FBA)➤ Management challenge: No central authority to
provide access control
SharedStorage
IRIX
DGUX
FreeBSD
MVS
VMS
LinuxSolaris
HPUX
NT
Magic Disks
➢ Instant copy➤ Devices, files or data
bases
➢ Remote data mirroring ➤ Metropolitan area➤ 100’s of kilometers
➢ 1000’s of virtual disks➤ Dynamic load balancing
➢ Behind the scenes backup ➤ No host involved
Scalable Storage Systems
➢ Current systems support➤ 10’s of terabytes➤ Dozens of SCSI, fibre channel, ESCON channels
per host➤ Highly available (years of run time)➤ Online code upgrades➤ Potentially 100’s of hosts connected to the same
device
➢ Support for chaining storage boxes together locally or remotely
Longevity
➢ Data should be forever➤ Storage needs to overcome network failures,
power failures, blizzards, asteroid strikes …➤ Some boxes have run for over 5 years without a
reboot or halt of operations➢ Storage features
➤ No single point of failure inside the box➤ At least 2 connections to a host➤ Online code upgrades and patches➤ Call home on error, ability to fix field problems
without disruptions➤ Remote data mirroring for real disasters
Symmetrix Architecture
➢ 32 PowerPC 750’s based “directors”
➢ Up to 32 GB of central “cache” for user data
➢ Support for SCSI, Fibre channel, Escon, …
➢ 384 drives (over 28 TB with 73 GB units)
Prefetch is Key
➢ Read hit gets RAM speed, read miss is spindle speed
➢ What helps cached storage array performance?➤ Contiguous allocation of files (extent-based file
systems) preserve logical to physical mapping➤ Hints from the host could help prediction
➢ What might hurt performance?➤ Clustering small, unrelated writes into contiguous
blocks (foils prefetch on later read of data)➤ Truly random read IO’s
Symmetrix Applications
➢ Instant copy➤ TimeFinder
➢ Remote data copy➤ SRDF (Symmetrix Remote Data Facility)
➢ Serverless Backup and Restore➤ Fastrax
➢ Mainframe & UNIX data sharing➤ IFS
Business Continuance Problem“Normal” Daily Operations Cycle
Online Day
BACKUP /BACKUP /DSSDSS
Resume Online Day
“Race to SunriseSunrise”
4 Hours of Data Inaccessibility*
2 am
6 am
➢ Creation and control of a copy of any active application volume
➢ Capability to allow the new copy to be used by another application or system
➤ Continuous availability of production data during backups, decision support, batch queries, DW loading, Year 2000 testing, application testing, etc.
➢ Ability to create multiple copies of a single application volume
➢ Non-disruptive re-synchronization when second application is complete
TimeFinder
Backups
Decision Support
Data Warehousing
Euro Conversion
BCV is a copy of real production data
Sales BUSINESS
CONTINUANCE
VOLUME
BUSINESS
CONTINUANCE
VOLUME
BUSINESS
CONTINUANCE
VOLUME
PRODUCTION
APPLICATION
VOLUME
PRODUCTION
APPLICATION
VOLUME
PRODUCTION
APPLICATION
VOLUME
Business Continuance Volumes
➢ A Business Continuation Volume (BCV) is created and controlled at the logical volume level
➢ Physical drive sizes can be different, logical size must be identical
➢ Several ACTIVE copies of data at once per Symmetrix
Using TimeFinder
➢ Establish BCV ➢ Stop transactions to clear
buffers➢ Split BCV ➢ Start transactions ➢ Execute against BCVs➢ Re-establish BCV
M1M1
BCVBCV
M2M2
Re-Establishing a BCV Pair
■ BCV pair “PROD” and “BCV” have been split
■ Tracks on “PROD” updated after split
■ Tracks on ‘BCV’ updated after split
■ Symmetrix keeps table of these “invalid” tracks after split
■ At re-establish BCV pair, “invalid” tracks are written from “PROD” to “BCV”
■ Synch complete
PROD M1PROD M1
Split BCV Pair
BCV M1BCV M1
UPDATEDUPDATED
UPDATEDUPDATED
UPDATED
Re-Establish BCV Pair
PRODPROD BCVBCV
UPDATEDUPDATED
UPDATEDUPDATED
UPDATED
Restore a BCV Pair
■ BCV pair “PROD” and “BCV” have been split
■ Tracks on “PROD” updated after split
■ Tracks on “BCV” updated after split
■ Symmetrix keeps table of these “invalid” tracks after split
■ At restore BCV pair, “invalid” tracks are written from “BCV to PROD”
■ Synch complete
PRODPROD
Split BCV Pair
BCVBCV
UPDATEDUPDATED
UPDATEDUPDATED
UPDATED
Restore BCV Pair
PRODPROD BCVBCV
UPDATEDUPDATED
UPDATEDUPDATED
UPDATED
Make as Many Copies as Needed
■ Establish BCV 1
■ Split BCV 1
■ Establish BCV 2
■ Split BCV 2
■ Establish BCV 3
M1M1
BCV 1BCV 1
M2M2
BCV 2BCV 2 BCV 3BCV 3
4 PM
5 PM
6 PM
The Purpose of SRDF
➢ Local data copies are not enough➢ Maximalist
➤ Provide a remote copy of the data that will be as usable after a disaster as the primary copy would have been.
➢ Minimalist➤ Provide a means for generating periodic physical
backups of the data.
Synchronous Data Mirroring
➢ Write is received from the host into the cache of the source
➢ I/O is transmitted to the cache of the target
➢ ACK is provided by the target back to the cache of the source
➢ Ending status is presented to the host
➢ Symmetrix systems destage writes to disk
➢ Useful for disaster recovery
- S e m i S y n c h r o n o u sM ir r o r in g
➢ An I/O write is received from the host/server into the cache of the source
➢ Ending status is presented to the host/server.
➢ I/O is transmitted to the cache of the target
➢ ACK is sent by the target back to the cache of the source
➢ Each Symmetrix system destages writes to disk
➢ Useful for adaptive copy
Backup / Restore of Big Data
➢ Exploding amounts of data cause backups to run on too long➤ How long does it take you to backup 1 TB of data?➤ Shrinking backup window and constant pressure for
continuous application up-time ➢ Avoid using production environment for backup
➤ No server CPU or I/O channels ➤ No involvement of regular network
➢ Performance must scale to match customer’s growth➢ Heterogeneous host support
Location 1 Location 2
Tape Library
Tape Library
SCSI
SCSIFastraxData Engine
UNIX
Linux
Fastrax EnabledBackup/Restore
Applications
SYMAPI
UNIX
STD1 STD2BCV1
R1 R2
BCV2
Fibre Channel PtP Link(s)
Fastrax Overview
Fastrax Performance
➢ Performance scales with the number of data movers in the Fastrax box & number of tape devices
➢ Restore runs as fast as backup➢ No performance impact on host during restore or backup
RAF
RAF
RAF
RAF DM
DM
DM
DM
SRDF
Fastrax
InfoMover File System
➢ Transparent availability of MVS data to Unix hosts➤ MVS datasets available as native Unix files
➢ Sharing a single copy of MVS datasets ➢ Uses MVS security and locking
➤ Standard MVS access methods for locking + security
Mainframe✔ IBM MVS / OS390
Open Systems✔ IBM AIX✔ HP HP-UX✔ Sun Solaris
ESCONChannel
ParallelChannel
FWD SCSIUltra SCSI
FibreChannel
Minimal NetworkOverhead-- No data transfer over network! --
MVSData
Symmetrix with ESP
IFS Implementation
Symmetrix API Overview
➢ SYMAPI Core Library➤ Used by “Thin” and Full Clients
➢ SYMAPI Mapping Library➢ SYMCLI Command Line Interface
Symmetrix API’s
➢ SYMAPI are the high level functions➤ Used by EMC’s ISV partners (Oracle, Veritas, etc)
and by EMC applications
➢ SYMCLI is the “Command Line Interface” which invoke SYMAPI➤ Used by end customers and some ISV
applications.
Basic Architecture
Symmetrix ApplicationProgramming Interface
(SymAPI)
SymmetrixCommand
LineInterpreter(SymCli)
OtherStorage
ManagementApplications
User access to the Solutions Enableris via the SymCli or
Storage Management Application
Client-Server Architecture
➢ Symapi Server runs on the host computer connected to the Symmetrix storage controller
➢ Symapi client runs on one or more host computers
Client Host
Server Host
Thin Client Host
SymAPIlibrary
SymAPIServer
StorageManagementApplications
StorageManagementApplications
SymAPI Client
ThinSymAPI Client
SymmAPI ComponentsInitialization
Discover and Update
Configuration
Gatekeepers
TimeFinder Functions
Device Groups
DeltaMark Functions
SRDF Functions
Statistics
Mapping Functions Base Controls
Calypso Controls
Optimizer Controls
InfoSharing
Data Object Resolve
RDBMS Data File
File System
Logical Volume
Host Physical Device
Symmetrix Device
Extents
File System Mapping
➢ File System mapping information includes:➤ File System attributes and
host physical location.➤ Directory attributes and
contents.➤ File attributes and host
physical extent information, including inode information, fragment size.
I-nodes Directories File extents
Solaris & Sun Starfire
➢ Hardware➤ Up to 62 IO Channels➤ 64 CPU’s➤ 64 GB of RAM➤ 60 TB of disk➤ Supports multiple
domains➢ Starfire & Symmetrix
➤ ~20% use more than 32 IO channels
➤ Most use 4 to 8 IO channels per domain
➤ Oracle instance usually above 1 TB
HPUX & HP 9000 Superdome
➢ Hardware➤ 192 IO Channels➤ 64 CPU’s cards➤ 128 GB RAM➤ 1 PB of storage
➢ Superdome and Symm➤ 16 LUNS per target➤ Want us to support
more than 4000 logical volumes!
Solaris and Fujitsu GP7000F M1000
➢ Hardware➤ 6-48 I/O slots➤ 4-32 CPU’s➤ Cross-Bar Switch ➤ 32 GB RAM➤ 64-bit PCI bus➤ Up to 70TB of storage
Solaris and Fujitsu GP7000F M2000
➢ Hardware➤ 12-192 I/O slots➤ 8-128 CPU’s➤ Cross-Bar Switch ➤ 256 GB RAM➤ 64-bit PCI bus➤ Up to 70TB of storage
AIX 5L & IBM RS/6000 SP
➢ Hardware➤ Scale to 512 Nodes
(over 8000 CPUs)➤ 32 TB RAM➤ 473 TB Internal
Storage Capacity➤ High Speed
Interconnect 1GB/sec per channel with SP Switch2
➤ Partitioned Workloads➤ Thousands of IO
Channels
IBM RS/6000 pSeries 680 AIX 5L
➢ Hardware➤ 24 CPUs
• 64-bit RS64 IV➤ 600MHz➤ 96 MB RAM➤ 873.3 GB Internal
Storage Capacity➤ 53 PCI slots
• 33 – 32bit/20-64bit
Really Big Data
➢IBM (Sequent) NUMA➤16 NUMA “Quads”
• 4 way/ 450 MHz CPUs• 2 GB Memory• 4 x 100MB/s FC-SW
➢Oracle 8.1.5 with up to 42 TB (mirrored) DB
➢EMC Symmetrix➤20 Small Symm 4’s➤2 Medium Symm 4’s
IQ
LI
NK
CascadedFibre
ChannelSwitch
Pair
CascadedFibre
ChannelSwitch
Pair
EMCDISK
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEONQUAD
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEON QUAD
NUMA-Q 2000 XEONQUAD
Fabric Domain 1
Fabric Domain 2
Fabric Domain 3
Fabric Domain 4
EMC DISK
EMC DISK
EMC DISK
EMC DISK
EMC DISK
Windows 2000 on IA32
➢ Usually lots of small (1u or 2u) boxes share a Symmetrix➤ 4 to 8 IO channels per box➤ Qualified up to 1 TB per meta volume (although
usually deployed with ½ TB or less)
➢ Management is a challenge➢ Will 2000 on IA64 handle big data better?
Lots of Devices
➢ Customers can uses hundreds of targets and LUN’s (logical volumes)➤ 128 SCSI devices per system is too few
➢ Better naming system to track lots of disks➢ Persistence for “not ready” devices in the
name space would help some of our features➢ devfs solves some of this
➤ Rational naming scheme➤ Potential for tons of disk devices (need SCSI driver
work as well)
Support for Dynamic Data
➢ What happens when the LV changes under a running file system? Adding new logical volumes?➤ Happens with TimeFinder, RDF, Fastrax➤ Requires a remounting, reloading drivers, rebooting?➤ API’s can be used to give “heads up” before events
➢ Must be able to invalidate➤ Data, name and attribute caches for individual files or
logical volumes➢ Support for dynamically loaded, layered drivers➢ Dynamic allocation of devices
➤ Especially important for LUN’s➤ Add & remove devices as fibre channel fabric changes
Keep it Open
➢ Open source is good for us➤ We can fix it or support it if you don’t want to➤ No need to reverse engineer some closed source
FS/LVM
➢ Leverage storage API’s ➤ Add hooks to Linux file systems, LVM’s, sys admin
tools
➢ Make Linux manageable➤ Good management tools are crucial in large data
centers
New Technology Opportunities
➢ Linux can explore new technologies faster than most➢ iSCSI
➤ SCSI over TCP for remote data copy?➤ SCSI over TCP for host storage connection?➤ High speed/zero-copy TCP is important to storage here!
➢ Infiniband➤ Initially targeted at PCI replacement➤ High speed, high performance cluster infrastructure for
file systems, LVM’s, etc• Multi gigabits/sec (2.5 GB/sec up to 30 GB/sec)
➤ Support for IB as a storage connection?➢ Cluster file systems
Linux at EMC
➢ Full support for Linux in SymAPI, RDF, TimeFinder, etc
➢ Working with partners in the application space and the OS space to support Linux ➤ Oracle Open World Demo of Oracle on Linux with over
20 Symms (could reach 1PB of storage!)
EMC Symmetrix Enterprise Storage
EMC ConnectrixEnterprise FiberChannel Switch
CentralizedMonitoring and
Management
Our Problem: Code Builds
➢ Over 70 OS developers➤ Each developer builds 15 variations of the OS➤ Each variation compiles over a million lines of
code➤ Full build uses gigabytes of space, with 100k
temporary files
➢ User sandboxes stored in home directory over NFS
➢ Full build took around 2 hours➢ 2 users could build at once
Our Original Environment
➢ Software➤ GNU tool chain➤ CVS for source control➤ Platform Computing’s Load Sharing Facility➤ Solaris on build nodes
➢ Hardware➤ EMC NFS server (Celerra) with EMC Symmetrix
back end➤ 26 SUN Ultra-2 (dual 300 MHz CPU) boxes➤ FDDI ring used for interconnect
LSF Architecture
➢ Distributed process scheduling and remote execution ➤ No kernel modifications➤ Prefers to use static placement for load balancing➤ Applications need to link special library➤ License server controls cluster access
➢ Master node in cluster➤ Manages load information➤ Makes scheduling decisions for all nodes
➢ Uses modified GNU Make (lsmake)
MOSIX Architecture
➢ Provide transparent, dynamic migration➤ Processes can migrate at any time➤ No user intervention required➤ Process thinks it is still running on its creation
node
➢ Dynamic load balancing➤ Use decentralized algorithm to continually level
load in the cluster➤ Based on number of CPU's, speed of CPU's,
RAM, etc
➢ Worked great for distributed builds in 1989
MOSIX Mechanism
➢ Each process has a unique home node➤ UHN is the node of the processes creation
➤ Process appears to be running at its UHN
➤ Invisible after migration to others on its new node
➢ UHN runs a deputy ➤ Encapsulates system state for migrated process
➤ Acts as a proxy for some location-sensitive system calls after migration
➤ Significant performance hit for IO over NFS, for example
MOSIX Migration
Link layerLink
layer
deputy
remote
User level
Kernel
localprocess
User level
Kernel
NF
S
MOSIX Enhancements
➢ MOSIX added static placement and remote execution➤ Leverage the load balancing infrastructure for
placement decisions➤ Avoid creation of deputies
➢ Lock remotely spawned processes down just in case
➢ Fix several NFS caching related bugs➢ Modify some of our makefile rules
MOSIX Remote Execution
Link layerLink
layer
deputy
remote
User level
Kernel
localprocess
User level
Kernel
NF
S
NF
S
EMC MOSIX cluster
➢ EMC’s original MOSIX cluster ➤ Compute nodes changed from LSF to MOSIX➤ Network changed from FDDI to 100 megabit
ethernet.
➢ The MOSIX cluster immediately moved the bottleneck from the cluster to the network and I/O systems.
➢ Performance was great, but we can do better!
Latest Hardware Changes
➢ Network upgrades➤ New switch deployed➤ Nodes to switch use 100 megabit ethernet➤ Switch to NFS server uses gigabit ethernet
➢ NFS upgrades➤ 50 gigabyte, striped file systems per user
(compared to 9 gigabyte non-striped file systems)➤ Fast/wide differential SCSI between server and
storage➢ Cluster upgrades
➤ Added 28 more compute nodes➤ Added 4 “submittal” nodes
EMC MOSIX Cluster
SCSI10
0 M
bit
Gigabit Ether
2 EMC Celerra NFSServers
4 SymmetrixSystems
Cisco 6506
52 VA Linux 3500s
Gigabit Eth
er
SCSI
SCSI
SCSI
Performance
➢ Running Red Hat 6.0 with 2.2.10 kernel (MOSIX and NFS patches applied)
➢ Builds are now around 15-20 minutes (down from 1-1.5 hours)➢ Over 35 concurrent builds at once
Build Submissions
100MbitCorporate Network
100 MbitCluster Network
SwitchSwitch
Windows
Sun Workstation
Linux
FreeBSD HP/UX
Selfish DualNIC nodes
Compute nodes
Compute nodes
Symmetrix Symmetrix Symmetrix Symmetrix
NFS Server NFS Server
Gigabit Ethernet
SCSI
100 Mbit
Cluster File System & MOSIX
SCSI
100
Mbit
1 EMC Celerra NFSServer
4 SymmetrixSystems
Cisco 6506
52 VA Linux 3500s
Gigabit
Ether
SCSI
SCSI
SCSI
Fibre Channel
Fibre ChannelFibre Channel
Fibre ChannelFibre Channel
Connectrix
DFSA Overview
➢ DFSA provides the structure to allow migrated processes to always do local IO
➢ MFS (MOSIX File System) created➤ No caching per node, write through➤ Serverless - all nodes can export/import files➤ Prototype for DFSA testing➤ Works like non-caching NFS
DFSA Requirements
➢ One active inode/buffer in the cluster for each file
➢ Time-stamps are cluster-wide, increasing➢ Some new FS operations
➤ Identify: encapsulate dentry info➤ Compare: are two files the same?➤ Create: produce a new file from SB/ID info
➢ Some new inode operations➤ Checkpath: verify path to file is unique➤ Dotdot: give true parent directory
Information
➢ MOSIX: ➤ http://www.mosix.org/
➢ GFS:➤ http://www.globalfilesystem.org/
➢ Migration Information➤ Process Migration, Milojicic, et al. To appear in
ACM Computing Surveys, 2000.➤ Mobility: Processes, Computers and Agents,
Milojicic, Douglis and Wheeler, ACM Press.